Overview
This series of documents describe how to deploy deep learning in EKS from direct TensorFlow deployment to subsequent Kubeflow deployment and are intended to provide a comprehensive scheme for implementing container-based deep learning.
Prerequisites
This document proceeds to run a deep learning task in EKS by using a self-built cluster after the steps in Building Deep Learning Container Image are completed.
The self-built image has been uploaded to the image repository ccr.ccs.tencentyun.com/carltk/tensorflow-model
, which can be directly pulled for use with no rebuild required.
Directions
Creating EKS cluster
Please create an EKS cluster as instructed in Connecting to a Cluster.
Note:
As you need to run a GPU-based training task, when creating a cluster, please pay attention to the supported resources in the AZ of the selected container network and be sure to select an AZ that supports GPU as shown below:

Creating CFS file system (optional)
The container will be automatically deleted, and the resources will be automatically released after the task ends. Therefore, to persistently store models and data, we recommend you mount an external storage service such as CBS, CFS, and COS.
In this example, CFS is used as an NFS disk to persistently store data with frequent reads and writes.
Creating CFS file system
- Log in to the CFS console and enter the File System page.
- Click Create. On the Create File System page that pops up, select the file system type and click Next: Detailed Settings.
- On the Detailed Settings page, set the relevant configuration items. For more information on CFS types and configurations, please see Creating File Systems and Mount Targets.
Note:
The CFS file system must be created in the region of the cluster.
- After confirming that everything is correct, click Buy Now and make the payment to create a file system.
- On the File System page, click the ID of the file system whose sub-target path needs to be obtained to enter the file system details page.
- Select the Mount Target Info tab and get the file system mount information next to Mount to Linux as shown below:
Note:
Note down the IPv4 address in the mount target details, such as 10.0.0.161:/
, which will be used as the NFS path in subsequent mount configuration.
Creating training task
This task uses the MNIST handwritten digit recognition dataset and two-layer CNN as an example. The sample image is the self-built image created in the previous chapter. If you need to use a custom image, please see Creating Deep Learning Container Image. Two task creation methods are provided below:
Taking the essence of the deep learning task into account, Job node deployment is used as an example in this document. For more information on how to deploy a Job, please see Job Management.
The following is the example of deployment in the console:
- In the Volume (optional) configuration item, select Using NFS disk and enter the name and IPv4 address of the CFS file system created previously as shown below:

- In the Mount Target configuration item in Containers in the Pod, select the volume and configure the mount target as shown below:

>!
>- As the dataset may need to be downloaded online, you need to configure the public network access for the cluster. For more information, please see Public Network Access.
>- After selecting a GPU model, when setting the request and limit, you need to assign the container CPU and memory resources meeting the resource specifications. The actual values do not need to be accurate down to the ones place.
When configuring in the console, you can also delete the default configuration and leave it empty to configure "unlimited" resources, which also have the corresponding billing specifications. This approach is recommended.
>- The container running command is inherited from Docker's CMD
field, whose preferred form is exec
. If you do not call the shell
command, there will be no normal shell processing. Therefore, if you want to run a command in the shell
form, you need to add "sh"
and "-c"
at the beginning.
When you enter multiple commands and parameters in the console, each command should take a line (subject to the line break)
You can also use a YAML file to create a task.
- Prepare a YAML file. Below is the sample file
gpu_pod.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: tf-cnn
annotations:
eks.tke.cloud.tencent.com/gpu-type: T4
spec:
containers:
- name: tf-cnn
image: hkccr.ccs.tencentyun.com/carltk/tensorflow-model:latest env:
- name: MODEL_DIR
value: /tf/model
- name: DATA_DIR
value: /tf/data
command:
- "sh"
- "-c"
- "python3 official/vision/image_classification/mnist_main.py \
--model_dir=$MODEL_DIR <br> --data_dir=$DATA_DIR <br> --train_epochs=5 <br> --distribution_strategy=one_device <br> --num_gpus=1 <br> --download"
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
volumeMounts:
- name: tf-model-cfs
mountPath: /tf
volumes:
- name: tf-model-cfs nfs:
path: / server: 10.0.1.8 restartPolicy: OnFailure
- Run the following command to complete deployment:
kubectl create -f [yaml_name]
Note
In addition to the precautions mentioned above for directions in the console, you also need to pay attention to the following:
- You need to use
annotations
to declare resource assignment in the YAML file. For more information, please see Annotation. You also should note that different GPU models correspond to different CPU and memory options. We recommend you enter the values as needed. - Here, NFS is used as the data volume. If you want to use other data volumes for persistent storage, please see Instructions for Other Storage Volumes.
- You can reserve
eks.tke.cloud.tencent.com/gpu-type
only with no other items needed in annotations. If /gpu-count
is specified, then cpu
and mem
must also be specified. (In this document, we recommend you not add other items, which will not affect the actual effect. If you enter other items without following the specifications, OOM errors may occur.) - For
nvidia.com/gpu
in GPU scheduling, only limits
is required. If only annotations
is specified, an error will be reported that no cards are found. If only limits
is specified, its values will be considered as the request
. If request
is also specified, its value must be the same as that of limits
. For more information, please see Schedule GPUs (here, adding the cpu
and memory
settings in request
and limits
is also not recommended as detailed above).
Viewing running result
You can view the running result either in the console or on the command line:
You can run commands to view events or logs:
Viewing storage
If you have configured NFS as instructed above, you can go to the mount target to view NFS storage:
- Run the following command to enter the relevant mount directory to check whether it exists:
cd /mound_data
See the figure below:

- Enter the
model
directory and view whether there is relevant data in it as shown below:

- Enter the
data
directory and view whether there is relevant data in it as shown below:

Relevant Operations
Using GPU to deploy deep learning task in TKE
Deployment in TKE is almost the same as that in EKS. Taking deployment through kubectl with a YAML file as an example, TKE has the following differences:
- When creating a TKE node, you should select a node with GPU. For more information, please see Using a GPU Node.
- As the node has built-in GPU resources,
annotations
and resources
are not needed. Practically, you can reserve annotations
, which TKE will not process. We recommend you comment out resources
, as it may cause unreasonable resource requirements.
FAQs
If you encounter any problems when performing this practice, please see FAQs for troubleshooting.
Was this page helpful?