Running PyTorch Training Job

Training image creation is easy. You only need to get an official image based on PyTorch 1.0, copy the above code to the image, and configure entrypoint (if entrypoint is not configured, you can also configure the startup command when submitting a PyTorchJob).

Note：

The training code is written based on PyTorch 1.0. As APIs of different PyTorch versions may be incompatible, you may need to adjust the above training code in a PyTorch environment on other versions.

Submitting the job

Prepare a PyTorchJob YAML file to define one master worker and one worker.

Note

You need to replace the <training image=""> placeholder with the address of the uploaded training image.
As GPU resources are configured in resource configuration, set backend for training to "nccl" in args; in jobs using no (Nvidia) GPU resources, use another backend such as gloo.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-nccl"
spec:
pytorchReplicaSpecs:
 Master:
   replicas: 1
   restartPolicy: OnFailure
   template:
     metadata:
       annotations:
         sidecar.istio.io/inject: "false"
     spec:
       containers:
         - name: pytorch
           image: <training image>
           args: ["--backend", "nccl"]
           resources: 
             limits:
               nvidia.com/gpu: 1
 Worker:
   replicas: 1
   restartPolicy: OnFailure
   template:
     metadata:
       annotations:
         sidecar.istio.io/inject: "false"
     spec:
       containers: 
         - name: pytorch
           image: <training image>
           args: ["--backend", "nccl"]
           resources: 
             limits:
               nvidia.com/gpu: 1

Run the following command to use kubectl to submit the PyTorchJob:
```
kubectl create -f ./pytorch_job_mnist_nccl.yaml
```

Run the following command to view the PyTorchJob:

kubectl get -o yaml pytorchjobs pytorch-dist-mnist-nccl

Run the following command to view Pods created by the PyTorch job:

kubectl get pods -l pytorch_job_name=pytorch-dist-mnist-nccl

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support

tencent cloud

Recent Pages