tencent cloud

Feedback

Running PyTorch Training Job

Last updated: 2023-05-19 17:07:37

    This document describes how to run a PyTorch training job.

    Prerequisites

    • PyTorch Operator has been installed in your AI environment.
    • Your AI environment has GPU resources.

    Directions

    The following steps are based on the official distributed training examples of PyTorch-Operator.

    Preparing the training code

    The code sample mnist.py at the official website of Kubeflow is used.

    Creating a training image

    Training image creation is easy. You only need to get an official image based on PyTorch 1.0, copy the above code to the image, and configure entrypoint (if entrypoint is not configured, you can also configure the startup command when submitting a PyTorchJob).

    Note:

    The training code is written based on PyTorch 1.0. As APIs of different PyTorch versions may be incompatible, you may need to adjust the above training code in a PyTorch environment on other versions.

    Submitting the job

    1. Prepare a PyTorchJob YAML file to define one master worker and one worker.

      Note
      • You need to replace the <training image=""> placeholder with the address of the uploaded training image.
      • As GPU resources are configured in resource configuration, set backend for training to "nccl" in args; in jobs using no (Nvidia) GPU resources, use another backend such as gloo.
      apiVersion: "kubeflow.org/v1"
      kind: "PyTorchJob"
      metadata:
      name: "pytorch-dist-mnist-nccl"
      spec:
      pytorchReplicaSpecs:
       Master:
         replicas: 1
         restartPolicy: OnFailure
         template:
           metadata:
             annotations:
               sidecar.istio.io/inject: "false"
           spec:
             containers:
               - name: pytorch
                 image: <training image>
                 args: ["--backend", "nccl"]
                 resources: 
                   limits:
                     nvidia.com/gpu: 1
       Worker:
         replicas: 1
         restartPolicy: OnFailure
         template:
           metadata:
             annotations:
               sidecar.istio.io/inject: "false"
           spec:
             containers: 
               - name: pytorch
                 image: <training image>
                 args: ["--backend", "nccl"]
                 resources: 
                   limits:
                     nvidia.com/gpu: 1
      
    2. Run the following command to use kubectl to submit the PyTorchJob:

      kubectl create -f ./pytorch_job_mnist_nccl.yaml
      
    3. Run the following command to view the PyTorchJob:

      kubectl get -o yaml pytorchjobs pytorch-dist-mnist-nccl
      
    4. Run the following command to view Pods created by the PyTorch job:

      kubectl get pods -l pytorch_job_name=pytorch-dist-mnist-nccl  
      
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support