Download the PyTorch Elastic repository and install it. Run in your terminal:
! git clone https://github.com/pytorch/elastic.git ! pip install torchelastic
Download and install Classy Vision:
! git clone https://github.com/facebookresearch/ClassyVision.git ! pip install classy_vision
If needed, install Docker:
! sudo apt install docker-compose
To run torchelastic manually you'll also need etcd:
! sudo apt install etcd-server
Set this environment variable to your current
torchelastic version. This tutorial only works for version >= 0.2.0:
! export VERSION=<torchelastic version>
The easiest way to get started is to use our example docker image. Run the following in your shell:
$ docker run --shm-size=2g --gpus=all torchelastic/examples:$VERSION --standalone --nnodes=1 --nproc_per_node=$NUM_CUDA_DEVICES /workspace/classy_vision/classy_train.py --device=gpu --config_file /workspace/classy_vision/configs/template_config.json
If you don't have GPUs available, simply drop the
--gpus=all flag. This will download and launch our example Docker container and start training on the current machine using torchelastic and Classy Vision. This is fine as a sanity check, but elasticity is really intended to help with training on multiple nodes. The next section will walk you through that.
Now let's replicate what the Docker example in the previous section did, to see how things work under the hood. torchelastic provides a drop-in replacement for
torch.distributed.launch and that's compatible with Classy Vision's
classy_train.py. The main difference is that torchelastic requires launching an
etcd server so that the workers know how to communicate with each other. In your shell, run this:
! classy-project my-project
Launch the etcd server:
! etcd --enable-v2 --listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001 --advertise-client-urls http://127.0.0.1:2379
This might fail if you alread have an etcd server running. torchelastic requires etcd v2 in order to work properly, so make sure to kill any etcd instances that you have running already.
! python -m torchelastic.distributed.launch --nproc_per_node=$NUM_CUDA_DEVICES --rdzv_endpoint 127.0.0.1:2379 \ ./classy_train.py --config configs/template_config.json --distributed_backend ddp
That's it! The training script should start running with torchelastic enabled.
Take a look at this link for the full documentation on how
torchelastic is meant to help with distributed training on multiple machines. In this part, we will simulate a multiple machine setup by launching multiple containers in the same host. Set this environment variable for the location of your ClassyVision repository:
In your shell, run:
cd $CLASSY_VISION_HOME/examples/elastic classy-project my_project
This will setup a Classy Vision project within the examples folder, which our containers will use as the training script. Now launch the containers:
That's it! This will launch two containers: one running the etcd server, and another doing training. You should see the output from both the etcd server and from the training script in your terminal.