This tutorial will demonstrate how to use PyTorch Elastic with Classy Vision.
Download the PyTorch Elastic repository and install it. Run in your terminal:
! git clone https://github.com/pytorch/elastic.git
! pip install torchelastic
Download and install Classy Vision:
! git clone https://github.com/facebookresearch/ClassyVision.git
! pip install classy_vision
If needed, install Docker:
! sudo apt install docker-compose
To run torchelastic manually you'll also need etcd:
! sudo apt install etcd-server
Set this environment variable to your current torchelastic
version. This tutorial only works for version >= 0.2.0:
! export VERSION=<torchelastic version>
The easiest way to get started is to use our example docker image. Run the following in your shell:
export NUM_CUDA_DEVICES=2
$ docker run --shm-size=2g --gpus=all torchelastic/examples:$VERSION
--standalone
--nnodes=1
--nproc_per_node=$NUM_CUDA_DEVICES
/workspace/classy_vision/classy_train.py
--device=gpu
--config_file /workspace/classy_vision/configs/template_config.json
If you don't have GPUs available, simply drop the --gpus=all
flag. This will download and launch our example Docker container and start training on the current machine using torchelastic and Classy Vision. This is fine as a sanity check, but elasticity is really intended to help with training on multiple nodes. The next section will walk you through that.
Now let's replicate what the Docker example in the previous section did, to see how things work under the hood. torchelastic provides a drop-in replacement for torch.distributed.launch
and that's compatible with Classy Vision's classy_train.py
. The main difference is that torchelastic requires launching an etcd
server so that the workers know how to communicate with each other. In your shell, run this:
! classy-project my-project
%cd my-project
Launch the etcd server:
! etcd --enable-v2 --listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001 --advertise-client-urls http://127.0.0.1:2379
This might fail if you alread have an etcd server running. torchelastic requires etcd v2 in order to work properly, so make sure to kill any etcd instances that you have running already.
Start training:
! python -m torchelastic.distributed.launch --nproc_per_node=$NUM_CUDA_DEVICES --rdzv_endpoint 127.0.0.1:2379 \
./classy_train.py --config configs/template_config.json --distributed_backend ddp
That's it! The training script should start running with torchelastic enabled.
Take a look at this link for the full documentation on how torchelastic.distributed.launch
works.
torchelastic
is meant to help with distributed training on multiple machines. In this part, we will simulate a multiple machine setup by launching multiple containers in the same host. Set this environment variable for the location of your ClassyVision repository:
export CLASSY_VISION_HOME=~/ClassyVision
In your shell, run:
cd $CLASSY_VISION_HOME/examples/elastic
classy-project my_project
This will setup a Classy Vision project within the examples folder, which our containers will use as the training script. Now launch the containers:
docker-compose up
That's it! This will launch two containers: one running the etcd server, and another doing training. You should see the output from both the etcd server and from the training script in your terminal.