Stanford Pratical Machine Learning-网页数据抓取
本文最后更新于:1 年前
Web Scraping
- ßThe goal is to extract data from website
- Noisy, weak labels, can be spammy
- Available at scale
- E.g. price comparison/tracking website
- Many ML datasets are obtained by web scraping
- E.g. ImageNet, Kinetics
- Web crawling VS scrapping
- Crawling: indexing whole pages on Internet
- Scraping: scraping particular data from web pages of a website
- “curl” often doesn’t work
- Website owners use various ways to stop bots
- Use headless browser: a web browser without a GUI
- You need a lot of new IPs, easy to get through public clouds
- In all IPv4 IPs, AWS owns 1.75%, Azure 0.55%, GCP 0.25%
1 |
Case Study
- Query houses sold in near Stanford
- You can replace the city and state in the URL for other places
Craw individual pages
- Get the house IDs from the index pages
1 |
- The house detail page by ID:
Extract data
- Identify the HTML elements through Inspect
1 |
- Repeat the previous process to extract other field data
- Use AWS EC2 t3.small (2GB memory, 2 vCPUs, $0.02 per hour)
- 2GB is necessary as the browser needs a lot memory, CPU and bandwidth are usually not an issue
- Can use spot instance to reduce the price
- The cost to crawl 1M houses is $16.6
- The speed is about 3s per page.
- 8.3 hours if using 100 instances.
- The extra cost includes storage, restart instances when IP is banned.
Cost is low
Crawl Images
- Get all image URLs
1 |
- A house listing has ~20 images
- The crawling cost is still reasonable: ~$300
- Storing these images is expensive: ~$300 per month
- You can reduce the image resolutions, or send data back
Legal Considerations
- Web scraping isn’t illegal by itself
- But you should
- NOT scrape data have sensitive information (E.g. private data involving username/password, personal health/medical information)
- NOT scape copyrighted data (E.g. YouTube videos, Flickr photos)
- Follow the Terms of Service that explicitly prohibits web scraping
- Consult a lawyer if you are doing it for profit
Stanford Pratical Machine Learning-网页数据抓取