Stanford Pratical Machine Learning-网页数据抓取
本文最后更新于:1 年前
这一章主要介绍网页数据爬取
Web Scraping
- ßThe goal is to extract data from website
- Noisy, weak labels, can be spammy
- Available at scale
- E.g. price comparison/tracking website
- Many ML datasets are obtained by web scraping
- E.g. ImageNet, Kinetics
- Web crawling VS scrapping
- Crawling: indexing whole pages on Internet
- Scraping: scraping particular data from web pages of a website
Tools
- “curl” often doesn’t work
- Website owners use various ways to stop bots
- Use headless browser: a web browser without a GUI
- You need a lot of new IPs, easy to get through public clouds
- In all IPv4 IPs, AWS owns 1.75%, Azure 0.55%, GCP 0.25%
可以使用selenium,进行数据抓取!
1 |
|
Case Study
- Query houses sold in near Stanford
- You can replace the city and state in the URL for other places
Craw individual pages
- Get the house IDs from the index pages
1 |
|
- The house detail page by ID:
id进行url拼接,就可以找到每个page的链接和信息
Extract data
- Identify the HTML elements through Inspect
在Chrome开发者模式中,看一看页面的结构,然后从网页页面中,把我们要用的内容给提取出来。
1 |
|
- Repeat the previous process to extract other field data
Cost
- Use AWS EC2 t3.small (2GB memory, 2 vCPUs, $0.02 per hour)
- 2GB is necessary as the browser needs a lot memory, CPU and bandwidth are usually not an issue
- Can use spot instance to reduce the price
- The cost to crawl 1M houses is $16.6
- The speed is about 3s per page.
- 8.3 hours if using 100 instances.
- The extra cost includes storage, restart instances when IP is banned.
Cost is low
Crawl Images
- Get all image URLs
1 |
|
- A house listing has ~20 images
- The crawling cost is still reasonable: ~$300
- Storing these images is expensive: ~$300 per month
- You can reduce the image resolutions, or send data back
Legal Considerations
- Web scraping isn’t illegal by itself
- But you should
- NOT scrape data have sensitive information (E.g. private data involving username/password, personal health/medical information)
- NOT scape copyrighted data (E.g. YouTube videos, Flickr photos)
- Follow the Terms of Service that explicitly prohibits web scraping
- Consult a lawyer if you are doing it for profit
References
Stanford Pratical Machine Learning-网页数据抓取
https://alexanderliu-creator.github.io/2023/08/23/stanford-pratical-machine-learning-wang-ye-shu-ju-zhua-qu/