Web Scraping

ßThe goal is to extract data from website
- Noisy, weak labels, can be spammy
- Available at scale
- E.g. price comparison/tracking website
Many ML datasets are obtained by web scraping
- E.g. ImageNet, Kinetics
Web crawling VS scrapping
- Crawling: indexing whole pages on Internet
- Scraping: scraping particular data from web pages of a website

Tools

“curl” often doesn’t work
- Website owners use various ways to stop bots
Use headless browser: a web browser without a GUI
You need a lot of new IPs, easy to get through public clouds
- In all IPv4 IPs, AWS owns 1.75%, Azure 0.55%, GCP 0.25%

可以使用selenium，进行数据抓取！

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.headless = True
chrome = webdriver.Chrome(chrome_options=chrome_options)

page = chrome.get(url)

Case Study

Query houses sold in near Stanford
- https://www.zillow.com/stanford-ca/sold/
- https://www.zillow.com/stanford-ca/sold/2-p/
You can replace the city and state in the URL for other places

Craw individual pages

Get the house IDs from the index pages

page = BeautifulSoup(open(html_path, 'r'))
links = [a['href'] for a in page.find_all(
'a', 'list-card-link')]
ids = [l.split('/')[-2].split('_')[0] for l in links]

The house detail page by ID:
- https://www.zillow.com/homedetails/19506780_zpid/

id进行url拼接，就可以找到每个page的链接和信息

Extract data

Identify the HTML elements through Inspect

在Chrome开发者模式中，看一看页面的结构，然后从网页页面中，把我们要用的内容给提取出来。

sold_items = [a.text for a in page.find(
'div', 'ds-home-details-chip').find('p').find_all('span')]

for item in sold_items:
  if 'Sold:' in item:
    result['Sold Price'] = item.split(' ')[1]
  if 'Sold on' in item:
    result['Sold On'] = item.split(' ')[-1]

Repeat the previous process to extract other field data

Cost

Use AWS EC2 t3.small (2GB memory, 2 vCPUs, $0.02 per hour)
- 2GB is necessary as the browser needs a lot memory, CPU and bandwidth are usually not an issue
- Can use spot instance to reduce the price
The cost to crawl 1M houses is $16.6
- The speed is about 3s per page.
- 8.3 hours if using 100 instances.
- The extra cost includes storage, restart instances when IP is banned.

Cost is low

Crawl Images

Get all image URLs

p = r'https:\\/\\/photos.zillowstatic.com\\/fp\\/([\d\w\-\_]+).jpg'

ids = [s.split('-')[0] for a in re.findall(p, html)]
urls = [f'https://photos.zillowstatic.com/fp/{id}-uncropped_scaled_within_1536_1152.jpg' for id in ids]

A house listing has ~20 images
- The crawling cost is still reasonable: ~$300
- Storing these images is expensive: ~$300 per month
  - You can reduce the image resolutions, or send data back

Legal Considerations

Web scraping isn’t illegal by itself
But you should
- NOT scrape data have sensitive information (E.g. private data involving username/password, personal health/medical information)
- NOT scape copyrighted data (E.g. YouTube videos, Flickr photos)
- Follow the Terms of Service that explicitly prohibits web scraping
Consult a lawyer if you are doing it for profit

References

slides

#研0自学

Stanford Pratical Machine Learning-网页数据抓取

https://alexanderliu-creator.github.io/2023/08/23/stanford-pratical-machine-learning-wang-ye-shu-ju-zhua-qu/

作者

Alexander Liu

发布于

2023年8月23日

许可协议

Stanford Pratical Machine Learning-数据标注上一篇

Stanford Pratical Machine Learning-数据获取下一篇