10 Examples of Web Scraping in Use
We’re studying Ruby in the Flatiron School curriculum, so I researched real-life examples of scraping, a common Ruby task. I understand web scraping is the process of finding and pulling information from public websites, but who does it and why? Here are five common uses with examples, though not all were scraped using Ruby, in capsule form.
Walmart vs. Amazon Companies with an e-commerce component regularly use software to send waves of web crawlers across a multitude of sites at a time. Crawlers and bots search for information such as product reviews, contact information to use for marketing, or prices for comparison websites, and save copies in a spreadsheet or database.
The searches retrieve and save enormous amounts of information. As Columbia University puts it, “Google scraped (emphasis mine) the web to catalogue all of the information on the internet and make it accessible.” A Google search, in essence, is a scrape.
Walmart was checking the prices on Amazon.com several million times a day in 2016 before Amazon blocked its bots, according to Reuters. It was a tech news sensation. In e-commerce, a customer toggling between two competitor sites can choose the product 50 cents cheaper with a click, so the stakes are high.
Proven The skincare startup Proven scrapes millions of customer reviews, information on thousands of beauty products, and scientific articles, to offer personalized health and beauty products. The brand both consults and creates a product specifically for each customer, according to Forbes.
Co-founder Amy Yuan searched thousands of ingredients, products, consumer reviews and scientific journal articles to find a product that worked for her own skin, according to Forbes. She and co-founder Ming Zhao extended that project with machine learning and artificial intelligence algorithms to understand the correlations between people’s skin and the ingredients that work for individuals.
Naked Apartments Type in what you’re looking for in a New York apartment, and Naked Apartments retrieves the real estate options, scraped from the web, that fill the bill. It also allows brokers to contact an apartment seeker with permission, and ranks the best sellers (a distinction they can’t buy). Zillow bought the Webby Award-nominated company in 2016.
Naked Apartments provided its in-demand listings service in 2010 and turned a profit in 2011. Travel site Trivago.com scrapes to offer comparison prices, and Indeed.com offers job listings and other directories, gathered from the web. Almost any size business also scrapes for competitor prices, leads, search engine results to track SEO and marketing.
Community and social justice organizations
MCSafetyFeed.org MCSafetyFeed scrapes public information in cooperation with Monroe County, New York, to maximize open data in government. The site provides a history of all the 911 calls that come in to the Monroe County dispatch center and additional data about each call that isn’t immediately available on the dispatch website or Monroe’s RSS feed. Read more about the tech aspects, here.
Community projects may provide scraped digital data, or information from reports requested from governing bodies, or a combination. The goal is usually just to get the information out there for civic transparency and accountability.
Pfizer’s disclosures of payments to doctors Journalism is obviously another powerful way to get information to the people. ProPublica offers a guide for journalists who may want to scrape public records themselves for reporting. The guide leads aspiring data journalists through the process of scraping pharmaceutical giant Pfizer’s record of payments made to doctors.
Another example of journalistic scraping is James Tozer’s Economist article on the countries likely to have liberal abortion laws. Read more about his entrance into data journalism, here.
Biostatistics “What if you had an idea for an ecological study, but the data you needed wasn’t available to you?” — a pitch for Columbia University’s Mailman School of Public Health’s courses on web scraping and population health methods. You can compile it yourself.
In public health, big data helps chart disease occurrence based on time, location and demographics, according to the SUNY Downstate Medical Center Department of Epidemiology and Biostatistics. Students use data to identify the relative contributions of biological, behavioral, socio-economic, and environmental risk factors to disease incidence.
Google Flu Trends It feels more accurate to describe the complex Google Flu Trends as “a big data tool for epidemiologists” than as an instance of “scraping.” (Browse the Google project’s summary on Wikipedia.
The idea was that when people have the flu, they search for flu-related information on Google, indicating the time and location of a potential outbreak. The goal was to predict outbreaks weeks earlier than the Centers for Disease Control and Prevention (CDC). Read more about the outcome, and how the process might be refined in the future, on Wired.
For more on the topic of tech and public health, epidemiology in particular, check out “Epidemiology in the Era of Big Data,” from the National Center for Biology Information, a branch of the National Institutes of Health.
Academics and student work
Perception of French hip hop Developer Alexandre Robin helped a friend write a student paper on the perception of French hip hop through the decades by scraping 7,000 newspaper articles using Node. Read more about his project, here.
Maestro Flatiron alumni Clinton Nguyen and Jason Decker used web scraping to refine browser searches. “Sign up to post your own educational trails and follow them, or simply search through on the main page for user-submitted trails that are already out there.”
Originally included in the Flatiron School newsletter.
Looking forward to your thoughts about web scraping below!