Brief History of Web Scraping
May 14, 2021
Data, web scraping
Web scraping is becoming a more widely known term. Most associate it with web data extraction, the most efficient and the simplest way of copying large chunks of information online; however, did you know that web scraping was born for a completely different purpose and it took almost two decades for it to transform into web scraping we are familiar with now?
Here is the timeline:
The birth of the World Wide Web
The origins of very basic web scraping can be dated back to 1989 when a British scientist Tim Berners-Lee created the World Wide Web. Originally the idea was to have a platform where information could be automatically shared between scientists in universities and institutes all around the world. However, with the World Wide Web came three very important features that are the key elements for every web scraping tool nowadays:
- the URLs which we now use to designate a scraper to a specific website,
- embedded hyperlinks that allow us to navigate through the designated website,
- and web pages that contained various types of data - text, images, audios, videos, etc.
First web browser
Continuing his work, two years later, Tim Berners-Lee created the very first web browser, an http:// web page, all run on a server from his NeXT computer, giving a way for people to access and interact with the World Wide Web.
The Wanderer
Time-wise not much apart, in 1993, the first concept of crawling was born. The Wanderer, more precisely - the World Wide Web Wanderer developed by Matthew Gray at the Massachusetts Institute of Technology was a first of its kind, Perl-based web crawler whose sole purpose was to measure out the size of the web. The same year, the Wanderer was used to generate an index called the Wandex. Even though the author does not claim it, the Wanderer with Wandex had the potential to become the first general-purpose World Wide Web search engine.
JumpStation
However, the same year, 1993, the technology that laid grounds for big names such as Google, Bing, Yahoo, and other search tools on the web today - JumpStation was born and became the actual very first crawler-based web search engine. With it, millions of web pages indexed - the internet turned into an open-source platform of data in various forms.
BeautifulSoup
A bit more than a decade later, in 2004, came BeautifulSoup - HTML parser, a library of commonly used algorithms written in Python programming language. BeautifulSoup helped to grasp the sense of site structure and parse the contents within the HTML containers; therefore, saving hours of work for programmers. And since the internet had become this immense source of information that anyone with a computer and internet connection had access to, as well as it being easily searchable, people had started to take advantage of this by extracting the information available to them. For some time websites did not prohibit the ability to download the content of their sites; however, slowly that changed, and for the amount of data that was getting downloaded - simply manually copy-pasting was not an option; therefore, other ways of obtaining the information was bound to be developed.
Rise of visual web scrapers
Soon after, web scraping as we know it was born. The visual web scraping software Web Integration Platform version 6.0 which was launched by Stefan Andresen, allowed users to highlight the necessary information of a web page and structure that data into a usable excel file, or database which provided an opportunity for non-programmers to join and easily extract data from the web.
Nowadays, as technologies and industries progress, companies are looking to gain an advantage over their competition. And, due to the fact, that the amount of information available on the internet is growing exponentially, Web scraping is becoming one of the most prominent and widely-used methods of acquiring data at scale across various industries and business spheres
Future of web scraping
Web scraping has grown immensely in recent years, and almost guaranteed to continue upward growth. Currently, the commercial web scraping scene is mostly for gaining a competitive advantage by collecting leads, scraping competitors, price monitoring, etc. However, as technology develops, such as Artificial Intelligence, and data becomes even more accessible and crucial to different aspects of life, web scraping will advance with it and produce various new and remarkable applications that we are only looking forward to experimenting with.