I have participated in numerous debates regarding the appropriate terminology for a certain process, thing, or event, and often found myself losing friends or making enemies by the end of such discussions.
I am taking another crack at this, my first post on my technical blog, I will be classifying the differences between web scraping and web crawling. Additionally, I will cover some basic terminologies used in the scraping discipline and explain why they are used in that way.
To begin with, the correct term is web scraping, not scrapping. Scrapping usually refers to disposing of unwanted items, as per the thesaurus. Whenever I come across companies using "scrapping" in their job descriptions, I never applied. In fact, I have, and will continue to, reject applicants who have previously engaged in "web scrapping" during their past jobs.
Another commonly held misconception in the community is “Web Scraping with BeautifulSoup”. While beautifulsoup is a versatile Python library, it does not actually scrape web pages. Rather, it parses the Document Object Model (DOM) of the response obtained by other libraries capable of connecting to the internet, such as requests, httpx, aiohttp, and so on. At one point, I had to verify the documentation of BeautifulSoup to ensure I had not overlooked a crucial feature of the library.
Even my favorite educational website for Python programming, realpython.com, has done the same mistake in a tutorial to talk about “web scraping with BeautifulSoup ''. It is a great article though.
I can understand your thoughts: do these things really matter? Yes, as someone who started their career as a Junior Data Engineer and has spent eight years in the industry specializing in web data extraction with Python, these things do matter to me. And I would like to share my insights with my peers.
Now, let's move on to the main discussion: web scraping and web crawling. Let's consider a use case. Suppose you are given a project to extract data from the Amazon website, where the client needs the data in two different formats. First, the client will provide you with a list of products and ask you to deliver the price data on a daily basis. Second, the client wants to monitor the price fluctuations of all the products in a sub-category, such as sneakers. 👟
How would you set up your scraper to handle both use cases? In the latter case, you have no idea how many sneakers will be present, and you need to gather all the data for the entire sub-category on a daily basis. In this scenario, the cost of coverage will be the integrity and accuracy of your business. Therefore, the design of your scraper will differ greatly from the first use case, where you just had to calculate the number of URL(product)s with resources needed to complete the task on time.
The design of the bot is the most significant difference between the two processes. To scrape a list of provided URLs, you can design a system that throttles the URLs and stores the page responses in a queue. These responses can then be consumed by a DOM parser to extract the desired data.On the other hand, when the crawler enters a node, such as page 1 of the listing, it must only identify the hyperlinks in the response that lead to another product and send them to the next iteration. This adaptability must be taken into consideration when building the bot.
One must be straightforward, while the other must be adaptive to the node it is crawling into, which can potentially expand into a humongous network 🕸️ - consider the number of pages you will have for sneakers in amazon.
This is why ScrapingHub (now zyte) named their crawler class 'spiders', which traverse through a network with the support of a 'twisted' backbone.
In conclusion, while scraping is a relatively straightforward process, crawling can be compared to searching for a needle in a haystack. Despite producing similar output, the design and construction of the two processes are vastly different.
I sincerely thank you for the time you have spent on reading this article and I would appreciate if you can provide a feedback or appreciation to my email.
Great article. It always bothered me that people said Web scrapping.
I got tired of saying "No, I don't scrap web, I just scrape required data from the web" 😄.
I like that you also highlighted that BeautifulSoup is not a web scraping library.
Thanks for your article and the clarification.
And also thanks for recommending my newsletter, hope we can collaborate in the future.