The Bot Scraping Problem

Jul 7, 2022
3 min read

Bot scraping is a decade-old plague for search engines without a cure. This practice allows content thieves to sidestep copyright law, by scraping the code from a website, which contains all the information stored thereon and the website’s structure itself. With most modern websites built using HTML, CSS, and JavaScript, deploying a bot (or multiple bots) to scrape code is easy.

Once stolen, the scraped code can then be used in a multitude of ways. Sometimes other big companies will employ a search engine scrape to collect data on search engine results pages (SERP data), which is then used to curate their own search engines and optimize the search experience for their user base. For example, in 2011, Microsoft admitted to search engine scraping massive amounts of data from Google to optimize Microsoft’s Bing search engine. Even with two corporations of this size at odds, no litigation was filed, which is either attributed to the difficulty of pursuing civil recourse or the desire to avoid attention on the matter.

Currently, the biggest issue bot scrapers pose to the internet is content theft. Web scrapers can steal parts or entire articles seconds after being posted, then use a thesaurus bot to change the wording slightly, and repost it immediately. This leads to a chain of lesser-known media outlets posting essentially the exact same article on the most recent piece of news to get precious and monetizable traffic redirected to their website.

Content theft is hard to address and almost impossible to overcome for smaller publishers. And, bots do not always deliver accurate results, which can produce accidental misinformation campaigns as multiple bots scrape inaccurate information and immediately repost it. The result is the legitimization of inaccurate information, since the sheer volume of articles delivering nearly identical messaging lends credence to content, even if incorrect.

Despite these concerns, Google is seemingly unbothered by the issue as indicated by its silence on the topic. Google offers a service that allows users to use the Digital Millennium Copyright Act (DMCA) to make takedown requests for stolen media. Some companies use this tool liberally, with companies like Remove Your Media issuing over five-hundred million DMCA claims. However, not every publisher can employ the services of a company like Remove Your Media and the individuals that publish on these sites do not have the time to check every article and issue strikes on every piece of stolen content.

Although Google quietly advises against scraping in its Google Search Central documentation, the practice and resulting problems remain pervasive. Some argue that Google shoulders the burden to end scraping as the only entity with sufficient search engine bandwidth to impede this practice. However, even if Google or legal regulators imposed harsh penalties for scraping and content theft, the anonymity of the bots makes enforcement nearly impossible.

Since 2007 (the same year that Google acquired YouTube), Google implemented a system on YouTube that automatically scans videos for copyright abuse and deprioritizes the videos in the search engine algorithm that are found to be in violation. Referred to as the YouTube Content ID, this system could be adjusted to automatically scan articles for duplicate content similar to videos. While perhaps not a permanent solution, this roadblock may at least temporarily curtail the issues, whether it's copyright infringement or misinformation campaigns, especially with National elections on the horizon.

Blog

The Bot Scraping Problem

Recent Posts

Quick Links

Contact

Follow Us