Google helps users verify web crawlers, Adds ‘GoogleOther’

Google helps users verify web crawlers, Adds ‘GoogleOther’

Google released additional details this week on how to verify if a web crawler accessing a server is one of Google’s own. This is key for website owners who need to know their website’s content is being indexed accurately by Google Search. Meanwhile, Google employees have been offered their own dedicated in-house crawler, GoogleOther, to help out with R&D.

Web crawlers, or “bots”, are the automated programs used by the likes of Google and Bing to browse the net and collect information about web pages. Bots start out by visiting a seed URL and then follow hyperlinks on that page to discover and index new web pages. The collected data is used to create a searchable index of web pages that can be used to provide search results to users.

Google uses different types of web crawlers to index web pages and deliver search results. Google’s common crawlers are used for building Google’s search indices, perform other product specific crawls, and for analysis. They always obey robots.txt rules and generally crawl from the IP ranges published in the googlebot.json object.

Updated documentation on Google Search Central sets out to verify the company’s own web crawlers.

“You can verify if a web crawler accessing your server really is a Google crawler, such as Googlebot. This is useful if you’re concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.”

Special-case crawlers are used for specific types of content, such as images or news articles. These crawlers are designed to follow specific rules and protocols to index their respective content accurately. User-triggered fetchers are used when a user actively submits a request for a web page to be crawled and indexed – usually via the Google Search Console.

There are two methods for verifying Google’s crawlers:

Manually: For one-off lookups, use command line tools. This method is sufficient for most use cases.

Automatically: For large scale lookups, use an automatic solution to match a crawler’s IP address against the list of published Googlebot IP addresses.

Website owners may use this information to identify and block malicious bots or non-Google crawlers that may be scraping their website’s content or causing excessive server load. By verifying the authenticity of the Googlebot, website owners can ensure that their website’s content is being indexed accurately and that their website is not being sidelined due to duplicate content or other black-hat practices.

In further news, Google introduced a fresh in-house web crawler called this week, dubbed GoogleOther. The new bot will not be made available to the public and will instead serve the company’s own teams. We can assume it is based on the based on the same principles and protocols of the standard Googlebot. Google Search Central briefly describes the new bot as a “Generic crawler that may be used by various product teams for fetching publicly accessible content from sites,” for example as part of “one-off crawls for internal research and development.” 

GoogleOther is also serves the purpose of easing some of the resource strain on the company’s main crawler. Googlebot requires significant storage capacity to store the vast amounts of data it collects during the crawling process. Google has never disclosed the exact number of servers dedicated to running Googlebot, but it is estimated that the company operates thousands of servers around the world to run its web crawling infrastructure.

Web crawlers are essential in SEO: they are the starting point for how search engines discover and index new pages. Identifying genuine Google crawlers allows website owners to monitor their website’s traffic and optimise their website’s performance by analysing the behaviour of Googlebot on their website. By understanding how Googlebot crawls, website owners can identify any issues that may be preventing their website from being crawled efficiently and optimise their website’s structure and content for better search engine rankings.

Tags :