Analyzing Googlebot traces for SEO

What is Googlebot ?

Googlebot is the Google crawler, which is the download robot. The crawler can also be called bot or spider. This robot is able, from an entry point to your site (the URL of the home page of your site, for example) to fetch all or at least a large part of your website content (pages HTML, images ...)

The Googlebot crawling is a prior and essential stage to being indexed by the famous Google algorithms.

Simply put, we can consider Googlebot an Internet browser just like Internet Explorer, Firefox, Chrome which would automatically click on all the links found on a website and would save this content.

For a good SEO optimization, it is important to understand the role and behavior of Google crawler.

View your pages as Googlebot

Yakaferci provides a tool which allows you to see the text content of your pages as Googlebot.


Check how Google spider sees your pages with our free SEO Page Analyzer :
ANALYZE


To start a Yakaferci analysis, you simply enter the link to your page in the area above and click the Analyze button.

Why analyze the traces left by Googlebot?

Almost all websites have installed traffic analysis tools like Google Analytics, Xiti or omniture for analyzing the behavior of their visitors in order to optimize their website.

It is the same for analyzing the Googlebot’s traces. Knowing the frequency of its passage, the pages it visited, the devices for which it scans your site ... you will better understand how it works and you will be able to optimize your site for a better communication with Googlebot which helps for the SEO.

Better communication with Googlebot enables efficient optimization of your SEO.

By improving the accessibility and understanding of your Website by Googlebot, you will improve your website’s SEO visibility.

Analysis of the traces left by the Googlebot

Thanks to the techniques described at the end of this article, it is possible to recover the traces left by the Googlebot when analyzing your site, which can be very instructive from an SEO point of view. Here are some tips:

Googlebot simulates multiple devices to connect to your site

Note that Googlebot downloads the same URLs 4 times, using a different Internet application name ("User-Agents"HTTP field). Here are the 4 values used by Google crawler:

  • Mozilla/5.0 : this corresponds to a Firefox browser on a desktop or a classic laptop
  • SAMSUNG-SGH-E250 : this is an old, 2006 cell phone : therefore Google tests and old configuration which is still used.
  • DoCoMo/2.0 N905i(c100;TB;W24H16) : it was the equivalent of a Japanese WAP . The HTML of certain Internet websites is or was optimized for these devices
  • Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) : these are for iPhones

This piece of information is not secret (see the official Google page on this topic) but it is often overlooked.

In this way, Google is able to detect if a website has been optimized with respect to these particular devices. Do not forget the main purpose of Google: presenting to the users the most relevant websites in relation to their queries. This relevance also requires accessibility and navigability of your site. So when you optimize your website for SEO also think "navigation" and "consistency"

It may be noted that at present it does not simulate the tablet type devices (iPad...)

Googlebot also downloads the CSS

The original mission of a search engine is to extract the text content on the Internet and create an index from the content to make the search easier and present the users with the most relevant websites. In this context, Google does not a priori need CSS files, that are style sheets of your site that control the display.

And still, Googlebot downloads the CSS files. Why ?

Only Google can precisely answer this question, but we can imagine a few good reasons :

  • the CSS files can contain URLs of images Googlebot wants to download
  • CSS files contain directives to manage the "Responsive Design", ie visually adapt the same content page based on the screen size. We know that Google tends to favor Responsive Design websites for smartphone searches.
  • according to some bad SEO optimization techniques, users would write in white on white background for example. This allowed them to add specific content to search engines. This type of technique is banned by Google and they need the CSS to control it.

Google crawler downloads the robots.txt file

It is not a surprise because any good crawler should check the rules of the Robots.txt file. This one defines which areas he has the right to crawl.

For more informaiton on this topic, check our article on the Robots.txt file

Googlebot optimizes its downloads

Given the difficult task of downloading all the pages of all websites in the world, it is of course natural that Google Crawler should try to find techniques of optimizing the size and the speed of its downloads.

Here are some techniques that are used :

  • compression of HTTP code when this is supported by the server of the crawled Website . This is done through the line "Accept-encoding : gzip,deflate" of the HTTP header sent by the Googlebot
  • the recovery of multiple pages on the same TCP/IP connection when the server of the crawled web site supports it. This is done through the line "Connection: Keep-Alive" of the HTTP header sent by Googlebot
  • using the "If-Modified-Since" HTTP field to avoid downloading a file that has not changed since the last visit. The sending of this field is, however, not systematic

For more information on this topic see our article on the HTTP protocole

Googlebot avoids invading your website

No doubt that with the technical means of Google, Googlebot would be able to crawl an entire website in minutes. However, they are careful to use a rather slow download. This allows them not to load the servers of the analyzed websites, so they do not to disturb their functioning

Googlebot downloads the URLs which do not come either from the sitemap nor from the internal links of your website

Sometimes you will be surprised by the URLs that Googlebot can send to your website. These URLs may be found neither in your internal links, nor in your sitemap, and still be crawled.

The reasons for this can be varied. For exemple if someone has installed an incorrect or old link to your website. For more information on this topic see our articles on the netlinking strategy and external links

Googlebot also downloads images

In this case it uses the "Googlebot-Image" User-Agent

It's on these images that Google can create the "Google Image" search area in its search interface.

Googlebot rapidly discovers the new pages

Compared with real-time champions such as Twitter, Google made ​​a lot of effort in recent years to index the new pages as soon as possible .

And if we analyze the logs, it is clear that new pages are often uploaded within a few hours after being put online, even if they have not yet been integrated into a sitemap file. For this, the new pages should only appear in the internal links of the existing pages, that Google will identify by checking.

However, we should not confuse the date of Googlebot's passage on the new page with the date of the provision of the new page in the search results. The second is significantly longer than the first (except for some websites with lots of news). For example, the page you are reading has been visited by Googlebot after 4 hours of being online and began to appear in the search results after 24 hours.

Methods to retrieve Googlebot traces

What are the traces left by Googlebot ?

What is interesting with the Google crawler is that Google's servers are in direct interaction with the Web servers that host your website. This has two consequences :

  • as with any other interaction with another browser, your Web server contains a history of traces left by the Googlebot (via the HTTP log files)
  • these traces are clear and indisputable, as opposed to all the attempts of good or bad quality analysis of Google's indexing algorithms

It is for these reasons that from time to time it is interesting to look closely at these log files . The results of the analysis of Googlebot's passage may help you with your SEO strategy.

Method 1 : analyze the HTTP logs of your website's server

Today, the traffic of a website is analyzed with modern, quality tools, among which the most popular are Google Analytics, Xiti... These tools are based on a Javascript code placed on the visitor's browser by the visited website. This Javascript code sends all the information needed to the analyzer. However, the robots / crawlers like Googlebot do not behave like classic browsers, and they do not particularly trigger the Javascript of these tools. That is why all the Googlebot visits are invisible in Google Analytics for example.

On the contrary, the HTTP logs created by the Web servers hosting the websites record the interactions with the Googlebot under the same title as any other Web client

If you do not know how to recover these files from the HTTP log, you can contact your web host. Be careful though, these files can be quite large for websites with lots of traffic

These files simply contain one line for each downloaded URL from your website (which can be an HTML page, a CSS, a Javascript, an image file...)

As any Web client interacting with a Web server, the GoogleGot must declare its name in the HTTP fields called "User-Agent". Google gives the possible values of its User-Agent here.

We see that the user agent used by the Googlebot is:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Some tools (especially grep in Linux environment) can extract all the lines of your log files containing the key "http://www.google.com/bot.html"

# grep "http://www.google.com/bot.html" www.default-access.log
66.249.75.104 - - [11/Dec/2013:11:15:31 +0100] "GET /balises-h1-h2 HTTP/1.1" 200 8848 0 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

We must now verify that the IP address of the Web client belongs to Google. This is done using the nslookup command for example :

# nslookup 66.249.75.104Authoritative answers can be found from:
75.249.66.in-addr.arpa	nameserver = ns1.google.com.

This time, we are sure this is the Googlebot !

Method 2 : detect requests from Googlebot by programming

This technique requires a little development: requests from Googlebot can be simply detected by programming (for example, using the User-Agent and reverse lookup) and saving the most interesting fields of the the HTTP request made by Googlebot. This method can provide additional information compared to method 1 described above.