Google & JavaScript: Googlebot’s Indexing Can Miss Metadata, Canonicals & Status Codes

At Google I/O yesterday, Google detailed some issues surrounding the use of JavaScript when it comes to both crawling and indexing content.  One thing they revealed is that Googlebot will process JavaScript heavy pages in two phases – an initial crawl, and then a second one a few days later that will render the JavaScript completely.  But there are issues with this two phased approach that can result in some cases where Google is missing crucial data.

With this two phase indexing, the second wave of indexing does not check for things such as canonicals and metadata – if that isn’t rendered on the first initial rendering by Googlebot, it can be missed completely and affect indexing and ranking.

For example, if your site is a progressive web app you’ve built around the single page app model, then it’s likely all your URLs share some base template of resources which are then filled in with content via AJAX or fetch requests.  And if that’s the case, consider this – did the initially server-side rendered version of the page have the correct canonical URL included in it?  Because if you’re relying on that to be rendered by the client, then we’ll actually completely miss it because that second wave of indexing doesn’t check for the canonical tag at all.

Additionally, if the user requests a URL that doesn’t exist and you attempt to send the user a 404 page, then we’re actually going to miss that too.

John Mueller confirmed this as well on Twitter.

He stresses that these are not minor issues, but can be major ones.

The important thing to take away right now is that these really aren’t minor issues, these are real issues that could affect your indexability.  Metadata, canonicals, HTTP codes as I mentioned at the beginning of this talk, these are all really key to how search crawlers understand the content on your webpages.

He continues and mentions that the Google I/O website also had these issues, which resulted in Google having to change how the page was rendered, so it was crawled and indexed properly.

John Mueller later suggested that sites can use dynamic rendering, where sites serve the completed version to Googlebot and other crawlers, while serving the regular JavaScript heavy one to users.

We have another option we’d like to introduce, we call it dynamic rendering.

In a nutshell, dynamic rendering is the principle of sending normal client-side rendered content to users and sending fully server-side rendered content to search engines and to other crawlers that need it.

He goes into much further detail in the video below.

For those that question whether this would be considered cloaking, done properly it would not be cloaking, since both users and Googlebot would see the same content, it is simply being served differently so Googlebot can index it correctly on the first pass.

The following two tabs change content below.

My Twitter profileMy Facebook profileMy Google+ profileMy LinkedIn profileMy Twitter profileMy Facebook profileMy Google+ profileMy LinkedIn profile