Monday, January 4, 2010

Google Now Scanning RSS, Atom Feeds, May Experiment with Real-Time Protocols in Future

According to a post on Google's Webmaster Central blog, Google is now discovering web sites by automatically scanning RSS and Atom feeds. This new process will help Google more quickly identify web pages and will allow users to find new content in search results as soon as it goes live. While not exactly "real-time," using feeds to identify updates to websites is an arguably faster method than the traditional crawling techniques Google has used in the past. And Google may get even faster in the near future - the post also notes that the company may soon explore using mechanisms like the real-time protocol PubSubHubbub to identify updated items going forward.

The blog post doesn't say whether or not RSS and Atom discovery is displacing traditional web crawling for sites that are feed-enabled, but it's likely that, if given the choice, Google will opt for the faster method if available. As Vanessa Fox notes on the SearchEngineLand blog, since it's unknown at this time whether Google is using the feeds in place of traditional web crawling, it may make sense to use full feeds rather than partial ones in order to get your content indexed faster by Google's search engine.

Real-Time Web Crawling in the Future?

Although only briefly mentioned in the post, Google hinted that they may begin looking into other mechanisms such as PubSubHubbub, an open protocol that provides near-instant notifications of change updates. No further details were provided beyond the one sentence, but the announcement clearly shows that Google has seen the writing on the wall and knows that the real-time web is the future. This is one trend the company isn't planning to ignore.

The real-time web, heavily influenced by the speed of Twitter and other other rapid-fire social networking updates, has created a desire among internet users for faster access to information. This desire has, in turn, led to the creation of new real-time protocols such as the above mentioned PubSubHubbub and its counterpart RSSCloud. If Google began to use these technologies for scanning the web, their search results wouldn't just be updated faster - they would be updated in real-time. That means information would become available in the search results listings as soon as it was published to the web.

That, of course, would lead to a whole new series of challenges for the search engine - most notably, how to rank the real-time results? Given that Google's search algorithm has been built on top of the concept of PageRank, a way to determine the relevance of a website by what other sites link to it, ranking search results that are so fresh that there is an absence of links could prove a difficult feat. However, Google is already doing this to some extent now. Over time, the PageRank algorithm has evolved and can now reward sites with fresher, more fitting content and rank them higher than sites with more links on some occasions. And if anyone can figure out the proper algorithm for mixing in real-time content and ranking it appropriately along with static pages, it's got to be Google. In fact, we'll probably soon see exactly how they plan on addressing this issue, when they incorporate Twitter search results into their index, as announced last week

...But Until Then, Google Delivering Faster, Fresher Results Instead

Although the PubSubHubbub mention may have been the most exiting part of the announcement, real-time search results aren't here just yet. In the meantime, we have to just be content with sped up results instead. The post advises website owners who are blocking Google's search bot software known as Googlebot from crawling their RSS/Atom feeds to unblock it via their robots.txt file. If unsure, webmasters can test their feed URLs with the robots.txt tester in Google Webmaster Tools, as the post recommends.

Written by Sarah Perez

Labels: ,