How Google May Crawl and Index Deep Web

The deep web will no more hard to crawl for google search engine. Yes that’s true it is very interesting how Google may crawl deep web entities behind html forms. Search engine crawler is dependent on hyperlinks, xml sitemaps in general terms. Bill Slawski has shared some interesting research documents about deep web crawling. The Deep Web refers to content hidden behind HTML forms. User has to perform a form submission with valid input values.

Google may able to crawl these pages and it may also decrease burden of crawlers. We use test scripts and auto form submissions for an application. In simple terms it may work in the same way. I am concerned and exciting about the data which is available on web and webmasters don’t want to make it visible.

Generally we webmaster use robots.txt guidelines and noindex, nofollow tags and meta tags. We also have seen that Google also display URLs disallowed by robots.txt. Here is the example how it display meta description on search engine result pages.

example – “A description for this result is not available because of this site’s robots.txt”

These research document and development will be helpful for the websites which have great data behind the search and html forms. I am going to read these research paper just to increase my knowledge on this topic.

Please let me know your views on this topic. Here is the research document