Robots.txt Tips For Deailing With Bots

by Mike on January 12, 2010 · 2 comments

in Web Server

robots.txt
The robots.txt is used to provide crawling instructions to web robots using the Robots Exclusion Protocol.  When a web robots visits your site it will check this file, robots.txt, to discover any directories or pages you want to exclude from the web robot listing on the search engine.  This is an important file which determines SEO for search engines and can help rankings.

User-agent: *
Disallow: /administrator
Disallow: /media
Disallow: /topsecret

The text above tells the robot not to visit the /administrator directory, the /media directory or the /topsecret directory. The robots do not have to follow your suggestions, they can ignore your “disallows”. It is important to understand that you really do not control the robots, you only are making suggestions.  So, do not count on keeping that /topsecret directory secret.  This is especially true of malware robots who are really looking for stuff like the /topsecret directory.

Here is an example of a robots.txt for WordPress.  Please note there is a lot of discussion about what is completely correct, as search engines like Google will not tell you everything about SEO, but listed are typical settings that should help, be sure to test on your own site and make modifications.  Also note, if your blog is in a directory called blog/ you will need to add that on each of the lines that you see below so it would look like this:

/blog/wp-admin

Just verify that you have the correct directories listed.

User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads

User-agent: Googlebot
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.gz$
Disallow: /*.wmv$
Disallow: /*.cgi$
Disallow: /*.xhtml$

# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*

# Google AdSense – if you are using it
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

# Internet Archiver Wayback Machine
User-agent: ia_archiver
Disallow: /

# digg mirror
User-agent: duggmirror
Disallow: /

Be sure to include your sitemap information.

Sitemap: http://www.example.com/sitemap.xml

There is a lot of discussion about whether you can use globbing or not.  Globbing is where you use a wildcard to list similar content, like this:

Disallow: /directory/*

Here are several bots from Google and how they are used.

Googlebot: crawls pages from Google web and indexes pages
Googlebot-Mobile: crawls pages for Google mobile index
Googlebot-Image: crawls pages for Google image index
Mediapartners-Google: crawls pages to determine AdSense content.  Disallow this if you are not using AdSense.
Adsbot-Google: crawls pages to measure AdWords quality. Again, disallow if you are not using AdWords.

Here are a few examples of allowing and disallowing some search bots.

# Disable Duggmirror
User-agent: duggmirror
Disallow: /

# Allow Google Iimages bot
User-agent: Googlebot-Image
Disallow:
Allow: /*

# Allow Adsense bot
User-agent: Mediapartners-Google*
Disallow:
Allow: /*

Joomla is a content manager and one of the common problems is seeing duplicate pages listed by Google.  The robots.txt can help with this but you will have to keep watching Google and what they list for problems.  What is listed below is a starting place only.

# Joomla Example
User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/

Previous post:

Next post: