The robots.txt is used to provide crawling instructions to web robots using the Robots Exclusion Protocol. When a web robots visits your site it will check this file, robots.txt, to discover any directories or pages you want to exclude from the web robot listing on the search engine. This is an important file which determines SEO for search engines and can help rankings.
The text above tells the robot not to visit the /administrator directory, the /media directory or the /topsecret directory. The robots do not have to follow your suggestions, they can ignore your “disallows”. It is important to understand that you really do not control the robots, you only are making suggestions. So, do not count on keeping that /topsecret directory secret. This is especially true of malware robots who are really looking for stuff like the /topsecret directory.
Here is an example of a robots.txt for WordPress. Please note there is a lot of discussion about what is completely correct, as search engines like Google will not tell you everything about SEO, but listed are typical settings that should help, be sure to test on your own site and make modifications. Also note, if your blog is in a directory called blog/ you will need to add that on each of the lines that you see below so it would look like this:
Just verify that you have the correct directories listed.
# Google Image
# Google AdSense – if you are using it
# Internet Archiver Wayback Machine
# digg mirror
Be sure to include your sitemap information.
There is a lot of discussion about whether you can use globbing or not. Globbing is where you use a wildcard to list similar content, like this:
Here are several bots from Google and how they are used.
Googlebot: crawls pages from Google web and indexes pages
Googlebot-Mobile: crawls pages for Google mobile index
Googlebot-Image: crawls pages for Google image index
Mediapartners-Google: crawls pages to determine AdSense content. Disallow this if you are not using AdSense.
Adsbot-Google: crawls pages to measure AdWords quality. Again, disallow if you are not using AdWords.
Here are a few examples of allowing and disallowing some search bots.
# Disable Duggmirror
# Allow Google Iimages bot
# Allow Adsense bot
Joomla is a content manager and one of the common problems is seeing duplicate pages listed by Google. The robots.txt can help with this but you will have to keep watching Google and what they list for problems. What is listed below is a starting place only.
# Joomla Example