The robots.txt file is used to communicate information to search engines - most commonly to specify URL's that should not be spidered. There are various (legit!) reasons why you may not want certain parts of your site indexed in the search engines, and robots.txt is a handy way to do so. So an associate sent me an Email saying that if you are blocking a URL in robots.txt and then remove that, it will take 180 days for Google to see it. This didn't seem right to me - my guess is they were mis-reading what happens if you request a manual exclusion of your entire site. But as with most things, the easiest way to find out is to do a test.
I've had the a "User-agent: *" and "Disallow /bad_robots/" in the www.komar.org/robot.txt file for years. I set this up originally as a honeypot - was kinda curious if I would see robots intentionally ignore this command ... and a several do follow the subtle/semi-hidden link on my contact page to the (blocked) URL of www.komar.org/bad_robots/. I'm happy to say that I haven't seen the big-3 do so - Google, Microsoft's LiveSearch, or Yahoo.
The /bad_robots/ page basically just has a title and h1 tags saying "Hello Mr. BAD Robot - you should not be able to see this! ;-)" ... so if this file was indexed by the search engines, you should be able to find it with these searches in Google , LiveSearch , and Yahoo - there are no results - all good! Plus if I fire up Google's Webmaster Control, it shows that /bad_robots/ is "URL restricted by robots.txt - Last Calculcated Feb 10, 2007" - this means Google has seen the link to this URL, but can't spider it per those instructions.
So at 1637 (all times MST) on February 18th, 2007, I changed the robots.txt file so that /bad_robots/ was no longer disallowed - how long would it take Google, LiveSearch, and Yahoo to notice the change and spider/index that URL? Please note that I am not using Sitemaps (which would probably speed up the process) ... so this is a just a test of normal spidering/indexing.
What first has to happen is that the search engines have to spider the robots.txt file to see the changes. Googlebot spiders this file once/day (like clockwork - consistant with their guidelines) so they snarfed it at 2310 that night. Interestingly enough, the next day, Webmaster Console shows the changes in the "updated robots.txt analysis" link and date/time of last download - quick! But it took until Feb 27th (T+9 days) for the /bad_robots/ restricted by robots.txt message to go away. Incidentally, they recently added the ability to see backlinks to your URL's - "No links found" is shown for /bad_robots/ - I'd tactfully suggest that they should show the known links to that page even if it is blocked by robots.txt! ;-) MSNbot and YahooSlurp (very) aggressively spider the robots.txt file; coming by at least hourly and often even more frequently. On Feb 18th, they downloaded it 59 and 96 times respectively. I was a bit surprised by this and think that is a bit much ... ;-)
While not totally neccessary to index the /bad_robots/ page, I did check the Apache Web Logs for when the spiders came by and grabbed my contact page which has the link to /bad_robots/ - GoogleBot snarfed it at 0030 on Feb 20th, MSNbot at 2252 on Feb 19th, and YahooSlurp at 2301 on Feb 19th. Each spider comes by again about every other day.
The next step is when do the search engine spiders come by and actually grab the /bad_robots/ page? MSNbot was the first one to spider it at 1702 on Feb 20th. Googlebot came by at 1126 on Feb 25th. And YahooSlurp came by at 0025 on Feb 26th.
So when did the search results show the page?
I checked daily in the morning ... and
Google was the first to show results on March 1st (T+11 days) with a cache time/date of 1126 on Feb 25th - same as above.
showed results on March 3rd (T+13 days).
and LiveSearch was at March 7th at (T+17 days).
Note: This is a single sample, so while not statistically valid, my guess is the results would not not vary that much and it certainly debunks the "you won't be seen for 180 days" myth.