Want to know more about robots.txt file? This beginners guide to robots.txt file will give you all the answers to your queries related to this file.
robots.txt, robots.txt, robots.txt..
You may have come across this strange name at least once if you are an internet surfer
robots.txt, robots.txt, robots.txt..
You have definetly seen this “.txt” more than once if you are a blogger using automated tools like wordpress or blogspot..etc.
robots.txt, robots.txt, robots.txt..
Now, you are undoubtedly curious by this “text file” if you are hosting and customizing your own blog/website.
Urrrmmm….so let’s try to find out about this mysterious thing together, you and me
You: “robots.txt is a robot hiding in a file?”
Karthik: “I will say it is just a plain text-file?”
You: “Hahaha.. trying to be smart Karthik? hah..My 12year old brother knows this much!”
You: “Let’s get serious, I need to know about it!!”
Karthik: “Ok, relax and let me guide you in the world of robots..urrmm, sorry robots.txt ”
Let’s try giving some definition of a robots.txt
– is a simple, plain text-file
– is placed in the root directory of a website
– used to control which pages, images or any other files that can be indexed by a Web Robot (aka Search Engines Spiders or bots)
– restrict a specific or any robots access to a website or part of the website
– provides some intructions for robots crawling a website
Now on the net, whenever there’s a behaviour among 2 or more entities, there’s bound to be a protocol (a rule) and here this protocol is named as The Robots Exclusion Standard or The Robots Exclusion Protocol.
The Robots Exclusion Standard
– is a set of rules or a convention which governs the behaviour of a Web Robot with respect to a Website’s directories
A raw fact: This protocol relies on the cooperation of Web Robots.
Q => What if these Bots do not cooperate?
A => Yes, not all Bots abide by this protocol and these bots are called ‘Bad Bots or Spam Bots’; so it’s not a 100% guaranteed way to restrict access to your files and directories.
A thought:
In this era of intense competition and high-ranking urge, most of the intelligent and business minded companies always try to conform as closely as possible to standards or protocols set universally. Therefore, in a near future, the population of Bad Bots will surely be minimized thus boosting the importance of this protocol.
Mechanism of robots.txt & a Web Robot
STEP 1: Web Spider visits a website, for e.g https://trainerkarthik.com
STEP 2: Spider checks for https://trainerkarthik.com/robots.txt
STEP 3: If robots.txt found, analyse intructions in the file & proceed to STEP 5
STEP 4: If not found, an error message printed in log file & proceed to STEP 6
STEP 5: Crawl/index website according to instructions defined in robots.txt
STEP 6: index website in the manners I (the spider) want
STEP 7: Crawl that website as many times as ‘I want’ (say n times)
Benefits of robots.txt
1) Minimize errors in log file – As you have observed in STEP 4 above, if you do not provide a robots.txt file an error message is logged. Now if the Web Bot crawl or access your site 100 times in a day (STEP 7), then imagine the size of error logs.
2) Save bandwidth – As you have seen in STEP 7, a bot can access your file ‘n’ times a day and n times crawling your images, html files,..etc. So all this may dump a considerable bandwidth and server load especially if you are running a site with a lot of images and graphics. Hence, using robots.txt can help you define this behaviour.
3) Restrict privacy of your website or part of your files – Spiders will crawl ALL your files if you don’t instruct them. At times you might want to restrict public view to a certain image file for instance.
4) Be more professional – If you are investing effort in doing the best in your endeavour, you should take all chances and opportunities on your side; consider doing things the right way – hence use a robots.txt
5) Boost your site rank – Web Robots are information greedy and pays respect (by increasing you PR value in their database) to those who have a lot of good content. Hence, why not provide them with the right content in the right format? So try to make a balance between the so called ‘machine readability” and “human readability”.
Usage Syntax of robots.txt file
When writing your own robots.txt, you pay particular care to the following:
1) Case sensitivity
2) Use the exact name of the existing web spider
3) Where are the semi-colons (:) placed
4) When to use asterisk (*)
5) Bots read instructions in the robots.txt in a “Top to Bottom” fashion.
6) the text-file should be named “robots.txt” with the ‘s’ in it
Remember, when you do SEO (aka Search Engine Optimization) for any website or blog, first thing you need to check is robots.txt file.
(Note: A text file can be created by using Notepad and saving the filename as ‘robots‘. If you are using Linux, you can do this with an editor like KWrite, Kate or any other text-editor and save it with the name “robots” and suffixing it explicitly with the extension ‘.txt‘)
You made some decent points there. I checked on the web for more info about the issue and found most people will go along with your views on this web site.