Robots.txt Advice

Discussion in 'Digital Marketing' started by WebMan, Aug 8, 2006.

  1. WebMan

    WebMan LawnSite Member
    from D/FW TX
    Messages: 11

    I have noticed a lot of self-built sites here (and for some reason many are Yahoo... just FYI Yahoo & GoDaddy while great at what they do are well known as two of the worst hosts & site builders in the some searching on web hosting forums or web master forums and then search for those two, I'm not being prejudiced you will just see 100s of posts to that effect)

    Anyway I have noticed a lot of sites missing even the basics of titles, keywords, or correct use of keywords, description tags, classification tags (all of which are vital except keywords) and robot instruction tags or cache tags (again essential).
    So it stands to reason if those basics are missing many of you probably don't know what a robots.txt file is. So here are some basics:

    Robots (sometimes called spiders) are the automated programs search engines use to look at thousands of web sites a day to see what's on them and rank them accordingly. Every engine has them and if you have a stats program you can see when they visit, for example Google is called Googlebot and will show as a visitor by that name.
    The robot's "tag" is important because it is located in your page headers and tells the robot some things about what to do on the page BUT all robots by specifications (from legitimate engines) must first look at the files of a site and try to find a file called "robots.txt" (no quotes) this is a text file and does not contain html code like a regular web page or file. It's a simple file like one done in Microsoft Notepad. They must look there first and then do what it tells them. That is the file's sole purpose; to give robots instructions.

    Here is where it can really help you guys (and gals) with new sites. It can prevent a search engine from seeing your site. Now you wonder "why would I want that?"

    A typical example is an unfinished site. The last site I looked at that someone posted here had two links to pages that didn't exist yet. So you get "404 page not found" errors. That is terrible for a search engine. Almost if not just as bad is a page that just says "coming soon" or "under construction".
    Search engine spiders will find your site if it's out there; sometimes almost immediately, sometimes it takes several weeks, but they will... (never pay for any search engine "submission" services. They are all a scam and some can hurt your rankings or get you barred).

    If the search engine robot finds a site with "404's" or "coming soon" pages it will kill your rankings. You will be so far down the bottom looks like up.

    It's much better to get listed the way you want the first time than to get a terrible rank then try to work your way up.
    BTW I also see many photos without an "alt txt" tag (alternate text) these are needed for 2 reasons. (1) Search engines can't see photos but they can see the alternate text that describes the photos. (2) To be W3C compliant they must have that, and search engines like sites better the more compliant they are. Also if a person uses a reading program because they are visually impaired the reader will read the alternate text to them and tell them what the photo is, that can be the difference in a site that confuses someone with a reader and one that makes sense.

    I can't do a full-blown article here on what to do with a robots.txt file so I'll give you the one most important from what I have seen here.

    If your site isn't totally finished: Use Notepad or similar to make a file called robots.txt (do NOT use Word or a web program or other word processor, it must be in plain text thus the dot-txt ending)
    In the file place the following:

    User-agent: *
    Disallow: /

    Just like that. That will stop all robots from seeing your site. Save it on your computer as robots.txt then just upload the file to the same folder where your pages are (usually public_html or www) but anyhow it should become part of the file list like it was one of your web pages (not in a sub folder like images or any other sub folder, the same place your pages are listed...and don't worry, because it ends in txt nobody will ever see it as a page. Your pages will have names like services.html and you want to see robots.txt listed along with them)

    Then don't forget when your site is finished to just delete that file. Within the same lenght of time mentioned above the search engines will find you... or read up on Robots.txt and make a file that allows the robots you want and tells them what they can and can't do, and stops the bad guys (spammers for example use robots that look for e-mail addresses...there are many-many "bad" robots out there and any kid with some computer savvy can find on-line instructions on how to make your own robot and have it look for whatever you tell it to)

    Hope this helps, it's better to keep the search engines away until your site is the way you want them to see it :cool:

Share This Page