Spiders are everywhere. You may be shocked to discover that they have even made their way onto the internet. I am not talking about articles about spiders, nor am I talking about photos of spiders, or even videos of them. No, it is much more disturbing. I am talking about the menacing little creepy crawlies that you cannot see or hear, and that you may never know you have encountered.
Spiders on the internet can be far more diabolical than those in real life. These little daemons crawl all over your web presence, taking snippets of every page, but does the debauchery stop there? Oh no, it sure doesn’t. Instead, these internet spiders go running back to every Bing, Google, and Yahoo! out there and tell them all about it. From the information gathered by the spiders these “search engines,” as they like to be called, determine your worth. That’s right; you’re expected to be at your best when you’re being “crawled” by these spiders, even if you had no idea they were crawling all over your things!
Ack! Spiders! Eww!! Get Rid of Them!
There is no internet Orkin Man. You are on your own, and what’s worse, there is no anti-eSpider bomb or spray. Your iPhone doesn’t have an app to zap them with. However, there is still hope! Prevention lies with you.
Path to Spider-free Living
If your page or directory is password protected, then it should not be indexed. This includes setting blogs, message boards, profiles, etc. to private in most cases.
There is the Robots Exclusion Standard which involves putting a text file named “robots.txt” in the root (first-level) directory of the web server. In this file you would include instructions as to what to index and what not to index. An example of what the file retrieved from robotstxt.org:
# robots.txt for http://www.example.com/
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
The first line with “Disallow” would disallow the indexing of everything within the ‘/cyberworld/map/’ on that web server. The ‘/foo.html’ is a specific file (web page) that is disallowed. Any extension can be used, ‘.xml’ ‘.htm’ ‘.asp’ etc, and it will prevent that page from being indexed.
There is also the tag method which involves placing a META tag with the name as “ROBOTS” and content set to “NOINDEX, NOFOLLOW”. The “NOINDEX” will prevent the crawler from indexing the page, and the NOFOLLOW prevents links on the page from being followed. There are other options that can be placed in this particular tag, and those include: “INDEX” which will allow the page to be indexed; this is included so that you can place a “NOFOLLOW”. You can also have a “NOINDEX, FOLLOW” which will allow the crawler to follow links on the page, but not index the page.
Read more about the Robots Exclusion Standard here at its Wikipedia page.