The Web 2.0 is very useful for passing information to and from various people all around the world. But if you are putting up information that you don't want the world to see, you need to do something more than just hide it in a secret directory. There are many ways that a search engine can find files that are hidden and unlinked. Such as:
Search engine spiders can fill out forms and spider the results pages. They can also read referral codes to see where someone has come from. This means, that if they visited your hidden page and then Google, Google can get that page into their index. Even if you don't link to the page, if someone else does, then eventually the Googlebot and other bot will find the link and spider the page. In the same way, even if you don't use a search engine add URL page, someone else might add your hidden page for you. Everything That is Not Actively Hidden Can be Found by a Search Engine in web 2.0 Era.
Everything that is stored on your website in public (non-password protected) directories is visible to a robot or search engine. At first blush, this might sound like a good thing. After all, isn't one of the goals of Web development to create pages and sites that are found and spidered by search engines? But you might be surprised at what search engines are now finding and including in their indexes.
Google and other search engines have tools that the Web for specific files and file types. And they don't just search the file names. Many of these file types are indexable, meaning the search engine can read the contents and index that as well. Even text in images is soon going to be indexable by search engines.
So if you have secret or private information in any of the following file types, you should not rely on Web 2.0 security through obscurity to protect them.
- HTML
- Acrobat (.pdf)
- PostScript (.ps)
- Word Documents (.doc)
- Excel Spreadsheets (.xls)
- Powerpoint Presentations (.ppt)
- Rich Text Format (.rtf)
- Flash (.swf and .fla)
- Images (.gif, .jpg, .png, and others)
It's All Vulnerable
If you put up any files that you don't want to be found on the website, they should be in a password protected directory. If they aren't, they are visible - and search engines can and will spider it.
How You Can Protect Your Files Through Web 2.0 Obscurity?
There are several ways to protect your files:
Don't put them up on the site
This is the most secure method. If you don't want your files to be seen by people, avoid putting them on a website or even a computer with a Web server on it.
Put them in a password protected directory
Use a server level password, such as htaccess for the securest protection. Or you can use a JavaScript password, but this is not very secure.
Put up a robots.txt file
This will prevent "law-abiding" robots from spidering the specified pages, but doesn't prevent robots who don't follow those rules. In fact, it acts as a flag to some that those directories might contain sensitive documents and materials.
When building and maintaining a website, it's important to keep security in mind at all times. "Security through obscurity" or the idea that if a page isn't linked means people won't find it, is an incorrect theory. If you put up a document or file on a Web page, you should assume that someone will find and read it. If you don't want it found, think really carefully before you post it to your website.
Sample of Robots.TXT
XYZ is your sample web directory, where is .doc, .pdf, .html.... are placed
User-agent: Googlebot # Googlebot Disallow:/xyz/ Crawl-delay: 100 User-agent: Googlebot-Image # Googlebot Disallow:/xyz/ Crawl-delay: 100 User-agent: Mediapartners-Google* # GoogleMedia Partners Disallow:/xyz/ Crawl-delay: 100
User-agent: AdsBot-Google # Google Adsbot Disallow:/xyz/ Crawl-delay: 100 User-agent: Googlebot-Mobile # Google Mobile robot Disallow:/xyz/ Crawl-delay: 100 User-agent: GurujiBot Disallow:/xyz/ Crawl-delay: 100 User-agent: ia_archiver # Alexa Disallow:/
User-agent: msnbot # MSN search bot Disallow:/xyz/ Crawl-delay: 100
User-agent: Slurp # Yahoo Disallow:/xyz/ Crawl-delay: 100 User-agent: Teoma # Ask Jeeves Disallow:/xyz/ Crawl-delay: 100 User-agent: Scooter # Altavista Disallow:/xyz/ Crawl-delay: 100 User-agent: Robozilla # Open directory project Disallow:/xyz/ Crawl-delay: 100
User-agent: baiduspider # BaiduDisallow:/xyz/abc/Disallow:/xyz/Crawl-delay: 100User-agent: lycos # Lycos Disallow:/xyz/ Disallow:/xyz/abc/ Crawl-delay: 100 User-agent: gulliver # Northern Light Disallow:/xyz/ Crawl-delay: 100 User-agent: Arachnoidea # Euroseek Disallow:/xyz/ Crawl-delay: 100 User-agent: Gulper # Yuntis Disallow:/xyz/ Crawl-delay: 100 User-agent: Fluffy the spider # Search Hippo Disallow:/xyz/ Crawl-delay: 100
User-agent: MSICrawler #MS Internet Explorer for offline viewing Disallow:/xyz/ Crawl-delay: 100 User-agent: Echo! Disallow:/xyz/ Crawl-delay: 100 User-agent: * #Disallow all other spiders Disallow: / Disallow: / Disallow: /xyz
0 comments:
Post a Comment