Beyond HTML: Security Concerns With Google

Now that Google is indexing a wide range of document types beyond HTML and plain text formats, potential security concerns are cropping up, both for searchers and Webmasters.

From the searcher point of view, the concern is that you might unwittingly open yourself up to viruses that are embedded in non-HTML files, such as Word macro viruses.

Until recently, search engines only delivered you to “safe” HTML or text files. It was possible that even these type of files might try to harm you, such as via JavaScript exploits. However, anyone who browses the Web is already exposed to such potential threats routinely and generally doesn’t have problems.

In contrast, people do not routinely open data documents such as Word or Excel files from those they do not know. Google has changed this, because its search results now contain direct links to such files from across the Web. These direct links mean that users might unwittingly open infected files.

For example, try a search for “clearcutting and fish populations in idaho.” The second result is an oddly named document called “Clearcutting in.” If you were to click on this link, instead of the document loading in your browser, your computer would instead launch Microsoft Word (assuming you have it installed).

This is because the link leads to a .doc file, a data file used by Microsoft Word. Such files can contain viruses, and if you open one without protection you’d be exposed to any virus inside.

The safe alternative is to always view such results using the “View as HTML” link that Google provides. You’ll see this link any time Google lists a non-HTML or text format file. By following it, you will be shown a safe, HTML version of the listing in your browser.

Ideally, Google would switch things around. I think by default the main link should bring up the safe HTML version while the “View as HTML” link would instead say something like “View Original File Type.” That would greatly reduce the odds of searchers getting accidentally infected by a virus. Google says it’s something it’ll consider.

“We’re going to continue to take a close look at this because, as you know, our users and their experience with Google is our number-one priority,” said spokesperson David Krane.

Krane also said that Google is noticing that when non-HTML content is offered, many users are opting to use the “View as HTML” choice. Aside from avoiding viruses, another good reason to do this is because the HTML versions are typically smaller than the actual data files, which means they load faster.

Another important point to note is that while the potential for viruses to hit searchers exists, the reality is that this doesn’t seem to have actually happened.

“We’ve yet to see email from any of our users complaining about computer viruses that they obtained via our search results,” Krane said.

Meanwhile, some Webmasters are reportedly shocked to discover that Word documents, Excel files, and other material they make available through public Web sites can now be found by searching at Google. There’s even the further concern that some of these documents might contain sensitive information, such as credit card numbers or password information.

The reality is that Google hasn’t created a security problem with these documents. It has simply exposed them. Any document that is made available on an Internet server (be it Web, FTP, Usenet, etc.) can be found by anyone. People can (and do) even create their own spiders to seek documents of particular types, such as email harvesters that roam the Internet in search of email addresses.

If a document is sensitive, don’t place it on the Internet, period. What if you must expose it to the Internet, so that selected individuals outside your company or organization can access it? Then establish a password protection or “authentication” system for your Web server, and make these documents only available to those who have a username and password.

Authentication systems will stop crawler-based search engines in their tracks. It’s an even better solution than using a robots.txt file, because listing sensitive data that you don’t want indexed by a spider in your robots.txt file is essentially a menu for any human who reads the file to find that information. An authentication system reveals nothing, and it has the added plus of keeping humans out as well.

Keep this in mind: None of the major search engine spiders will try to access authenticated information. However, a custom spider or a nefarious human may still try to hack in. Authentication is a barrier to them, but not absolute protection.

Related reading

site search hp
Space Shuttle Launch