Computer Counselor
Advanced Web Search Techniques That Help Your Practice

It is possible to locate valuable information that most search engines ignore

By Carole Levitt
Carole Levitt is an attorney and president of Internet For Lawyers.

It may not be widely known that much of the Internet is "invisible" to the general-purpose search engines that most people use. In fact, as many as 550 billion posted pages are not found when a searcher enters key words. These pages are invisible because they have never been located and indexed (or "crawled" in Web parlance) by a search engine. To be visible to search engines, Web pages must contain text (i.e., words, not graphics), be coded in HTML (the formatting language used in most Web pages), and be submitted for indexing to a search engine.

Submission, however, does not guarantee that a page will be indexed. Properly submitted, HTML-coded Web pages may still be invisible for numerous reasons. For example, HTML-coded pages are invisible if they are password protected, require registration, or are part of a database within a site. Some pages that could be crawled are not because they contain ephemeral data (sports scores, news, weather, and stock quotes, for example). Additionally, pages that are not coded in HTML are invisible to most search engines, so a search for information located in PDFs (Portable Document Format), sounds, or images is typically fruitless with a search engine.

An attorney can benefit greatly from knowing how to search unindexed Web pages, which contain a treasure-trove of information, including hard-to-find PDF government documents and data posted on Usenet forums. To search these sources, users can access the Google search engine and click on Group Search. This grants searchers access to an archive of 650 million Usenet postings from 1995 to the present. A simple rule for searching is to try different words, phrases, and subjects in conjunction with a date restriction. For example, a search for "Firestone tires" that limits the date to 1998 generates a results list that includes a Usenet post that reads, "All I know is that I had to have all four Firestones on my 4x4 Ranger replaced THREE TIMES in 13,000 miles!! They start feeling out of balance.à" The possible value of this post or a similar one to a practitioner who is handling a personal injury or product liability case is clear.

Usenet posts also contain information about people. A search based on a person's name or e-mail address can locate his or her postings. Attorneys may find it helpful to know a client's (or potential client's) hobbies or interests. A search on a name or e-mail address on Google Groups should reveal the postings made by that name or address to a group. When I ran the name of the vice president of Elite.com through Google Groups, I learned that he was a Star Trek fan. When attorneys use this search, a missing witness may be found. People who post messages on Google Groups often reveal--deliberately or carelessly--such hard-to-come-by information as their e-mail addresses, home or work addresses, phone numbers, and employer names.

Many sites contain both a visible and an invisible component. The State Bar of California's site, for example, can be located on a general-purpose search engine by entering "State Bar of California," but the database of individual State Bar records located on the site cannot be searched in the same manner (the attorney's name and "California State Bar number," for example), because databases are invisible to search engines. For example, if a California attorney types his or her name into any search engine, the resulting list of hits will not include the attorney's State Bar record. The record can only be retrieved by a searcher who first visits the State Bar's site (
www.calbar.org), clicks first on Member Records Online and then Go Directly to Member Search, where the searcher types a name into the blanks.

Although a general-purpose search cannot reveal the contents of a database, some search engines can at least indicate its presence. Google and Alta Vista, for example, are better than others at finding invisible databases. They both bring a searcher to the Member Records Online database (at http://www.calsb.org/MM/SBMBRSHP.HTM) if a searcher types "California State Bar member" into the search engine's search box.

Database Locators
Another way for a searcher who is unfamiliar with the URL to find the State Bar database is by using a directory to invisible sites and databases. Typically, these are arranged by topic. Examples of database locators include Bright Planet's completeplanet.com and Intelliseek's Invisibleweb.com. Both are free. The sites can also be searched by key word. Completeplanet.com claims to search through 38,500 of an estimated 200,000 invisible databases, while Invisibleweb.com claims to search over 10,000 databases. These claims are a bit misleading because neither site actually searches the data within the invisible databases. Instead, like Google and Alta Vista, they point the searcher to the invisible database where the search can then be conducted.

A searcher could learn of the State Bar's member records database by browsing the topical directory at Invisibleweb.com and clicking on Find a Lawyer after clicking Legal. However, taking the topical directory route at completeplanet.com does not work as well because there is no subtopic relating to finding a lawyer after selecting Law and Politics. Conducting a search at completeplanet.com by entering the words "California State Bar member" into the Find Database search box works better.

Searchers who gather consumer and competitive intelligence systematically may consider a subscription to Intelliseek's Corporate Intelligence Service. CIS sifts through the invisible Web to gather gossip, rumors, trends, and opinions from Usenet forums, discussion groups, and news. A subscription can cost between $100,000 to $300,000 annually to monitor information and aggregate the results. Subscribers who can afford CIS's subscription are primarily brand managers, public relations firms, and product researchers, but there may be instances in which CIS's services may attract law firms that are involved in high-stakes class actions or product liability lawsuits.

Another type of file that most general-purpose search engines cannot index is a PDF. The use of PDFs is increasing throughout the Internet, but only Google and a search engine run by Adobe (the company that created the PDF) index PDFs. Currently, Google allows users to search more than 13 million documents in this format, while Adobe's search engine (
) indexes over 1 million PDFs. If a searcher knows in advance that the document being sought is a government document (the federal and state governments make prolific use of the PDF format), then the searcher should first try Searchpdfadobe or Google.

The Government's Document Search Site
Another choice for locating government documents is to search the federal government's portal, FirstGov.gov, which was launched in September 2000 as the official U.S. government portal to 30 million pages of government information. The site also now indexes 18 million state government pages and purports to use a powerful search engine that searches every word of every U.S. government document in a quarter of a second or less. However, both the Google and Searchpdfadobe search engines proved more powerful than FirstGov in a test search for a congressional staff report in PDF titled "Know the Rules, Use the Tools" by Senator Orrin Hatch. FirstGov's search engine is also now established at the U.S. Government Printing Office site (
). If a searcher does not know whether a particular document being sought is a bill, statute, or regulation, the searcher can query the entire site at once rather than piecemeal.
A variety of sites index ephemeral data, sounds, or images all of which are typically invisible to general-purpose search engines. To access ephemeral data, such as current news, a searcher is advised to use Moreover.com. At 6:55 p.m. on July 5, 2001, I typed "Ariel Sharon and France" into the Moreover search box. The result was an article posted only eight minutes earlier. Running the same search through Google resulted in an article dated June 1, 2001. To locate sounds, go to Findsounds.com, type words or phrases to describe the sounds (e.g., "cash register"), and listen to 20 different cash register sounds. To locate images, search Alta Vista by clicking on Images or click on Advanced Search at Google and scroll down the page to locate the image search box.

According to a survey by Danny Sullivan of Searchenginewatch.com, searchers report a success rate of only 77 percent. Another survey shows that 71 percent of searchers regularly encounter frustration when they cannot find a relevant Web site, with the boiling point occurring within roughly 12 minutes. Chances are that many of these unsuccessful searches are being conducted on a general-purpose search engine when the desired information can be found on an invisible page. With just a few tools, however, searchers may find their access to useful information surging.

