Evaluating Search Engines for Chemistry - II

Part of The Alchemist's Lair Web Site
Maintained by Harry E. Pence, Professor of Chemistry, SUNY Oneonta, for the use of his students. Any opinions are totally coincidental and have no official endorsement, including the people who sign my pay checks. Comments and suggestions are welcome (pencehe@oneonta.edu).

Last Revised April 11, 2001

YOU ARE HERE> Alchemist's Lair > Web Tutorial > Engine Evaluation - II

Return to the contents page of the Spring, 2001 issue of the Computers in Chemical Education Newsletter.)


Evaluating WWW Search Engines for Chemistry, Harry E. Pence, Mariah Bernard, Aubray Cerniglia, Linda Mokay, and Lut Wong, SUNY Oneonta, Oneonta, NY


INTRODUCTION

Two years ago, a group of senior chemistry majors at SUNY Oneonta set out to update the work of Alexander Lebedev (Moscow University), who had created a web site entitled Best Search Engines for Finding Scientific Information on the Web. In order to measure the comprehensiveness of the important WWW search engines, Lebedev compared the number of hits recorded by eleven different search engines for eight different keywords important in physics and/or chemistry. Unfortunately, he had not revised his results since 1997, and the rapid rate of change on the WWW suggested that his results needed to be brought up to date. The results of this 1999 study were made available as a web page entitled, "Evaluating WWW Search Engines for Chemistry." In the past two years, the WWW has continued to change rapidly. In particular, financial problems have forced many of the dot.com companies, including those that run search engines, to cut expenses and reduce staff. There is also a powerful new search engine, called Google, that seems to be an excellent choice for scientists. A preliminary evaluation of Google has been reported, but Google has not been compared with the other commonly-used engines.

CRITERIA FOR SELECTING A SEARCH ENGINE

As noted in the previous article in this series, there are at least three important criteria that should be used to evaluate search engines: comprehensiveness, currency, and efficiency. Comprehensiveness, the measure of what fraction of the total web sites the search engine index actually includes, is particularly important for chemists, because they are looking for unusual information that may not be included in smaller search engine indices. Currency measures how often the search engine revisits sites to determine whether or not there have been any changes. This is important to all web searches, since failure to revisit sites allows dead links to be included in the index. The final important criterion is efficiency. Are the most useful sites listed early in the search results? This is probably the most difficult criterion to evaluate quantitatively.

There are many sources of information that can be consulted to evaluate a search engine. Reviews in popular computer journals can be helpful, but they are aimed at general users, rather than scientists. The articles by Steve Lawrence and C. Lee Giles, NEC Research Institute have written two excellent articles comparing the increase in the size of the search engine indices with the size of the web and evaluating the percentage of dead links on the various engines. (Science, 280, April 3, 1998, pps. 98-100, (paper #1 on the WWW.) and Nature, 400, pps.107-9, 1999, (paper #2 on the WWW ). Although these articles are two years old, they are still relevant. In addition, the references in "Following Search Engine Changes", which is also found elsewhere in this web site, may also be useful. The purpose of this report is to continue Lebedev's work by providing an up-to-date review of major search engines focused on the needs of chemists.

Lebedev chose to focus mainly on comprehensiveness. At the time, he argued that only 10-20% of the total number of documents found by search engines are scientific. The increasing commercialization of the web probably indicates that the current fraction of scientific documents is even lower. A larger index should increase the probability that all of the available documents will be found.as well as the probability of finding unusual information, which would presumably include chemical terms. The previous article in this series made minor changes in both the search terms as well as the search engines, and this latest project has made further modifications.

The same abbreviated list of scientific terms was used in this study as in the previous report. The engines used were Altavista, Hotbot, Excite, Google, NorthernLight, and MS Network. The search terms were based on Lebedev's list, namely, crystallography, catalysis, benzene, luminescence, ferroelectric, and EXAFS. The number of hits was recorded for each term on each search engine. The list of search engines used in the previous study was modified by replacing Infoseek with Google. In 1998, Infoseek was acquired by the Disney corporation and transformed into the basis for a web portal, called Go.com. More recently, Disney has switched to the GoTo.com search engine, which bases the order of the hits returned on the willingness of the sites to pay for placement. Disney has also announced that it would shut down Infoseek. Since the future of Infoseek is highly uncertain, it seemed reasonable to eliminate this engine

Goggle is reported to offer several potential advantages that indicate it should be included for evaluation. It currently claims that it has an effective index of about 1.3 billion pages, based on direct links to over 602 million pages as well as 648 million URLs that are "partially linked.." Thus Google is claiming to provide access to an unprecedented fraction of all the indexable pages on the entire web. Another advantage of Google is the use of PageRank to identify the most relevant sites. PageRank determines the "importance" of a site by the number of other sites that link to it, but pages that are themselves ranked as "important" are counted more in determining the ranking. If it performs as advertised, the combination of comprehensiveness and relevance can be very powerful.

In July 2000, Danny Sullivan, one of the leading search engine experts, compared the major search engines using a set of what he called obscure terms, that is, terms where no engine produces more than 100 hits. Even though his terms weren't specific to chemistry or science, this may be a good measure of both general search engines performance and may also relate to the performance on chemistry searches. Sullivan found that Google was clearly the best, but the FAST index (which powers the search engine called www.alltheweb.com) also did quite well. On the other hand, AltaVista, was described as a major disappointment because it did poorly compared to other search engines that claim to have smaller indices.

Lebedev had concluded his studies by recommending that AltaVista was the most comprehensive search engine, and the previous article in this series agreed with his recommendation. The question was whether or not Lebedev's methodology would still place Altavista at the top or would agree with Sullivan's results.

SEARCH ENGINE RESULTS

During March of 2001, students in the chemistry senior seminar at SUNY Oneonta once again surveyed the major web search engines. The main change from the previous survey was the replacing Infoseek with Google, and the reasons for that are described above. The results are shown in Table I (which will open in another window). The search terms used in this list are crystallography, catalysis, benzene, luminescence, ferroelectric, and EXAFS. The 1996 and 1997 results are from Lebedev's study, the 1999 results are from the previous SUNY Oneonta study, and the 2001 results are the current project.

In general, the number of hits for each search term on each engine have increased since the 1999 study, often quite substantially. This is in line with the fact that the WWW continues to expand explosively and also that most search engines are claiming that they have significantly increased their index sizes. The noticeable exception to this trend is Excite, which produced fewer hits in all but one case than in the 1999 study. This continues a trend that has been previously noted, namely, that Excite tends to return fewer hits with each biannual survey. It would appear from these results that the Excite index is slowly but steadily decreasing in size.

Aside from Excite, all of the other search engines report more hits on each of the search terms. Google, the newest search engine, did extremely well on every search term. When searching for the term crystallography, Altavista returned more hits, but otherwise Google reported more hits than any other engine. Thus, Google appears to have one of the most comprehensive indices for science. As noted above, Danny Sullivan has reported similar results, with perhaps an even greater bias towards Google. Search engines like Hotbot and Northern Light, which have been strong performers in prior surveys, appeared to return significantly fewer hits than Google and Altavista.

COMMENTS ON THE METHOD

Those who search on one of the popular search engines naturally expect that the results will be drawn from the index associated with that engine (remember that search engines don't really search the web; they search an index of web sites that has been accumulated by automated web crawlers). In reality, search engines may use several different indices, including those associated with what appears to be a competitor. Danny Sullivan publishes a listing of search engine associations that attempts to clarify some of these relationships. For example, the MS Network search engine is currently a hybrid service, using a combination of their own directory, the Direct Hit service for the first few hits, and the Altavista search engine index. The algorithm that determines the order in which the hits from these various sources will be listed is determined by the MS Network. As a result, the ranking and number of hits for MS Network is different from that for Altavista, even though there is some overlap of index use. The average user is not concerned with these subtleties, however. The only important question is which engine, by whatever means, produces the best results.

Of potentially greater concern is the observation that the number of hits reported for a search term only crudely measures the search engine index size. As noted in a previous study, searching the same term on the same engine at two different times could give significantly different results. In a personal communication, Steve Lawrence (NEC Research Institute) suggested that this problem resulted because some engines limit the amount of search time, and and when that time limit is reached, the search terminates, regardless of whether the entire index has been reviewed. The time limit is presumably determined by the number of searches in progress at a given time. Thus, the number of hits listed may vary depending on when a search is performed. On engines where this is known to be a problem, the searches in this study were repeated at different times to insure that the results were approximately the same.

The suit of search terms used in this study represents a more serious problem. This concern was expressed in the earlier article, and further experience has reinforced this unease. What does it really mean when a search engine reports that it has over 100,000 hits? Even the most diligent researcher is unlikely to review more than the first few hundred hits, and most individuals will stop after a much smaller number. Indeed, many engines will limit the number of hits that you can access. Normally this is at a high enough number (perhaps a thousand) that few users encounter this limitation, but it still does suggest that the large number of hits recorded in the search data are not really searchable, either in theory or in fact.

Probably the preferred solution is to develop a new set of search terms that return fewer hits than those in the current list, perhaps less than 100 (the number used by Danny Sullivan). This should make the reported number of hits more accurate and meaningful but would lose the ability to compare the current results with those of previous studies. A preliminary survey of this type is reported in Table II (which will open in another window). The terms used were enediynes, calixarene, attosecond, oligosaccharide, and XANES. These new search terms did produce fewer hits but still failed to consistently achieve only a few hundred hits. The same list of search engines was used as that given above, but based on Sullivan's results it was decided to include the FAST search engine . It is hoped that in a future article these search terms can be further improved.

CONCLUSIONS

Changing the suite of search terms did have a significant effect. Whereas the first set of search terms (Table I ) seemed to show that Google and Altavista had drawn ahead of the other engines in terms of index size, the second set of search terms (Table II) shows both Google and Northern Light to be more comprehensive than Altavista. Past problems with the Altavista counter suggest these comparisons may be more realistic than those in Table I. FAST, which had not been included in the previous survey, also did very well. Anecdotal evidence suggests that FAST is not commonly used by chemists, but these results suggest it probably should be considered as a viable option. Overall, this second set of results seems to agree with Sullivan's study that was mentioned earlier.

The strategy of choosing search terms that give fewer results does clearly produce different results, and these are thought to be a more accurate measure of index comprehensiveness. It is planned to continue to improve the accuracy of these results by refining the set of search terms as well as in other ways . At the conclusion of the previous paper in this series it was noted that the WWW is still in a state of rapid development, and even these conclusions must be considered to be tentative. That continues to be a prudent position.


Return to the contents page of the Spring, 2001 issue of the Computers in Chemical Education Newsletter.)

Return to The Alchemist's Lair Web Site

Return to Web Tutorial Home Page.

Return to Chem 398 Assignments Home Page.

You are the visitor to the Alchemist's Lair site since Jan. 10,1997.