Evaluating Search Engines for Chemistry - III

Part of The Alchemist's Lair Web Site
Maintained by Harry E. Pence, Professor of Chemistry, SUNY Oneonta, for the use of his students. Any opinions are totally coincidental and have no official endorsement, including the people who sign my pay checks. Comments and suggestions are welcome (pencehe@oneonta.edu).

Last Revised March 15, 2003

YOU ARE HERE> Alchemist's Lair > Web Tutorial > Engine Evaluation - III

Return to the contents page of the Spring, 2003 issue of the Computers in Chemical Education Newsletter.)


Evaluating WWW Search Engines for Chemistry III, Harry E. Pence, Diane Barber, Seth Knupp, Timothy Naples, Katherine O'Brien, Melissa Ogborn, SUNY Oneonta, Oneonta, NY


INTRODUCTION

This report continues the work begun in 1999 by a group of senior chemistry majors at SUNY Oneonta, who had set out to update the 1997 work of Alexander Lebedev (Moscow University), who presented the "Best Search Engines for Finding Scientific Information on the Web." Lebedev compared the number of hits recorded by eleven different search engines for eight different keywords important in physics and/or chemistry. Since he had not revised his results since 1997, reassessing the situation seemed to be a good project. In fact, the original page is no longer available, apparently being totally removed in 2001. If you wish to check on this original research, a note at the end of this article suggests a web site that will accomplish this. The results of the 1999 study by SUNY Oneonta students are found on a web page entitled, "Evaluating WWW Search Engines for Chemistry." and the results of the 2001 SUNY Oneonta study are found at Evaluating WWW Search Engines for Chemistry II."

The rapid change in the WWW which was noted in previous articles has continued, probably exacerbated by the financial pressures from the bursting of the dot.com bubble. Thus, several of the engines which were included in the last survey no longer seem relevant. After going through considerable reorganization (including bankruptcy), Excite now describes itself as a "personalization portal" with metasearch capabilities. For this reason, it has been dropped from the current study, which is focused on crawler-based engines. In January 2001, NorthernLight announced that it will no longer provide free web searches for the general public. As of March, 2003, it was still possible to search NorthernLight using the new URL provided here, but the future status of this ability is questionable, and so this engine has also been dropped.

Two new search engines have been added to replace Excite and NorthernLight, Wisenut and Scirus. Wisenut came online in 2001, and claims to have an index of 1.5 billion pages. Wisenut also claims to use a search algorithm that gives more relevant results. If these claims are true, it would suggest that Wisenut is comparable to the other engines in this study. Scirus was developed by Elsevier Science, a major scientific publishing house, and claims to be the most comprehensive science-specific search engine. It indexes not only web based sources, but also an extensive journal database, including Elsevier's ScienceDirect, MEDLINE, Beilstein, BioMed Central and several preprint servers. It should be noted that the name Scirus is based on a character mentioned in "The Description of Greece" by Pausanius, which makes this one of the more esoteric search engine names.

CRITERIA FOR SELECTING A SEARCH ENGINE

As noted in the previous article in this series, there are at least three important criteria that should be used to evaluate search engines: comprehensiveness, currency, and efficiency. Comprehensiveness, the measure of what fraction of the total web the search engine index actually includes, is particularly important for chemists, because they are often looking for unusual information that may not be included in smaller search engine indices. Currency measures how often the search engine revisits sites to determine whether or not there have been any changes. This is important to all web searches, since failure to revisit sites allows dead links to be included in the index. The final important criterion is efficiency. Are the most useful sites listed early in the search results? This is probably the most difficult criterion to evaluate quantitatively, since what is important to an inorganic chemist may be of no interest to an organic chemist and visa versa. In fact, the needs of a single searcher may vary from day to day. Danny Sullivan has an excellent (and highly relevant) summary of both the issues and problems involved in a web site titled, "In Search Of The Relevancy Figure."

As noted in earlier studies, the total number of hits reported by an engine is taken as a measure of comprehensiveness; however, when the number of hits is very large, the number of hits shown may not be accurate. In the previous study, it was noted that the terms originally used by Lebedev returned so many hits that some engines gave only estimated results and some engines truncated the search before it was complete. To alleviate this problem a revised set of terms that each gave only a very hundred hits was suggested in the previous article. This list was used in the current study. As before, only single word search terms were used in order to focus on the index size rather than the way the search engine handled compound search terms. To test for freshness, the first ten hits were checked for dead links, and the total number of dead links for all of the terms on an engine was taken as a measure of how up to date the engine index was. Scirus represented a special challenge, since it accesses a large amount of material which is in Elsevier journals rather than freely available on the WWW. For those who have ScienceDirect and can access these journals online, this could be a great asset. Still it represented somewhat of an unfair advantage when compared with engines that only used web sources. To make the comparison more equivalent the study of Scirus was done two ways, web sources only and all sources.

SEARCH ENGINE RESULTS

As in past studies, senior chemistry majors at SUNY Oneonta surveyed the major web search engines over several days in March. As noted earlier, Excite and NorthernLight were removed from consideration and replaced with Scirus and Wisenut. Each search was performed twice, usually on two separate days. The results are shown in Table I (which will open in another window). The new set of search terms was used: enediynes, calixarene, attosecond, dendrimer, oligosaccharide, and XANES. These terms were also part of the 2001 results and so a comparison of the current results with 2001 is provided.

For each search term, the engines were arranged in order, so that the engine with the most hits was given a one and the one with the least hits was assigned a seven. These values were then averaged for the six search terms. The results obtained from using the complete Scirus database, including ScienceDirect, were not included. The resulting rankings were Google (1.3) < Hotbot (2.5) < FAST (3.3) < Scirus (4.5) < MSNetwork (4.7) = Altavista (4.7) < Wisenut (7). Altavista ranked fifth or sixth on every term except oligosaccharide, where it was first. As noted below, one of the top ten listings for oligosaccharide was a pornographic site. The high number of hits obtained on Altavista for this term may be coincidence or it may reflect further contributions of this type. If the complete Scirus database had been included, it would have had a rank of 2.3, second highest among the engines tested.

As seen from the above rankings, Google continues to be the most comprehensive WWW engine, even when compared with Scirus. When limited to web sources, Scirus was not nearly as comprehensive as Google, and both Hotbot and FAST were better than Scirus. MSNetwork, Altavista, and Wisenut brought up the tail end of the pack. Altavista, which was at one time a favorite engine for many chemists, has apparently not recovered from the financial problems that prevented it from expanding its index. Despite the claims mentioned earlier, Wisenut is clearly not competitive in terms of comprehensiveness.

Contrary to the previous study, searching the same term on the same engine at two different times gave essentially the same results in almost every case. It had previously been noted that some engines limit the amount of search time, and and when that time limit is reached, the search terminates, regardless of whether the entire index has been reviewed. This can cause the number of hits returned to vary depending on when a search is performed. Despite running the searches at two different times, very little variation was noted. Either this study was fortunate enough to occur at a slow time on the various engines, or else the strategy of using terms that produced relatively few hits was successful.

To measure freshness, the first ten hits were checked to make sure that the URL was still in existence. Failed links are an indication that the engine has failed to keep up with recent developments on the WWW. This check was usually performed on two different days for each engine in order to attempt to avoid cases where the page still existed, but a server simply was not available at a given time. The results are shown in Table II below, with the engines listed from least to most dead links.

Table II. Number of Dead Links on Tested Search Engines

Engine Name Number of dead links
MSNetwork 0.5
Google 2
Hotbot 3.5
FAST 4.5
Scirus 4.5
AltaVista 5
Wisenut 7

Based on these results, MSNetwork and Google provide the greatest freshness. MSNetwork is based on Looksmart, which relies on human editors more than a web crawl. Wisenut, which is also owned by Looksmart, is a particular disappointment, since, despite claims that it was going to compete with Google, it is clearly the poorest of the engines tested. Somewhat to our surprise, one of the first ten links from AltaVista for the term oligosaccharide was a pornography site. It is well known that the webmasters of porn sites are especially adept at sneaking into search results, but it was unusual to see this happen with a scientific term that would not seem to provide the target audience these individuals usually look for. Either this is an unusually erudite porn master or perhaps there is a new meaning to the term "sugar daddy." Regardless, it must be counted as a black mark against the relevance of Altavista that it included this site in the top ten hit list, and so it was counted as a dead link.

Many web sites also rank search engines, although they do not use scientific terms as a criteria. For example, Greg Notess has reported on search engine index statistics as of December, 2002. He agrees that the Google index is clearly the largest, but suggests FAST is closer behind than would be indicated in this study. Both Altavista and Wisenut do much better in his results than in the current study. Whether this difference represents very recent changes in these engine indexes or a weakness in finding scientific sites cannot be determined on the basis of the available information. Notess has also reported on index freshness, and also finds that MSNetwork and Google do the best job of keeping up to date, which agrees with the result of this study. The references in "Following Search Engine Changes", an article found elsewhere on this web site, may be useful.

COMMENTS ON THE METHOD

The strategy of using search terms that do not return a large number of hits seems to be successful, but does it still represent a realistic measure of comprehensiveness? Few, if any, researchers will go beyond the first few hundred hits, and so very high numbers of hits are basically meaningless. When faced with such a large number of hits the best strategy is to narrow the focus of the search by combining several terms. Thus selecting search terms that give relatively few hits is probably not only gives more reproducible results but also a more realistic measure of actual practice.

This approach is not without problems, however. The alternative list of search terms suggested in 2001 did, indeed, give a limited number of hits then; however, all of the engines report more hits on each term in 2003 and in some cases the increase has been very large. This suggests that in future surveys, it will be necessary to modify the list of search terms maintain the goal of obtaining a relatively small number of hits from year to year, despite the continuing increase in the size of the WWW. It may well be necessary to eliminate some terms that produce so many hits that the counting becomes questionable.

CONCLUSIONS

Which is the best search engine for chemistry? There have been a number of claims that one engine or another has met or surpassed Google, but the results of this study suggest that for searches which are intended to find only web sources, Google continues to offer the best combination of comprehensiveness and freshness. Scirus does present a very attractive alternative, especially for those who have on-line access to ScienceDirect, and can print out the articles. For such individuals, the best alternative may be a combined search, using Google to locate web references and Scirus to explore journals and similar sources. (Remember that the ScienceDirect database does not include many of the major chemistry journals, and so this is not a substitute for Chemical Abstracts.) Among the second tier of engines, FAST and Hotbot seems to offer an good combination of comprehensiveness and freshness. MSNetwork does have the advantage of having few dead links, but neither Altavista nor Wisenut appears to have much to recommend them.

As the previous studies in this series have demonstrated, search engines are changing so rapidly that any evaluation will remain accurate only a short time. Several recent announcements suggest that this will continue to be true for some time. In February, Google announced that it had purchased Pyra Labs, developers of a product called Blogger, which is one of the most widely used weblog publishing tools. Chris Sherman has pointed out that more rapid access to these ad-free weblogs may open some new avenues not only for Google's news and advertising operations, but also for web site evaluations since web logs often offer evaluations of web sites. If this can be used to help measure the relevance of web sites, it might be a significant new way to decide the order in which hits are returned for search terms.

Early this year, Overture, a search engine that determines the order in which sites are returned by the willingness of the site owner to pay, has recently acquired Altavista, one of the premier crawler engines on the web. This was reasonable, since Overture has had an ongoing partnership with Altavista and acquiring this the Altavista technology would enable Overture to be more competitive with Google in selling complete packages of search products. A short time later, Overture announced that it was acquiring FAST, another major search engine. There is considerable logic in either of these purchases, but the combination of both purchases seems redundant. Perhaps Overture bought both of these engines to prevent one or the other from being purchased by a competitor. If this is the case, it seems likely that money and resources will be dedicated to only one of these two engines, and the other engine will not be kept up to date. Of course, it is too early for any of these changes to be evident in the results of the current study, but it seems likely that in two more years when the study is repeated, there will be even more changes necessary to reflect a new situation that will exist then.

Addendum: The Waybackmachine

Often, as is the case with the Lebedev site that was the inspiration for this research, old web pages simply fade away and are now longer available. There is a site that has archived the WWW at various times in its development, the waybackmachine. If you go to this site and put in the original URL for Lebedev's site (www.chem.msu.su/eng/comparison.html) it will return the page as it was at various dates up until 2001, when the page was apparently removed. Aside from being of interest with respect to this particular research, the capabilities of the waybackmachine may also be useful to anyone who is doing a web search and finds himself or herself confronted with the frustrating 404 error. (Note: There will be extra credit for those who actually remember the source of the name for this site, the Way Back Machine.)


Return to the contents page of the Spring, 2003 issue of the Computers in Chemical Education Newsletter.)

Return to The Alchemist's Lair Web Site

Return to Web Tutorial Home Page.

Return to Chem 398 Assignments Home Page.

You are the visitor to the Alchemist's Lair site since Jan. 10,1997.