Start Your Engines! Lap Two
Search vs. Metasearch Engines
Jason Jorski
9th Grade
Bishop McGuinness High School
Abstract
The purpose of this project was to determine whether a single Internet metasearch engine
would return more relevant to topic Uniform Resource Locators (URLs) of Web pages in a
search query on the Internet than a combination of six Internet search engines. I
hypothesized a combination of search engines would return more relevant to topic URLs than
a single metasearch engine. A list of five topics was created. A search query was
performed on the Internet using six different search engines and six different metasearch
engines for each of the five topics. All searches were run on the same day because the
Internet changes daily from the addition, deletion, and relocation of Web pages and Web
sites. Search queries were multiple word phrases in lowercase, enclosed in quotation
marks. Search queries were not refined with advanced searching techniques. Metasearch
engines were not restricted to using the same six search engines as those selected for the
search engine group. The results of search queries were compared among the six search
engines and also among the six metasearch engines. An equal number of URLs from each
search engine's results were combined together by topic in a group equal in number to the
results for each metasearch engine. The relevant to topic URLs were counted for each group
of search engines and each metasearch engine. When equal numbers of URLs were compared, a
combination of search engines returned more relevant to topic URLs than a single
metasearch engine.
Introduction
The Internet or World Wide Web has grown and is growing exponentially. It is currently
estimated to be as large as 800 million Web pages. Of these 800 million pages 58% are not
indexed by prominent search engines2. Information on the Internet consists of Web pages at
Web sites which lack organization in terms of conventional indexing such as cataloging,
alphabetization, or subject heading grouping. To search the Internet requires full text
searching as opposed to the field searching by author, title, or subject heading used at
the local public library and in conventional databases. Search engines have been developed
to make searching the Internet easier through their creation of an indexed database that
can be accessed through the use of keywords or phrases to obtain a listing of Uniform
Resource Locators (URLs) of Web pages5.
The lack of organization or indexing of the Internet often returns a large number of
results, many of which are irrelevant to the search query. A single search engine will
usually return no more than 45% of the possible relevant results6. Many search engines
have a relevancy determination tool programmed into their software. Relevancy is often
determined by the frequency or the location of keywords in a Web site or on a Web Page. An
engine may assign a relevancy percentage to each URL, or place what it considers the most
relevant URLs at the beginning of the listing6,8.
Searching the Internet with a search engine is actually searching a database of
information that was gathered by an automatic or robot program called a spider or crawler
and not the Internet itself. These programs look for new, changed, or defunct Web sites
and send back the information to change and update a search engine database. Some crawlers
search only the titles and some search the entire document. Each search engine creates a
database that is different from every other search engine's database. The diversity in the
databases' results is an overlap of only thirty-four percent (34%) of some of the major
search engines8. My previous year's research reported an overlap of only thirty-seven
percent (37%) among six search engines. Search engine database diversity accounts for the
wide variety of results received among search engines even though the same exact terms are
used in the query. To obtain the most comprehensive search of the Internet, several search
engines should be searched sequentially for the same topic1,4,8.
Metasearch engines were developed as a more efficient method for sequential searching of
search engines. Metasearch engines are designed to provide a comprehensive search of the
Internet by entering a single search query and sending it to multiple search engines. Some
metasearch engines search sequentially and some simultaneously but all search multiple
search engines and return one set of results. Some metasearch engines search just a few
search engines and others claim to search as many as one hundred search engines.
Metasearch engines dispatch the query to selected search engines, interpret the results,
and display the results in a uniform or integrated format for the requestor4,6,8.
Statement of the Problem
In a search query of the Internet, will the results of a single Internet metasearch engine
return more relevant to topic Uniform Resource Locators (URLs) to Internet Web pages than
a combination of results from six Internet search engines?
Hypothesis
The combination of results from six Internet search engines will return more relevant to
topic Uniform Resource Locators (URLs) to Internet Web pages than the results of a single
Internet metasearch engine when comparing equal numbers of URLs for each group of results.
Materials
Computer with Internet connection
Modem
Telephone line
Printer
Ream of printing paper
List of five topics
Calculator
Experimental Method
1. Create a list of topics to search on the Internet.
2. Construct multiple word search queries for each topic.
3. Connect to the Internet.
4. Run all search queries on the same day.
5. Access AltaVista and search for each of the topics using the multiple
word search queries in lower case enclosed in quotation marks without further refinement.
6. Download and print the results for each query, which will contain
Uniform Resource Locators (URLs) to Internet Web pages.
7. Repeat steps 5 and 6 for Excite, HotBot, Infoseek, Lycos, and Yahoo.
8. Access MetaCrawler and search for each of the topics using the
multiple word search queries in lower case enclosed in quotation marks without further
refinement.
9. Download and print the results for each query, which will contain
Uniform Resource Locators (URLs) to Internet Web pages.
10. Repeat steps 8 and 9 for byteSearch, Mamma, ProFusion, SavvySearch, and Supercrawler.
11. Determine URLs that are duplicates, mirrors, and non-responsive on AltaVista for each
topic.
12. Determine the relevant to topic URLs on AltaVista for each topic.
13. Repeat steps 11 and 12 for Excite, HotBot, Infoseek, Lycos, and
Yahoo.
14. Determine URLs that are duplicates, mirrors, and non-responsive on MetaCrawler for
each topic.
15. Determine the relevant to topic URLs on MetaCrawler for each topic.
16. Repeat steps 14 and 15 for byteSearch, Mamma, ProFusion, SavvySearch, and
Supercrawler.
17. Record information on data log.
18. Select up to the first ten URLs from AltaVista, Excite, HotBot,
Infoseek, Lycos, and Yahoo for each topic creating a set of URLs equal in number to the
results for each topic from a metasearch engine.
19. Analyze data from the Internet search engine groups and the Internet
metasearch engines.
20. Perform percentage and/or statistical analysis.
Topics selected for search were: Custer's Last Stand; Hubble Satellite; Indochinese Tiger;
Oklahoma Land Run, and Retrograde Rotation. Searches for the five topics were run on six
metasearch engines and a maximum number of URLs up to sixty were collected. Metasearch
engines were not restricted in their choice of search engines or the number of search
engines searched. Duplicates and mirrors of URL's were eliminated from the relevance
counts but not from the total URL counts. Non-responsive URLs were determined but were not
eliminated from the total URL counts. Metasearch engines do not handle duplicates
identically. Some list duplicates individually by engine and others list only once with a
notation listing the search engines it appears on. Therefore, duplicates were not
eliminated from the total number of URLs but were only counted once in the relevancy
counts. Relevant to topic URLs were determined for each metasearch engine for each topic.
Searches for the five topics were ran sequentially on a combination of six search engines.
Groups of results from the search engine combination were created by selecting: the first
six URLs from each of the six engines creating a group of thirty-six (36) URLs; the first
eight URLs from each engine for a group of forty-eight (48); and the first ten from each
engine for a group of sixty (60). Duplicate URLs and mirrors of URLs between engines were
eliminated from the relevance counts but not from the total URL counts. Non-responsive
URLs were determined but were not eliminated from the total number of URLs returned. The
relevant to topics URLs were determined for each of the three search engine combination
groups.
Metasearch engine results were grouped by total number of URL returns and were rounded
down if the total was equal to or less than the median number between groups or rounded up
if the total was greater than the median between groups. Metasearch engines returning
forty-two (42) URLs or less were compared to the search engine combination group of
thirty-six (36). Metasearch engines returning forty-three (43) to fifty-four (54) URLs
were compared to the search engine combination group of forty-eight (48). Those returning
fifty-five (55) to sixty (60) or more URLs were compared to the search engine combination
group of sixty (60). Metasearch engines were compared individually to a search engine
combination group of comparable total URL returns.
Relevancy was determined by opening each Web page for review and not by the software of
the search engines. If the Web site contained on the first page at least one fact that
could be used for a student research paper then it was considered relevant.
Results
According to my results, I supported my hypothesis. The six search
engine combination returned the highest number of URLs relevant to topic. The six search
engine combination performed better than the individual metasearch engines at the rate of:
60% more than byteSearch; 80% more than Mamma; 60% more than MetaCrawler; 100% more than
ProFusion; 80% more than SavvySearch; and 60% more than Supercrawler. Collectively the
search engine combination group returned: more relevant to topic URLs at the rate of
73.3%; matched the metasearch engines at a rate of 13.3%; and were outperformed by the
metasearch engines by a rate of 13.3%.
Of the six individual metasearch engines, Mamma returned the most relevant to topic URLs
at a rate of 27% with byteSearch coming in second with a rate of 18%.
Conclusion
The Internet is so large it appears no single search engine or metasearch engine can cover
it all. Therefore selecting a search engine or a metasearch engine to use becomes a very
important factor in the volume and quality of results returned from a search query of the
Internet. Using more than one search engine or metasearch engine to search the Internet
will always provide more comprehensive coverage of the Internet but will often return a
large number of results to be reviewed. The search engine combination obtained by
sequential searching of six search engines returned the most relevant to topic URLs at the
rate of 73.3% while reviewing no more than ten URLs per engine for a total of sixty URLs
per topic.
Acknowledgments
I would like to thank my biology teacher, Theresa Gavula, for serving
as my adult sponsor. I would also like to thank Bryan Stanhouse, Ph.D. for his guidance in
the data analysis of my project.
Bibliography
1. Basch, Reva. (1996). Find Anything on the Web. Computer Life, 3(9): 61.
2. Click! (1999). Yahoo Internet Life, 5(9): 38.
3. Friel, Daniel. (1998). Superior Software: Metasearch Engines. Business Economics,
33(2): 70.
4. Garman, Nancy. (1999). Meta Search Engines. Online, 23(3): 74.
5. Glossbrenner, Alfred and Glossbrenner, Emily. (1998). Search Engines for the World Wide
Web. Peachpit Press. Berkeley, CA
6. Haskin, David. (1997). IW Labs: Power Search. Internet World, 8(12): 78.
7. Notess, Greg R. (1997). Measuring the size of Internet databases. Database, 20, 69-71.
8. Repman Judi and Carlson, Randal D. (1999). Surviving the Storm: Using Metasearch
Engines Effectively. Computers in Libraries, 19(5): 50.