Start Your Engines! Lap Two
Search vs. Metasearch Engines
Bishop McGuinness High School
The purpose of this project was to determine whether a single Internet metasearch engine would return more relevant to topic Uniform Resource Locators (URLs) of Web pages in a search query on the Internet than a combination of six Internet search engines. I hypothesized a combination of search engines would return more relevant to topic URLs than a single metasearch engine. A list of five topics was created. A search query was performed on the Internet using six different search engines and six different metasearch engines for each of the five topics. All searches were run on the same day because the Internet changes daily from the addition, deletion, and relocation of Web pages and Web sites. Search queries were multiple word phrases in lowercase, enclosed in quotation marks. Search queries were not refined with advanced searching techniques. Metasearch engines were not restricted to using the same six search engines as those selected for the search engine group. The results of search queries were compared among the six search engines and also among the six metasearch engines. An equal number of URLs from each search engine's results were combined together by topic in a group equal in number to the results for each metasearch engine. The relevant to topic URLs were counted for each group of search engines and each metasearch engine. When equal numbers of URLs were compared, a combination of search engines returned more relevant to topic URLs than a single metasearch engine.
The Internet or World Wide Web has grown and is growing exponentially. It is currently estimated to be as large as 800 million Web pages. Of these 800 million pages 58% are not indexed by prominent search engines2. Information on the Internet consists of Web pages at Web sites which lack organization in terms of conventional indexing such as cataloging, alphabetization, or subject heading grouping. To search the Internet requires full text searching as opposed to the field searching by author, title, or subject heading used at the local public library and in conventional databases. Search engines have been developed to make searching the Internet easier through their creation of an indexed database that can be accessed through the use of keywords or phrases to obtain a listing of Uniform Resource Locators (URLs) of Web pages5.
The lack of organization or indexing of the Internet often returns a large number of results, many of which are irrelevant to the search query. A single search engine will usually return no more than 45% of the possible relevant results6. Many search engines have a relevancy determination tool programmed into their software. Relevancy is often determined by the frequency or the location of keywords in a Web site or on a Web Page. An engine may assign a relevancy percentage to each URL, or place what it considers the most relevant URLs at the beginning of the listing6,8.
Searching the Internet with a search engine is actually searching a database of information that was gathered by an automatic or robot program called a spider or crawler and not the Internet itself. These programs look for new, changed, or defunct Web sites and send back the information to change and update a search engine database. Some crawlers search only the titles and some search the entire document. Each search engine creates a database that is different from every other search engine's database. The diversity in the databases' results is an overlap of only thirty-four percent (34%) of some of the major search engines8. My previous year's research reported an overlap of only thirty-seven percent (37%) among six search engines. Search engine database diversity accounts for the wide variety of results received among search engines even though the same exact terms are used in the query. To obtain the most comprehensive search of the Internet, several search engines should be searched sequentially for the same topic1,4,8.
Metasearch engines were developed as a more efficient method for sequential searching of search engines. Metasearch engines are designed to provide a comprehensive search of the Internet by entering a single search query and sending it to multiple search engines. Some metasearch engines search sequentially and some simultaneously but all search multiple search engines and return one set of results. Some metasearch engines search just a few search engines and others claim to search as many as one hundred search engines. Metasearch engines dispatch the query to selected search engines, interpret the results, and display the results in a uniform or integrated format for the requestor4,6,8.
Statement of the Problem
In a search query of the Internet, will the results of a single Internet metasearch engine return more relevant to topic Uniform Resource Locators (URLs) to Internet Web pages than a combination of results from six Internet search engines?
The combination of results from six Internet search engines will return more relevant to topic Uniform Resource Locators (URLs) to Internet Web pages than the results of a single Internet metasearch engine when comparing equal numbers of URLs for each group of results.
Computer with Internet connection
Ream of printing paper
List of five topics
1. Create a list of topics to search on the Internet.
2. Construct multiple word search queries for each topic.
3. Connect to the Internet.
4. Run all search queries on the same day.
5. Access AltaVista and search for each of the topics using the multiple word search queries in lower case enclosed in quotation marks without further refinement.
6. Download and print the results for each query, which will contain Uniform Resource Locators (URLs) to Internet Web pages.
7. Repeat steps 5 and 6 for Excite, HotBot, Infoseek, Lycos, and Yahoo.
8. Access MetaCrawler and search for each of the topics using the multiple word search queries in lower case enclosed in quotation marks without further refinement.
9. Download and print the results for each query, which will contain Uniform Resource Locators (URLs) to Internet Web pages.
10. Repeat steps 8 and 9 for byteSearch, Mamma, ProFusion, SavvySearch, and Supercrawler.
11. Determine URLs that are duplicates, mirrors, and non-responsive on AltaVista for each topic.
12. Determine the relevant to topic URLs on AltaVista for each topic.
13. Repeat steps 11 and 12 for Excite, HotBot, Infoseek, Lycos, and Yahoo.
14. Determine URLs that are duplicates, mirrors, and non-responsive on MetaCrawler for each topic.
15. Determine the relevant to topic URLs on MetaCrawler for each topic.
16. Repeat steps 14 and 15 for byteSearch, Mamma, ProFusion, SavvySearch, and Supercrawler.
17. Record information on data log.
18. Select up to the first ten URLs from AltaVista, Excite, HotBot, Infoseek, Lycos, and Yahoo for each topic creating a set of URLs equal in number to the results for each topic from a metasearch engine.
19. Analyze data from the Internet search engine groups and the Internet metasearch engines.
20. Perform percentage and/or statistical analysis.
Topics selected for search were: Custer's Last Stand; Hubble Satellite; Indochinese Tiger; Oklahoma Land Run, and Retrograde Rotation. Searches for the five topics were run on six metasearch engines and a maximum number of URLs up to sixty were collected. Metasearch engines were not restricted in their choice of search engines or the number of search engines searched. Duplicates and mirrors of URL's were eliminated from the relevance counts but not from the total URL counts. Non-responsive URLs were determined but were not eliminated from the total URL counts. Metasearch engines do not handle duplicates identically. Some list duplicates individually by engine and others list only once with a notation listing the search engines it appears on. Therefore, duplicates were not eliminated from the total number of URLs but were only counted once in the relevancy counts. Relevant to topic URLs were determined for each metasearch engine for each topic.
Searches for the five topics were ran sequentially on a combination of six search engines. Groups of results from the search engine combination were created by selecting: the first six URLs from each of the six engines creating a group of thirty-six (36) URLs; the first eight URLs from each engine for a group of forty-eight (48); and the first ten from each engine for a group of sixty (60). Duplicate URLs and mirrors of URLs between engines were eliminated from the relevance counts but not from the total URL counts. Non-responsive URLs were determined but were not eliminated from the total number of URLs returned. The relevant to topics URLs were determined for each of the three search engine combination groups.
Metasearch engine results were grouped by total number of URL returns and were rounded down if the total was equal to or less than the median number between groups or rounded up if the total was greater than the median between groups. Metasearch engines returning forty-two (42) URLs or less were compared to the search engine combination group of thirty-six (36). Metasearch engines returning forty-three (43) to fifty-four (54) URLs were compared to the search engine combination group of forty-eight (48). Those returning fifty-five (55) to sixty (60) or more URLs were compared to the search engine combination group of sixty (60). Metasearch engines were compared individually to a search engine combination group of comparable total URL returns.
Relevancy was determined by opening each Web page for review and not by the software of the search engines. If the Web site contained on the first page at least one fact that could be used for a student research paper then it was considered relevant.
According to my results, I supported my hypothesis. The six search engine combination returned the highest number of URLs relevant to topic. The six search engine combination performed better than the individual metasearch engines at the rate of: 60% more than byteSearch; 80% more than Mamma; 60% more than MetaCrawler; 100% more than ProFusion; 80% more than SavvySearch; and 60% more than Supercrawler. Collectively the search engine combination group returned: more relevant to topic URLs at the rate of 73.3%; matched the metasearch engines at a rate of 13.3%; and were outperformed by the metasearch engines by a rate of 13.3%.
Of the six individual metasearch engines, Mamma returned the most relevant to topic URLs at a rate of 27% with byteSearch coming in second with a rate of 18%.
The Internet is so large it appears no single search engine or metasearch engine can cover it all. Therefore selecting a search engine or a metasearch engine to use becomes a very important factor in the volume and quality of results returned from a search query of the Internet. Using more than one search engine or metasearch engine to search the Internet will always provide more comprehensive coverage of the Internet but will often return a large number of results to be reviewed. The search engine combination obtained by sequential searching of six search engines returned the most relevant to topic URLs at the rate of 73.3% while reviewing no more than ten URLs per engine for a total of sixty URLs per topic.
I would like to thank my biology teacher, Theresa Gavula, for serving as my adult sponsor. I would also like to thank Bryan Stanhouse, Ph.D. for his guidance in the data analysis of my project.
1. Basch, Reva. (1996). Find Anything on the Web. Computer Life, 3(9): 61.
2. Click! (1999). Yahoo Internet Life, 5(9): 38.
3. Friel, Daniel. (1998). Superior Software: Metasearch Engines. Business Economics, 33(2): 70.
4. Garman, Nancy. (1999). Meta Search Engines. Online, 23(3): 74.
5. Glossbrenner, Alfred and Glossbrenner, Emily. (1998). Search Engines for the World Wide Web. Peachpit Press. Berkeley, CA
6. Haskin, David. (1997). IW Labs: Power Search. Internet World, 8(12): 78.
7. Notess, Greg R. (1997). Measuring the size of Internet databases. Database, 20, 69-71.
8. Repman Judi and Carlson, Randal D. (1999). Surviving the Storm: Using Metasearch Engines Effectively. Computers in Libraries, 19(5): 50.