Strategic Librarian

Using strategy to develop the law firm library.

Retro Thoughts on Search Engines

1 Comment

Back in 1996, I worked with the Minnesota State Law Library to develop the court decisions archive for the Minnesota Appellate Court System.  After many years of working in libraries including 10 working in law firm libraries, I was finally getting my MLS and had managed to arrange an independent study that required me to study what other courts were doing with their decisions and what software those courts were using to make the decisions searchable.   Once I learned which search engines were being used, I needed to evaluate which one to use for this project.   To do that, I wrote a paper describing the criteria to use to evaluate this type of software along with several appendices that contained the evaluation of a number of search engines in use during that time.  

The paper was published on the web back in 1996.  In addition to this paper, I also wrote one titled State Court Decisions on the Internet that summarized what was available and what functionality each site had.  Following that, I created a web site with the same name that served as a directory of state court decision sites.  That was taken down eventually as I moved into another position and did not have time to maintain it.  I only have a print copy of the appendices, which are of little use given how long ago I wrote the paper unless someone wants to go down memory lane with WAIS or Swish or the like, but I did find an electronic version of the search engine paper.  

I do think the main part of the article still has value as it was one of the first basic web postings on the topic and some of the ideas hold true even if the language I used may not be used today to describe functionality, etc.   I will follow this posting with another in the near future that provides an updated look at search engines as we know them today.

Web Search Engines  

Nina Platt, November 1996  

Regular users of the Internet’s World Wide Web have become very familiar with the sites that have been developed that allow users to search across all or a part of the Web.  Examples of these sites include Infoseek, Altavista, Hotbot, etc.  They are great tools for finding information that is stored across the Web and can assist users in finding information that would otherwise be difficult to locate.  In addition to these tools, individual websites have begun adding searching capabilities that allow users to search all or a part of their own sites.  This paper focuses on the search engines that have been used by the sites that provide access to state appellate court decisions.  Many of the sites listed in the paper, State Court Decisions on the Internet, have provided access to decisions through a variety of search engines which will be described throughout the following pages. 

Structure of Search Engine 

The search software on the web (more often referred to as search engines or search tools) are really database management tools that have been developed over the years to manage data and now have been enhanced to allow users access over the Internet, or they are database management tools that have been developed specifically for the Internet.  And to further complicate this explanation, third party or individual developers have also created Web interfaces that allow Web browsers to communicate with search engines.  Additional differences that can be attributed to search engines include: 

  • They are created to manage databases with structured fields (i.e., Foxpro), databases of textual documents(i.e., Excite), or databases that have both structured field and text (i.e. freeWAIS-sf). 
  • They are created to manage relational databases (i.e., Dbase) or flat files (i.e., Filemaker Pro). 
  • They are created to manage databases that have index files (that contain each word that has been indexed) but maintain the original documents in directories on the computer (i.e. Swish), databases that have index files (that contain each word that has been indexed) and store the documents in a data file (i.e., Folio), and databases that have index files (that contain each word that has been indexed), data files, and maintain the original documents for displaying or downloading (Basis Webserver).
  • They are developed to support different operating systems (i.e. UNIX, Windows NT, Windows 95).
  • They can be accessed through a variety of interfaces including Web browsers, Telnet connections, modem connections, connections using a GUI Windows interface, etc.

This paper will deal with search engines that have been developed to manage databases or document collections that may have both full text and structured fields. 

Indexing 

Before a database can be searched it must be indexed.  Indexing generally consists of systematically going through all documents that have been designated for indexing and creating a file of terms found in the documents.  The index will also include pointers back to the original document or to a record in a data file.  This allows the user to search on a term and receive a grouping of records or documents that contain the term.  The search engines examined use a variety of capabilities to make databases searchable.  They may or may not include the ability to:

  • Specify multiple directories and files to index
  • Recursively index subdirectories (index all subdirectories without having to specify each subdirectory) 
  • Specify file extensions of files to be indexed
  • Specify stopwords (commonly used words that should not be indexed) 
  •  Provide incremental indexing (when new documents are added, the entire file system does not have to be reindexed)
  • Provide dynamic indexing (documents can be indexed while users are searching)
  • Schedule indexing when site is not in use
  •  Index across servers on the same network or across networks
  • Merge indexes
  • Create individual indexes for different collections of documents
  • Add structured fields that will be indexed
  •  Index a variety of document formats including HTML, ASCII, PDF, wordprocessing files, etc.
  •  Index HTML tags: <meta>, <head>, <body>, <title>,header (<h1> to <h6>), emphasized (<i>, <b>, <em>, <strong>), or comment tags
  •  Index a protected server (one that requires user authentication to access)

Searching 

Once the database has been indexed a form or script is used to provide access by searching.  The searching capabilities of a database can be different even if developers are using the same search engine.  The differences are due in part to how the database was indexed and in part to how the search interface has been set up.  Some search engines do not include a searching function.  The developers of these engines created the ability to index and left it up to third parties or individuals to create the scripts or forms that are used in searching.  As with indexing, the search capabilities of a database can be different depending on what the developer included in the search function.  They may or may not include the ability to search using:

  • Natural query language.  This allows users to enter a question or phrase that best describes the topic for which they are searching.
  • Boolean operators (AND, OR, NOT).  These are connectors that allow user to search where all terms are contained in documents (AND), where any terms are contained in documents (OR), or when one term but not the other are contained in documents (NOT).  The default that is used when not entering a connector is generally AND or OR.
  • Proximity operators.  These connectors allow users to search where a term is found within so many characters from another term (W/number of characters), where a term is found ADJacent or NEAR another term.  Another capability offered by some database managers allows the user to specify the order of the terms (i.e., database BEFORE manager). 
  • Phrase searching.  Allows users to search for an exact phrase.
  • Thesaurus.  Uses an operator that replaces terms with synonyms or provides the user with a summary of broader or narrower terms and/or synonyms. 
  • Concept searching.  Similar to thesaurus, this function will search on all variations of a term. 
  • Wildcards.  Allows users to truncate terms when they want variations of a term or insert wildcards when they are not sure of the spelling or want to specify how many characters should be replaced by the wildcards.  A single string wildcard can be used to replace one character (i.e., Anders?n for Anderson or Andersen).  Multiple characters can be replaced by using more than one single string wildcard (i.e., act??? would retrieve action or acting).  A character string wildcard can be used to search words that contain the same string of characters (i.e., dark* would retrieve darker, darkness, darkest, etc).  Wildcards can be used for prefixes, suffixes, or characters within a word depending on how the software was developed. 
  • Exact match.  Allows users to search on the term exactly as it is entered.  This is useful if the database was set up to search for the singular and plural variation of a term. 
  • Fuzzy match.  Returns records with words that have a similar spelling to search terms. 
  • Numeric operators like equals, greater than, less than, etc.  Returns records with a specific alpha numeric value. 
  • Range operator.  Returns records within a range of values. 
  • Fielded searches.  Allows users to search on a specific field or fields in the database. 
  • Query by example.  Enables users to find other documents similar to an document in the current result set that the user finds relevant. 
  • Advisors.  Provide tips on how to construct a better query.  

Additional features that may be part of the search function are:

  • Users can select to search on one or more databases
  • Users can select the max number of records to return in a result set
  • Users can choose between a simple or advanced search form

Results Display 

Once the database has been searched, the results must be returned to the user in a usable format.  The format should include enough of a description of the records or documents returned to allow the user to make a decision about which document he/she wishes to display.  As with the indexing and searching, the results display will be different depending on the database manager used.  They may or may not include the ability to display the following: 

  • Title of document/record
  •  Author of document/record
  •  Description or summary of document/record
  •  Size of document/record
  • Relevance ranking of document/record
  • Number of documents/records found
  • Database from which the document was retrieved
  • Search terms used
  • Date document was created or indexed
  • Database fields as specified by database administrator or user

Additional features that may be part of the results display function include:

  • Results display can be modified by administrator or by user. 
  • Terms searched on are highlighted in the document. 
  • Users can navigate between search terms (or hits) within the retrieved documents. 
  • Users can select to display various formats of the same document (ex. HTML, wordprocessing, PDF). 

Descriptions of Search Engines 

As mentioned above, the creators of the websites that have searchable archives of court decisions have used a variety of search engines to index and provide searching capabilities for their users.  These search engines include: 

  • Applesearch
  • Excerpt
  • Excite
  • Folio
  • Frontpage
  •  Fulcrum
  • Isearch
  •  PL Web
  • Swish
  • TEAMate
  • WAIS
  • WebFind/WebIndex (WebSite) 

Appendix A includes an description of these search engines based on the criteria listed above.  It also includes description of Basis Webserver and ht://Dig, two search engines which look very interesting.  The information on various search engines covered in this paper was collected through an analysis of the documentation posted at the website of the organization or individual who developed the software and/or through a survey sent to the developer.   

WAIS is a name that is used for a variety of products that are both freeware and commercial.  They include the original WAIS search engine that was originally developed by Thinking Machines Corporations.  WAIS Inc. now sells a commercial version of WAIS and owns the trademarks “WAIS” and “Wide Area Information Servers”.  Variations of WAIS that have been developed include freeWAIS (freeware), freeWAIS-sf (freeware), WAIS (commercial), NT Wais Toolkit (commercial).  For purposes of simplication, freeWAIS-sf is the only WAIS product described in the descriptions.  Swish, which is also included in the evaluations, is a simplified version of WAIS.  Swishgate and wwwwais are WWW to WAIS gateways which allow users to search and display the results of searches done on a WAIS or Swish database.  Descriptions of these products are included. 

It is assumed that all search engines discussed require a HTTP compliant web server with the exception of Frontpage and Website (both are Web server products with built-in searching functions).  

How to Choose a Search Engine 

The purpose of this study was to gather information to be used in determining which search engine should be used to create a searchable archive of court decisions.  Selecting a search engine is a difficult task because of the variables involved.  To make a decision one must answer the following questions? 

  • What platform will you be using.  Examine the platforms supported carefully.  The search engine generally will only run on specific platforms with specific versions of the platforms operating system.
  • How important is easy installation and low maintenance?  If a website creator finds that time is a precious commodity or doesn’t have the technical know how to undertake a complicated installation then he/she should look for a search engine that is easy to install and maintain. 
  •  Do you want to maintain a collection of files stored in directories or do you want to maintain a datafile that stores the documents?  There are advantages to both.  The search engine that requires maintenance of a directory or directories of files does not require that you import each file into the database.  The search engine that requires the documents be stored in a datafile allows the administrator the luxury of only having to maintain one or more files (depending on the structure of the database).
  • What documents do you want to index and search?  If the documents that are going to be included in the database are spread across file servers or directories, then the search engine chosen must include the ability to index documents wherever they are located.
  • Do you want to maintain individual databases or be able to choose one or more databases to search?  If so, then the search engine has to have the capability to create multiple databases and the search interface must provide users with the option to select databases.
  • What search capabilities do you want to offer your users?  The searching functions must be examined carefully to see that they meet user needs.See Appendix B. 
  • Do you want to add structured fields to your documents?  If so, the search engine must support the addition of structured fields.  The advantages to adding structured fields are many including the ability to search on specific criteria.  The disadvantages include the increased amount of time needed to maintain the database.
  • How do you want the results displayed?  Do you need the ability to modify the results display that is delivered with the search engine?  As with the searching functions, the results display function must be examined carefully to see that they meet user needs.  See Appendix C.
  • Do you want your users to be able modify the results display?  Different users may find that they need to display different components of the documents they’ve retrieved.
  • Do you want your users to be able to view the HTML or ASCII form of the documents and be able to download the original wordprocessing file?  The various search engines handle this in different ways (some of which are more time consuming) and some do not offer the function. 
  • How much do you want to pay for the search engine?  The costs of the search engine range anywhere from free to thousands of dollars.  See Appendix D. 

These are just a few of the questions that must be considered before selecting a search engine.  To make a good selection, end users should be included in the evaluation process.  This will ensure that the database that is developed will meet user needs.  Also, there is no perfect search engine.  All of the features of the products must be examined and tradeoffs must be made depending on what is more important.  For example, a website developer may find that the users must have Boolean operators but don’t need the ability to modify search results.  

Conclusion 

As mentioned earlier, the research for this paper was initiated in order to determine which search engine should be used for a searchable archive of court decisions.  And as mentioned earlier there is no perfect search engine that will provide all of the functions needed.  Selection is a three step process: 

  • Determine the functionality needed
  • Compare what is needed with the functionality provided by the various products
  • Determine what tradeoffs you are willing to make

One thought on “Retro Thoughts on Search Engines

  1. Experienced librarians should spend more time putting up this type of content, since it is very informative for new librarians to see that we continue to struggle with the same concepts and improvements, for example, in search engines. In a way, it is a measuring stick for our profession. I wonder if there is an archive where all the work done by librarians goes? I think it would be neat to see other work done by librarians who have worked in the profession for awhile. Too much of the focus online is on the cutting edge and not enough on the past.

    Like