decorative image
Yale University Library decorative image
Research Tools Libraries and Collections About the Library Library Services
Orbis and Library Catalogs Databases and Article Searching Online Journals and Newspapers Research Guides by Subject Ask! a Librarian
Full Text CD-ROM databases
Decorative Image

A     B     C     D     E     F     G     H     I     J     K     L     M     N     O     P     Q     R     S     T     U     V     W     X

Alphabetical Listing of Holdings     Holdings by Subject     Helpsheets    Search Electronic Texts

Guide to Searching the Electronic Text Collection

The Yale University Electronic Text Center uses a specialized version of the Library ht://dig search engine to search its Web site. This search engine allows the user to search Web pages on the ETC site, the general library server, or the library workstation.

Note: ht://dig excludes all directories in the /home file system (~ directories) on the general Library server, except for the /~lso and /~liblicense directories.

Search Options

There are five variables you can use to control your search. Use the ALL, ANY, or BOOLEAN Matches to control how your search words relate to one another. Use the Format menu to control the output of your search.

1. Match = ALL

ALL is the equivalent of the Boolean term AND. It requires that all words entered in the search for appear on a Web page.

2. Match = ANY

ANY is the equivalent of the Boolean term OR. It requires that one of the terms entered in the search form appear on a Web page.

3. Match = BOOLEAN

Use Boolean expressions to describe exactly what you are searching for using the operators AND, OR, and NOT. You can combine your search words using these operators. Use parentheses (nesting) to further refine your search by creating sets. For example, a search for (A or B) and (C or D) finds all pages that contain either A or B AND either C or D.

Here are some examples of Boolean searches:

SML and hours
hours or open
(SML or CCL) and (hours or open)
SML not CCL and hours

4. Format = LONG

Search results display descriptions of Web pages.

5. Format = SHORT

Search results display page titles only.

About the search

Stopwords are ignored; you will prompted to redo your search correctly. There is full support for the ISO-Latin-1 character set. Both SGML entities like 'à' and ISO-Latin-1 characters can be indexed and searched.

ht://dig uses "exact" and "fuzzy" matching. "Exact" matches will rank first, followed by "fuzzy" matches based on a database of word "endings"--fish, fisher, fishing. Additional information about the search algorithms is available at ht://dig home.

The search matches are ranked on the results screen. The rank of a match is determined by the weight of the words that caused the match and the weight of the algorithm that generated the match. Word weights are generally determined by the importance of the word in a document. For example, words in the title of a document have a much higher weight than words in the text of the document.

Explanation of ranking by the creator of ht://dig:

The scores are calculated from both stored information (gathered by htdig) and dynamic info like the matches returned from a search.

First the stored info: There are three parameters that make up the individual word weight that htdigassigns during the indexing process.

1) (w) The attributes 'heading_factor_*', 'keyword_factor', 'text_factor', and 'title_factor' are used as weight multipliers for words that htdig finds.
2) (c) The # of times the word occurs within a document
3) (l) The normalized location of the first occurance of the word. (normalized means that is always falls between 0 and 1000, 0 being the top of the document)

Dynamic searching: After htsearch does all its magic with the fuzzy search algorithms and finds a list of matching documents, the document scores are computed. For each document, each word weight is computed with the formula:

w * c * (1000 - l)

This number is then multiplied with the weight assigned to the particular fuzzy algorithm that produced the word. All these numbers are summed up to get the final score for a match. (Note that the calculation of the scores did *not* require the actual document record to be retrieved. This is part of the reason htsearch is pretty fast.) The scores have a pretty big range and after sorting the matches on score, the highest score is assumed 100% or whatever maximum number of stars you display.


© 2000 Yale University Library
E-mail questions to etc@yale.edu
Phone Reference (203) 432-1775/1780

       

© 2007 Yale University Library
This file last modified 07/16/01
Send comments to the SML Reference Desk

image map of navigational links
Search this siteYale UniversityYaleInfoContact UsOrbis Library CatalogLibrary hours