Guide to Searching the Electronic Text Collection
The Yale University Electronic Text Center uses a specialized
version of the Library ht://dig search engine to search its Web site.
This search engine allows the user to search Web pages on the ETC
site, the general library server, or the library workstation.
Note: ht://dig excludes all directories in the /home file
system (~ directories) on the general Library server, except for the
/~lso and /~liblicense directories.
Search Options
There are five variables you can use to control your search. Use
the ALL, ANY, or BOOLEAN Matches to control how your search words
relate to one another. Use the Format menu to control the output of
your search.
1. Match = ALL
ALL is the equivalent of the Boolean term AND. It requires that all words
entered in the search for appear on a Web page.
2. Match = ANY
ANY is the equivalent of the Boolean term OR. It requires that one of the
terms entered in the search form appear on a Web page.
3. Match = BOOLEAN
Use Boolean expressions to describe exactly what you are searching for using
the operators AND, OR, and NOT. You can combine your search words using these
operators. Use parentheses (nesting) to further refine your search by creating
sets. For example, a search for (A or B) and (C or D) finds all pages that
contain either A or B AND either C or D.
Here are some examples of Boolean searches:
SML and hours
hours or open
(SML or CCL) and (hours or open)
SML not CCL and hours
4. Format = LONG
Search results display descriptions of Web pages.
5. Format = SHORT
Search results display page titles only.
About the search
Stopwords are ignored; you will prompted to redo your search correctly. There
is full support for the ISO-Latin-1 character set. Both SGML entities like
'à' and ISO-Latin-1 characters can be indexed and searched.
ht://dig uses "exact" and "fuzzy" matching. "Exact" matches will rank
first, followed by "fuzzy" matches based on a database of word "endings"--fish,
fisher, fishing. Additional information about the search algorithms is available
at ht://dig home.
The search matches are ranked on the results screen. The rank of a match
is determined by the weight of the words that caused the match and the weight
of the algorithm that generated the match. Word weights are generally determined
by the importance of the word in a document. For example, words in the title
of a document have a much higher weight than words in the text of the document.
Explanation of ranking by the creator of ht://dig:
The scores are calculated from both stored information (gathered by htdig)
and dynamic info like the matches returned from a search.
First the stored info: There are three parameters that make up the individual
word weight that htdigassigns during the indexing process.
1) (w) The attributes 'heading_factor_*', 'keyword_factor', 'text_factor',
and 'title_factor' are used as weight multipliers for words that htdig finds.
2) (c) The # of times the word occurs within a document
3) (l) The normalized location of the first occurance of the word. (normalized
means that is always falls between 0 and 1000, 0 being the top of the document)
Dynamic searching: After htsearch does all its magic with the fuzzy search
algorithms and finds a list of matching documents, the document scores are
computed. For each document, each word weight is computed with the formula:
w * c * (1000 - l)
This number is then multiplied with the weight assigned to the particular
fuzzy algorithm that produced the word. All these numbers are summed up to
get the final score for a match. (Note that the calculation of the scores
did *not* require the actual document record to be retrieved. This is part
of the reason htsearch is pretty fast.) The scores have a pretty big range
and after sorting the matches on score, the highest score is assumed 100%
or whatever maximum number of stars you display.

©
2000 Yale University Library
E-mail questions to
etc@yale.edu
Phone Reference (203) 432-1775/1780
© 2007 Yale University Library
This file last modified 07/16/01
Send comments to the SML Reference
Desk