 Freestyle
is a search method that allows a search to be specified in
plain language, without the use of connectors or a specific syntax. It provides
answers in a relevance ranked list, based on the statistical similarity of each
document to the search. Freestyle allows users who are not trained in boolean
search to quickly and easily obtain reasonable search results. Freestyle can
also be a powerful search tool for the advanced boolean searcher.
Table of Contents
- Information Retrieval Theory
- Statistical Retrieval
- Term Vectors
- An Example Topic
- When To Use Freestyle
- When Boolean is Better
- Concept Searching
- Ensuring Maximum Coverage
- Statistical Tools for Boolean Searches
- Good Search Terms
- Relevance Ranking the Results
- Converting Failed Searches
- Zero Answer Searches
- More Than 1000 Answers
- More-Like-This
- Using Freestyle
- Selecting the Source
- Enter the Freestyle Search
- Phrase Recognition
- Noise Words
- Mandatory Terms
- Restrictions
- Thesaurus and Related Concepts
- Number of Documents
- Understanding Freestyle Search Results
- The WHERE Screen
- The WHY Screen
- SuperKWIC Display Mode
Information Retrieval Theory
This section describes some of the basic theory behind statistical retrieval
techniques. Because Freestyle utilizes a statistical retrieval algorithm, an
understanding of the underlying theory will allow the advanced user to better
understand Freestyle and to achieve better results.
A primary source of data on information retrieval methods and effectiveness
is the Text Retrieval Conference (TREC) sponsored by the National Institute of
Standards and Technology (NIST). The conference provides a reasonable size
collection of documents, a set of topics, and judges the relevance of documents
to the topics. The conference attracts many of the universities and corporations
active in the area of text retrieval. The conference provides actual data on the
effectiveness of many different algorithms and theories.
TREC results have consistently shown that for text retrieval, statistical
methods provide results that are superior to natural language methods.
Boolean methods are very difficult to compare to automatic methods due to the
human involvement of formulating the boolean search, and the lack of a ranked
results list. However, the limited tests done have shown statistical retrieval
to perform as well as or better than boolean searches constructed by expert
searchers.
Back
to TOC >>
Statistical Retrieval
Statistical retrieval ranks the documents in the target collection based on
their similarity to the search description. The ranking utilizes the words and
phrases within the documents and the search. In its simplest form, only
individual words are considered. The ranking formula applies a term weight to
each search term, and then scores each document that contains one or more of the
search terms.
The two fundamental components to term weight are the term frequency
and the inverse document frequency. The term frequency is the number
of times a term occurs within a document. The more often the term occurs, the
more likely the document contains relevant material related to the concept
represented by the term. The inverse document frequency is the inverse of the
number of documents within the collection that contain at least one occurrence
of the term. The higher the number of documents that contain the term, the lower
the value of the term in differentiating the documents. For example, if a term
occurs in every document, the term is not a good term for ranking the documents.
The most simple of scoring formulas is to multiply the term frequency by the
inverse document frequency for each search term and each document, and sum the
values for each document.
Unfortunately, the simple formula does not work well in real collections.
Longer documents will always score better because they will contain more terms,
and a higher frequency for those terms. It is therefore important to normalize
the document scores based on the document length. Although much work has been
done in this area, current formulas seem to have a bias for either shorter or
longer material.
Single terms can be very ambiguous and have different meanings depending upon
their context. This problem may be greatly reduced by considering phrases as
well as single terms. A phrase often has a more specific meaning than a
single term.
Back
to TOC >>
Term Vectors
A term vector is a list of terms that are extracted from a document. The
terms may be single words or phrases, and may include proper nouns. The terms
are selected based on statistics and phrase recognition software. The theory is
that a relatively small set of important terms represent the concepts within the
document, and that these terms make excellent search terms.
One use of term vectors is for relevance feedback. Relevance feedback is a
method of improving a search results set by modifying the original search based
on the searchers judgment of relevance of one or more of the answers. Of the two
basic approaches, search expansion and term re-weighting, search expansion
provides the most improvement. The search expansion method uses the term
vector from a relevant answer to supplement the original search terms. This
method can provide dramatic improvement in the answer precision.
The second term vector feature is a statistical thesaurus. The statistical
thesaurus is a very large collection of term vectors, generated in advance from
the documents that are part of the data collection. The theory is that terms
that occur together in a term vector are related to each other, and represent
the same concept. The search is then expanded using the statistical thesaurus.
This feature has been shown to be vastly superior to ordinary thesauri for
enhancing queries.
Back
to TOC >>
An Example Topic
The remainder of this document will use examples to illustrate features of
the LexisNexis system. All of the examples are based on a single topic. The
following topic was used for the TREC conference. It was randomly selected as an
example topic for this paper.
What research is ongoing to reduce the effects of osteoporosis in existing
patients as well as prevent the disease occurring in those unafflicted at this
time?
Customer service personnel from LexisNexis participated in a boolean
experiment for this topic as expert boolean searchers. The search they
constructed was:
research! Or stud! Or analy! Or test! Or experiment! Or exam! Or inqu! W/25
osteoporosis
This boolean search was the result of 25 minutes of work and was the eleventh
search constructed. This search returned 32 answers from the TREC collection, 20
of which were assessed as relevant. The assessors found a total of 36 relevant
documents for the topic within the collection. The precision for the boolean
search was 63%, and recall was 56%.
By contrast, the Freestyle search below obtained 19 relevant documents within
the top 32, 5 of which were not in the boolean answer set. It took less than 2
minutes to construct and run the search. The Freestyle search had a total of 25
relevant documents in the top 50 answers, with 8 answers which were not in the
boolean answer set. There were a total of 32 relevant documents in the top 100
answers, with 14 answers which were not in the boolean answer set. The boolean
answer set had 2 documents which were not in the Freestyle answer set, and there
were 2 relevant documents in the collection that neither search located.
Back
to TOC >>
When To Use Freestyle
Freestyle is not a replacement for boolean searching. Freestyle supplements
boolean searching, allowing the expert searcher to use another tool to locate
the desired information in an efficient manner. There are times when it is
appropriate to use either boolean or Freestyle, and times when both methods may
be used together.
Back
to TOC >>
When Boolean is Better
Certain retrieval problems work much better in boolean than they do in Freestyle.
Boolean allows for a more specific definition of a search. When searching
public records or other structured material, boolean allows for a more specific
definition of names and provides superior results when used by a skilled
searcher.
Boolean also allows a skilled searcher to find every mention of a specific
name.
Boolean allows the use of universal characters, and many connectors not
supported by Freestyle. When these features are needed, boolean must be used.
Back
to TOC >>
Concept Searching
Freestyle provides an easier method of search for general concepts such as:
"What are the benefits of pets to the elderly?" A topic such as this
can be entered as is to Freestyle, whereas a boolean search may be difficult to
construct.
This is especially true when researching an unfamiliar topic. Freestyle can
usually bring back a relevant document ranked high in the answer set where it is
quickly found, even with a relatively weak search description. By contrast, a
weak boolean search would return a very large number of answers where the same
document is somewhere in the large set, but necessarily near the front.
Back
to TOC >>
Ensuring Maximum Coverage
Many times it is not sufficient to find some of the material about a topic -
all of the material must be found. In a system with as much material as LexisNexis, this is nearly an impossible task. One method to enhance your
overall recall is to search using both multiple searches, and multiple search
methods. Because Freestyle differs from boolean in its retrieval method, it
often times finds relevant documents that boolean searches do not find. This is
especially true when the Freestyle search contains terminology that is not
present in the boolean searches. As shown in the example topic, the Freestyle search retrieved many documents missed by the boolean search.
Back
to TOC >>
Statistical Tools for Boolean Searches
Statistical tools can help the boolean searcher. Tools are available to help
select search terms, browse answers, and find results where a boolean search has
failed.
Back
to TOC >>
Good Search Terms
The key to a successful search is using the right terminology. Without the
right terms, a search will not retrieve the right material, or will retrieve too
much non-relevant material. Selecting the right terms can be a difficult task,
especially in topic areas unfamiliar to the searcher.
A traditional approach is to start with a few terms, run a search, and read
some of the documents returned. Then the search is refined based on the
documents found. New terms may be added to broaden the search, or to further
restrict a search. This method is very effective, but is very time consuming,
and may miss large amounts of relevant material.
The Related Concepts feature allows the user to quickly obtain a list of
potential search terms that are related to user specified terms. The system
provided terms are related to the user provided terms because they occur within
the same documents as the user provided terms, and both terms were important
terms within the documents. This is vastly different than a standard thesaurus
that relates terms that are similar in meaning.
The Related Concepts feature is accessed by placing a single word or phrase
within parentheses after typing REL. For example, if your topic is osteoporosis,
your request would be:
REL(OSTEOPOROSIS)
and the resulting display would be:
As you can see, the terms displayed are all very relevant to the topic of
osteoporosis. After selecting the desired terms, the search entry screen is
redisplayed with the terms listed on the screen, but not within the existing
search. The searcher must insert the terms into the search using the appropriate
connectors. This differs from the thesaurus feature because the Related Concepts
terms will not necessarily be combined with an OR connector. It may be desirable
to use a restrictive connector such as AND or WITHIN.
It is possible to request Related Concepts for multiple search terms. Each
must be specified separately. For example, if the topic is how diet affects
osteoporosis, the request would be:
REL(OSTEOPOROSIS) and REL(DIET)
and the resulting display would be:
When multiple terms are requested from boolean, all of the requested terms
must occur in the same documents as the terms returned. This differs from the Freestyle
feature where only some of the terms need to be present. If a multiple
term request can find no terms, the requests may be done individually to see the
concepts for each term.
The Related Concepts uses a large statistical thesaurus built from the
documents contained within the LexisNexis database. Not every document is
represented and the material is grouped by news and legal. Therefore, you will
see different related terms for the same search term in legal material and in
news material.
Back
to TOC >>
Relevance Ranking the Results
A boolean search returns a list of result documents ordered by some attribute
of the document. In news material, the documents generally appear in a reverse
chronological order - the most recent stories are first. A relevanced ranked
results lists, the default order of results from a Freestyle search, ranks the
documents based on their statistical similarity to the search terms. The theory
is that the most relevant documents are first.
If a boolean results set contains less than 1000 documents, the documents may
be reordered based on their relevance to the boolean search terms by entering
".RANK". There are some instances where ranking is not possible due to
a lack of good terms. For example, the search "DATE IS 4/9/1997" is a
valid search, but there are no terms useful for ranking the results.
As with Freestyle, the ranking is better when more terms are available. The
ranking formula uses the same principles as the Freestyle ranking formula.
After results have been ranked by relevance, the original sort order may be
obtained by entering ".RANK" a second time.
Back
to TOC >>
Converting Failed Searches
Boolean searches fail when no documents are found, or the system predicts it
will find more than 1000 documents. It is possible to complete a search that
will return more than 1000 documents, but it is not usually done because it is
hard to work with such a large number of answers.
Back
to TOC >>
Zero Answer Searches
When a boolean search returns zero answers, the system may offer to convert
the search to a Freestyle search. It will only offer the conversion if the
system is able to obtain search terms from the original search, and it has a
reasonable probability of finding documents. If a single term boolean search
fails to find any documents, using the same single term in Freestyle will fail
also. The same holds true for multiple terms connected via the OR connector.
Additionally, Freestyle does not support universal characters, so any terms
containing a universal or super universal can not be converted.
Back
to TOC >>
More Than 1000 Answers
These searches will be converted to Freestyle, except where no search terms
can be found due to universal characters as described in the prior section. Freestyle
is especially useful in returning a small number of relevant documents
when there are a very large number of potentially relevant documents are
available.
Back
to TOC >>
More-Like-This
The More-Like-This feature constructs a Freestyle search from a document. The
feature is activated by entering .MORE when the desired document is
displayed. The system will generate a term vector from the document, combine
this with the original search terms, and construct a Freestyle search
description. The Freestyle SEARCH OPTIONS screen will be displayed with the
search description present. The search may then be edited, mandatory terms
added, restrictions added, etc. The search may then be run, and the resulting
answers browsed. To return to the original search, enter .EM for exit
more.
Not all documents are candidates for More-Like-This. The document must
be relevant. The document should focus on a single topic. Testing has shown that
More-Like-This provides excellent results when the seed document was very good.
However, the results are poor when the seed document was not relevant, or
contained other non-relevant topics.
Additionally, the system may not be able to produce a suitable term vector
from some documents, and a message will be displayed, indicating that
More-Like-This is unavailable from that document.
The More-Like-This feature automatically saves the original search on the
log. The user may optionally save the More-Like-This search to log also. This is
done by entering .KEEP after running the More search. Once the More
search has been kept to the log, it becomes a regular Freestyle search. When it
is retrieved via the .LOG command, the .EM command will no longer
be valid. Additionally, .MORE may be performed from this search, even
though .more can not be performed from a More-Like-This search.
Back
to TOC >>
Using Freestyle
This section provides tips for using Freestyle with all examples based on the
example topic.
Back
to TOC >>
Selecting the Source
The LexisNexis system contains vast amounts of data that are divided into
libraries and files. Large group files exist, such as the NEWS library ALLNWS
file, which contain millions of documents from thousands of sources. Also,
single source files exist such as the NEWS library, LAT file, which contains
only newspaper stories published in the Los Angeles Times. The ALLNWS file
offers all of the data in the LAT file, as well as much more. However, searching
the ALLNWS files is more difficult for a number of reasons. A search will return
many more answers, which makes finding the relevant material more difficult.
Additionally, the cost of the ALLNWS file is far greater than the cost of the
LAT file. An additional consideration for Freestyle is that the LAT file
contains stories that are relatively uniform in length , and mostly focus on a
single topic. Statistical algorithms will work best in this type of material. By
contrast, the ALLNWS file contains newspaper stories, as well as more in depth
magazine stories, and many other very large and very small documents. This
diverse collection is much more difficult to search.
One approach to beginning a research topic is to select a relatively small
source that should have a reasonable coverage of the desired topic. Newspapers
often times make an excellent starting source because of their broad topic
coverage and focused stories. The initial searches will work better, and cost
less. Once a search has been perfected, change to the larger source, and rerun
the search.
Back
to TOC >>
Enter the Freestyle Search
A Freestyle search may be entered in plain language, such as the way the
example topic from TREC was specified. Alternately, the search may be simply a
set of words and phrases. For the example topic, the initial search may have
been:
research, osteoporosis, prevent, disease, patient
Each approach has advantages and disadvantages. When a complete sentence is
entered, there may be phrases present, words that provide value that the
searcher may not think of as a search term, and a natural redundancy of
important terms. Each occurrence of the same term in the search is weighted
independently. Five occurrences of the same term has the net effect of making
that term five times more important than if it had only been entered one time.
Unfortunately, a plain language search may include many terms which don't help,
and may actually hurt.
For the example topic, the example was entered as a question. Then the term osteoporosis
was entered as a mandatory term, and also added to the search description
multiple times due to its importance. The search was then expanded using related
concepts. If the topic is entered as is, without using a mandatory term or
expanding the search description in any way, the search will find only 7
relevant documents within the top 32 positions, where as the modified search
returns 19 relevant documents.
Back
to TOC >>
Phrase Recognition
Freestyle scans the entered search, and recognizes phrases based on an
internal phrase dictionary. The phrase dictionary contains a large list of
phrases built from data in the LexisNexis system. It is not an exhaustive list
of phrases, especially proper noun phrases. The searcher may specify a phrase
manually by enclosing the words in quotes. A searcher may prevent phrase
recognition by separating terms with commas. If Freestyle identifies a phrase in
a search, and the searcher does not want the words to be searches as a phrase,
the search may be edited, and the quotes removed. Freestyle only attempts to
find phrases on the initial search entry screen, and not from the search edit
screen.
Back
to TOC >>
Noise Words
Freestyle recognizes two different sets of noise words - words which
do not help in retrieval. One set is common with boolean search, and is not
indexed by LexisNexis. These words include such things as personal pronouns
(he, she, they) and forms of the verb to be (be, is was). Freestyle uses a
second, larger set of terms, which do not add value to a statistical search -
either because they occur too frequently, or they do not provide much semantic
value. This second set of terms may be useful as part of a phrase, and the terms
are searchable in a phrase.
Unfortunately, it is difficult to make a list of words that is perfect for
every possible topic.
In the example topic, the words "is", "the",
"of" , "as", "those", and "this" are all
part of the first style of noise words and are not searchable in the LexisNexis system. The words "at", "in", and "well" are part
of the second style of noise words, and only searchable in Freestyle as part of
a phrase. In this example, the words "occurring" and
"ongoing" will probably add very little value, and should not have
been search terms. Although Freestyle tried to remove the unnecessary terms, the
searcher should be vigilant of which terms were used in the ranking process, via
the WHY screen, and modify the search as needed.
Freestyle displays the following screen after a search completes. As noted on
the screen, the search terms are displayed in order of importance. The terms
shown after the "*" are the second class of noise terms discussed
above. The first class of noise terms are not displayed on the screen at all.
Back
to TOC >>
Mandatory Terms
Mandatory terms restrict the statistical algorithm to only those documents
which contain the specified mandatory term or terms. Normally, the statistical
algorithm ranks document based on overall search term occurrence within the
document, and the absence of any one search term may not be enough to prevent
the document from ranking high in the list based on the other search terms.
Logically, Freestyle constructs a boolean search with the mandatory terms
connected via an AND connector, and uses this to restrict the collection for the
statistical algorithm. This means no documents will be considered which do not
contain the mandatory terms. A term which is specified as a mandatory term will
not be used in the statistical scoring of documents unless it is also entered as
a search description term. A side effect of making a term mandatory is that it
will occur in every document considered by the statistical algorithm, and will
therefore be assigned a very low term weight. This may be overcome by repeating
the term multiple times in the search description.
Mandatory terms must be used carefully. An advantage of Freestyle is that it
finds documents that boolean searches miss. The reason is that boolean searches
often times rely on the presence of a specific term, and its absence does not
mean the document is not relevant.
Terms are entered as mandatory terms individually - do not use sentences.
Phrases should be double quoted, and Freestyle will automatically recognize
phrases when mandatory terms are initially entered. When mandatory terms are
edited, the automatic phrase recognition is not performed. Multiple terms that
are entered and are not in a phrase will be connected via an AND connector.
Mandatory terms entry:
"ongoing research" osteoporosis
Resulting boolean limitation applied to the collection:
ongoing research AND osteoporosis
Freestyle's synonym syntax may be used to connect terms with an OR connector.
Synonyms are contained within parentheses immediately following the head word.
For example:
Mandatory terms entry:
osteoporosis, prevent (prevention, treatment)
Resulting boolean limitation:
osteoporosis AND ( prevent OR prevention OR treatment )
Freestyle will NOT automatically recognize phrases within the parentheses,
however the searcher may specify a phrase using double quotation marks.
Back
to TOC >>
Restrictions
Freestyle allows a limited amount of segment restricted searching via the
restrictions feature. Date restrictions such as DATE AFT 1992 are possible.
Additionally, up to four other segments will be listed depending on the source
selected. For example, for the NEWS library and LAT file, the additional
segments are HLEAD, HEADLINE, BYLINE, and TERMS.
Use of restrictions is similar to the use of mandatory terms. They restrict
the collection size, affect the term weights, and are connected via an AND
connector unless synonym syntax is used. Freestyle does not automatically
recognize phrases in restrictions, and restriction terms will not appear on the
WHY or WHERE screens, while mandatory terms do appear on these screens.
HEADLINE Restriction entry:
osteoporosis, prevent (prevention, treatment)
Resulting boolean limitation:
HEADLINE(osteoporosis AND ( prevent OR prevention OR treatment ))
Back
to TOC >>
Thesaurus and Related Concepts
The thesaurus option provides suggested terms for expanding the search
definition. Freestyle has a single on-line thesaurus which contains data from
the Macmillan Legal Thesaurus and Webster's Collegiate Thesaurus. The following
screen is displayed for the example search when the thesaurus is first selected.
From this screen, the searcher selects the numbers of headwords to be
expanded, and may request related concepts for the entire search. For a given
headword, synonyms are grouped by sense as defined in the original thesaurus.
Additionally, term variations may be present. The entry for "disease"
is shown here.
Be careful when selecting synonyms in Freestyle. Testing has shown that
synonyms can often times hurt overall search performance as opposed to help it.
The related concepts feature will provide a set of terms based on all the
terms in the search description. The provided terms are related because they
occur in the same documents as the terms in the search. Testing has shown that
use of related concepts can greatly improve answer quality.
Back
to TOC >>
Number of Documents
The current default number of answers for a Freestyle search is 25. In many
cases, this is not enough. The maximum number is 1000. In almost all case this
is too many. Using 50 or 100 is a reasonable starting point. The SORT feature of
Freestyle is especially useful when most of the answers are relevant, which is
why 1000 is usually too large.
Back
to TOC >>
Understanding Freestyle Search Results
Freestyle offers three features for browsing search results that are not
available in boolean. They are the WHERE and WHY screens and the SuperKWIC
display mode. The More-Like-This feature is available for Freestyle as it is for
boolean searches.
Back
to TOC >>
The WHERE Screen
This screen identifies which documents contain which search terms in the
answer set. This can be very useful in selecting documents to view. The first
page of the WHERE screen for the example topic is shown below.
Back
to TOC >>
The WHY Screen
The WHY screen displays search term occurrence data, and an indication of the
importance assigned to each term by Freestyle. This screen provides some very
important clues for understanding why a search returned certain documents, and
potentially about what documents are available. The first page of the WHY screen
for the example topic is shown below:
Some observations from this screen are that there are a total of 132 documents
in the TREC collection that contained the term osteoporosis, and the search term
occurring is relatively highly weighted, even though it probably does not
provide much value. By selecting the Next terms option, the following screen is
eventually displayed.
This screen illustrates that the term osteoporosis, which is the primary concept
for the topic, has the lowest possible term importance. This is a side effect of
using the term as a mandatory term. It is partially overcome by entering the
term in the search description multiple times, as was done in the example
search. Additionally, other important terms such as drug and research
have relatively low weights as compared to existing and occurring.
The terms existing and occurring should be omitted from any future
search on this topic.
The WHY screen may also be used to obtain document counts. For example, in
the NEWS library, LAT file, the search O.J. Simpson will produce a WHY
screen indicating there are 6311 documents that contain O.J. and 13247
documents that contain Simpson.
Back
to TOC >>
SuperKWIC Display Mode
The SuperKWIC display mode will display a single KWIC window from each
document. A KWIC window may be very short or very long, depending on the density
of search terms within the document. The single window is selected based on a
formula that rewards diversity of the terms within the window far greater than
the number of terms within the window.
Back
to Top >>
 

|