LexisNexis Home Products & Services Customer Service Center Company Information Alliance Partners LexisNexis Bookstore Search


  Current Subscribers
  

Non Subscribers: Click here to find a product that's right for you! Home > Customer Service Center > Freestyle Searching

 

Freestyle Searching with LexisNexis

Freestyle is a search method that allows a search to be specified in plain language, without the use of connectors or a specific syntax. It provides answers in a relevance ranked list, based on the statistical similarity of each document to the search. Freestyle allows users who are not trained in boolean search to quickly and easily obtain reasonable search results. Freestyle can also be a powerful search tool for the advanced boolean searcher.

Table of Contents

  1. Information Retrieval Theory
    1. Statistical Retrieval
    2. Term Vectors
  2. An Example Topic
  3. When To Use Freestyle
    1. When Boolean is Better
    2. Concept Searching
    3. Ensuring Maximum Coverage
  4. Statistical Tools for Boolean Searches
    1. Good Search Terms
    2. Relevance Ranking the Results
    3. Converting Failed Searches
    4. Zero Answer Searches
    5. More Than 1000 Answers
    6. More-Like-This
  5. Using Freestyle
    1. Selecting the Source
    2. Enter the Freestyle Search
      1. Phrase Recognition
      2. Noise Words
      3. Mandatory Terms
      4. Restrictions
      5. Thesaurus and Related Concepts
      6. Number of Documents
    3. Understanding Freestyle Search Results
      1. The WHERE Screen
      2. The WHY Screen
      3. SuperKWIC Display Mode

Information Retrieval Theory

This section describes some of the basic theory behind statistical retrieval techniques. Because Freestyle utilizes a statistical retrieval algorithm, an understanding of the underlying theory will allow the advanced user to better understand Freestyle and to achieve better results. 

A primary source of data on information retrieval methods and effectiveness is the Text Retrieval Conference (TREC) sponsored by the National Institute of Standards and Technology (NIST). The conference provides a reasonable size collection of documents, a set of topics, and judges the relevance of documents to the topics. The conference attracts many of the universities and corporations active in the area of text retrieval. The conference provides actual data on the effectiveness of many different algorithms and theories.

TREC results have consistently shown that for text retrieval, statistical methods provide results that are superior to natural language methods.   Boolean methods are very difficult to compare to automatic methods due to the human involvement of formulating the boolean search, and the lack of a ranked results list. However, the limited tests done have shown statistical retrieval to perform as well as or better than boolean searches constructed by expert searchers.  

Back to TOC >>

Statistical Retrieval

Statistical retrieval ranks the documents in the target collection based on their similarity to the search description. The ranking utilizes the words and phrases within the documents and the search. In its simplest form, only individual words are considered. The ranking formula applies a term weight to each search term, and then scores each document that contains one or more of the search terms.

The two fundamental components to term weight are the term frequency and the inverse document frequency. The term frequency is the number of times a term occurs within a document. The more often the term occurs, the more likely the document contains relevant material related to the concept represented by the term. The inverse document frequency is the inverse of the number of documents within the collection that contain at least one occurrence of the term. The higher the number of documents that contain the term, the lower the value of the term in differentiating the documents. For example, if a term occurs in every document, the term is not a good term for ranking the documents.

The most simple of scoring formulas is to multiply the term frequency by the inverse document frequency for each search term and each document, and sum the values for each document.

Unfortunately, the simple formula does not work well in real collections. Longer documents will always score better because they will contain more terms, and a higher frequency for those terms. It is therefore important to normalize the document scores based on the document length. Although much work has been done in this area, current formulas seem to have a bias for either shorter or longer material.

Single terms can be very ambiguous and have different meanings depending upon their context. This problem may be greatly reduced by considering phrases as well as single terms. A phrase often has a more specific meaning than a single term.

Back to TOC >>

Term Vectors

A term vector is a list of terms that are extracted from a document. The terms may be single words or phrases, and may include proper nouns. The terms are selected based on statistics and phrase recognition software. The theory is that a relatively small set of important terms represent the concepts within the document, and that these terms make excellent search terms.

One use of term vectors is for relevance feedback. Relevance feedback is a method of improving a search results set by modifying the original search based on the searchers judgment of relevance of one or more of the answers. Of the two basic approaches, search expansion and term re-weighting, search expansion provides the most improvement.   The search expansion method uses the term vector from a relevant answer to supplement the original search terms. This method can provide dramatic improvement in the answer precision.

The second term vector feature is a statistical thesaurus. The statistical thesaurus is a very large collection of term vectors, generated in advance from the documents that are part of the data collection. The theory is that terms that occur together in a term vector are related to each other, and represent the same concept. The search is then expanded using the statistical thesaurus. This feature has been shown to be vastly superior to ordinary thesauri for enhancing queries. 

Back to TOC >>

An Example Topic

The remainder of this document will use examples to illustrate features of the LexisNexis system. All of the examples are based on a single topic. The following topic was used for the TREC conference. It was randomly selected as an example topic for this paper.

What research is ongoing to reduce the effects of osteoporosis in existing patients as well as prevent the disease occurring in those unafflicted at this time?

Customer service personnel from LexisNexis participated in a boolean experiment for this topic as expert boolean searchers. The search they constructed was:

research! Or stud! Or analy! Or test! Or experiment! Or exam! Or inqu! W/25 osteoporosis

This boolean search was the result of 25 minutes of work and was the eleventh search constructed. This search returned 32 answers from the TREC collection, 20 of which were assessed as relevant. The assessors found a total of 36 relevant documents for the topic within the collection. The precision for the boolean search was 63%, and recall was 56%.

By contrast, the Freestyle search below obtained 19 relevant documents within the top 32, 5 of which were not in the boolean answer set. It took less than 2 minutes to construct and run the search. The Freestyle search had a total of 25 relevant documents in the top 50 answers, with 8 answers which were not in the boolean answer set. There were a total of 32 relevant documents in the top 100 answers, with 14 answers which were not in the boolean answer set. The boolean answer set had 2 documents which were not in the Freestyle answer set, and there were 2 relevant documents in the collection that neither search located.

Back to TOC >>

When To Use Freestyle

Freestyle is not a replacement for boolean searching. Freestyle supplements boolean searching, allowing the expert searcher to use another tool to locate the desired information in an efficient manner. There are times when it is appropriate to use either boolean or Freestyle, and times when both methods may be used together.

Back to TOC >>

When Boolean is Better

Certain retrieval problems work much better in boolean than they do in Freestyle.

Boolean allows for a more specific definition of a search. When searching public records or other structured material, boolean allows for a more specific definition of names and provides superior results when used by a skilled searcher.

Boolean also allows a skilled searcher to find every mention of a specific name.

Boolean allows the use of universal characters, and many connectors not supported by Freestyle. When these features are needed, boolean must be used.

Back to TOC >>

Concept Searching

Freestyle provides an easier method of search for general concepts such as: "What are the benefits of pets to the elderly?" A topic such as this can be entered as is to Freestyle, whereas a boolean search may be difficult to construct.

This is especially true when researching an unfamiliar topic. Freestyle can usually bring back a relevant document ranked high in the answer set where it is quickly found, even with a relatively weak search description. By contrast, a weak boolean search would return a very large number of answers where the same document is somewhere in the large set, but necessarily near the front.

Back to TOC >>

Ensuring Maximum Coverage

Many times it is not sufficient to find some of the material about a topic - all of the material must be found. In a system with as much material as LexisNexis, this is nearly an impossible task. One method to enhance your overall recall is to search using both multiple searches, and multiple search methods. Because Freestyle differs from boolean in its retrieval method, it often times finds relevant documents that boolean searches do not find. This is especially true when the Freestyle search contains terminology that is not present in the boolean searches. As shown in the example topic, the Freestyle search retrieved many documents missed by the boolean search.

Back to TOC >>

Statistical Tools for Boolean Searches

Statistical tools can help the boolean searcher. Tools are available to help select search terms, browse answers, and find results where a boolean search has failed.

Back to TOC >>

Good Search Terms

The key to a successful search is using the right terminology. Without the right terms, a search will not retrieve the right material, or will retrieve too much non-relevant material. Selecting the right terms can be a difficult task, especially in topic areas unfamiliar to the searcher.

A traditional approach is to start with a few terms, run a search, and read some of the documents returned. Then the search is refined based on the documents found. New terms may be added to broaden the search, or to further restrict a search. This method is very effective, but is very time consuming, and may miss large amounts of relevant material.

The Related Concepts feature allows the user to quickly obtain a list of potential search terms that are related to user specified terms. The system provided terms are related to the user provided terms because they occur within the same documents as the user provided terms, and both terms were important terms within the documents. This is vastly different than a standard thesaurus that relates terms that are similar in meaning.

The Related Concepts feature is accessed by placing a single word or phrase within parentheses after typing REL. For example, if your topic is osteoporosis, your request would be:

REL(OSTEOPOROSIS)

and the resulting display would be:

As you can see, the terms displayed are all very relevant to the topic of osteoporosis. After selecting the desired terms, the search entry screen is redisplayed with the terms listed on the screen, but not within the existing search. The searcher must insert the terms into the search using the appropriate connectors. This differs from the thesaurus feature because the Related Concepts terms will not necessarily be combined with an OR connector. It may be desirable to use a restrictive connector such as AND or WITHIN.

It is possible to request Related Concepts for multiple search terms. Each must be specified separately. For example, if the topic is how diet affects osteoporosis, the request would be:

REL(OSTEOPOROSIS) and REL(DIET)

and the resulting display would be:

When multiple terms are requested from boolean, all of the requested terms must occur in the same documents as the terms returned. This differs from the Freestyle feature where only some of the terms need to be present. If a multiple term request can find no terms, the requests may be done individually to see the concepts for each term.

The Related Concepts uses a large statistical thesaurus built from the documents contained within the LexisNexis database. Not every document is represented and the material is grouped by news and legal. Therefore, you will see different related terms for the same search term in legal material and in news material.

Back to TOC >>

Relevance Ranking the Results

A boolean search returns a list of result documents ordered by some attribute of the document. In news material, the documents generally appear in a reverse chronological order - the most recent stories are first. A relevanced ranked results lists, the default order of results from a Freestyle search, ranks the documents based on their statistical similarity to the search terms. The theory is that the most relevant documents are first.

If a boolean results set contains less than 1000 documents, the documents may be reordered based on their relevance to the boolean search terms by entering ".RANK". There are some instances where ranking is not possible due to a lack of good terms. For example, the search "DATE IS 4/9/1997" is a valid search, but there are no terms useful for ranking the results.

As with Freestyle, the ranking is better when more terms are available. The ranking formula uses the same principles as the Freestyle ranking formula.

After results have been ranked by relevance, the original sort order may be obtained by entering ".RANK" a second time.

Back to TOC >>

Converting Failed Searches

Boolean searches fail when no documents are found, or the system predicts it will find more than 1000 documents. It is possible to complete a search that will return more than 1000 documents, but it is not usually done because it is hard to work with such a large number of answers.

Back to TOC >>

Zero Answer Searches

When a boolean search returns zero answers, the system may offer to convert the search to a Freestyle search. It will only offer the conversion if the system is able to obtain search terms from the original search, and it has a reasonable probability of finding documents. If a single term boolean search fails to find any documents, using the same single term in Freestyle will fail also. The same holds true for multiple terms connected via the OR connector. Additionally, Freestyle does not support universal characters, so any terms containing a universal or super universal can not be converted.

Back to TOC >>

More Than 1000 Answers

These searches will be converted to Freestyle, except where no search terms can be found due to universal characters as described in the prior section. Freestyle is especially useful in returning a small number of relevant documents when there are a very large number of potentially relevant documents are available.

Back to TOC >>

More-Like-This

The More-Like-This feature constructs a Freestyle search from a document. The feature is activated by entering .MORE when the desired document is displayed. The system will generate a term vector from the document, combine this with the original search terms, and construct a Freestyle search description. The Freestyle SEARCH OPTIONS screen will be displayed with the search description present. The search may then be edited, mandatory terms added, restrictions added, etc. The search may then be run, and the resulting answers browsed. To return to the original search, enter .EM for exit more.

Not all documents are candidates for More-Like-This. The document must be relevant. The document should focus on a single topic. Testing has shown that More-Like-This provides excellent results when the seed document was very good. However, the results are poor when the seed document was not relevant, or contained other non-relevant topics.

Additionally, the system may not be able to produce a suitable term vector from some documents, and a message will be displayed, indicating that More-Like-This is unavailable from that document.

The More-Like-This feature automatically saves the original search on the log. The user may optionally save the More-Like-This search to log also. This is done by entering .KEEP after running the More search. Once the More search has been kept to the log, it becomes a regular Freestyle search. When it is retrieved via the .LOG command, the .EM command will no longer be valid. Additionally, .MORE may be performed from this search, even though .more can not be performed from a More-Like-This search.

Back to TOC >>

Using Freestyle

This section provides tips for using Freestyle with all examples based on the example topic.

Back to TOC >>

Selecting the Source

The LexisNexis system contains vast amounts of data that are divided into libraries and files. Large group files exist, such as the NEWS library ALLNWS file, which contain millions of documents from thousands of sources. Also, single source files exist such as the NEWS library, LAT file, which contains only newspaper stories published in the Los Angeles Times. The ALLNWS file offers all of the data in the LAT file, as well as much more. However, searching the ALLNWS files is more difficult for a number of reasons. A search will return many more answers, which makes finding the relevant material more difficult. Additionally, the cost of the ALLNWS file is far greater than the cost of the LAT file. An additional consideration for Freestyle is that the LAT file contains stories that are relatively uniform in length , and mostly focus on a single topic. Statistical algorithms will work best in this type of material. By contrast, the ALLNWS file contains newspaper stories, as well as more in depth magazine stories, and many other very large and very small documents. This diverse collection is much more difficult to search.

One approach to beginning a research topic is to select a relatively small source that should have a reasonable coverage of the desired topic. Newspapers often times make an excellent starting source because of their broad topic coverage and focused stories. The initial searches will work better, and cost less. Once a search has been perfected, change to the larger source, and rerun the search.

Back to TOC >>

Enter the Freestyle Search

A Freestyle search may be entered in plain language, such as the way the example topic from TREC was specified. Alternately, the search may be simply a set of words and phrases. For the example topic, the initial search may have been:

research, osteoporosis, prevent, disease, patient

Each approach has advantages and disadvantages. When a complete sentence is entered, there may be phrases present, words that provide value that the searcher may not think of as a search term, and a natural redundancy of important terms. Each occurrence of the same term in the search is weighted independently. Five occurrences of the same term has the net effect of making that term five times more important than if it had only been entered one time. Unfortunately, a plain language search may include many terms which don't help, and may actually hurt.

For the example topic, the example was entered as a question. Then the term osteoporosis was entered as a mandatory term, and also added to the search description multiple times due to its importance. The search was then expanded using related concepts. If the topic is entered as is, without using a mandatory term or expanding the search description in any way, the search will find only 7 relevant documents within the top 32 positions, where as the modified search returns 19 relevant documents.

Back to TOC >>

Phrase Recognition

Freestyle scans the entered search, and recognizes phrases based on an internal phrase dictionary. The phrase dictionary contains a large list of phrases built from data in the LexisNexis system. It is not an exhaustive list of phrases, especially proper noun phrases. The searcher may specify a phrase manually by enclosing the words in quotes. A searcher may prevent phrase recognition by separating terms with commas. If Freestyle identifies a phrase in a search, and the searcher does not want the words to be searches as a phrase, the search may be edited, and the quotes removed. Freestyle only attempts to find phrases on the initial search entry screen, and not from the search edit screen.

Back to TOC >>

Noise Words

Freestyle recognizes two different sets of noise words - words which do not help in retrieval. One set is common with boolean search, and is not indexed by LexisNexis. These words include such things as personal pronouns (he, she, they) and forms of the verb to be (be, is was). Freestyle uses a second, larger set of terms, which do not add value to a statistical search - either because they occur too frequently, or they do not provide much semantic value. This second set of terms may be useful as part of a phrase, and the terms are searchable in a phrase.

Unfortunately, it is difficult to make a list of words that is perfect for every possible topic.

In the example topic, the words "is", "the", "of" , "as", "those", and "this" are all part of the first style of noise words and are not searchable in the LexisNexis system. The words "at", "in", and "well" are part of the second style of noise words, and only searchable in Freestyle as part of a phrase. In this example, the words "occurring" and "ongoing" will probably add very little value, and should not have been search terms. Although Freestyle tried to remove the unnecessary terms, the searcher should be vigilant of which terms were used in the ranking process, via the WHY screen, and modify the search as needed.

Freestyle displays the following screen after a search completes. As noted on the screen, the search terms are displayed in order of importance. The terms shown after the "*" are the second class of noise terms discussed above. The first class of noise terms are not displayed on the screen at all.

Back to TOC >>

Mandatory Terms

Mandatory terms restrict the statistical algorithm to only those documents which contain the specified mandatory term or terms. Normally, the statistical algorithm ranks document based on overall search term occurrence within the document, and the absence of any one search term may not be enough to prevent the document from ranking high in the list based on the other search terms.

Logically, Freestyle constructs a boolean search with the mandatory terms connected via an AND connector, and uses this to restrict the collection for the statistical algorithm. This means no documents will be considered which do not contain the mandatory terms. A term which is specified as a mandatory term will not be used in the statistical scoring of documents unless it is also entered as a search description term. A side effect of making a term mandatory is that it will occur in every document considered by the statistical algorithm, and will therefore be assigned a very low term weight. This may be overcome by repeating the term multiple times in the search description.

Mandatory terms must be used carefully. An advantage of Freestyle is that it finds documents that boolean searches miss. The reason is that boolean searches often times rely on the presence of a specific term, and its absence does not mean the document is not relevant.

Terms are entered as mandatory terms individually - do not use sentences. Phrases should be double quoted, and Freestyle will automatically recognize phrases when mandatory terms are initially entered. When mandatory terms are edited, the automatic phrase recognition is not performed. Multiple terms that are entered and are not in a phrase will be connected via an AND connector.

Mandatory terms entry:

"ongoing research" osteoporosis

Resulting boolean limitation applied to the collection:

ongoing research AND osteoporosis

Freestyle's synonym syntax may be used to connect terms with an OR connector. Synonyms are contained within parentheses immediately following the head word. For example:

Mandatory terms entry:

osteoporosis, prevent (prevention, treatment)

Resulting boolean limitation:

osteoporosis AND ( prevent OR prevention OR treatment )

Freestyle will NOT automatically recognize phrases within the parentheses, however the searcher may specify a phrase using double quotation marks.

Back to TOC >>

Restrictions

Freestyle allows a limited amount of segment restricted searching via the restrictions feature. Date restrictions such as DATE AFT 1992 are possible. Additionally, up to four other segments will be listed depending on the source selected. For example, for the NEWS library and LAT file, the additional segments are HLEAD, HEADLINE, BYLINE, and TERMS.

Use of restrictions is similar to the use of mandatory terms. They restrict the collection size, affect the term weights, and are connected via an AND connector unless synonym syntax is used. Freestyle does not automatically recognize phrases in restrictions, and restriction terms will not appear on the WHY or WHERE screens, while mandatory terms do appear on these screens.

HEADLINE Restriction entry:

osteoporosis, prevent (prevention, treatment)

Resulting boolean limitation:

HEADLINE(osteoporosis AND ( prevent OR prevention OR treatment ))

Back to TOC >>

Thesaurus and Related Concepts

The thesaurus option provides suggested terms for expanding the search definition. Freestyle has a single on-line thesaurus which contains data from the Macmillan Legal Thesaurus and Webster's Collegiate Thesaurus. The following screen is displayed for the example search when the thesaurus is first selected.

From this screen, the searcher selects the numbers of headwords to be expanded, and may request related concepts for the entire search. For a given headword, synonyms are grouped by sense as defined in the original thesaurus. Additionally, term variations may be present. The entry for "disease" is shown here.

Be careful when selecting synonyms in Freestyle. Testing has shown that synonyms can often times hurt overall search performance as opposed to help it.

The related concepts feature will provide a set of terms based on all the terms in the search description. The provided terms are related because they occur in the same documents as the terms in the search. Testing has shown that use of related concepts can greatly improve answer quality.

Back to TOC >>

Number of Documents

The current default number of answers for a Freestyle search is 25. In many cases, this is not enough. The maximum number is 1000. In almost all case this is too many. Using 50 or 100 is a reasonable starting point. The SORT feature of Freestyle is especially useful when most of the answers are relevant, which is why 1000 is usually too large.

Back to TOC >>

Understanding Freestyle Search Results

Freestyle offers three features for browsing search results that are not available in boolean. They are the WHERE and WHY screens and the SuperKWIC display mode. The More-Like-This feature is available for Freestyle as it is for boolean searches.

Back to TOC >>

The WHERE Screen

This screen identifies which documents contain which search terms in the answer set. This can be very useful in selecting documents to view. The first page of the WHERE screen for the example topic is shown below.

Back to TOC >>

The WHY Screen

The WHY screen displays search term occurrence data, and an indication of the importance assigned to each term by Freestyle. This screen provides some very important clues for understanding why a search returned certain documents, and potentially about what documents are available. The first page of the WHY screen for the example topic is shown below:

Some observations from this screen are that there are a total of 132 documents in the TREC collection that contained the term osteoporosis, and the search term occurring is relatively highly weighted, even though it probably does not provide much value. By selecting the Next terms option, the following screen is eventually displayed.

This screen illustrates that the term osteoporosis, which is the primary concept for the topic, has the lowest possible term importance. This is a side effect of using the term as a mandatory term. It is partially overcome by entering the term in the search description multiple times, as was done in the example search. Additionally, other important terms such as drug and research have relatively low weights as compared to existing and occurring. The terms existing and occurring should be omitted from any future search on this topic.

The WHY screen may also be used to obtain document counts. For example, in the NEWS library, LAT file, the search O.J. Simpson will produce a WHY screen indicating there are 6311 documents that contain O.J. and 13247 documents that contain Simpson.

Back to TOC >>

SuperKWIC Display Mode

The SuperKWIC display mode will display a single KWIC window from each document. A KWIC window may be very short or very long, depending on the density of search terms within the document. The single window is selected based on a formula that rewards diversity of the terms within the window far greater than the number of terms within the window.

Back to Top >>

Email This Page




Legal Academic Corporate & Professional Risk & Information Analytics Government
Terms & Conditions Privacy & Security Products Index Site Map Contact Us
Copyright © 2009 LexisNexis, a division of Reed Elsevier Inc. All rights reserved.