com.raritantechnologies.searchApp.formatters
Class KeywordExtractor

java.lang.Object
  extended bycom.raritantechnologies.searchApp.formatters.KeywordExtractor
All Implemented Interfaces:
IConfigurable, IFieldFormatter

public class KeywordExtractor
extends java.lang.Object
implements IFieldFormatter

Extracts a set of words or phrases from a field or fields that are contained within a set of "match" values. Adds the extracted terms to an IResult output field.

Can also be used to extract the set of keywords matched by one or more IDocumentMatchers from a prior DocumentClassifier formatting operation.

For example, can create a "keywords" field by extracting keywords from other fields in the result.

The list of valid phrases can be statically defined in the configuration file, acquired from using a ITermExtractor - an Entity Extractor, a data file, or looked up from a SearchSource.

An optional IStringFilter can be added to modify each extracted term in cases where terms need to modified prior to insertion into the keyword result field.

XML Configuration Template:
   <Formatter 
      formatterClass="com.raritantechnologies.searchApp.formatters.KeywordExtractor"
      outputField="[ name of new keyword field in result ]"
      matchCase="[ UPPER | LOWER ]"
      matcherNameField="[ name of field with IDocumentMatcher names (added by previous DocumentClassifier) ]"
      documentClassifier="[ name of Document Classifier ]" 
      tokenizerString="[ optional string to use for tokenization ]"
      numDuplicates="[ optional number to boost the number of keyword duplicates ]" >

     <!-- result fields to search for keywords -->
     <InputFields>
       <Field ID="[result field name]" />
       <Field ID="[another field name]" />
       <!-- etc. -->
     </InputFields>

     <!-- can use a Term Extractor to get keywords -->
     <TermExtractor class="[ class of com.raritantechnologies.utils.tagging.ITermExtractor ]" >
       <!-- configuration parameters for TermExtractor -->

       <!-- If the Term Extractor has the ability to extract different entity types: set the mapping -->
       <!-- between extracted entity type and output result field -->
       <EntityTypeFieldMap startsWith="true">
         <!-- One or more Field tags: -->
         <Field ID="[ result field ID ]" entityType="[ type of entity ]"  />
       </EntityTypeFieldMap>

     </TermExtractor>

     <!-- Alternatively, can use Keyword or thesaurus files which should have one keyword/phrase per line. -->
     <KeywordFile fileName="name of keyword file" charSet="[ optional char set to use ]" >
        <StringFilter class="[ class of com.raritantechnologies.utils.filter.IStringFilter ]" >
           <!-- configuration parameters for String Filter -->
        </StringFilter>
     </KeywordFile>

     <!-- Can list more than one keyword file -->
     <KeywordFile fileName="name of  second keyword file" >
        <StringFilter class="[class of com.raritantechnologies.utils.filter.IStringFilter" >
           <!-- configuration parameters for String Filter -->
        </StringFilter>
     </KeywordFile>

     <!-- Can also add one or more RTI search sources that have keyword sets -->
     <SearchSource sourceName="[ name of search source ]" >
       <!-- query parameters to get keyword results -->
       <QueryParam param="[query param name]"    value="[query param value]" />
       <QueryParam param="[another query param]" value="[another value]" />

       <!-- Lookups can also be DYNAMIC: query value derived from another result field -->
       <QueryParam param="[ name of source query param ]" 
                      queryField="[ name of result field to get value for query ]" />
       <!-- etc... -->

       <!-- output fields to extract keywords from -->
       <OutputField ID="[result field name]" />
       <OutputField ID="[another result field name]" />
       <!-- etc... -->

       <!-- filter to apply to result fields -->
       <StringFilter class="[class of com.raritantechnologies.utils.filter.IStringFilter ]" >
           <!-- Configuration parameters for String Filter -->
       </StringFilter>

     </SearchSource>

   </Formatter>
 

Developed by Raritan Technologies .

Author:
Ted Sullivan

Field Summary
 
Fields inherited from interface com.raritantechnologies.searchApp.IFieldFormatter
TEMPLATE
 
Constructor Summary
KeywordExtractor()
           
 
Method Summary
 java.lang.String formatField(java.lang.String fieldVal)
          Reformats a field value.
 java.lang.String formatField(java.lang.String sessionID, java.lang.String fieldVal)
          Reformats a field value.
 void formatResultField(IResult res)
          Formats a result field "in place".
 void formatResultField(java.lang.String sessionID, IResult res)
          Formats a result field "in place", incorporating session context.
 java.lang.String getConfigurationXML()
           
 java.lang.String getConfigurationXML(java.lang.String configurationTemplate)
           
 java.lang.String getFieldName()
          Returns the name of the result field that this formatter can reformat.
 void initialize(org.w3c.dom.Element elem)
          Initializes the formatter from configuration XML element.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KeywordExtractor

public KeywordExtractor()
Method Detail

formatResultField

public void formatResultField(IResult res)
Description copied from interface: IFieldFormatter
Formats a result field "in place".

Specified by:
formatResultField in interface IFieldFormatter
Parameters:
res - The result object that is to be formatted.

formatResultField

public void formatResultField(java.lang.String sessionID,
                              IResult res)
Description copied from interface: IFieldFormatter
Formats a result field "in place", incorporating session context.

Specified by:
formatResultField in interface IFieldFormatter
Parameters:
sessionID - The session key needed to lookup any session content stored in the session data cache.
res - The result object that is to be formatted.

getFieldName

public java.lang.String getFieldName()
Description copied from interface: IFieldFormatter
Returns the name of the result field that this formatter can reformat.

Specified by:
getFieldName in interface IFieldFormatter

formatField

public java.lang.String formatField(java.lang.String fieldVal)
Description copied from interface: IFieldFormatter
Reformats a field value.

Specified by:
formatField in interface IFieldFormatter
Parameters:
fieldVal - The field value to be reformatted.
Returns:
The reformatted field value.

formatField

public java.lang.String formatField(java.lang.String sessionID,
                                    java.lang.String fieldVal)
Description copied from interface: IFieldFormatter
Reformats a field value.

Specified by:
formatField in interface IFieldFormatter
Parameters:
sessionID - The session key needed to lookup any session content stored in the session data cache.
fieldVal - The field value to be reformatted.
Returns:
The reformatted field value.

initialize

public void initialize(org.w3c.dom.Element elem)
Initializes the formatter from configuration XML element.

Specified by:
initialize in interface IFieldFormatter

getConfigurationXML

public java.lang.String getConfigurationXML()
Specified by:
getConfigurationXML in interface IFieldFormatter

getConfigurationXML

public java.lang.String getConfigurationXML(java.lang.String configurationTemplate)
Specified by:
getConfigurationXML in interface IFieldFormatter