| Main | Javadoc | Example |
|
|
This example shows the use of a ResultParserFormatter and StringPatternResultParser. These will be used to create a Shakespeare glossary search source using data files obtained from the Absolute Shakespeare web site.
Since the language used by Shakespeare is different in many ways from modern English, readers often need to look up archaic words or phrases in order to understand the dialog. This is a common problem with any subject area that has it own unique vocabulary or jargon. The glossary from Absolute Shakespeare provides explanations for many of these 16th century words and phrases used by the Bard. The glossary defines term definitions using the following format:
BODKIN, sub. a dagger FUSTILARIAN, sub. a term of reproach LOVES, OF ALL, for the sake of everything lovely, an adjuration
We want to convert this format to an term lookup search source where any comma separated terms are reformatted in phrase order. Our technique will take advantage of the "semi-structured" nature of the data: terms are first and in ALL CAPITALS. Terms may include one or more commas. The definitions may also include commas so we cannot use this character as a delimiter. Instead, we will use some regular expressions to parse this structure:
Input RegEx pattern for phrases: ([A-Z\-\']*), ([A-Z\s\-\']*), ([a-z]*)(.*) Output pattern for term: $2 $1 Output pattern for definition: $3$4
The load operation will consist of a FlatFileMemorySourceFactory which loads a set of files into an InMemorySearchSource. Once the data is loaded into an RTI SearchSource, we can use it in various ways:
<!-- =========================================================================================== -->
<!-- Builds an in-memory glossary of terms used by Shakespeare. Used by tagging string filter -->
<!-- to provide a 'click-on' definition feature for annotating Shakespeare passages. The data -->
<!-- is from the Absolute Shakespeare web site (http://absoluteshakespeare.com) -->
<!-- =========================================================================================== -->
<SourceType name="ShakespeareGlossary" type="InMemorySearchSource"
displayName="Shakespeare Glossary"
sourceFactoryClass="com.raritantechnologies.searchApp.FlatFileMemorySourceFactory"
queryProcessor="com.raritantechnologies.searchApp.InMemorySearchSource"
sourceName="ShakespeareGlossary"
delimiter="|"
blankQueryReturnsAll="false" >
<Columns>
<Column ID="GlossaryTerm" />
</Columns>
<Files>
<File name="BASE_PATH/data/ShakespeareGlossary/Glossary_A-F.txt" />
<File name="BASE_PATH/data/ShakespeareGlossary/Glossary_G-K.txt" />
<File name="BASE_PATH/data/ShakespeareGlossary/Glossary_L-P.txt" />
<File name="BASE_PATH/data/ShakespeareGlossary/Glossary_Q-T.txt" />
<File name="BASE_PATH/data/ShakespeareGlossary/Glossary_U-Z.txt" />
</Files>
<!-- =================================================================== -->
<!-- The Field Formatter will be used to format the raw results obtained -->
<!-- from the flat files before inserting the metadata into the search -->
<!-- source. We use a ResultParserFormatter and regular expressions -->
<!-- to break up each line into a 'Term' and 'Description' field. -->
<!-- (Normally, the flat file would contain a delimiter that we could -->
<!-- use to parse the line, but this data does not have a reliable -->
<!-- delimiter, so we need to improvise.) -->
<!-- =================================================================== -->
<FieldFormatter class="com.raritantechnologies.searchApp.formatters.ResultParserFormatter" >
<ResultParser class="com.raritantechnologies.searchApp.formatters.StringPatternResultParser"
fieldID="GlossaryTerm"
initialState="AT_LINE" >
<!-- String filters that can transform a text line into metadata fields -->
<FilteredField state="AT_LINE"
fieldID="Term" >
<!-- Find patterns of "'ALL-CAPS, lowercase ..." or "ALL, CAPS, lowercase ..." -->
<!-- with the ALL CAPS section as the Term field. -->
<StringFilter class="com.raritantechnologies.utils.filter.RegExprStringFilter" >
<Pattern inPattern="([A-Z\-\']*), ([A-Z\s\-\']*), ([a-z]*)(.*)"
outPattern="$2 $1" />
<Pattern inPattern="([A-Z\s\-\']*), ([a-z]*)(.*)"
outPattern="$1" />
</StringFilter>
</FilteredField>
<FilteredField state="AT_LINE"
fieldID="Definition" >
<!-- Find patterns of "'ALL-CAPS, lowercase ..." or "ALL, CAPS, lowercase ..." -->
<!-- with the lowercase ... section as the Description field. -->
<StringFilter class="com.raritantechnologies.utils.filter.RegExprStringFilter" >
<Pattern inPattern="([A-Z\-\']*), ([A-Z\s\-\']*), ([a-z])(.*)"
outPattern="$3$4" />
<Pattern inPattern="([A-Z\s\-\']*,) ([a-z])(.*)"
outPattern="$2$3" />
</StringFilter>
</FilteredField>
</ResultParser>
</FieldFormatter>
</SourceType>