A class that defines a regular expression over the tokens appearing in a {@link LayeredSequence} object.
For example, suppose we want to find parts of sentences that match the pattern "DT cow", where "DT" is the part-of-speech tag representing a determiner. Assume that sentences are represented as {@link LayeredSequence}objects, where the words layer has the name "word" and the part-of-speech layer has the name "pos". Then the above pattern can be constructed by calling {@code new LayeredTokenPattern("DT_pos cow_word")}. Given a test sentence {@code sent}, the {@link #matcher(LayeredSequence)} method will return a{@link LayeredTokenMatcher} object that will allow you to access the rangesand groups.
The patterns are expressed using the standard {@link java.util.regex.Pattern}language, but with the following changes.
The basic unit of match is not a character, but instead a token. A token consists of two parts: a value and a layer name. A token is expressed using an underscore to separate the two. For example {@code Foo_bar} will matchwhen the token @{code Foo} appears on the layer with the name {@code bar}. In the example above, the token {@code DT_pos} will match the word- POS pair{@code (w, p)} pair when {@code p = DT}. The value of {@code w} is allowed tobe anything. Currently there is no way to match the value of multiple layers at once (e.g. match all occurrences of "bank" that are nouns).
The value of a token can only have characters from this set: {@code [a-zA-Z0-9\\-.,:;?!"'`]}. The layer name can only have characters from this set: {@code [a-zA-Z0-9\\-]}.
When expressing a pattern, tokens must be space separated.
In the following examples {@code pos} refers to a part-of-speech layer, and{@code word} refers to a word layer.
- {@code ^John_word lives_word in_word NNP_pos+} - matches sentences that startwith "John lives in" and then is followed by at least one proper noun.
- {@code ^(NNP_pos+) lives_word in_word (NNP_pos+) ._pos$} - matches sentencesthat start with at least one proper noun, followed by "lives in", followed by at least one proper noun, and then ending with a period. Captures the two proper nouns as groups (see {@link LayeredTokenMatcher}).
@author afader