A SemgrexPattern is a
tgrep
-type pattern for matching node configurations in one of the SemanticGraph structures. Unlike
tgrep
but like Unix
grep
, there is no pre-indexing of the data to be searched. Rather there is a linear scan through the graph where matches are sought.
SemgrexPattern instances can be matched against instances of the {@link IndexedWord} class.
A node is represented by a set of attributes and their values contained by curly braces: {attr1:value1;attr2:value2;...}. Therefore, {} represents any node in the graph. Attributes must be plain strings; values can be strings or regular expressions blocked off by "/". (I think regular expressions must match the whole attribute value; so that /NN/ matches "NN" only, while /NN.* / matches "NN", "NNS", "NNP", etc. --wcmac)
For example,
{lemma:slice;tag:/VB.* /}
represents any verb nodes with "slice" as their lemma. Attributes are extracted using
edu.stanford.nlp.ling.AnnotationLookup
.
The root of the graph can be marked by the $ sign, that is
{$}
represents the root node.
Relations are defined by a symbol representing the type of relationship and a string or regular expression representing the value of the relationship. A relationship string of
%
means any relationship. It is also OK simply to omit the relationship symbol altogether.
Currently supported node relations and their symbols:
Symbol | Meaning |
---|
A <reln B | A is the dependent of a relation reln with B |
A >reln B | A is the governer of a relation reln with B |
A <<reln B | A is the dependent of a relation reln in a chain to B following dep->gov paths |
A >>reln B | A is the governer of a relation reln in a chain to B following gov->dep paths |
A x,y<<reln B | A is the dependent of a relation reln in a chain to B following dep->gov paths between distances of x and y |
A x,y>>reln B | A is the governer of a relation reln in a chain to B following gov->dep paths between distances of x and y |
A == B | A and B are the same nodes in the same graph |
A @ B | A is aligned to B |
In a chain of relations, all relations are relative to the first node in the chain. For example, "
{} >nsubj {} >dobj {}
" means "any node that is the governor of both a nsubj and a dobj relation". If instead what you want is a node that is the governer of a nsubj relation with a node that is itself the governer of dobj relation, you should write: "
{} >nsubj ({} >dobj {})
".
If a relation type is specified for the << relation, the relation type is only used for the first relation in the sequence. Therefore, if B depends on A with the relation type foo, the pattern
{} <<foo {}
will then match B and everything that depends on B.
Similarly, if a relation type is specified for the >> relation, the relation type is only used for the last relation in the sequence. Therefore, if A governs B with the relation type foo, the pattern
{} >>foo {}
will then match A and all of the nodes which have a sequence leading to A.
Boolean relational operators
Relations can be combined using the '&' and '|' operators, negated with the '!' operator, and made optional with the '?' operator.
Relations can be grouped using brackets '[' and ']'. So the expression
{} [<subj {} | <agent {}] & @ {}
matches a node that is either the dep of a subj or agent relationship and has an alignment to some other node.
Relations can be negated with the '!' operator, in which case the expression will match only if there is no node satisfying the relation.
Relations can be made optional with the '?' operator. This way the expression will match even if the optional relation is not satisfied.
The operator ":" partitions a pattern into separate patterns, each of which must be matched. For example, the following is a pattern where the matched node must have both "foo" and "bar" as descendants:
{}=a >> {word:foo} : {}=a >> {word:bar}
This pattern could have been written
{}=a >> {word:foo} >> {word:bar}
However, for more complex examples, partitioning a pattern may make it more readable.
Naming nodes
Nodes can be given names (a.k.a. handles) using '='. A named node will be stored in a map that maps names to nodes so that if a match is found, the node corresponding to the named node can be extracted from the map. For example
({tag:NN}=noun)
will match a singular noun node and after a match is found, the map can be queried with the name to retrieved the matched node using {@link SemgrexMatcher#getNode(String o)} with (String)argument "noun" (
not "=noun"). Note that you are not allowed to name a node that is under the scope of a negation operator (the semantics would be unclear, since you can't store a node that never gets matched to). Trying to do so will cause a {@link ParseException} to be thrown. Named nodes
can be put within the scope of an optionality operator.
Named nodes that refer back to previously named nodes need not have a node description -- this is known as "backreferencing". In this case, the expression will match only when all instances of the same name get matched to the same node. For example: the pattern
{} >dobj ({} > {}=foo) >mod ({} > {}=foo)
will match a graph in which there are two nodes,
X
and
Y
, for which
X
is the grandparent of
Y
and there are two paths to
Y
, one of which goes through a
dobj
and one of which goes through a
mod
.
Naming relations
It is also possible to name relations. For example, you can write the pattern
{idx:1} >=reln {idx:2}
The name of the relation will then be stored in the matcher and can be extracted with
getRelnName("reln")
At present, though, there is no backreferencing capability such as with the named nodes; this is only useful when using the API to extract the name of the relation used when making the match.
In the case of ancestor and descendant relations, the
last relation in the sequence of relations is the name used.
@author Chloe Kiddon