ANNIS

A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.

Nodes and Edges

AQL is based on the concept of searching for node elements and edges between them. A search is formulated by defining each token, non-terminal node or annotation being searched for as an element. An element can be a token (simply text between quotes: "dogs" or else tok="dogs") or an attribute-value pair (such as tok="dogs", or optionally with a namespace: tiger:cat="PP"). Note that different corpora can have completely different annotation names and values - these are not specified by ANNIS. Underspecified tokens or nodes in general may be specified using tok and node respectively.

Once all elements are declared, relations between the elements (or edges) are specified which must hold between them. The elements are referred back to serially using variable numbers, and linguistic operators bind them together, e.g. #1 > #2 meaning the first element dominates the second in a tree or graph. Operators define the possible overlap and adjacency relations between annotation spans, as well as recursive hierarchical relations between nodes. Some operators also allow specific labes to be specified in addition to the operator (see the operator list below).

The following example, a query searching for German sentences with topicalized objects (i.e. with the word order object-verb-subject), illustrates these ideas in practice as a complete example:

QueryExplanation
node & pos="VVFIN" & cat="S" & node &
#3 >[tiger:func="OA"] #1 &
#3 >[tiger:func="SB"] #4 &
#3 > #2 &
#1 .* #2 &
#2 .* #4
two nodes are defined, a finite verb and a sentence node (S)
S dominates node 1 with label OA
S dominates node 4 with label SB
S dominates (>) the verb
node 1 precedes (.*) the verb
the verb precedes (.*) node 4

Try out this query in the pcc2 corpus in ANNIS3

Shortcuts

Starting in ANNIS 3.1.0, you can also use shortcuts to define the relations between query nodes. You can specify the operator that applies between two nodes directly between those nodes. For example, the following two queries are equivalent:

cat="NP" & cat="PP" & #1 > #2

or

cat="NP" > cat="PP"

Specifying the dominance operator > directly between the NP and PP nodes works the same as declaring the two nodes and then stating that the dominance relationshop holds between the first node (#1) and the second one (#2).

Naming nodes

For complex queries in which the same nodes are involved in multiple relations, it may be easier to name nodes explicitly, rather than using the #1, #2, ... number notation. Node names are given to nodes when they are defined and are placed before the '#' sing. The following query illustrates this mechanism:

QueryExplanation
NP#cat="NP" &  
PP1#cat="PP" . PP2#cat="PP" &
#NP > #PP1 &
#NP > #PP2
a nominal phrase which we will call 'NP' 
two consecutive prepositional phrases named PP1 and PP2 
the NP node dominates PP1
the same NP node also dominates PP2 

Try out this query in the GUM corpus in ANNIS3

Metadata

To specify metadata conditions which must apply to matches, add key-value pairs preceded by the reserved prefix meta::. Metadata may apply to corpora, sub-corpora, or individual documents within a corpus. For example, the following query finds sequences of two consecutive adverbs in sports documents (Genre="Sport"):

QueryExplanation
pos="ADV" & pos="ADV" &
#1 . #2 &
meta::Genre="Sport"
two adverb tags
adverb 1 precedes adverb 2 directly
the metadatum Genre must have the value "Sport"

Try out this query in the pcc2 corpus in ANNIS

Unary Operators

Two operators refer to only one matching element, instead of specifying a relationship between two elements: tokenarity and arity. The tokenarity operator specifies how many tokens should be covered by the matching element, whereas the arity operator determines the amount of directly dominated children the matched node should have. For example, the following query searches for nominal phrases that dominate exactly 4 nodes:

QueryExplanation
cat="NP" 
& #1:arity=4
a syntactic category 'NP' for 'nominal phrase'
this node should have exactly 4 children 

Try out this query in the pcc2 corpus in ANNIS

Query Builder

In order to facilitate the formulation of complex queries, a graphical query builder allows users to define their search in a graph. This reflects the nodes and edges in the query directly but gives a more intuitive view of the elements and relations being searched for.

ANNIS3 Query Builder
The Query Builder allows users to model queries as a graph

RegEx Support

ANNIS supports RegEx natively in all token, annotation, edge label and metadata searches. In the query builder simply select ~ instead of = as the comparison operator. When entering a query manually use = but replace the double quotes around annotation and token values with slashes, e.g. lemma=/d.g/ finds "dog" and "dig".

Example Queries

Beginning with ANNIS 3.0.0, the possibility to include user-defined example queries within a corpus distribution has been added. The example queries can be entered in a separate file, called example_queries.tab within the relANNIS corpus folder. For more information on how to add example queries to your corpus, see the ANNIS User Guide on the documentation page.

Operators

AQL currently includes the following operators:

Operator Description Illustration Notes
.direct precedence
A B
For non-terminal nodes, precedence is determined by the right-most and left-most terminal children. In corpora with multiple segmentations the layer on which consecutivity holds may be specified with .layer
.*indirect precedence
A x y z B
For specific sizes of precedence spans, .n,m can be used, e.g. .3,4 - between 3 and 4 token distance; the default maximum distance for .* is 50 tokens. As above, segmentation layers may be specified, e.g. .layer,3,4
^directly near
A B or B A
Same as precedence, but in either order. In corpora with multiple segmentations the layer on which consecutivity holds may be specified with ^layer
^*indirectly near
A x y z B or B x y z A
Like indirect precedence in either order. The form ^n,m can be used, e.g. ^3,4 - between 3 and 4 token distance; the default maximum distance for ^* is 50 tokens. As above, segmentation layers may be specified, e.g. ^layer,3,4
>direct dominance
A
|
B
A specific edge type may be specified, e.g. >secedge to find secondary edges. Edge labels are specified in brackets, e.g. >[func="OA"] for an edge with the function 'object, accusative'
>* indirect dominance
A
|
...
|
B
For specific distance of dominance, >n,m can be used, e.g. >3,4 - dominates with 3 to 4 edges distance
_=_identical coverage
A
B
Applies when two annotations cover the exact same span of tokens
_i_ inclusion
AAA
B
Applies when one annotation covers a span identical to or larger than another
_o_ overlap
AAA   
   BBB
For overlap only on the left or right side, use _ol_ and _or_ respectively
_l_ left aligned  
AAA
BB 
Both elements span an area beginning with the same token
_r_ right aligned
AA
BBB
  Both elements span an area ending with the same token
== value identity
A = B
The value of the annotation or token A is identical to that of B (this operator does not bind, i.e. the nodes must be connected by some other criteria too)
!= value difference
A ≠ B
The value of the annotation or token A is different from B (this operator does not bind, i.e. the nodes must be connected by some other criteria too)
->LABEL labeled pointing relation
      LABEL      
                
V  
A B
A labeled, directed relationship between two elements. Annotations can be specified with ->LABEL[annotation="VALUE"]
->LABEL * indirect pointing relation
LABEL   ...   LABEL
                        
   V  
A B
An indirect labeled relationship between two elements. The length of the chain may be specified with ->LABEL n,m for relation chains of length n to m
>@l left-most child
A
/ | \
B x y
 
>@r right-most child
A
/ | \
x y B
 
$ Common parent node
x
/ \
A B
 
$*Common ancestor node
x
|
...
/ \
A B
 
#x:arity=n Arity x
/ | \
1 ... n
Specifies the amount of directly dominated children that the searched node has
#x:tokenarity=n Tokenarity x
...
/ \
1 ... n
Specifies the length of the span of tokens covered by the node
#x:root Root ___
x
...
Specifies that the node is not dominated by any other node within its namespace