ANNIS
A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.
ANNIS is an open source, cross platform (Linux, Mac, Windows),
web browser-based search and visualization architecture for complex multi-layer
linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation
of Information Structure, was originally designed to provide access to the data of
the SFB 632 - “Information Structure: The Linguistic Means for Structuring Utterances,
Sentences and Texts”. It has since then been extended to a large number of projects
annotating a variety of phenomena. Since complex linguistic phenomena such as
information structure interact on many levels, ANNIS addresses the need
to concurrently annotate, query and visualize data from such varied
areas as syntax, semantics, morphology, prosody, referentiality,
lexis and more. For projects working with spoken language, support
for audio / video annotations is also required.
Data is often annotated using both automatic taggers/parsers
and a growing set of manual annotation tools
(e.g. EXMARaLDA, ELAN,
annotate/Synpathy,
MMAX, RSTTool,
Arborator, WebAnno, Hexatomic),
ANNIS provides the means for visualizing and retrieving this data.
Pepper is used to import the multiple annotation formats into ANNIS.
For detailed information on the latest version of ANNIS see the User Guide under documentation
If you use ANNIS in your scientific work, please cite it as follows.
Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31).
http://dsh.oxfordjournals.org/content/31/1/118
ANNIS can be installed locally on your computer but there are also public available installations which can be used without any installation.
Diversity of primary data
Language data can be very heterogeneous and may come from
typologically diverse languages. It differs with respect to modality
(written vs. spoken language, monologue vs. dialogue) and basic unit
(sentence vs. discourse). In addition, special character sets (e.g.
for Hindi, Old High German or the African Kwa languages) mean that full
Unicode support is essential, in both visualization and search facilities.
The system also offers support for right-to-left
script languages, such as Arabic and Hebrew. This includes right-to-left
tree layouting for treebanks in
these languages.
ANNIS supports Unicode in both visualization and search,
including Regular Expressions
Right-to-left Arabic data in the KWIC view
Right-to-left layouting for trees in Hebrew
Diversity of Annotation
Data is annotated on various linguistic levels: phonetics/phonology,
morpho-syntax, semantics, and information structure.
The data types of the annotation range from attribute-value pairs to
set relations (e.g. for annotating co-reference), directed
relations/pointers (e.g. for annotating anaphoric relations), trees,
and graphs (see Visualizations).
Furthermore, the annotations are created with the help of different
tools, i.e. different tool formats have to be supported. In order to ensure
compatibility with as many formats as possible, we use the Pepper converter framework,
which maps a large number of formats via the metamodel Salt into the native format of ANNIS.
Multi-layer Annotation
A very central requirement is support for visualizing and querying
annotations on multiple layers, each layer representing one type of
information, e.g. morphemic transcription, grammatical functions, pitch
accents, etc. Queries must be able to simultaneously constrain all
these layers and the relationships between them, making operators for
the description of topological tree structures as well as span overlap
necessary.
The system also supports parallel corpora aligned at all levels (i.e. words, sentences, syntactic phrases etc. can be aligned), and each aligned language may have its own annotation layers.
Parallel aligned data with a separate syntax tree for each language
Accessibility
Data in the database should be easy to access and to query. Software
and hardware requirements on the client side should be limited to a
freely available browser (e.g. Mozilla Firefox). As little training as
possible should be required, making a graphical query builder as well
as corpus-specific example queries and tutorials necessary.
Queries should return results reasonably quickly, even in large datasets. In order to realize this, the original data from XML and other formats is compiled and stored in the ANNIS backend within a relational database (PostgreSQL),
which offers scalability and access speed not feasible for an XML DB,
as well as native RegEx support.
For enquiries, e-mail us at: annis-support@lists.hu-berlin.de
Current team members:
Former team members: