ANNIS (corpus-tools.org)

ANNIS is an open source, cross platform (Linux, Mac, Windows), web browser-based search and visualization architecture for complex multi-layer linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation of Information Structure, was originally designed to provide access to the data of the SFB 632 - “Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts”. It has since then been extended to a large number of projects annotating a variety of phenomena. Since complex linguistic phenomena such as information structure interact on many levels, ANNIS addresses the need to concurrently annotate, query and visualize data from such varied areas as syntax, semantics, morphology, prosody, referentiality, lexis and more. For projects working with spoken language, support for audio / video annotations is also required.

Data is often annotated using both automatic taggers/parsers and a growing set of manual annotation tools (e.g. EXMARaLDA, ELAN, annotate/Synpathy, MMAX, RSTTool, Arborator, WebAnno, Hexatomic), ANNIS provides the means for visualizing and retrieving this data. Pepper is used to import the multiple annotation formats into ANNIS.

For detailed information on the latest version of ANNIS see the User Guide under documentation

If you use ANNIS in your scientific work, please cite it as follows.

Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31). http://dsh.oxfordjournals.org/content/31/1/118

ANNIS can be installed locally on your computer but there are also public available installations which can be used without any installation.

Humboldt-Universität zu Berlin, Corpus Linguistics and Morphology has a number of mostly smaller corpora available without a login.
The Georgetown University ANNIS runs some freely available corpora

Diversity of primary data

Language data can be very heterogeneous and may come from typologically diverse languages. It differs with respect to modality (written vs. spoken language, monologue vs. dialogue) and basic unit (sentence vs. discourse). In addition, special character sets (e.g. for Hindi, Old High German or the African Kwa languages) mean that full Unicode support is essential, in both visualization and search facilities. The system also offers support for right-to-left script languages, such as Arabic and Hebrew. This includes right-to-left tree layouting for treebanks in these languages.

Hindi data in ANNIS

ANNIS supports Unicode in both visualization and search, including Regular Expressions

Right-to-left Arabic data in the KWIC view

Right-to-left Arabic data in the KWIC view

Right-to-left layouting for trees in Hebrew

Right-to-left layouting for trees in Hebrew

Diversity of Annotation

Data is annotated on various linguistic levels: phonetics/phonology, morpho-syntax, semantics, and information structure. The data types of the annotation range from attribute-value pairs to set relations (e.g. for annotating co-reference), directed relations/pointers (e.g. for annotating anaphoric relations), trees, and graphs (see Visualizations). Furthermore, the annotations are created with the help of different tools, i.e. different tool formats have to be supported. In order to ensure compatibility with as many formats as possible, we use the Pepper converter framework, which maps a large number of formats via the metamodel Salt into the native format of ANNIS.

Multi-layer Annotation

A very central requirement is support for visualizing and querying annotations on multiple layers, each layer representing one type of information, e.g. morphemic transcription, grammatical functions, pitch accents, etc. Queries must be able to simultaneously constrain all these layers and the relationships between them, making operators for the description of topological tree structures as well as span overlap necessary. The system also supports parallel corpora aligned at all levels (i.e. words, sentences, syntactic phrases etc. can be aligned), and each aligned language may have its own annotation layers.

Parallel aligned data with a separate syntax tree for each language

Accessibility

Data in the database should be easy to access and to query. Software and hardware requirements on the client side should be limited to a freely available browser (e.g. Mozilla Firefox). As little training as possible should be required, making a graphical query builder as well as corpus-specific example queries and tutorials necessary.

Performance and Scalability

Queries should return results reasonably quickly, even in large datasets. In order to realize this, the original data from XML and other formats is compiled and stored in the ANNIS backend within a relational database (PostgreSQL), which offers scalability and access speed not feasible for an XML DB, as well as native RegEx support.

For enquiries, e-mail us at: annis-support@lists.hu-berlin.de

Current team members: