ANNIS

A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.

All corpora are available in the ANNIS database format, which can be imported directly into ANNIS, and PAULA XML, which is more readable and editable but must be converted to ANNIS using the Pepper converter framework for import. Some corpora are offered in other source formats, such as TreeTagger SGML (a.k.a. CWB format) or especially for multimodal corpora, EXMARaLDA XML, which may also be converted to relANNIS or PAULA.

Corpus (Download) Full Name Language Texts / Tokens Annotations Source
GUM
[PAULA, ANNIS]
Georgetown University Multilayer Corpus English 25 / 22656 Multiple POS, lemma, constituent trees, dependency trees, coreference, entity types, rhetorical structure, information status, TEI structure Georgetown Linguistics
pcc2
[PAULA, ANNIS]
Potsdam Commentary Corpus 2 German 2 / 399 POS, lemma, morphology, constituent trees, dependency trees, coreference, rhetorical structure, information structure SFB632/D1
subtok.demo
[PAULA, ANNIS]
Subtoken Demo English 1 / 11 Diplomatic transcription, lemma, line, page, norm, pos, rend SFB632/D1
dialog.demo
[EXMARaLDA, ANNIS]
Dialog Demo
(Sample from BeMaTaC)
German 1 / 102 Spoken word forms, normalization and utterances for two speakers, time-aligned audio SFB632/D1
parallel.sample
[PAULA, ANNIS]
Sample parallel corpus English, German 2 / 10 POS, lemma, alignment, alignment type (good/fuzzy) SFB632/D1
a5.hausa.umarnin.uwa
[TreeTagger, ANNIS]
Umarnin Uwa film transcript Hausa 47 / 10194 Automatic POS, speakers, extralinguistic info, foreign words/code-switching SFB632/A5 and D1
b4.tatian2.0
[PAULA, ANNIS]
Tatian Corpus of Deviating Examples (T-CODEX) 2.0 Old High German, Latin 2031 / 11295 POS, chunks, grammatical function, information structure SFB632/B4; edition text courtesy of Vandenhoeck & Ruprecht
b7.wolof.web.V2
[TreeTagger, ANNIS]
Wolof Sample Web Corpus 2.0 Wolof 4 / 14676 POS, sentence segmentation SFB632/B7 and D1
b7.wolof.wiki.V4
[TreeTagger, ANNIS]
Wolof Wikipedia Corpus 4.0 Wolof 14 / 12738 POS, sentence segmentation, English translations SFB632/B7 and D1
d2.20samplesDEU
[PAULA, ANNIS]
D2 20 Samples (QUIS data) German 22 / 382 POS, aligned audio, accent, tones, information structure, grammatical function, morphology SFB632/D2
Aeschylus.Persae.L1-18
[PAULA, ANNIS]
Aeschylus, Persae: lines 1-18 Classical Greek (Polytonic) 1 / 87 POS, lemma, grammatical function, labeled syntax trees Francesco Mambrini / Perseus Project, Tufts University
fuerstinnenkorrespondenz
[EXMARaLDA, ANNIS]
Early modern correspondence of German princesses and nobility Early New High German 600 / 262,468 POS, lemma, clauses, grammatical function, normalization, orthography, politeness Courtesy of the Lehrstuhl für Indogermanistik, Universität Jena [website]
Nestorchronik.sample
[PAULA, ANNIS]
Nestor Chronicle - 181,18 - 182,20 Old Russian 1 / 273 POS/morphology, clause-level syntax trees Courtesy of Roland Meyer / Institut für Slawistik , Humboldt-Universität zu Berlin
SMULTRON_Banana
[PAULA, ANNIS]
SMULTRON Parallel Treebank Sample German & English 2 / 3782 POS, syntax trees, word and phrase level alignment with alignment quality; for German: lemma, morphology, entities Courtesy of the Institut für Computerlinguistik, Universität Zürich
ridges.herbology
[PAULA, ANNIS]
RIDGES Herbology (Early) Modern German 14 / 63734 see RIDGES documentation RIDGES Project
Align1_992
[PAULA, ANNIS]
Roman de Flamenca Parallel Corpus Old Occitan, English 2 / 14166 see Flamenca documentation Olga Scrivner, Indiana University
abraham.our.father
[PAULA, ANNIS]
Abraham our Father (Shenoute) Sahidic Coptic 7 / 7705 see documentation at http://coptic.pacific.edu/ Coptic SCRIPTORIUM
apophthegmata.patrum.5
[PAULA, ANNIS]
Apophthegmata Patrum Sahidic Coptic 5 / 700 see documentation at http://coptic.pacific.edu/ Coptic SCRIPTORIUM