ANNIS
A web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.
Demo Corpora for download
All corpora are available in the ANNIS database format, which can be imported directly into ANNIS, and PAULA XML, which is more readable and editable but must be converted to ANNIS using the Pepper converter framework for import. Some corpora are offered in other source formats, such as TreeTagger SGML (a.k.a. CWB format) or especially for multimodal corpora, EXMARaLDA XML, which may also be converted to relANNIS or PAULA.
Corpus (Download) | Full Name | Language | Texts / Tokens | Annotations | Source |
---|---|---|---|---|---|
GUM [PAULA, ANNIS] |
Georgetown University Multilayer Corpus | English | 25 / 22656 | Multiple POS, lemma, constituent trees, dependency trees, coreference, entity types, rhetorical structure, information status, TEI structure | Georgetown Linguistics |
pcc2 [PAULA, ANNIS, GraphML] |
Potsdam Commentary Corpus 2 | German | 2 / 399 | POS, lemma, morphology, constituent trees, dependency trees, coreference, rhetorical structure, information structure | SFB632/D1 |
subtok.demo [PAULA, ANNIS] |
Subtoken Demo | English | 1 / 11 | Diplomatic transcription, lemma, line, page, norm, pos, rend | SFB632/D1 |
dialog.demo [EXMARaLDA, ANNIS] |
Dialog Demo (Sample from BeMaTaC) |
German | 1 / 102 | Spoken word forms, normalization and utterances for two speakers, time-aligned audio | SFB632/D1 |
parallel.sample [PAULA, ANNIS] |
Sample parallel corpus | English, German | 2 / 10 | POS, lemma, alignment, alignment type (good/fuzzy) | SFB632/D1 |
a5.hausa.umarnin.uwa [TreeTagger, ANNIS] |
Umarnin Uwa film transcript | Hausa | 47 / 10194 | Automatic POS, speakers, extralinguistic info, foreign words/code-switching | SFB632/A5 and D1 |
b4.tatian2.0
[PAULA, ANNIS] |
Tatian Corpus of Deviating Examples (T-CODEX) 2.0 | Old High German, Latin | 2031 / 11295 | POS, chunks, grammatical function, information structure | SFB632/B4; edition text courtesy of Vandenhoeck & Ruprecht |
b7.wolof.web.V2
[TreeTagger, ANNIS] |
Wolof Sample Web Corpus 2.0 | Wolof | 4 / 14676 | POS, sentence segmentation | SFB632/B7 and D1 |
b7.wolof.wiki.V4
[TreeTagger, ANNIS] |
Wolof Wikipedia Corpus 4.0 | Wolof | 14 / 12738 | POS, sentence segmentation, English translations | SFB632/B7 and D1 |
d2.20samplesDEU [PAULA, ANNIS] |
D2 20 Samples (QUIS data) | German | 22 / 382 | POS, aligned audio, accent, tones, information structure, grammatical function, morphology | SFB632/D2 |
Aeschylus.Persae.L1-18 [PAULA, ANNIS] |
Aeschylus, Persae: lines 1-18 | Classical Greek (Polytonic) | 1 / 87 | POS, lemma, grammatical function, labeled syntax trees | Francesco Mambrini / Perseus Project, Tufts University |
fuerstinnenkorrespondenz [EXMARaLDA, ANNIS] |
Early modern correspondence of German princesses and nobility | Early New High German | 600 / 262,468 | POS, lemma, clauses, grammatical function, normalization, orthography, politeness | Courtesy of the Lehrstuhl für Indogermanistik, Universität Jena [website] |
Nestorchronik.sample [PAULA, ANNIS] |
Nestor Chronicle - 181,18 - 182,20 | Old Russian | 1 / 273 | POS/morphology, clause-level syntax trees | Courtesy of Roland Meyer / Institut für Slawistik , Humboldt-Universität zu Berlin |
SMULTRON_Banana [PAULA, ANNIS] |
SMULTRON Parallel Treebank Sample | German & English | 2 / 3782 | POS, syntax trees, word and phrase level alignment with alignment quality; for German: lemma, morphology, entities | Courtesy of the Institut für Computerlinguistik, Universität Zürich |
ridges.herbology [PAULA, ANNIS] |
RIDGES Herbology | (Early) Modern German | 14 / 63734 | see RIDGES documentation | RIDGES Project |
Align1_992 [PAULA, ANNIS] |
Roman de Flamenca Parallel Corpus | Old Occitan, English | 2 / 14166 | see Flamenca documentation | Olga Scrivner, Indiana University |