Demo corpora

All corpora are available in the ANNIS database format (relANNIS), which can be imported directly into ANNIS, and PAULA XML, which is more readable and editable but must be converted to ANNIS using the legacy Pepper converter framework for import. Some corpora are offered in other source formats, such as TreeTagger SGML (a.k.a. CWB format) or especially for multimodal corpora, EXMARaLDA XML, which may also be converted to relANNIS using Annatto.

Corpus (Download)Full NameLanguageTexts/TokensAnnotationsSource
GUM [PAULA, ANNIS]Georgetown University Multilayer CorpusEnglish25 / 22656Multiple POS, lemma, constituent trees, dependency trees, coreference, entity types, rhetorical structure, information status, TEI structureGeorgetown Linguistics
pcc2 [PAULA, ANNIS, GraphML]Potsdam Commentary Corpus 2 (two example documents)German3 / 399POS, lemma, morphology, constituent trees, dependency trees, coreference, rhetorical structure, information structureSFB632/D1
subtok.demo [PAULA, ANNIS]Subtoken DemoEnglish1 / 11Diplomatic transcription, lemma, line, page, norm, pos, rendSFB632/D1
dialog.demo [EXMARaLDA, ANNIS]Dialog Demo (Sample from BeMaTaC)German1 / 102Spoken word forms, normalization and utterances for two speakers, time-aligned audioSFB632/D1
parallel.sample [PAULA, ANNIS]Sample parallel corpusEnglish, German2 / 10POS, lemma, alignment, alignment type (good/fuzzy)SFB632/D1
a5.hausa.umarnin.uwa [TreeTagger, ANNIS]Umarnin Uwa film transcriptHausa47 / 10194Automatic POS, speakers, extralinguistic info, foreign words/code-switchingSFB632/A5 and D1
b4.tatian2.0 [PAULA, ANNIS]Tatian Corpus of Deviating Examples (T-CODEX) 2.0Old High German, Latin2031 / 11295POS, chunks, grammatical function, information structureSFB632/B4; edition text courtesy of Vandenhoeck & Ruprecht
b7.wolof.web.V2 [TreeTagger, ANNIS]Wolof Sample Web Corpus 2.0Wolof4 / 14676POS, sentence segmentationSFB632/B7 and D1
b7.wolof.wiki.V4 [TreeTagger, ANNIS]Wolof Wikipedia Corpus 4.0Wolof14 / 12738POS, sentence segmentation, English translationsSFB632/B7 and D1
d2.20samplesDEU [PAULA, ANNIS]D2 20 Samples (QUIS data)German22 / 382POS, aligned audio, accent, tones, information structure, grammatical function, morphologySFB632/D2
Aeschylus.Persae.L1-18 [PAULA, ANNIS]Aeschylus, Persae: lines 1-18Classical Greek (Polytonic)1 / 87POS, lemma, grammatical function, labeled syntax treesFrancesco Mambrini / Perseus Project, Tufts University
fuerstinnenkorrespondenz [EXMARaLDA, ANNIS]Early modern correspondence of German princesses and nobilityEarly New High German600 / 262,468POS, lemma, clauses, grammatical function, normalization, orthography, politenessCourtesy of the Lehrstuhl für Indogermanistik, Universität Jena [website]
Nestorchronik.sample [PAULA, ANNIS]Nestor Chronicle - 181,18 - 182,20Old Russian1 / 273POS/morphology, clause-level syntax treesCourtesy of Roland Meyer / Institut für Slawistik , Humboldt-Universität zu Berlin
SMULTRON_Banana [PAULA, ANNIS]SMULTRON Parallel Treebank SampleGerman & English2 / 3782POS, syntax trees, word and phrase level alignment with alignment quality; for German: lemma, morphology, entitiesCourtesy of the Institut für Computerlinguistik, Universität Zürich
ridges.herbology [PAULA, ANNIS]RIDGES Herbology(Early) Modern German14 / 63734see RIDGES documentationRIDGES Project
Align1_992 [PAULA, ANNIS]Roman de Flamenca Parallel CorpusOld Occitan, English2 / 14166see Flamenca documentationOlga Scrivner, Indiana University