corpus-tools.org

With corpus-tools.org we provide an infrastructure to annotate, migrate, and analyze linguistic data.

ANNIS

ANNIS is an open source, cross-platform (Linux, Mac, Windows), browser-based search and visualization architecture for complex multi-layer linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation of Information Structure, was originally designed to provide access to the data of the SFB 632 - “Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts”. It has since extended to be used by a large number of projects annotating a variety of phenomena. Since complex linguistic phenomena, such as information structure, interact on many levels, ANNIS addresses the need to concurrently annotate, query and visualize data from varied areas such as syntax, semantics, morphology, prosody, referentiality, lexis and more. For projects working with spoken language, support for audio / video annotations is also required.

Hexatomic

Hexatomic is an extensible, OS-independent platform for deep multi-layer linguistic annotation of corpora. It is being developed for sustainability, in order to support research software re-use rather than new development of software with each new research project. Using Hexatomic, linguistic research projects can implement what they need on top of an existing platform with high compatibility to other tools and pipelines. Hexatomic is funded by Deutsche Forschungsgemeinschaft (DFG) under grant number 391160252. Development is based at the Department of English Studies (Friedrich Schiller University Jena) and the Department for German Studies and Linguistics (Humboldt-Universität zu Berlin).

Pepper

If you need to convert corpora from one linguistic format into another, Pepper is your swiss-army knife. When your annotation tool produces a different data format from the one your analysis tool can read, Pepper is there to the rescue.

  • Pepper can convert documents in a variety of linguistic formats, such as: EXMARaLDA, Tiger XML, MMAX2, RST, TCF, TreeTagger format, TEI (subset), ANNIS format, PAULA and many many more.
  • Pepper comes with a plug-in mechanism which makes it easy to extend it for further formats and data manipulations.

Salt

With Salt we provide an easily understandable meta model for linguistic data as well as an open source API to store, manipulate and represent data. Salt is an abstract model, poor in linguistic semantics. As a result, it is independent of any linguistic schools or theories. The core model is graph-based, thereby keeping structural restrictions very low and allowing for a wide range of possible linguistic annotations like syntactic, morphological, coreferential annotations, and many more. You can even model your own very personal annotation as long as it fits into a graph structure (and so far we have not seen a linguistic annotation which does not). Furthermore, Salt does not depend on a specific linguistic tagset which allows you to use every tagset you like.

When refering to the whole corpus-tools.org toolchain, please cite the following paper.

Druskat, Stephan & Gast, Volker & Krause, Thomas et al. (2016): corpus-tools.org: An Interoperable Generic Software Tool Set for Multi-layer Linguistic Corpora. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). http://www.lrec-conf.org/proceedings/lrec2016/summaries/918.html

For citing a specific tool use the references given on the sub-page of the tool.