corpus-tools.org

With corpus-tools.org we provide an infrastructure to annotate, migrate, and analyze linguistic data.

ANNIS

ANNIS is an open source, cross-platform (Linux, Mac, Windows), browser-based search and visualization architecture for complex multi-layer linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation of Information Structure, was originally designed to provide access to the data of the SFB 632 - “Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts”. It has since extended to be used by a large number of projects annotating a variety of phenomena. Since complex linguistic phenomena, such as information structure, interact on many levels, ANNIS addresses the need to concurrently annotate, query and visualize data from varied areas such as syntax, semantics, morphology, prosody, referentiality, lexis and more. For projects working with spoken language, support for audio / video annotations is also required.

Atomic

Atomic is an open source multi-layer corpus annotation tool – and platform – for the desktop. It runs on all major operating systems. Atomic is easily extensible through its plugin system, and supports a multitude of different linguistic formats.

Atomic has originally been developed within the LinkType research project at the universities of Jena and Zurich. It works on Salt data models, and is not limited to any specific annotation types thus making it a true multi-layer annotation tool.

Specific annotation types or workflows may demand specific tooling, and Atomic provides an extensible infrastructure to include such tools, e.g., specific editors, data views, or NLP components.

Pepper

If you need to convert corpora from one linguistic format into another, Pepper is your swiss-army knife. When your annotation tool produces a different data format from the one your analysis tool can read, Pepper is there to the rescue.

  • Pepper can convert documents in a variety of linguistic formats, such as: EXMARaLDA, Tiger XML, MMAX2, RST, TCF, TreeTagger format, TEI (subset), ANNIS format, PAULA and many many more.
  • Pepper comes with a plug-in mechanism which makes it easy to extend it for further formats and data manipulations.

Salt

With Salt we provide an easily understandable meta model for linguistic data as well as an open source API to store, manipulate and represent data. Salt is an abstract model, poor in linguistic semantics. As a result, it is independent of any linguistic schools or theories. The core model is graph-based, thereby keeping structural restrictions very low and allowing for a wide range of possible linguistic annotations like syntactic, morphological, coreferential annotations, and many more. You can even model your own very personal annotation as long as it fits into a graph structure (and so far we have not seen a linguistic annotation which does not). Furthermore, Salt does not depend on a specific linguistic tagset which allows you to use every tagset you like.

When refering to the whole corpus-tools.org toolchain, please cite the following paper.

Druskat, Stephan & Gast, Volker & Krause, Thomas et al. (2016): corpus-tools.org: An Interoperable Generic Software Tool Set for Multi-layer Linguistic Corpora. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). http://www.lrec-conf.org/proceedings/lrec2016/summaries/918.html

For citing a specific tool use the references given on the sub-page of the tool.