Salt

A powerful, tagset-independent and theory-neutral meta model and API for storing, manipulating, and representing nearly all types of linguistic data.

language-independent

Salt supports a huge set of languages and typesets. Each language which can be expressed in UTF-8 is supported by Salt.

theory-neutral

Salt is open to any linguistic school or theory, not limited to a specific one.

tagset independent

Salt is not bound to a tagset. Annotations are represented as attribute-value pairs and can be chosen freely.

open source

Salt is licensed under the Apache License, Version 2.0 and published as a Java library on Github.

multimedia support

Salt is a text-based model, but also supports the modeling of audio and video corpora.

multi-layer

Salt is not limited to a specific set of annotation layers. Since Salt is a graph-based model, you can model many different structures, such as tree structures, span annotations, coreference chains and so on.

With Salt we provide an easily understandable meta model for linguistic data and an open source Java API for storing, manipulating and representing data. Salt is an abstract model, poor in linguistic semantics. As a result, it is independent of any linguistic schools or theories. The core model is graph-based, thereby keeping the structural restrictions very low and allowing for a wide range of possible linguistic annotations, such as syntactic, morphological, coreferential annotations and many more. You can even model your own very personal annotation as long as it fits into a graph structure (and so far we have not seen a linguistic annotation which does not). Furthermore, Salt does not depend on a specific linguistic tagset which allows you to use every tagset you like.

Salt serves as an underlying model for the tools ANNIS (a multi-layer corpus search and visualization tool), Pepper (which uses Salt to map from and too various annotation data formats) and Atomic (a prototype of a multi-layer annotation editor) .

If you use Salt in your scientific work, please cite it as follows.

F. Zipser & L. Romary. 2010. A model oriented approach to the mapping of annotation formats using standards..
In Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC 2010. Malta. URL: http://hal.archives-ouvertes.fr/inria-00527799/en/

Linguistic corpora show a nearly unlimited set of different kinds of annotations to describe linguistic phenomena. That means, not only is the number of tagsets unbound, there is also a huge set of different structures to describe these phenomena. For instance part-of-speech or lemmatizations can be added to a token, tokens can also be grouped together to annotate them as one entity, such as names in named entity recognition. Furthermore, the annotation of constituents or the rhetorical structure theory build an entire tree-like structure above the tokens. In dialogue data we need to model multiple texts belonging to different speakers, as well as a link to an audio stream. Coreferences or anaphoric chains need to interlink tokens, which may be spread across the entire text. Many tools and models (e.g., formats) address these annotations one by one: There are formats to address syntactical data (like Tiger XML or the Penn Treebank format), or coreferences (like the MMAX2 format), or dialogue data (like the EXMARaLDA basic transcription, the ELAN format), and many many more. But developements in the last years have shown that a lot of linguistic phenomena are spread over different kinds of annotations. So more and more corpora have been created which are annotated on multiple annotation layers like the PCC or the TUEBA-D/Z corpus. The aim of Salt is to consolidate all kinds of annotations within a single model. For doing so we need a powerful base structure, which can cover all the different necessities at once. A very well-known and powerful structure in mathematics and informatics is the common graph, which is widely used for modeling very different kinds of data. The graph structure has a further benefit in that it helps to keep the model simple with its small set of different model elements. Our graph structure is rather simple, it only contains four model elements: node, relation, label and layer.

Salt graph (meta-) model

Graphs are very flexible and abstract structures, but not very specialized to a linguistic purpose. So we need to abstract over linguistic data to map them into such a structure.

To give a simple explanation of what a graph is, let us forget linguistics for a moment and think about humans and their relationships. Imagine a set of humans, for instance your family or friends. In a graph, each of these humans will represent one node. The relationship, for instance between exactly two humans, then is defined as a relation. In other words, an edge connects two nodes. Now, the relations between humans can be very different, so for instance the relation between a couple can be described as a love relation, whereas the relation between an employee and her/his boss could be described as a work relation. A relation can also have a direction: Imagine, for instance, that a person and a car are modeled as nodes, linked by relation with the semantic "drive". This way, a person can drive a car, but not the other way round. These examples show that edges between nodes can be very different, similar to human relations. To differentiate between the types of edges, they can be labeled. The same goes for nodes: they can also labeled, for instance with the name of the human that node represents.

Re-applied to linguistics, this means that if we can model humans and their relationships as a graph, we can also model linguistic artifacts as a graph. E.g., we can model texts, tokens etc. as nodes, linguistic categorization as labels and relations between them as relations.

Let's look at some examples for how linguistic data can be modelled in a graph-based world. A fundamental concept in Salt is the token. A token in Salt is used as the smallest annotatable unit. For instance a token bundles a range of characters to a word, a syllable, a sentence etc. together.

kwic

KWIC sample left to right arrow Salt graph

grid

span sample left to right arrow Salt graph
const sample left to right arrow Salt graph
dependency sample left to right arrow Salt graph
coreference sample left to right arrow Salt graph

Salt is published in the maven central repository. To use Salt in a Java-based application via maven, just add the following lines to your pom.xml:

<groupId>corpus-tools</groupId>
<artifactId>salt-api</artifactId>
<version>VERSION</version>

Replace VERSION with a version of your choice. Available versions can be found at maven central repository.

A sample project to demonstrate the power of Salt can be downloaded from https://github.com/korpling/salt-demo.

The Salt source code can be downloaded from GitHub at https://github.com/korpling/salt.

Salt is published under the Open Source license Apache License, Version 2.0. We want to enable everyone to use the software without restrictions, and also enable the community to take part in its developement.

Found a bug or have any feature request?

Please let us know what you have found, or which ideas for enhancements you have. Please leave us an issue on GitHub at Salt or write us an e-mail: saltnpepper@lists.hu-berlin.de.

Want to contribute to the project?

We published the Salt source code on the GitHub platform at Salt. If you are interested in contributing to the project, please feel free to fork or clone it. We are happy about any suggestions, bug reports, bug fixes, and so on. It would be nice if you keep us informed about your ideas and enhancements: Please write us an e-mail: saltnpepper@lists.hu-berlin.de.