ANNIS Developer Guide

ANNIS is a web-based search and visualization architecture for multi-layer corpora. ANNIS consists of two major components: a backend service and a web front-end. There is also a local version, ANNIS Kickstarter, which is a simple starting point for new users who want to try out the system without installing a full server.

The service runs on a web server such as Tomcat or Jetty and communicates with a relational database, using the open source DB PostgreSQL. PostgreSQL (Version 9.4 or larger) must be installed for ANNIS to work. For more information on installing and managing the backend service, see the Administration Guide in the documentation.

The ANNIS front-end is a web application implemented in Java and the Vaadin framework and runs in a normal browser (we recommend Mozilla Firefox). The server running the web-application communicates with the backend service via a REST interface.

ANNIS Kickstarter is a cross-platform local version which requires nothing but a PostgreSQL installation to run. It will run under LINUX, Windows and Mac. For a quick tutorial to get started with Kickstarter, see the ANNIS User Guide.

ANNIS uses Maven3 as build tool. Maven itself is based on Java and should run on every major operating system. You have to download and install the appropriate version for your operating system from http://maven.apache.org/download.html before you can build ANNIS. Maven will download all needed dependencies from central servers on the first build so you will need to have a working internet connection. The dependencies are cached locally once their are downloaded.

When you have downloaded or checked out the source of ANNIS the top-level directory of the source code is the parent project for all ANNIS sub-projects. If you want to build every project that is part of ANNIS just execute

cd <annis-sources>/
mvn install

This might take a while on the first execution. mvn clean will remove all compiled code if necessary.

If you only want to compile a sub-project execute mvn install in the corresponding sub-directory. Every folder with a sub-project will have a pom.xml file. These files configure the whole build process. The Maven documenation contains detailed explanations of the structure and possible content of the configuration files.

Some sub-projects don't provide a library but will produce a zip or tar/gz- file when they are compiled. These assembly steps (see Maven Assembly documentation) are automatically invoked on mvn install.

While you can use any text editor of your choice to change ANNIS and compile it completely on the command line using Maven, a proper IDE will be a huge help for you. You can use any IDE which has a good support for Maven. The ANNIS main developers currently recommend Eclipse Proton or the Netbeans 9 IDE for development.

This way of running the front-end is very useful, if you want to access Annis on you local machine as a single user.

You don't need to install Jetty or Tomcat by yourself using this method.

cd <unzipped source>/annis-gui/
mvn jetty:run

Now you can access the site under http://localhost:8080/annis-gui/. The Jetty server might be stopped by pressing "CTRL-C".

The release process, including all necessary tests, might take several days and includes fixing bugs that are only discovered in the release testing process. Never ever add new features in this release process, there is the separate "develop" branch which you can use for this purposes.

You must have mdBook installed to make a release. Otherwise the documentation can't be created.

Start the release process by executing mvn gitflow:release-startfor a regular release (branched from the develop) or mvn gitflow:hotfix-start for a hotfix that is branched from master. The command will ask you for the new version number, use semantic versioning.
Add new changelog entry, if some important information is missing create an enhancement or bugfix issue in GitHub and repeat
- Get the GitHub Milestone id associated the release (is visible in the URL if you view the issues of the release tracking milestone).
- execute this script Misc/changelog.py <milestone-id>
- add the output to the begin of the CHANGELOG file
Update and commit license information

mvn license:add-third-party license:download-licenses

Build the complete project with tests.

mvn clean
mvn -DskipTests=true install
mvn test

Do manual tests. If you have to fix any bug document it in the issue tracker, update the changelog and start over at step 1. If no known bugs are left to fix go to the next section.

Finish the release by executing either mvn gitflow:release-finish for regular releases or mvn gitflow:hotfix-finish for hotfixes.
Release the staging repository to Maven Central with the Nexus interface: https://oss.sonatype.org/
Create a new release on GitHub including the changelog. Upload the binaries from Maven repository to GitHub release as well.

The ANNIS import format is inspired by the Salt meta-data model an ANNIS uses Salt internally to represent a matched graph from the database. However there are some restrictions which ANNIS has but Salt doesn't.

node names must be unique per document
document names must be unique per top-level corpus
a ANNIS corpus contains only one top-level corpus
there are no meta-data for nodes
string identifiers such as annotation or layer names have a limited number of allowed characters and should match the regular expression \verbatim [a-zA-Z_][a-zA-Z0-9_-]* \endverbatim in order to be searchable with AQL

An ANNIS corpus can be either a zip file or a directory which includes the following files

First line is exactly "3.3", the next lines can contain human readable text.

Pure UTF-8 encoded text file

Contains structural information about the corpus and its documents.

*TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	not NULL	description
id	integer	X	X	primary key
name	text	X	X	unique name (per corpus)
type	text		X	CORPUS, DOCUMENT
version	text			version number (currently not used)
pre	integer		X	pre order of the corpus tree
post	integer		X	post order of the corpus tree
top_level	boolean		X	true for the toplevel corpus

Contains meta-data on the corpus and the documents.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	description
corpus_ref	integer	X	foreign key to corpus.id
namespace	text
name	text
value	text

`text.annis`

Describes all texts that are included in the corpus.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	not NULL	description
corpus_ref	integer	X	foreign key to corpus.id. The corpus id should be the id of the document in the corpus table
id	integer	X	restart from 0 for every corpus_ref
name	text		name of the text
text	text		content of the text

primary key: corpus_ref, id

Every node in the corpus will have exactly one entry in this table.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	not NULL	description
id	bigint	X	X	primary key
text_ref	integer		X	foreign key to text.id
corpus_ref	integer		X	foreign key to corpus.id
layer	text
name	text			A human readable identfier of the node. Must be unique for each document.
left	integer		X	position of first covered character
right	integer		X	position of the character after the last covered character
token_index	integer			index of this token (if it is a token, otherwise NULL)
left_token	integer		X	index of first covered token, for token, this value is the token_index
right_token	integer		X	index of last covered token, for token, this value is the token_index
seg_index	integer			index of this segment (if it is a segment, i.e. there is some SOrderingRelation connected to this node)
seg_name	text			name of the segment path this segment belongs to
span	text			for tokens or node with a segmentation index: substring of the covered original text
root	boolean		X	True if this node has no parents in all components

Lists the components (connected sub-graphs) of the graph.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	not NULL	description
id	bigint	X	X	primary key
type	char(1)			edge type: c, d, p
layer	text		X	Could be set to e.g. "default_layer" if not in any Salt layer
name	text			The sType of the component, e.g anaphoric for a some kind of pointing relation component

A rank entry describes one of the positions of a node in a component tree. There is one rank entry for each edge. Furthermore, every component has a virtual relation and thus an additional rank entry where the parent attribute is NULL and the level is 0.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	not NULL	description
id	bigint	X	X	primary key
pre	integer		X	the preorder of the target node. the root of the component tree should always have a pre-order of 0
post	integer		X	the post-order or the target node
node_ref	bigint		X	the node.id of the target node
component_ref	bigint		X
parent	bigint			id of the parent rank entry
level	integer			level of this rank entry (not node!) in the component tree

Rank entries with the type 'c' (coverage spans) must be ommitted, if the referenced node is a token and if the parent coverage span is continuous. Continuous means the range of covered token has no gaps, thus it includes all token between the first and the last covered token. The idea behind this is, that you can recover the needed information using the "left_token" and "right_token" from the span together with the "token_index" (all in the node.annis table) if the span is continuous.

Contains all annotations per node.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	not NULL	description
node_ref	bigint	X	foreign key to _node.id
namespace	text
name	text	X
value	text

unique(node_ref, namespace, name)

Contains all annotations per edge (which is represented by a rank entry)

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	not NULL	description
rank_ref	bigint	X	foreign key to rank.id
namespace	text
name	text	X
value	text

Describes which visualizers to trigger depending of the namespace of a node or edge occuring in the search results.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	not NULL	description
corpus	text		the name of the supercorpus
version	text		the version of the corpus
namespace	text		the several layers of the corpus
element	text		the type of the entry: "node" or "edge"
vis_type	text	X	the abstract type of visualization: "tree", "discourse", "grid", ...
display_name	text	X	the name of the layer which shall be shown for display
visibility	text		either "permanent", "visible", "hidden", "removed" or "preloaded", default is "hidden"
order	bigint		the order of the layers, in which they shall be shown
mappings	text

Contains the media files that are connected with this corpus including their binary content.

Each file directly inside the ExtData folder belongs to the toplevel corpus. A sub-folder corresponds to a document and each file inside a sub-folder belongs to the document with the same name.

A query builder is a class that

implements the QueryBuilderPlugin interface
has the @PluginImplementation annotation

When implementing the interface you have to provide a short name, a caption and a callback function that creates new Vaadin components. You get an object of the type QueryController which you can use to set new queries from your component.

A query builder plugin must be either registered in the addCustomUIPlugins() function of the SearchUI (if the plugin is part of the annis-gui project) or must be added to a JAR-file that is located at one of the following locations:

the plugins folder inside the deployed web application
the path defined in the ANNIS_PLUGINS environment variable

Please also note that it is possible to configure a default query builder for an instance. Further information can be found in the User Guide.

The ANNIS-service has an public REST API that can be accessed by third-party applications.

Some of the API calls are protected resources. HTTP user/password authentification is used in this case. The users of the REST service can be configured as described in the User Guide. Whenever a user is not permitted to perform a certain action a 403 Forbidden HTTP response is sent. On some actions (like "list all corpora") only the resources that are available to the user are shown.

You can omit any authentication data. In this case you have the same rights as the "anonymous" user.

The following APIs are currently available:

Corpus queries and
Administrative tasks

ANNIS uses the semantic versioning scheme for its REST-API. Minor version updates of ANNIS will be backwards-compatible to the APIs described by this documentation. There might be further un-official API calls that might change without any notice.

Interface defining the REST API calls that ANNIS provides for querying the data.

All paths for this part of the service start with the "annis/query/" prefix.

GET annis/query/search/count

q - The query in the ANNIS Query Language (AQL)
corpora - A comma separated list of corpus names

Produces an XML representation of the total matches and the number of documents that contain matches (application/xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<matchAndDocumentCount>
  <!-- the number of documents that contain matches -->
  <documentCount>2</documentCount>
  <!-- total number of matches -->
  <matchCount>399</matchCount>
</matchAndDocumentCount>

GET annis/query/search/find

q - The query in the ANNIS Query Language (AQL)
corpora - A comma separated list of corpus names
offset - Optional offset from where to start the matches. Default is 0.
limit - Optional limit of the number of returned matches. Set to -1 if unlimited. Default is -1.
order - Optional order how the results should be sorted. Can be either "normal", "random" or "inverted" "normal" is the default ordering, "inverted" inverses the default ordering and "random" is a non-stable (thus you will get different results for the same offset and limit) random ordering.

A list of the match identifiers for the query.

Can produce the MIME type application/xml in the following format

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<match-group>
  <!-- each match is enclosed in an match tag -->
  <match>
    <!-- the first matched node of match 1 did not match an annotation -->
    <anno></anno>
    <!-- the second matched node of match 1 was a match on the 'tiger::pos' annotation-->
    <anno>tiger::pos</anno>
    <!-- ID of first matched node of match 1 -->
    <id>salt:/pcc2/11299/#tok_1</id>
    <!-- ID of second matched noded  of match 1 -->
    <id>salt:/pcc2/11299/#tok_2</id>
  </match>
  <match>
    <anno></anno>
    <anno>tiger::pos</anno>
    <!-- ID of first matched noded of match 2 -->
    <id>salt:/pcc2/11299/#tok_2</id>
    <!-- ID of second matched noded of match 2-->
    <id>salt:/pcc2/11299/#tok_3</id>
  </match>
  <!-- and so on -->
</match-group>

or the MIME type text/plain

salt:/pcc2/11299/#tok_1 tiger::pos::salt:/pcc2/11299/#tok_2
salt:/pcc2/11299/#tok_2 tiger::pos::salt:/pcc2/11299/#tok_3
salt:/pcc2/11299/#tok_3 tiger::pos::salt:/pcc2/11299/#tok_4

In this format, there is one line per match and each ID is separated by space. An ID can be prefixed by the fully qualified annotation name (which is separated with '::' from the ID).

POST annis/query/search/subgraph

Request body

Consumes application/xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<match-group>
  <!-- each match is enclosed in an match tag -->
  <match>
    <!-- the first matched node of match 1 did not match an annotation -->
    <anno></anno>
    <!-- the second matched node of match 1 was a match on the 'tiger::pos' annotation-->
    <anno>tiger::pos</anno>
    <!-- ID of first matched node of match 1 -->
    <id>salt:/pcc2/11299/#tok_1</id>
    <!-- ID of second matched noded  of match 1 -->
    <id>salt:/pcc2/11299/#tok_2</id>
  </match>
  <match>
    <anno></anno>
    <anno>tiger::pos</anno>
    <!-- ID of first matched noded of match 2 -->
    <id>salt:/pcc2/11299/#tok_2</id>
    <!-- ID of second matched noded of match 2-->
    <id>salt:/pcc2/11299/#tok_3</id>
  </match>
  <!-- and so on -->
</match-group>

or consumes text/plain:

salt:/pcc2/11299/#tok_1 tiger::pos::salt:/pcc2/11299/#tok_2
salt:/pcc2/11299/#tok_2 tiger::pos::salt:/pcc2/11299/#tok_3
salt:/pcc2/11299/#tok_3 tiger::pos::salt:/pcc2/11299/#tok_4

One line per match, each ID is separated by space. An ID can be prepended by the fully qualified annotation name (which is separated with '::' from the ID).

segmentation - Optional parameter for segmentation layer on which the context is applied. Leave empty for token layer (which is default).
left - Optional parameter for the left context size, default is 0.
right - Optional parameter for the right context size, default is 0.
filter - Optional parameter with value "all" or "token". If "token" only token will be fetched. Default is "all".

Returns a representation of the Salt annotation graph in the EMF XMI format and with MIME type application/xml or application/xmi+xml.

GET annis/query/graph/{top}/{doc}

{top} is the toplevel corpus name of the document and {doc} the document name.

filternodeanno - A comma seperated list of node annotations which are used as a filter for the graph. Only nodes having one of the annotations are included in the result.

Returns a representation of the Salt annotation graph in the EMF XMI format and with MIME type application/xml or application/xmi+xml.

Get the content a binary object for a specific document

GET annis/query/corpora/{top}/{document}/binary
GET annis/query/corpora/{top}/{document}/binary/{offset}/{length}
GET annis/query/corpora/{top}/{document}/binary/{file}
GET annis/query/corpora/{top}/{document}/binary/{file}/{offset}/{length}

Accepts any MIME type. The MIME type is used as implicit argument to filter the files that match a given query.

There are several ways of selecting the binary data you want to receive. You can choose to select the file only by giving a document name given by the {top} and {document} arguments (paths 1 and 2). This will return the first file that also matches the requested accepted mime types. Alternatively the name of the file itself can be given as path argument {file} (paths 3 and 4). You can also choose to either get the complete file (paths 1 and 3) or chunks containing only a subset of the binary data (paths 2 and 4). In the latter case, you can specify the {offset} and the {length} of the chunk (both in bytes).

{top} - The toplevel corpus name.
{document} - The name of the document that has the file. If you want the files for the toplevel corpus itself, use the name of the toplevel corpus as document name.
{file} - File name/title to select.
{offset} - Defines the offset from the the binary chunk starts (in bytes).
{length} - Defines the length of the binary chunk (in bytes).

A binary stream that contains the file content. If path variant 2 and 4 is used only a subset of the file is returned. Path variant 1 and 3 always return the complete file.

Interface defining the REST API calls that ANNIS provides for administrative tasks.

Currently it is possible to import corpora, monitor the import status with this interface and to manage user accounts. All paths for this part of the service start with "annis/admin/".

POST annis/admin/import

Request body

A ZIP file which contains one or more corpora in separate sub-folders. Consumes MIME type application/zip.

overwrite - Set to "true" if an existing corpus corpus should be overwritten.
statusMail - An e-mail address to which status reports are sent.
alias - An internal alias name of the corpus which can be used instead of the actual corpus name when referring to it in the URL. Corpora can share the same alias.

Import has been accepted and its status can be queried by the URL given in the Location header.

Bad request, e.g. if the corpus already exists and overwrite parameter was not set.

GET annis/admin/import/status/

The response lists all currently running import jobs has the MIME type application/xml and the following format:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<importJobs>
  <!-- an importJob tag for each running import -->
  <importJob>
    <!-- visible caption, e.g. corpus name  -->
    <caption>MyNewCorpus</caption>
    <!-- A list of output messages from the import process-->
    <messages>
      <m>first message</m>
      <m>second message</m>
      <m>just another message</m>
    </messages>
    <!-- true if the corpus will be overwritten -->
    <overwrite>true</overwrite>
    <!-- current status, can be WAITING, RUNNING, SUCCESS or ERROR -->
    <status>RUNNING</status>
    <!-- an unique identifier for this import job -->
    <uuid>7799322d-83ec-4900-83b0-c542e2ca2137</uuid>
    <!-- a mail address to which status reports should be send -->
    <statusMail>mail@example.com</statusMail>
    <!-- alias name of the corpus as defined by the import request -->
    <alias>CorpusAlias</alias>
 </importJob>
</importJobs>

The root element has the name importJobs and there is an importJob element for each element of the list.

GET annis/admin/import/status/finished/{uuid}

The {uuid} defines an unique identifier of the import job, which was returned by the import function.

If the import finished, a 200 HTTP status code is sent and a proper description of the import job is returned. After this resource has been successfully accessed once, a 404 HTTP status code will be sent on subsequent requests.

The response has the MIME type application/xml and the following format:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<importJob>
  <!-- visible caption, e.g. corpus name  -->
  <caption>MyNewCorpus</caption>
  <!-- A list of output messages from the import process-->
  <messages>
    <m>first message</m>
    <m>second message</m>
    <m>just another message</m>
  </messages>
  <!-- true if the corpus will be overwritten -->
  <overwrite>true</overwrite>
  <!-- current status, can be WAITING, RUNNING, SUCCESS or ERROR -->
  <status>RUNNING</status>
  <!-- an unique identifier for this import job -->
  <uuid>7799322d-83ec-4900-83b0-c542e2ca2137</uuid>
  <!-- a mail address to which status reports should be send -->
  <statusMail>mail@example.com</statusMail>
  <!-- alias name of the corpus as defined by the import request -->
  <alias>CorpusAlias</alias>
</importJob>

When the import is not finished yet or if status was finished and already queries, a 404 HTTP status code will be sent.

GET annis/admin/users/{userName}
PUT annis/admin/users/{userName}

Gets information about an existing user with the name {userName} (1) or updates/creates a new user with this name (2).

Request body

The PUT action accepts the user information in XML (MIME typeapplication/xml). The fields correspond to the fields of the single user configuration file. Please have a look at the general user configuration information in the ANNIS user guide for a more detailed explanation.

<user>
  <!-- User name (must be the same as the "userName" parameter) -->
  <name>myusername</name>
  <!-- hashed password in the Shiro1CryptFormat -->
  <passwordHash>$shiro1$SHA-256$1$tQNwU[...]</passwordHash>
  <!-- A list of groups the users should belong to. -->
  <group>group1</group>
  <group>group2</group>
  <group>group3</group>
  <!-- A list of explicit permission the users should have. -->
  <permission>admin:*</permission>
  <permission>query:*</permission>
  <!-- Optional expiration date encoded in the ISO-8601 standard</a> -->
  <expires>2015-02-12T00:00:00.000+01:00</expires>
</user>

If the GET variant is used and the queried user exists, return the user information with the MIME type application/xml. The fields correspond to the fields of the single user configuration file. Please have a look at the general user configuration information in the ANNIS user guide for a more detailed explanation.

<user>
  <!-- User name (must be the same as the "userName" parameter) -->
  <name>myusername</name>
  <!-- hashed password in the Shiro1CryptFormat -->
  <passwordHash>$shiro1$SHA-256$1$tQNwU[...]</passwordHash>
  <!-- A list of groups the users should belong to. -->
  <group>group1</group>
  <group>group2</group>
  <group>group3</group>
  <!-- A list of explicit permission the users should have. -->
  <permission>admin:*</permission>
  <permission>query:*</permission>
  <!-- Optional expiration date encoded in the ISO-8601 standard</a> -->
  <expires>2015-02-12T00:00:00.000+01:00</expires>
</user>

ANNIS Developer Guide

Architecture

Backend Service

Web Front-end

ANNIS Kickstarter

Building

Using and IDE

Running an embedded Jetty instance for local access

Making a new ANNIS release

Introduction

Release Process

Initialization phase

Testing cycle

Finish phase

ANNIS import format version 3.3

annis.version

corpus.annis

corpus_annotation.annis

text.annis

node.annis

component.annis

rank.annis

node_annotation.annis

edge_annotation.annis

resolver_vis_map.annis

ExtData folder

Create new query builder

Public REST API

Authentication

Available APIs

Corpus queries

Count matches of a query

Path(s)

Parameters

Responses

Code 200

Find matches for a given query

Path(s)

Parameters

Responses

Code 200

Get a subgraph from a set of (matched) Salt IDs

Path(s)

Request body

Parameters

Responses

Code 200

Get the annotation graph of a complete document

Path(s)

Parameters

Responses

Code 200

Get the content a binary object for a specific document

Path(s)

Responses

Code 200

Administrative tasks

Import one or more corpora

Path(s)

Request body

Parameters

Responses

Code 202

Code 400

Status of all running import jobs

Path(s)

Responses

Code 200

Show import job information after it was finished

Path(s)

Responses

Code 200

Code 404

User managment

Path(s)

Request body

Responses

Code 200

`annis.version`

`corpus.annis`

`corpus_annotation.annis`

`text.annis`

`node.annis`

`component.annis`

`rank.annis`

`node_annotation.annis`

`edge_annotation.annis`

`resolver_vis_map.annis`

`ExtData` folder