Architecture

ANNIS is a web-based search and visualization architecture for multi-layer corpora. ANNIS consists of two major components: a backend service and a web front-end. There is also a local version, ANNIS Kickstarter, which is a simple starting point for new users who want to try out the system without installing a full server.

Backend Service

The service runs on a web server such as Tomcat or Jetty and communicates with a relational database, using the open source DB PostgreSQL. PostgreSQL (Version 9.4 or larger) must be installed for ANNIS to work. For more information on installing and managing the backend service, see the Administration Guide in the documentation.

Web Front-end

The ANNIS front-end is a web application implemented in Java and the Vaadin framework and runs in a normal browser (we recommend Mozilla Firefox). The server running the web-application communicates with the backend service via a REST interface.

ANNIS Kickstarter

ANNIS Kickstarter is a cross-platform local version which requires nothing but a PostgreSQL installation to run. It will run under LINUX, Windows and Mac. For a quick tutorial to get started with Kickstarter, see the ANNIS User Guide.

Building

ANNIS uses Maven3 as build tool. Maven itself is based on Java and should run on every major operating system. You have to download and install the appropriate version for your operating system from http://maven.apache.org/download.html before you can build ANNIS. Maven will download all needed dependencies from central servers on the first build so you will need to have a working internet connection. The dependencies are cached locally once their are downloaded.

When you have downloaded or checked out the source of ANNIS the top-level directory of the source code is the parent project for all ANNIS sub-projects. If you want to build every project that is part of ANNIS just execute

cd <annis-sources>/
mvn install

This might take a while on the first execution. mvn clean will remove all compiled code if necessary.

If you only want to compile a sub-project execute mvn install in the corresponding sub-directory. Every folder with a sub-project will have a pom.xml file. These files configure the whole build process. The Maven documenation contains detailed explanations of the structure and possible content of the configuration files.

Some sub-projects don't provide a library but will produce a zip or tar/gz- file when they are compiled. These assembly steps (see Maven Assembly documentation) are automatically invoked on mvn install.

Using and IDE

While you can use any text editor of your choice to change ANNIS and compile it completely on the command line using Maven, a proper IDE will be a huge help for you. You can use any IDE which has a good support for Maven. The ANNIS main developers currently recommend Eclipse Proton or the Netbeans 9 IDE for development.

Running an embedded Jetty instance for local access

This way of running the front-end is very useful, if you want to access Annis on you local machine as a single user.

You don't need to install Jetty or Tomcat by yourself using this method.

cd <unzipped source>/annis-gui/
mvn jetty:run

Now you can access the site under http://localhost:8080/annis-gui/. The Jetty server might be stopped by pressing "CTRL-C".

Making a new ANNIS release

Introduction

The release process, including all necessary tests, might take several days and includes fixing bugs that are only discovered in the release testing process. Never ever add new features in this release process, there is the separate "develop" branch which you can use for this purposes.

You must have mdBook installed to make a release. Otherwise the documentation can't be created.

Release Process

Initialization phase

  1. Start the release process by executing mvn gitflow:release-startfor a regular release (branched from the develop) or mvn gitflow:hotfix-start for a hotfix that is branched from master. The command will ask you for the new version number, use semantic versioning.
  2. Add new changelog entry, if some important information is missing create an enhancement or bugfix issue in GitHub and repeat
    • Get the GitHub Milestone id associated the release (is visible in the URL if you view the issues of the release tracking milestone).
    • execute this script Misc/changelog.py <milestone-id>
    • add the output to the begin of the CHANGELOG file
  3. Update and commit license information
mvn license:add-third-party license:download-licenses

Testing cycle

  1. Build the complete project with tests.
mvn clean
mvn -DskipTests=true install
mvn test
  1. Do manual tests. If you have to fix any bug document it in the issue tracker, update the changelog and start over at step 1. If no known bugs are left to fix go to the next section.

Finish phase

  1. Finish the release by executing either mvn gitflow:release-finish for regular releases or mvn gitflow:hotfix-finish for hotfixes.
  2. Release the staging repository to Maven Central with the Nexus interface: https://oss.sonatype.org/
  3. Create a new release on GitHub including the changelog. Upload the binaries from Maven repository to GitHub release as well.

ANNIS import format version 3.3

The ANNIS import format is inspired by the Salt meta-data model an ANNIS uses Salt internally to represent a matched graph from the database. However there are some restrictions which ANNIS has but Salt doesn't.

  • node names must be unique per document
  • document names must be unique per top-level corpus
  • a ANNIS corpus contains only one top-level corpus
  • there are no meta-data for nodes
  • string identifiers such as annotation or layer names have a limited number of allowed characters and should match the regular expression \verbatim [a-zA-Z_][a-zA-Z0-9_-]* \endverbatim in order to be searchable with AQL

An ANNIS corpus can be either a zip file or a directory which includes the following files

annis.version

First line is exactly "3.3", the next lines can contain human readable text.

Pure UTF-8 encoded text file

corpus.annis

Contains structural information about the corpus and its documents.

*TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column type unique not NULL description
id integer X X primary key
name text X X unique name (per corpus)
type text X CORPUS, DOCUMENT
version text version number (currently not used)
pre integer X pre order of the corpus tree
post integer X post order of the corpus tree
top_level boolean X true for the toplevel corpus

corpus_annotation.annis

Contains meta-data on the corpus and the documents.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column type unique not NULL description
corpus_ref integer X foreign key to corpus.id
namespace text
name text
value text

text.annis

Describes all texts that are included in the corpus.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column type unique not NULL description
corpus_ref integer X foreign key to corpus.id. The corpus id should be the id of the document in the corpus table
id integer X restart from 0 for every corpus_ref
name text name of the text
text text content of the text

primary key: corpus_ref, id

node.annis

Every node in the corpus will have exactly one entry in this table.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column type unique not NULL description
id bigint X X primary key
text_ref integer X foreign key to text.id
corpus_ref integer X foreign key to corpus.id
layer text
name text A human readable identfier of the node. Must be unique for each document.
left integer X position of first covered character
right integer X position of the character after the last covered character
token_index integer index of this token (if it is a token, otherwise NULL)
left_token integer X index of first covered token, for token, this value is the token_index
right_token integer X index of last covered token, for token, this value is the token_index
seg_index integer index of this segment (if it is a segment, i.e. there is some SOrderingRelation connected to this node)
seg_name text name of the segment path this segment belongs to
span text for tokens or node with a segmentation index: substring of the covered original text
root boolean X True if this node has no parents in all components

component.annis

Lists the components (connected sub-graphs) of the graph.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column type unique not NULL description
id bigint X X primary key
type char(1) edge type: c, d, p
layer text X Could be set to e.g. "default_layer" if not in any Salt layer
name text The sType of the component, e.g anaphoric for a some kind of pointing relation component

rank.annis

A rank entry describes one of the positions of a node in a component tree. There is one rank entry for each edge. Furthermore, every component has a virtual relation and thus an additional rank entry where the parent attribute is NULL and the level is 0.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column type unique not NULL description
id bigint X X primary key
pre integer X the preorder of the target node. the root of the component tree should always have a pre-order of 0
post integer X the post-order or the target node
node_ref bigint X the node.id of the target node
component_ref bigint X
parent bigint id of the parent rank entry
level integer level of this rank entry (not node!) in the component tree

Rank entries with the type 'c' (coverage spans) must be ommitted, if the referenced node is a token and if the parent coverage span is continuous. Continuous means the range of covered token has no gaps, thus it includes all token between the first and the last covered token. The idea behind this is, that you can recover the needed information using the "left_token" and "right_token" from the span together with the "token_index" (all in the node.annis table) if the span is continuous.

node_annotation.annis

Contains all annotations per node.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column type unique not NULL description
node_ref bigint X foreign key to _node.id
namespace text
name text X
value text

unique(node_ref, namespace, name)

edge_annotation.annis

Contains all annotations per edge (which is represented by a rank entry)

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column type unique not NULL description
rank_ref bigint X foreign key to rank.id
namespace text
name text X
value text

resolver_vis_map.annis

Describes which visualizers to trigger depending of the namespace of a node or edge occuring in the search results.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column type unique not NULL description
corpus text the name of the supercorpus
version text the version of the corpus
namespace text the several layers of the corpus
element text the type of the entry: "node" or "edge"
vis_type text X the abstract type of visualization: "tree", "discourse", "grid", ...
display_name text X the name of the layer which shall be shown for display
visibility text either "permanent", "visible", "hidden", "removed" or "preloaded", default is "hidden"
order bigint the order of the layers, in which they shall be shown
mappings text

ExtData folder


Contains the media files that are connected with this corpus including their binary content.

Each file directly inside the ExtData folder belongs to the toplevel corpus. A sub-folder corresponds to a document and each file inside a sub-folder belongs to the document with the same name.

Create new query builder

A query builder is a class that

  1. implements the QueryBuilderPlugin interface
  2. has the @PluginImplementation annotation

When implementing the interface you have to provide a short name, a caption and a callback function that creates new Vaadin components. You get an object of the type QueryController which you can use to set new queries from your component.

A query builder plugin must be either registered in the addCustomUIPlugins() function of the SearchUI (if the plugin is part of the annis-gui project) or must be added to a JAR-file that is located at one of the following locations:

  • the plugins folder inside the deployed web application
  • the path defined in the ANNIS_PLUGINS environment variable

Please also note that it is possible to configure a default query builder for an instance. Further information can be found in the User Guide.

Public REST API

The ANNIS-service has an public REST API that can be accessed by third-party applications.

Authentication

Some of the API calls are protected resources. HTTP user/password authentification is used in this case. The users of the REST service can be configured as described in the User Guide. Whenever a user is not permitted to perform a certain action a 403 Forbidden HTTP response is sent. On some actions (like "list all corpora") only the resources that are available to the user are shown.

You can omit any authentication data. In this case you have the same rights as the "anonymous" user.

Available APIs

The following APIs are currently available:

ANNIS uses the semantic versioning scheme for its REST-API. Minor version updates of ANNIS will be backwards-compatible to the APIs described by this documentation. There might be further un-official API calls that might change without any notice.

Corpus queries

Interface defining the REST API calls that ANNIS provides for querying the data.

All paths for this part of the service start with the "annis/query/" prefix.

Count matches of a query

Path(s)

  1. GET annis/query/search/count

Parameters

  • q - The query in the ANNIS Query Language (AQL)
  • corpora - A comma separated list of corpus names

Responses

Code 200

Produces an XML representation of the total matches and the number of documents that contain matches (application/xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<matchAndDocumentCount>
  <!-- the number of documents that contain matches -->
  <documentCount>2</documentCount>
  <!-- total number of matches -->
  <matchCount>399</matchCount>
</matchAndDocumentCount>

Find matches for a given query

Path(s)

  1. GET annis/query/search/find

Parameters

  • q - The query in the ANNIS Query Language (AQL)
  • corpora - A comma separated list of corpus names
  • offset - Optional offset from where to start the matches. Default is 0.
  • limit - Optional limit of the number of returned matches. Set to -1 if unlimited. Default is -1.
  • order - Optional order how the results should be sorted. Can be either "normal", "random" or "inverted" "normal" is the default ordering, "inverted" inverses the default ordering and "random" is a non-stable (thus you will get different results for the same offset and limit) random ordering.

Responses

Code 200

A list of the match identifiers for the query.

Can produce the MIME type application/xml in the following format

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<match-group>
  <!-- each match is enclosed in an match tag -->
  <match>
    <!-- the first matched node of match 1 did not match an annotation -->
    <anno></anno>
    <!-- the second matched node of match 1 was a match on the 'tiger::pos' annotation-->
    <anno>tiger::pos</anno>
    <!-- ID of first matched node of match 1 -->
    <id>salt:/pcc2/11299/#tok_1</id>
    <!-- ID of second matched noded  of match 1 -->
    <id>salt:/pcc2/11299/#tok_2</id>
  </match>
  <match>
    <anno></anno>
    <anno>tiger::pos</anno>
    <!-- ID of first matched noded of match 2 -->
    <id>salt:/pcc2/11299/#tok_2</id>
    <!-- ID of second matched noded of match 2-->
    <id>salt:/pcc2/11299/#tok_3</id>
  </match>
  <!-- and so on -->
</match-group>

or the MIME type text/plain

salt:/pcc2/11299/#tok_1 tiger::pos::salt:/pcc2/11299/#tok_2
salt:/pcc2/11299/#tok_2 tiger::pos::salt:/pcc2/11299/#tok_3
salt:/pcc2/11299/#tok_3 tiger::pos::salt:/pcc2/11299/#tok_4

In this format, there is one line per match and each ID is separated by space. An ID can be prefixed by the fully qualified annotation name (which is separated with '::' from the ID).

Get a subgraph from a set of (matched) Salt IDs

Path(s)

  1. POST annis/query/search/subgraph

Request body

Consumes application/xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<match-group>
  <!-- each match is enclosed in an match tag -->
  <match>
    <!-- the first matched node of match 1 did not match an annotation -->
    <anno></anno>
    <!-- the second matched node of match 1 was a match on the 'tiger::pos' annotation-->
    <anno>tiger::pos</anno>
    <!-- ID of first matched node of match 1 -->
    <id>salt:/pcc2/11299/#tok_1</id>
    <!-- ID of second matched noded  of match 1 -->
    <id>salt:/pcc2/11299/#tok_2</id>
  </match>
  <match>
    <anno></anno>
    <anno>tiger::pos</anno>
    <!-- ID of first matched noded of match 2 -->
    <id>salt:/pcc2/11299/#tok_2</id>
    <!-- ID of second matched noded of match 2-->
    <id>salt:/pcc2/11299/#tok_3</id>
  </match>
  <!-- and so on -->
</match-group>

or consumes text/plain:

salt:/pcc2/11299/#tok_1 tiger::pos::salt:/pcc2/11299/#tok_2
salt:/pcc2/11299/#tok_2 tiger::pos::salt:/pcc2/11299/#tok_3
salt:/pcc2/11299/#tok_3 tiger::pos::salt:/pcc2/11299/#tok_4

One line per match, each ID is separated by space. An ID can be prepended by the fully qualified annotation name (which is separated with '::' from the ID).

Parameters

  • segmentation - Optional parameter for segmentation layer on which the context is applied. Leave empty for token layer (which is default).
  • left - Optional parameter for the left context size, default is 0.
  • right - Optional parameter for the right context size, default is 0.
  • filter - Optional parameter with value "all" or "token". If "token" only token will be fetched. Default is "all".

Responses

Code 200

Returns a representation of the Salt annotation graph in the EMF XMI format and with MIME type application/xml or application/xmi+xml.

Get the annotation graph of a complete document

Path(s)

  1. GET annis/query/graph/{top}/{doc}

{top} is the toplevel corpus name of the document and {doc} the document name.

Parameters

  • filternodeanno - A comma seperated list of node annotations which are used as a filter for the graph. Only nodes having one of the annotations are included in the result.

Responses

Code 200

Returns a representation of the Salt annotation graph in the EMF XMI format and with MIME type application/xml or application/xmi+xml.

Get the content a binary object for a specific document

Path(s)

  1. GET annis/query/corpora/{top}/{document}/binary
  2. GET annis/query/corpora/{top}/{document}/binary/{offset}/{length}
  3. GET annis/query/corpora/{top}/{document}/binary/{file}
  4. GET annis/query/corpora/{top}/{document}/binary/{file}/{offset}/{length}

Accepts any MIME type. The MIME type is used as implicit argument to filter the files that match a given query.

There are several ways of selecting the binary data you want to receive. You can choose to select the file only by giving a document name given by the {top} and {document} arguments (paths 1 and 2). This will return the first file that also matches the requested accepted mime types. Alternatively the name of the file itself can be given as path argument {file} (paths 3 and 4). You can also choose to either get the complete file (paths 1 and 3) or chunks containing only a subset of the binary data (paths 2 and 4). In the latter case, you can specify the {offset} and the {length} of the chunk (both in bytes).

  • {top} - The toplevel corpus name.
  • {document} - The name of the document that has the file. If you want the files for the toplevel corpus itself, use the name of the toplevel corpus as document name.
  • {file} - File name/title to select.
  • {offset} - Defines the offset from the the binary chunk starts (in bytes).
  • {length} - Defines the length of the binary chunk (in bytes).

Responses

Code 200

A binary stream that contains the file content. If path variant 2 and 4 is used only a subset of the file is returned. Path variant 1 and 3 always return the complete file.

Administrative tasks

Interface defining the REST API calls that ANNIS provides for administrative tasks.

Currently it is possible to import corpora, monitor the import status with this interface and to manage user accounts. All paths for this part of the service start with "annis/admin/".

Import one or more corpora

Path(s)

  1. POST annis/admin/import

Request body

A ZIP file which contains one or more corpora in separate sub-folders. Consumes MIME type application/zip.

Parameters

  • overwrite - Set to "true" if an existing corpus corpus should be overwritten.
  • statusMail - An e-mail address to which status reports are sent.
  • alias - An internal alias name of the corpus which can be used instead of the actual corpus name when referring to it in the URL. Corpora can share the same alias.

Responses

Code 202

Import has been accepted and its status can be queried by the URL given in the Location header.

Code 400

Bad request, e.g. if the corpus already exists and overwrite parameter was not set.

Status of all running import jobs

Path(s)

  1. GET annis/admin/import/status/

Responses

Code 200

The response lists all currently running import jobs has the MIME type application/xml and the following format:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<importJobs>
  <!-- an importJob tag for each running import -->
  <importJob>
    <!-- visible caption, e.g. corpus name  -->
    <caption>MyNewCorpus</caption>
    <!-- A list of output messages from the import process-->
    <messages>
      <m>first message</m>
      <m>second message</m>
      <m>just another message</m>
    </messages>
    <!-- true if the corpus will be overwritten -->
    <overwrite>true</overwrite>
    <!-- current status, can be WAITING, RUNNING, SUCCESS or ERROR -->
    <status>RUNNING</status>
    <!-- an unique identifier for this import job -->
    <uuid>7799322d-83ec-4900-83b0-c542e2ca2137</uuid>
    <!-- a mail address to which status reports should be send -->
    <statusMail>mail@example.com</statusMail>
    <!-- alias name of the corpus as defined by the import request -->
    <alias>CorpusAlias</alias>
 </importJob>
</importJobs>

The root element has the name importJobs and there is an importJob element for each element of the list.

Show import job information after it was finished

Path(s)

  1. GET annis/admin/import/status/finished/{uuid}

The {uuid} defines an unique identifier of the import job, which was returned by the import function.

Responses

Code 200

If the import finished, a 200 HTTP status code is sent and a proper description of the import job is returned. After this resource has been successfully accessed once, a 404 HTTP status code will be sent on subsequent requests.

The response has the MIME type application/xml and the following format:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<importJob>
  <!-- visible caption, e.g. corpus name  -->
  <caption>MyNewCorpus</caption>
  <!-- A list of output messages from the import process-->
  <messages>
    <m>first message</m>
    <m>second message</m>
    <m>just another message</m>
  </messages>
  <!-- true if the corpus will be overwritten -->
  <overwrite>true</overwrite>
  <!-- current status, can be WAITING, RUNNING, SUCCESS or ERROR -->
  <status>RUNNING</status>
  <!-- an unique identifier for this import job -->
  <uuid>7799322d-83ec-4900-83b0-c542e2ca2137</uuid>
  <!-- a mail address to which status reports should be send -->
  <statusMail>mail@example.com</statusMail>
  <!-- alias name of the corpus as defined by the import request -->
  <alias>CorpusAlias</alias>
</importJob>

Code 404

When the import is not finished yet or if status was finished and already queries, a 404 HTTP status code will be sent.

User managment

Path(s)

  1. GET annis/admin/users/{userName}
  2. PUT annis/admin/users/{userName}

Gets information about an existing user with the name {userName} (1) or updates/creates a new user with this name (2).

Request body

The PUT action accepts the user information in XML (MIME typeapplication/xml). The fields correspond to the fields of the single user configuration file. Please have a look at the general user configuration information in the ANNIS user guide for a more detailed explanation.

<user>
  <!-- User name (must be the same as the "userName" parameter) -->
  <name>myusername</name>
  <!-- hashed password in the Shiro1CryptFormat -->
  <passwordHash>$shiro1$SHA-256$1$tQNwU[...]</passwordHash>
  <!-- A list of groups the users should belong to. -->
  <group>group1</group>
  <group>group2</group>
  <group>group3</group>
  <!-- A list of explicit permission the users should have. -->
  <permission>admin:*</permission>
  <permission>query:*</permission>
  <!-- Optional expiration date encoded in the ISO-8601 standard</a> -->
  <expires>2015-02-12T00:00:00.000+01:00</expires>
</user>

Responses

Code 200

If the GET variant is used and the queried user exists, return the user information with the MIME type application/xml. The fields correspond to the fields of the single user configuration file. Please have a look at the general user configuration information in the ANNIS user guide for a more detailed explanation.

<user>
  <!-- User name (must be the same as the "userName" parameter) -->
  <name>myusername</name>
  <!-- hashed password in the Shiro1CryptFormat -->
  <passwordHash>$shiro1$SHA-256$1$tQNwU[...]</passwordHash>
  <!-- A list of groups the users should belong to. -->
  <group>group1</group>
  <group>group2</group>
  <group>group3</group>
  <!-- A list of explicit permission the users should have. -->
  <permission>admin:*</permission>
  <permission>query:*</permission>
  <!-- Optional expiration date encoded in the ISO-8601 standard</a> -->
  <expires>2015-02-12T00:00:00.000+01:00</expires>
</user>