Architecture
ANNIS is a web-based search and visualization architecture for multi-layer corpora. ANNIS consists of two major components: a backend service and a web front-end. There is also a local version, ANNIS Kickstarter, which is a simple starting point for new users who want to try out the system without installing a full server.
Backend Service
The service runs on a web server such as Tomcat or Jetty and communicates with a relational database, using the open source DB PostgreSQL. PostgreSQL (Version 9.4 or larger) must be installed for ANNIS to work. For more information on installing and managing the backend service, see the Administration Guide in the documentation.
Web Front-end
The ANNIS front-end is a web application implemented in Java and the Vaadin framework and runs in a normal browser (we recommend Mozilla Firefox). The server running the web-application communicates with the backend service via a REST interface.
ANNIS Kickstarter
ANNIS Kickstarter is a cross-platform local version which requires nothing but a PostgreSQL installation to run. It will run under LINUX, Windows and Mac. For a quick tutorial to get started with Kickstarter, see the ANNIS User Guide.
Building
ANNIS uses Maven3 as build tool. Maven itself is based on Java and should run on every major operating system. You have to download and install the appropriate version for your operating system from http://maven.apache.org/download.html before you can build ANNIS. Maven will download all needed dependencies from central servers on the first build so you will need to have a working internet connection. The dependencies are cached locally once their are downloaded.
When you have downloaded or checked out the source of ANNIS the top-level directory of the source code is the parent project for all ANNIS sub-projects. If you want to build every project that is part of ANNIS just execute
cd <annis-sources>/
mvn install
This might take a while on the first execution. mvn clean
will remove all compiled
code if necessary.
If you only want to compile a sub-project execute mvn install
in the
corresponding sub-directory. Every folder with a sub-project will have a pom.xml
file. These files configure the whole build process. The Maven documenation
contains detailed explanations of the structure and possible content of the
configuration files.
Some sub-projects don't provide a library but will produce a zip or tar/gz-
file when they are compiled. These assembly steps (see Maven Assembly documentation) are automatically
invoked on mvn install
.
Using and IDE
While you can use any text editor of your choice to change ANNIS and compile it completely on the command line using Maven, a proper IDE will be a huge help for you. You can use any IDE which has a good support for Maven. The ANNIS main developers currently recommend Eclipse Proton or the Netbeans 9 IDE for development.
Running an embedded Jetty instance for local access
This way of running the front-end is very useful, if you want to access Annis on you local machine as a single user.
You don't need to install Jetty or Tomcat by yourself using this method.
cd <unzipped source>/annis-gui/
mvn jetty:run
Now you can access the site under http://localhost:8080/annis-gui/. The Jetty server might be stopped by pressing "CTRL-C".
Making a new ANNIS release
Introduction
The release process, including all necessary tests, might take several days and includes fixing bugs that are only discovered in the release testing process. Never ever add new features in this release process, there is the separate "develop" branch which you can use for this purposes.
You must have mdBook installed to make a release. Otherwise the documentation can't be created.
Release Process
Initialization phase
- Start the release process by executing
mvn gitflow:release-start
for a regular release (branched from thedevelop
) ormvn gitflow:hotfix-start
for a hotfix that is branched frommaster
. The command will ask you for the new version number, use semantic versioning. - Add new changelog entry, if some important information is missing create an enhancement or bugfix issue in GitHub and repeat
- Get the GitHub Milestone id associated the release (is visible in the URL if you view the issues of the release tracking milestone).
- execute this script
Misc/changelog.py <milestone-id>
- add the output to the begin of the
CHANGELOG
file
- Update and commit license information
mvn license:add-third-party license:download-licenses
Testing cycle
- Build the complete project with tests.
mvn clean
mvn -DskipTests=true install
mvn test
- Do manual tests. If you have to fix any bug document it in the issue tracker, update the changelog and start over at step 1. If no known bugs are left to fix go to the next section.
Finish phase
- Finish the release by executing either
mvn gitflow:release-finish
for regular releases ormvn gitflow:hotfix-finish
for hotfixes. - Release the staging repository to Maven Central with the Nexus interface: https://oss.sonatype.org/
- Create a new release on GitHub including the changelog. Upload the binaries from Maven repository to GitHub release as well.
ANNIS import format version 3.3
The ANNIS import format is inspired by the Salt meta-data model an ANNIS uses Salt internally to represent a matched graph from the database. However there are some restrictions which ANNIS has but Salt doesn't.
- node names must be unique per document
- document names must be unique per top-level corpus
- a ANNIS corpus contains only one top-level corpus
- there are no meta-data for nodes
- string identifiers such as annotation or layer names have a limited number of allowed characters and should match the regular expression \verbatim [a-zA-Z_][a-zA-Z0-9_-]* \endverbatim in order to be searchable with AQL
An ANNIS corpus can be either a zip file or a directory which includes the following files
annis.version
First line is exactly "3.3", the next lines can contain human readable text.
Pure UTF-8 encoded text file
corpus.annis
Contains structural information about the corpus and its documents.
*TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
id | integer | X | X | primary key |
name | text | X | X | unique name (per corpus) |
type | text | X | CORPUS, DOCUMENT | |
version | text | version number (currently not used) | ||
pre | integer | X | pre order of the corpus tree | |
post | integer | X | post order of the corpus tree | |
top_level | boolean | X | true for the toplevel corpus |
corpus_annotation.annis
Contains meta-data on the corpus and the documents.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
corpus_ref | integer | X | foreign key to corpus.id | |
namespace | text | |||
name | text | |||
value | text |
text.annis
Describes all texts that are included in the corpus.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
corpus_ref | integer | X | foreign key to corpus.id. The corpus id should be the id of the document in the corpus table | |
id | integer | X | restart from 0 for every corpus_ref | |
name | text | name of the text | ||
text | text | content of the text |
primary key: corpus_ref, id
node.annis
Every node in the corpus will have exactly one entry in this table.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
id | bigint | X | X | primary key |
text_ref | integer | X | foreign key to text.id | |
corpus_ref | integer | X | foreign key to corpus.id | |
layer | text | |||
name | text | A human readable identfier of the node. Must be unique for each document. | ||
left | integer | X | position of first covered character | |
right | integer | X | position of the character after the last covered character | |
token_index | integer | index of this token (if it is a token, otherwise NULL) | ||
left_token | integer | X | index of first covered token, for token, this value is the token_index | |
right_token | integer | X | index of last covered token, for token, this value is the token_index | |
seg_index | integer | index of this segment (if it is a segment, i.e. there is some SOrderingRelation connected to this node) | ||
seg_name | text | name of the segment path this segment belongs to | ||
span | text | for tokens or node with a segmentation index: substring of the covered original text | ||
root | boolean | X | True if this node has no parents in all components |
component.annis
Lists the components (connected sub-graphs) of the graph.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
id | bigint | X | X | primary key |
type | char(1) | edge type: c, d, p | ||
layer | text | X | Could be set to e.g. "default_layer" if not in any Salt layer | |
name | text | The sType of the component, e.g anaphoric for a some kind of pointing relation component |
rank.annis
A rank entry describes one of the positions of a node in a component tree. There is one rank entry for each edge. Furthermore, every component has a virtual relation and thus an additional rank entry where the parent attribute is NULL and the level is 0.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
id | bigint | X | X | primary key |
pre | integer | X | the preorder of the target node. the root of the component tree should always have a pre-order of 0 | |
post | integer | X | the post-order or the target node | |
node_ref | bigint | X | the node.id of the target node | |
component_ref | bigint | X | ||
parent | bigint | id of the parent rank entry | ||
level | integer | level of this rank entry (not node!) in the component tree |
Rank entries with the type 'c' (coverage spans) must be ommitted, if the referenced node is a token and if the parent coverage span is continuous. Continuous means the range of covered token has no gaps, thus it includes all token between the first and the last covered token. The idea behind this is, that you can recover the needed information using the "left_token" and "right_token" from the span together with the "token_index" (all in the node.annis table) if the span is continuous.
node_annotation.annis
Contains all annotations per node.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
node_ref | bigint | X | foreign key to _node.id | |
namespace | text | |||
name | text | X | ||
value | text |
unique(node_ref, namespace, name)
edge_annotation.annis
Contains all annotations per edge (which is represented by a rank entry)
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
rank_ref | bigint | X | foreign key to rank.id | |
namespace | text | |||
name | text | X | ||
value | text |
resolver_vis_map.annis
Describes which visualizers to trigger depending of the namespace of a node or edge occuring in the search results.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
corpus | text | the name of the supercorpus | ||
version | text | the version of the corpus | ||
namespace | text | the several layers of the corpus | ||
element | text | the type of the entry: "node" or "edge" | ||
vis_type | text | X | the abstract type of visualization: "tree", "discourse", "grid", ... | |
display_name | text | X | the name of the layer which shall be shown for display | |
visibility | text | either "permanent", "visible", "hidden", "removed" or "preloaded", default is "hidden" | ||
order | bigint | the order of the layers, in which they shall be shown | ||
mappings | text |
ExtData
folder
Contains the media files that are connected with this corpus including their binary content.
Each file directly inside the ExtData folder belongs to the toplevel corpus. A sub-folder corresponds to a document and each file inside a sub-folder belongs to the document with the same name.
Create new query builder
A query builder is a class that
- implements the
QueryBuilderPlugin
interface - has the
@PluginImplementation
annotation
When implementing the interface you have to provide a short name, a caption
and a callback function that creates new Vaadin components. You get an object
of the type QueryController
which you can use to set new queries from your
component.
A query builder plugin must be either registered in the addCustomUIPlugins()
function
of the SearchUI
(if the plugin is part of the annis-gui project) or must be
added to a JAR-file that is located at one of the following locations:
- the plugins folder inside the deployed web application
- the path defined in the
ANNIS_PLUGINS
environment variable
Please also note that it is possible to configure a default query builder for an instance. Further information can be found in the User Guide.
Public REST API
The ANNIS-service has an public REST API that can be accessed by third-party applications.
Authentication
Some of the API calls are protected resources. HTTP user/password authentification is used in this case.
The users of the REST service can be configured as described in the User Guide.
Whenever a user is not permitted to perform a certain action a 403 Forbidden
HTTP response is sent.
On some actions (like "list all corpora") only the resources that are available to the user are shown.
You can omit any authentication data. In this case you have the same rights as the "anonymous" user.
Available APIs
The following APIs are currently available:
ANNIS uses the semantic versioning scheme for its REST-API. Minor version updates of ANNIS will be backwards-compatible to the APIs described by this documentation. There might be further un-official API calls that might change without any notice.
Corpus queries
Interface defining the REST API calls that ANNIS provides for querying the data.
All paths for this part of the service start with the "annis/query/" prefix.
Count matches of a query
Path(s)
GET
annis/query/search/count
Parameters
q
- The query in the ANNIS Query Language (AQL)corpora
- A comma separated list of corpus names
Responses
Code 200
Produces an XML representation of the total matches and the number of documents that contain matches (application/xml
):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<matchAndDocumentCount>
<!-- the number of documents that contain matches -->
<documentCount>2</documentCount>
<!-- total number of matches -->
<matchCount>399</matchCount>
</matchAndDocumentCount>
Find matches for a given query
Path(s)
GET
annis/query/search/find
Parameters
q
- The query in the ANNIS Query Language (AQL)corpora
- A comma separated list of corpus namesoffset
- Optional offset from where to start the matches. Default is 0.limit
- Optional limit of the number of returned matches. Set to -1 if unlimited. Default is -1.order
- Optional order how the results should be sorted. Can be either "normal", "random" or "inverted" "normal" is the default ordering, "inverted" inverses the default ordering and "random" is a non-stable (thus you will get different results for the same offset and limit) random ordering.
Responses
Code 200
A list of the match identifiers for the query.
Can produce the MIME type application/xml
in the following format
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<match-group>
<!-- each match is enclosed in an match tag -->
<match>
<!-- the first matched node of match 1 did not match an annotation -->
<anno></anno>
<!-- the second matched node of match 1 was a match on the 'tiger::pos' annotation-->
<anno>tiger::pos</anno>
<!-- ID of first matched node of match 1 -->
<id>salt:/pcc2/11299/#tok_1</id>
<!-- ID of second matched noded of match 1 -->
<id>salt:/pcc2/11299/#tok_2</id>
</match>
<match>
<anno></anno>
<anno>tiger::pos</anno>
<!-- ID of first matched noded of match 2 -->
<id>salt:/pcc2/11299/#tok_2</id>
<!-- ID of second matched noded of match 2-->
<id>salt:/pcc2/11299/#tok_3</id>
</match>
<!-- and so on -->
</match-group>
or the MIME type text/plain
salt:/pcc2/11299/#tok_1 tiger::pos::salt:/pcc2/11299/#tok_2
salt:/pcc2/11299/#tok_2 tiger::pos::salt:/pcc2/11299/#tok_3
salt:/pcc2/11299/#tok_3 tiger::pos::salt:/pcc2/11299/#tok_4
In this format, there is one line per match and each ID is separated by space. An ID can be prefixed by the fully qualified annotation name (which is separated with '::' from the ID).
Get a subgraph from a set of (matched) Salt IDs
Path(s)
POST
annis/query/search/subgraph
Request body
Consumes application/xml
:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<match-group>
<!-- each match is enclosed in an match tag -->
<match>
<!-- the first matched node of match 1 did not match an annotation -->
<anno></anno>
<!-- the second matched node of match 1 was a match on the 'tiger::pos' annotation-->
<anno>tiger::pos</anno>
<!-- ID of first matched node of match 1 -->
<id>salt:/pcc2/11299/#tok_1</id>
<!-- ID of second matched noded of match 1 -->
<id>salt:/pcc2/11299/#tok_2</id>
</match>
<match>
<anno></anno>
<anno>tiger::pos</anno>
<!-- ID of first matched noded of match 2 -->
<id>salt:/pcc2/11299/#tok_2</id>
<!-- ID of second matched noded of match 2-->
<id>salt:/pcc2/11299/#tok_3</id>
</match>
<!-- and so on -->
</match-group>
or consumes text/plain
:
salt:/pcc2/11299/#tok_1 tiger::pos::salt:/pcc2/11299/#tok_2
salt:/pcc2/11299/#tok_2 tiger::pos::salt:/pcc2/11299/#tok_3
salt:/pcc2/11299/#tok_3 tiger::pos::salt:/pcc2/11299/#tok_4
One line per match, each ID is separated by space. An ID can be prepended by the fully qualified annotation name (which is separated with '::' from the ID).
Parameters
segmentation
- Optional parameter for segmentation layer on which the context is applied. Leave empty for token layer (which is default).left
- Optional parameter for the left context size, default is 0.right
- Optional parameter for the right context size, default is 0.filter
- Optional parameter with value "all" or "token". If "token" only token will be fetched. Default is "all".
Responses
Code 200
Returns a representation of the Salt annotation graph in the EMF XMI format and with MIME type application/xml
or application/xmi+xml
.
Get the annotation graph of a complete document
Path(s)
GET
annis/query/graph/{top}/{doc}
{top} is the toplevel corpus name of the document and {doc} the document name.
Parameters
filternodeanno
- A comma seperated list of node annotations which are used as a filter for the graph. Only nodes having one of the annotations are included in the result.
Responses
Code 200
Returns a representation of the Salt annotation graph in the EMF XMI format and with MIME type application/xml
or application/xmi+xml
.
Get the content a binary object for a specific document
Path(s)
GET
annis/query/corpora/{top}/{document}/binaryGET
annis/query/corpora/{top}/{document}/binary/{offset}/{length}GET
annis/query/corpora/{top}/{document}/binary/{file}GET
annis/query/corpora/{top}/{document}/binary/{file}/{offset}/{length}
Accepts any MIME type. The MIME type is used as implicit argument to filter the files that match a given query.
There are several ways of selecting the binary data you want to receive. You can choose to select the file only by giving a document name given by the {top} and {document} arguments (paths 1 and 2). This will return the first file that also matches the requested accepted mime types. Alternatively the name of the file itself can be given as path argument {file} (paths 3 and 4). You can also choose to either get the complete file (paths 1 and 3) or chunks containing only a subset of the binary data (paths 2 and 4). In the latter case, you can specify the {offset} and the {length} of the chunk (both in bytes).
- {top} - The toplevel corpus name.
- {document} - The name of the document that has the file. If you want the files for the toplevel corpus itself, use the name of the toplevel corpus as document name.
- {file} - File name/title to select.
- {offset} - Defines the offset from the the binary chunk starts (in bytes).
- {length} - Defines the length of the binary chunk (in bytes).
Responses
Code 200
A binary stream that contains the file content. If path variant 2 and 4 is used only a subset of the file is returned. Path variant 1 and 3 always return the complete file.
Administrative tasks
Interface defining the REST API calls that ANNIS provides for administrative tasks.
Currently it is possible to import corpora, monitor the import status with this interface and to manage user accounts. All paths for this part of the service start with "annis/admin/".
Import one or more corpora
Path(s)
POST
annis/admin/import
Request body
A ZIP file which contains one or more corpora in separate sub-folders.
Consumes MIME type application/zip
.
Parameters
overwrite
- Set to "true" if an existing corpus corpus should be overwritten.statusMail
- An e-mail address to which status reports are sent.alias
- An internal alias name of the corpus which can be used instead of the actual corpus name when referring to it in the URL. Corpora can share the same alias.
Responses
Code 202
Import has been accepted and its status can be queried by the URL given in the Location
header.
Code 400
Bad request, e.g. if the corpus already exists and overwrite
parameter was not set.
Status of all running import jobs
Path(s)
GET
annis/admin/import/status/
Responses
Code 200
The response lists all currently running import jobs has the MIME type application/xml
and the following format:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<importJobs>
<!-- an importJob tag for each running import -->
<importJob>
<!-- visible caption, e.g. corpus name -->
<caption>MyNewCorpus</caption>
<!-- A list of output messages from the import process-->
<messages>
<m>first message</m>
<m>second message</m>
<m>just another message</m>
</messages>
<!-- true if the corpus will be overwritten -->
<overwrite>true</overwrite>
<!-- current status, can be WAITING, RUNNING, SUCCESS or ERROR -->
<status>RUNNING</status>
<!-- an unique identifier for this import job -->
<uuid>7799322d-83ec-4900-83b0-c542e2ca2137</uuid>
<!-- a mail address to which status reports should be send -->
<statusMail>mail@example.com</statusMail>
<!-- alias name of the corpus as defined by the import request -->
<alias>CorpusAlias</alias>
</importJob>
</importJobs>
The root element has the name importJobs
and there is an importJob
element for each element of the list.
Show import job information after it was finished
Path(s)
GET
annis/admin/import/status/finished/{uuid}
The {uuid} defines an unique identifier of the import job, which was returned by the import function.
Responses
Code 200
If the import finished, a 200 HTTP status code is sent and a proper description of the import job is returned. After this resource has been successfully accessed once, a 404 HTTP status code will be sent on subsequent requests.
The response has the MIME type application/xml
and the following format:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<importJob>
<!-- visible caption, e.g. corpus name -->
<caption>MyNewCorpus</caption>
<!-- A list of output messages from the import process-->
<messages>
<m>first message</m>
<m>second message</m>
<m>just another message</m>
</messages>
<!-- true if the corpus will be overwritten -->
<overwrite>true</overwrite>
<!-- current status, can be WAITING, RUNNING, SUCCESS or ERROR -->
<status>RUNNING</status>
<!-- an unique identifier for this import job -->
<uuid>7799322d-83ec-4900-83b0-c542e2ca2137</uuid>
<!-- a mail address to which status reports should be send -->
<statusMail>mail@example.com</statusMail>
<!-- alias name of the corpus as defined by the import request -->
<alias>CorpusAlias</alias>
</importJob>
Code 404
When the import is not finished yet or if status was finished and already queries, a 404 HTTP status code will be sent.
User managment
Path(s)
GET
annis/admin/users/{userName}PUT
annis/admin/users/{userName}
Gets information about an existing user with the name {userName} (1) or updates/creates a new user with this name (2).
Request body
The PUT
action accepts the user information in XML (MIME typeapplication/xml
).
The fields correspond to the fields of the single user configuration file.
Please have a look at the general user configuration information in the ANNIS user guide for a more detailed explanation.
<user>
<!-- User name (must be the same as the "userName" parameter) -->
<name>myusername</name>
<!-- hashed password in the Shiro1CryptFormat -->
<passwordHash>$shiro1$SHA-256$1$tQNwU[...]</passwordHash>
<!-- A list of groups the users should belong to. -->
<group>group1</group>
<group>group2</group>
<group>group3</group>
<!-- A list of explicit permission the users should have. -->
<permission>admin:*</permission>
<permission>query:*</permission>
<!-- Optional expiration date encoded in the ISO-8601 standard</a> -->
<expires>2015-02-12T00:00:00.000+01:00</expires>
</user>
Responses
Code 200
If the GET
variant is used and the queried user exists, return the user information with the MIME type application/xml
.
The fields correspond to the fields of the single user configuration file.
Please have a look at the general user configuration information in the ANNIS user guide for a more detailed explanation.
<user>
<!-- User name (must be the same as the "userName" parameter) -->
<name>myusername</name>
<!-- hashed password in the Shiro1CryptFormat -->
<passwordHash>$shiro1$SHA-256$1$tQNwU[...]</passwordHash>
<!-- A list of groups the users should belong to. -->
<group>group1</group>
<group>group2</group>
<group>group3</group>
<!-- A list of explicit permission the users should have. -->
<permission>admin:*</permission>
<permission>query:*</permission>
<!-- Optional expiration date encoded in the ISO-8601 standard</a> -->
<expires>2015-02-12T00:00:00.000+01:00</expires>
</user>