ANNIS import format version 3.3

The ANNIS import format is inspired by the Salt meta-data model an ANNIS uses Salt internally to represent a matched graph from the database. However there are some restrictions which ANNIS has but Salt doesn't.

node names must be unique per document
document names must be unique per top-level corpus
a ANNIS corpus contains only one top-level corpus
there are no meta-data for nodes
string identifiers such as annotation or layer names have a limited number of allowed characters and should match the regular expression \verbatim [a-zA-Z_][a-zA-Z0-9_-]* \endverbatim in order to be searchable with AQL

An ANNIS corpus can be either a zip file or a directory which includes the following files

First line is exactly "3.3", the next lines can contain human readable text.

Pure UTF-8 encoded text file

Contains structural information about the corpus and its documents.

*TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	not NULL	description
id	integer	X	X	primary key
name	text	X	X	unique name (per corpus)
type	text		X	CORPUS, DOCUMENT
version	text			version number (currently not used)
pre	integer		X	pre order of the corpus tree
post	integer		X	post order of the corpus tree
top_level	boolean		X	true for the toplevel corpus

Contains meta-data on the corpus and the documents.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	description
corpus_ref	integer	X	foreign key to corpus.id
namespace	text
name	text
value	text

`text.annis`

Describes all texts that are included in the corpus.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	not NULL	description
corpus_ref	integer	X	foreign key to corpus.id. The corpus id should be the id of the document in the corpus table
id	integer	X	restart from 0 for every corpus_ref
name	text		name of the text
text	text		content of the text

primary key: corpus_ref, id

Every node in the corpus will have exactly one entry in this table.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	not NULL	description
id	bigint	X	X	primary key
text_ref	integer		X	foreign key to text.id
corpus_ref	integer		X	foreign key to corpus.id
layer	text
name	text			A human readable identfier of the node. Must be unique for each document.
left	integer		X	position of first covered character
right	integer		X	position of the character after the last covered character
token_index	integer			index of this token (if it is a token, otherwise NULL)
left_token	integer		X	index of first covered token, for token, this value is the token_index
right_token	integer		X	index of last covered token, for token, this value is the token_index
seg_index	integer			index of this segment (if it is a segment, i.e. there is some SOrderingRelation connected to this node)
seg_name	text			name of the segment path this segment belongs to
span	text			for tokens or node with a segmentation index: substring of the covered original text
root	boolean		X	True if this node has no parents in all components

Lists the components (connected sub-graphs) of the graph.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	not NULL	description
id	bigint	X	X	primary key
type	char(1)			edge type: c, d, p
layer	text		X	Could be set to e.g. "default_layer" if not in any Salt layer
name	text			The sType of the component, e.g anaphoric for a some kind of pointing relation component

A rank entry describes one of the positions of a node in a component tree. There is one rank entry for each edge. Furthermore, every component has a virtual relation and thus an additional rank entry where the parent attribute is NULL and the level is 0.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	unique	not NULL	description
id	bigint	X	X	primary key
pre	integer		X	the preorder of the target node. the root of the component tree should always have a pre-order of 0
post	integer		X	the post-order or the target node
node_ref	bigint		X	the node.id of the target node
component_ref	bigint		X
parent	bigint			id of the parent rank entry
level	integer			level of this rank entry (not node!) in the component tree

Rank entries with the type 'c' (coverage spans) must be ommitted, if the referenced node is a token and if the parent coverage span is continuous. Continuous means the range of covered token has no gaps, thus it includes all token between the first and the last covered token. The idea behind this is, that you can recover the needed information using the "left_token" and "right_token" from the span together with the "token_index" (all in the node.annis table) if the span is continuous.

Contains all annotations per node.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	not NULL	description
node_ref	bigint	X	foreign key to _node.id
namespace	text
name	text	X
value	text

unique(node_ref, namespace, name)

Contains all annotations per edge (which is represented by a rank entry)

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	not NULL	description
rank_ref	bigint	X	foreign key to rank.id
namespace	text
name	text	X
value	text

Describes which visualizers to trigger depending of the namespace of a node or edge occuring in the search results.

TAB-separated file as described in "Text Format" section of the PostgreSQL documentation

column	type	not NULL	description
corpus	text		the name of the supercorpus
version	text		the version of the corpus
namespace	text		the several layers of the corpus
element	text		the type of the entry: "node" or "edge"
vis_type	text	X	the abstract type of visualization: "tree", "discourse", "grid", ...
display_name	text	X	the name of the layer which shall be shown for display
visibility	text		either "permanent", "visible", "hidden", "removed" or "preloaded", default is "hidden"
order	bigint		the order of the layers, in which they shall be shown
mappings	text

Contains the media files that are connected with this corpus including their binary content.

Each file directly inside the ExtData folder belongs to the toplevel corpus. A sub-folder corresponds to a document and each file inside a sub-folder belongs to the document with the same name.

ANNIS Developer Guide

`annis.version`

`corpus.annis`

`corpus_annotation.annis`

`text.annis`

`node.annis`

`component.annis`

`rank.annis`

`node_annotation.annis`

`edge_annotation.annis`

`resolver_vis_map.annis`

`ExtData` folder