ANNIS import format version 3.3
The ANNIS import format is inspired by the Salt meta-data model an ANNIS uses Salt internally to represent a matched graph from the database. However there are some restrictions which ANNIS has but Salt doesn't.
- node names must be unique per document
- document names must be unique per top-level corpus
- a ANNIS corpus contains only one top-level corpus
- there are no meta-data for nodes
- string identifiers such as annotation or layer names have a limited number of allowed characters and should match the regular expression \verbatim [a-zA-Z_][a-zA-Z0-9_-]* \endverbatim in order to be searchable with AQL
An ANNIS corpus can be either a zip file or a directory which includes the following files
annis.version
First line is exactly "3.3", the next lines can contain human readable text.
Pure UTF-8 encoded text file
corpus.annis
Contains structural information about the corpus and its documents.
*TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
id | integer | X | X | primary key |
name | text | X | X | unique name (per corpus) |
type | text | X | CORPUS, DOCUMENT | |
version | text | version number (currently not used) | ||
pre | integer | X | pre order of the corpus tree | |
post | integer | X | post order of the corpus tree | |
top_level | boolean | X | true for the toplevel corpus |
corpus_annotation.annis
Contains meta-data on the corpus and the documents.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
corpus_ref | integer | X | foreign key to corpus.id | |
namespace | text | |||
name | text | |||
value | text |
text.annis
Describes all texts that are included in the corpus.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
corpus_ref | integer | X | foreign key to corpus.id. The corpus id should be the id of the document in the corpus table | |
id | integer | X | restart from 0 for every corpus_ref | |
name | text | name of the text | ||
text | text | content of the text |
primary key: corpus_ref, id
node.annis
Every node in the corpus will have exactly one entry in this table.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
id | bigint | X | X | primary key |
text_ref | integer | X | foreign key to text.id | |
corpus_ref | integer | X | foreign key to corpus.id | |
layer | text | |||
name | text | A human readable identfier of the node. Must be unique for each document. | ||
left | integer | X | position of first covered character | |
right | integer | X | position of the character after the last covered character | |
token_index | integer | index of this token (if it is a token, otherwise NULL) | ||
left_token | integer | X | index of first covered token, for token, this value is the token_index | |
right_token | integer | X | index of last covered token, for token, this value is the token_index | |
seg_index | integer | index of this segment (if it is a segment, i.e. there is some SOrderingRelation connected to this node) | ||
seg_name | text | name of the segment path this segment belongs to | ||
span | text | for tokens or node with a segmentation index: substring of the covered original text | ||
root | boolean | X | True if this node has no parents in all components |
component.annis
Lists the components (connected sub-graphs) of the graph.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
id | bigint | X | X | primary key |
type | char(1) | edge type: c, d, p | ||
layer | text | X | Could be set to e.g. "default_layer" if not in any Salt layer | |
name | text | The sType of the component, e.g anaphoric for a some kind of pointing relation component |
rank.annis
A rank entry describes one of the positions of a node in a component tree. There is one rank entry for each edge. Furthermore, every component has a virtual relation and thus an additional rank entry where the parent attribute is NULL and the level is 0.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
id | bigint | X | X | primary key |
pre | integer | X | the preorder of the target node. the root of the component tree should always have a pre-order of 0 | |
post | integer | X | the post-order or the target node | |
node_ref | bigint | X | the node.id of the target node | |
component_ref | bigint | X | ||
parent | bigint | id of the parent rank entry | ||
level | integer | level of this rank entry (not node!) in the component tree |
Rank entries with the type 'c' (coverage spans) must be ommitted, if the referenced node is a token and if the parent coverage span is continuous. Continuous means the range of covered token has no gaps, thus it includes all token between the first and the last covered token. The idea behind this is, that you can recover the needed information using the "left_token" and "right_token" from the span together with the "token_index" (all in the node.annis table) if the span is continuous.
node_annotation.annis
Contains all annotations per node.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
node_ref | bigint | X | foreign key to _node.id | |
namespace | text | |||
name | text | X | ||
value | text |
unique(node_ref, namespace, name)
edge_annotation.annis
Contains all annotations per edge (which is represented by a rank entry)
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
rank_ref | bigint | X | foreign key to rank.id | |
namespace | text | |||
name | text | X | ||
value | text |
resolver_vis_map.annis
Describes which visualizers to trigger depending of the namespace of a node or edge occuring in the search results.
TAB-separated file as described in "Text Format" section of the PostgreSQL documentation
column | type | unique | not NULL | description |
---|---|---|---|---|
corpus | text | the name of the supercorpus | ||
version | text | the version of the corpus | ||
namespace | text | the several layers of the corpus | ||
element | text | the type of the entry: "node" or "edge" | ||
vis_type | text | X | the abstract type of visualization: "tree", "discourse", "grid", ... | |
display_name | text | X | the name of the layer which shall be shown for display | |
visibility | text | either "permanent", "visible", "hidden", "removed" or "preloaded", default is "hidden" | ||
order | bigint | the order of the layers, in which they shall be shown | ||
mappings | text |
ExtData
folder
Contains the media files that are connected with this corpus including their binary content.
Each file directly inside the ExtData folder belongs to the toplevel corpus. A sub-folder corresponds to a document and each file inside a sub-folder belongs to the document with the same name.