Datamodel overview
PAULA projects are graphs dominated by a top level node refered to as a
corpus
. Corpus objects comprise graphs of one or more
annotated document
objects, optionally organized within a
tree of subcorpus
objects. The tree of corpus, subcorpora
and documents corresponds to a file system folder tree. Corpora,
subcorpora and documents can all receive metadata
annotations.
All documents must contain at least one source of primary text data
, possibly more in cases of parallel
corpora or dialogue data, and at
least one tokenization
of this data. Tokenized data may be annotated directly using features called feat
, such as parts-of-speech, lemmatization, etc. Further hierarchical structures can
be built on top of tokens using flat span objects called mark
(i.e. markables) or hierarchically nestable objects called
struct
(i.e. structures), which may also be annotated with
feat
objects. The type of node or annotation (part-of-speech,
phrase-category etc.) is given by the type attribute of each set of
nodes or annotations.
Beyond the edges resulting from the construction of hierarchies through
structs, further non-hierarchical edges may be defined between any two
nodes in a document using pointing relations. Both edges connecting
structs to tokens or other structs and pointing relations may be
annotated using feats and given a type. All objects and annotations
below the document level may carry a PAULA namespace
bundling relevant annotation layers which belong together under a common
identifier (note that these are not identical with XML namespaces). The
following two figures give an overview of this general data model for
the corpus/document structure and the structure of objects within them.
For details and examples of the individual model elements and their
specific XML serialization see the next chapters.