Spans and markables
Introduction to spans and markables
In PAULA it is possible to define spans of data for further annotation.
Spans are defined using the mark
element, which stands for markable
and has two primary functions: defining a tokenization
for a primary text data and defining a non-terminal annotation
span node above the token level.
Tokenizations and token markables
A tokenization
forms a minimal level of analysis that segments a
primary text data
file into units that can be annotated further. It is
not possible to directly annotate text that is not tokenized, and every
PAULA document must contain at least one tokenization
. It is possible
to include whitespace characters within the primary data and then ignore
these characters while tokenizing, so that adjacent tokens are not
interrupted by any characters on the tokenized level.
The following example illustrates this principle.
Tokenization of the primary text data
"This is an example."
<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_mark.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc1_tok"/>
<markList xmlns:xlink="http://www.w3.org/1999/xlink" type="tok"
xml:base="mycorpus.doc1.text.xml">
<mark id="tok_1"
xlink:href="#xpointer(string-range(//body,'',1,4))"/><!-- This -->
<mark id="tok_2"
xlink:href="#xpointer(string-range(//body,'',6,2))"/><!-- is -->
<mark id="tok_3"
xlink:href="#xpointer(string-range(//body,'',9,2))"/><!-- an -->
<mark id="tok_4"
xlink:href="#xpointer(string-range(//body,'',12,7))"/><!--example-->
<mark id="tok_5"
xlink:href="#xpointer(string-range(//body,'',19,1))"/><!-- . -->
</markList>
</paula>
The first token element with the id "tok_1" begins at the first character of the text (the letter "T") and goes covering a total of 4 character: "This". Character 5 is a space, which has not been tokenized. The next token, "tok_2", begins at character 6, covering 2 characters: "is". It is also possible to define tokens with no textual extension, i.e. empty tokens. Such tokens have a string range spanning zero characters. However, they must still have an anchor position within the text. The following example illustrates an empty token in the sentence "he takes people out to fish", where the unrealized subject of "to fish" is tokenized between "out" and "to" with a character span of zero characters.
Tokenization of the primary data "he takes people out to fish"
<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_mark.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc2_tok"/>
<markList xmlns:xlink="http://www.w3.org/1999/xlink" type="tok"
xml:base="mycorpus.doc2.text.xml">
<mark id="tok_1"
xlink:href="#xpointer(string-range(//body,'',1,2))"/><!-- he -->
<mark id="tok_2"
xlink:href="#xpointer(string-range(//body,'',4,5))"/><!-- takes -->
<mark id="tok_3"
xlink:href="#xpointer(string-range(//body,'',10,6))"/><!--people-->
<mark id="tok_4"
xlink:href="#xpointer(string-range(//body,'',17,3))"/><!-- out -->
<mark id="tok_5"
xlink:href="#xpointer(string-range(//body,'',21,0))"/><!-- -->
<mark id="tok_6"
xlink:href="#xpointer(string-range(//body,'',22,2))"/><!-- to -->
<mark id="tok_7"
xlink:href="#xpointer(string-range(//body,'',25,4))"/><!--fish-->
</markList>
</paula>
Although a PAULA tokenization file is defined with reference to the
general markable DTD paula_mark.dtd
, it is distinguished from other
types of markables, specifically annotation markables, in
two ways. Firstly, the @type
attribute of the element markList
,
which must be set to the value tok
. Secondly, tokenization can only
refer to a primary text data
file. It is not possible to define a
token pointing to a more complex structure (e.g. another markable or
token).
As of PAULA version 1.1 it is possible to have multiple primary text data
files, each of which must then be tokenized. Multiple
tokenizations of the same primary text data
are not possible in PAULA
1.1, but are planned as part of a future version of PAULA XML.
Annotation span markables
The element mark
may be used to group together a set of
tokens for further annotation. This is usually done in
order to annotate a certain feature-value pair which applies to these
tokens. Span annotations therefore have the semantics of attribution
within the graph structure, i.e. stating that an area of the data has a
certain property or attribute. These attributes are realized in PAULA
using feat
annotation files, one or more of which can apply to any
span defined by a markable. Span markables are defined with reference to
the DTD paula_mark.dtd
. The type of markable being annotated (e.g. a
referent or referring expression in a discourse, a chunk for chunking
annotation, etc.) is given by the @type
attribute of the markList
element, and may be any string value other than "tok" which is reserved
for tokenizations. Other values are not ruled out by
the format, but it is recommended to use types that follow XML element
naming conventions, i.e. strings that contain only alphanumeric ascii
characters with no spaces and beginning with an alphabetic character.
Markables may be continuous or discontinuous, i.e. they may apply to a set of consecutive tokens or to non-consecutive tokens. The following example illustrates both types of markables in a single file with the type "chunk".
Markables of the type "chunk" above a set of six tokens "I" "'ve "picked" "the" "kids" "up"
<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_mark.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc1_chunk_seg"/>
<markList xmlns:xlink="http://www.w3.org/1999/xlink" type="chunk"
xml:base="mycorpus.doc1.tok.xml">
<!-- I -->
<mark id="chunk_1" xlink:href="#tok_1"/>
<!-- 've picked...up -->
<mark id="chunk_2"
xlink:href="(#xpointer(id('tok_2')/range-to(id('tok_3'))),#tok_6)"/>
<!-- the kids -->
<mark id="chunk_3"
xlink:href="#xpointer(id('tok_3')/range-to(id('tok_4')))"/>
</markList>
</paula>
In the example, three markables have been defined which refer to six
tokens in the token file mycorpus.doc1.tok.xml
, as entered in the
markList
element's @xml:base
attribute. The first markable,
"chunk_1" points to "#tok_1" in the token file which covers the
string "I". The third markable, "chunk_3", points to a range of
consecutive tokens, from "tok_3" to "tok_4", which covers the words
"the kids". The chunk in the middle, "chunk_2", points to a
discontinuous set of tokens, namely a range "tok_2" to "tok_3" and a
further individual token "tok_6", corresponding to the tokens "'ve
picked" and a later token "up". These markables cannot be annotated
further within this file (e.g. with the type of chunk as nominal,
verbal, etc.). Further annotation of the markables beyond the markable
list @type
must be added in separate files as feat
annotations.
Note that the markable type is set once in the markList
element for
all markables in the file. To define markables of a different type, a
separate markable file must be generated. Separate files are not
required to have the same segmentations and constitute independent
layers of annotation.
Feats
The element feat
and corresponding feat files represent arbitrary
key-value feature annotations which may be applied to a variety of
elements, such as parts of speech or syntactic categories, but also
metadata. They can be applied to mark elements to annotate spans of
tokens or even tokens directly, but also
to struct
elements as part of non-hierarchical annotations
or metadata annotation of annoSet
elements. The following two examples
illustrate feature annotation of spans and tokens. For other uses see
metadata and annotating structs.
In the next example a featList
with the @type
"pos"
contains six feat
elements, each annotating a single token with its
part of speech in the @value
attribute.
Annotating tokens with feat
annotations for part of speech
<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_feat.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc1_pos"/>
<featList xmlns:xlink="http://www.w3.org/1999/xlink" type="pos"
xml:base="mycorpus.doc1.tok.xml">
<feat xlink:href="#tok_1" value="PP"/><!-- I -->
<feat xlink:href="#tok_2" value="VBP"/><!-- 've -->
<feat xlink:href="#tok_3" value="VBN"/><!-- picked -->
<feat xlink:href="#tok_4" value="DT"/><!-- the -->
<feat xlink:href="#tok_5" value="NNS"/><!-- kids -->
<feat xlink:href="#tok_6" value="RP"/><!-- up -->
</featList>
</paula>
It is also possible to annotate more than one token at a time by using annotation span markables, which cover one or more tokens each. In this case the features do not refer to a token file, but to a markable file which refers to some tokens in itself. The following example illustrates the annotation of such spans, which works in much the same way as the annotation of tokens.
Annotating spans from a markable file with feat
annotations for chunk type
<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_feat.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc1_chunk_seg_chunk_type"/>
<featList xmlns:xlink="http://www.w3.org/1999/xlink"
type="chunk_type" xml:base="mycorpus.doc1.chunk_seg.xml">
<feat xlink:href="#chunk_1" value="N"/><!-- I -->
<feat xlink:href="#chunk_2" value="V"/><!-- 've picked _ up -->
<feat xlink:href="#chunk_3" value="N"/><!-- the kids -->
</featList>
</paula>
In this case, three features of the type "chunk_type" have been
assigned to three markables in the file mycorpus.doc1.chunk_seg.xml
.
The "chunk_type" of the first markable is given the value "N". The
second markable receives the "chunk_type" "V" and the third is "N"
again. Note that the tokens covered by the respective markables are not
defined here, though comments to the right of each element can help keep
track of the text covered by each annotation. The actual tokens covered
by each markable are defined in the separate file
mycorpus.doc1.chunk_seg.xml
. There is also no necessary connection
between the type of feature and the type of markable, though in many
cases it makes sense to give them similar names, e.g. markables called
"chunk" and an annotation "chunk_type" (see also naming
conventions).
Multifeats
In cases where multiple annotations always apply to the same nodes, it
may be more economic to specify multiple, usually related annotations in
the same file. This is made possible by the use of multiFeat
files,
together with the associated paula_multiFeat.dtd
. Each multiFeat
contains multiple feat annotations applying to the element specified in
the @xlink:href
attribute of the multiFeat
element. Since the
multiFeat
itself is not an actual annotation, but a container for
other annotations, the multiFeatList
element is conventionally given
the type "multiFeat". The example below illustrates the use of multiFeat
annotations.
Annotating multiple annotations using multiFeat
elements.
<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_multiFeat.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc1.tok_multiFeat"/>
<multiFeatList xmlns:xlink="http://www.w3.org/1999/xlink"
type="multiFeat" xml:base="mycorpus.doc1.tok.xml">
<multiFeat xlink:href="#tok_1"> <!-- I -->
<feat name="pos" value="PPER"/>
<feat name="lemma" value="I"/>
</multiFeat>
<multiFeat xlink:href="#tok_2"> <!-- 've -->
<feat name="pos" value="VBP"/>
<feat name="lemma" value="have"/>
</multiFeat>
<!-- ... -->
</multiFeatList>
</paula>
Note that there is no difference from the data model point of view
between the use of multiple feat
files or one multiFeat
file
specifying the same annotation types. Note also that when using
namespaces, all annotations in a multiFeat
have the
same namespace, determined by the multiFeat
file name. While it is
possible to have different annotation in different multiFeat
elements
in the same file, it is recommended to avoid this, as it can quickly
become confusing. The use of multiFeat
annotations can also make it
potentially difficult to add, remove and edit annotations after the
fact, since separate annotation layers are mixed in one XML file.