Hierarchical structures

Hierarchical structures are used in PAULA for two different purposes: for the creation of hierarchically nested annotation graphs (e.g. syntax trees, rhetorical structure annotation, hierarchical topological fields) and for the definition of structured annoSet objects (see annoSets). Hierarchical structures express the graph semantic property that a parent node consists of its children, or in reverse, that children nodes constitute their parent nodes. The semantics of hierarchical edges is also called dominance (a parent node dominates a child node), and they are consequently known as dominance edges as well. This chapter describes hierarchical annotation graphs. For non-hierarchical annotations see also spans and markables.

Structs

To form hierarchically nested (i.e. recursive) non-terminal nodes above the token level, the struct element should be used. Directed acyclic graphs (DAGs) of struct elements may be defined in struct files according to paula_struct.dtd. The struct element is embedded within a structList which determines the @type for all structs in the file. It has only one attribute, an @id which allows it to become the target of incoming edges. Outgoing edges are annotated using the child element rel, which has its own @type (the type of edge) and an attribute @xlink:href determining the target's id, as well as its own @id attribute for further annotation (see annotating structs and rels). The following example illustrates a simple syntax tree for the sentence "he ". The correpsonding syntax tree is also visualized in the next figure.

Constructing a hierarchical syntax tree with struct elements type

<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_struct.dtd">

<paula version="1.1">
<header paula_id="mycorpus.doc2_phrase"/>

<structList xmlns:xlink="http://www.w3.org/1999/xlink" 
type="phrase">
<struct id="phrase_1"> <!-- NP -->
    <!-- he -->
    <rel id="rel_1" type="edge" xlink:href="mycorpus.doc2.tok.xml#tok_1"/>
</struct>
<struct id="phrase_2"> <!-- VP -->
    <!-- takes -->
    <rel id="rel_2" type="edge" xlink:href="mycorpus.doc2.tok.xml#tok_2"/>
    <rel id="rel_3" type="edge" xlink:href="#phrase_3"/>
    <rel id="rel_4" type="edge" xlink:href="#phrase_4"/>
    <rel id="rel_5" type="edge" xlink:href="#phrase_5"/>
</struct>
<struct id="phrase_3"> <!-- NP -->
    <!-- people -->
    <rel id="rel_6" type="edge" xlink:href="mycorpus.doc2.tok.xml#tok_3"/>
    <!-- _ -->
    <rel id="rel_7" type="secedge" xlink:href="mycorpus.doc2.tok.xml#tok_5"/>
</struct>
<struct id="phrase_4"> <!-- PRT -->
    <!-- out -->
    <rel id="rel_8" type="edge" xlink:href="mycorpus.doc2.tok.xml#tok_4"/>
</struct>
<struct id="phrase_5"> <!-- S -->
    <rel id="rel_9" type="edge" xlink:href="#phrase_6"/>
    <rel id="rel_10" type="edge" xlink:href="#phrase_7"/>
</struct>
<struct id="phrase_6"> <!-- NP -->
    <!-- _ -->
    <rel id="rel_11" type="edge" xlink:href="mycorpus.doc2.tok.xml#tok_5"/>
</struct>
<struct id="phrase_7"> <!-- VP -->
    <!-- to -->
    <rel id="rel_12" type="edge" xlink:href="mycorpus.doc2.tok.xml#tok_6"/>
    <rel id="rel_13" type="edge" xlink:href="#phrase_8"/>
</struct>
<struct id="phrase_8"> <!-- VP -->
    <!-- fish -->
    <rel id="rel_14" type="edge" xlink:href="mycorpus.doc2.tok.xml#tok_7"/>
</struct>
<struct id="phrase_9"> <!-- S -->
    <rel id="rel_15" type="edge" xlink:href="#phrase_1"/> 
    <rel id="rel_16" type="edge" xlink:href="#phrase_2"/> 
</struct>
<struct id="phrase_10"> <!-- TOP -->
    <rel id="rel_17" type="edge" xlink:href="#phrase_9"/> 
</struct>
</structList>

</paula>

In this example, the individual nodes in the tree from the figure above are represented by struct elements. Each struct element contains rel elements which define edge leading to its children. Thus "phrase_1" directly dominates a token "tok_1", corresponding to the word "he". Note that, since the tokens are in a separate file, references to the tokens give a full href attribute with the token file name: mycorpus.doc2.tok.xml#tok_1. Phrase nodes dominating other phrase nodes within the same file do not require any prefix: "phrase_9" dominates "#phrase_5" directly. Most edges in the tree have been given the edge @type "edge", but one edge, by which the NP above "people" (marked in red in the figure above) indirectly dominates an empty token between "out" and "to" (marked in green) with a different @type: "secedge" (a 'secondary' edge). There is no limit to the amount of edge types used in a document, but XML naming conventions should be followed in giving type names that are ascii alphanumeric, without spaces and beginning with an alphabetic character (see naming conventions). The node labels ("NP", "VP") and the edge labels ("SBJ", "PRP") are not defined within the struct file, but are given as separate annotation files: see annotating structs and rels.

Annotating structs and rels

Hierarchical graphs made of struct and rel elements may be further annotated using feat elements, much like annotation spans. To annotate struct nodes, use a feat file pointing to the nodes and give the annotation name in the @type attribute. The following example illustrates the phrase annotations for the tree in the previous section.

Annotating nodes from a struct file with feat annotations for phrase category: "cat"

<?xml version="1.0" standalone="no"?>

<!DOCTYPE paula SYSTEM "paula_feat.dtd">
<paula version="1.1">

<header paula_id="mycorpus.doc2_phrase_cat"/>

<featList xmlns:xlink="http://www.w3.org/1999/xlink" type="cat" 
xml:base="mycorpus.doc2.phrase.xml">
    <feat xlink:href="#phrase_1" value="NP"/><!-- he -->
    <feat xlink:href="#phrase_2" value="VP"/><!-- takes -->
    <feat xlink:href="#phrase_3" value="NP"/><!-- people _ -->
    <feat xlink:href="#phrase_4" value="PRT"/><!-- out -->
    <feat xlink:href="#phrase_5" value="S"/><!-- _ to fish -->
    <feat xlink:href="#phrase_6" value="NP"/><!-- _ -->
    <feat xlink:href="#phrase_7" value="VP"/><!-- to fish -->
    <feat xlink:href="#phrase_8" value="VP"/><!-- fish -->
    <!-- he takes people out _ to fish -->
    <feat xlink:href="#phrase_9" value="S"/>
    <!-- he takes people out _ to fish -->
    <feat xlink:href="#phrase_10" value="TOP"/>
</featList>

</paula>

The annotation name is set as "cat" and it applies to the elements "phrase_1" to "phrase_10" in the xml:base file, which contains the phrase nodes. For conventions how to name the @paula_id and XML files, see naming conventions.

Annotating edges works in a similar way, except that rel elements are references instead of struct elements. It is possible to annotate edges of multiple types in the same XML file, as long as the name of the annotation being applied to them is identical. The following example illustrates this using the edges from the example tree in the previous section (note that "rel_7" had the type "secedge" while the others had "edge", and also that not all edges have been annotated, which is fine).

Annotating edges from a struct file with feat annotations for phrase function: "func"

<?xml version="1.0" standalone="no"?>

<!DOCTYPE paula SYSTEM "paula_feat.dtd">
<paula version="1.1">

<header paula_id="mycorpus.doc2_phrase_func"/>

<featList xmlns:xlink="http://www.w3.org/1999/xlink" type="func" 
xml:base="mycorpus.doc2.phrase.xml">
    <feat xlink:href="#rel_5" value="PRP"/><!-- _ to fish -->
    <feat xlink:href="#rel_9" value="SBJ"/><!-- _ -->
    <feat xlink:href="#rel_11" value="NONE"/><!-- _ -->
    <feat xlink:href="#rel_15" value="SBJ"/><!-- he -->
</featList>

</paula>

Just as with markables, it is also possible to specify multiple annotations for the same nodes in one XML document using multiFeat files (see multiFeats for details).