Metadata
Metadata encompasses annotations that apply to an entire object in the
corpus structure, i.e. to a corpus, subcorpus or document. The metadata
does not annotate specific elements within a text, but rather
characterizes the entire container object. In PAULA XML metadata is
realized in lists of feat
elements (features), which refer to the
annoSet
of the relevant object (see annoSets). It is also
possible for metadata annotations to carry a namespace,
just like any other form of annotation.
Corpus and subcorpus metadata
Corpus and subcorpus level metadata can optionally be added to any
corpus or subfolder containing an annoSet
. It is not possible to add
metadata to a folder not containing an annoSet
. The following example
illustrates a metadata annotation for the corpus mycorpus
.
Metadata for the corpus mycorpus
<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_feat.dtd">
<paula version="1.1">
<header paula_id="mycorpus.meta_lang"/>
<featList xmlns:xlink="http://www.w3.org/1999/xlink"
type="lang" xml:base="mycorpus.anno.xml">
<feat xlink:href="#anno_1" value="eng"/><!-- English -->
</featList>
</paula>
Since the name of the metadata attribute is determined in the the
@type
attribute of the featList
element, it is necessary to define a
separate feat
file for each metadata annotation, unless
multiFeat metadata files are used. Note also that in
this example the feat is only pointing at the struct
element "anno_1"
from the annoSet
file mycorpus.anno.xml
. It is also possible to have
multiple feat
elements, pointing to each one of the struct
elements
in the annoSet
. In the current version of PAULA this makes no
difference: once a metadata annotation has been applied to any struct
element in the annoSet
, it applies to the entire object described by
the annoSet
.
Document metadata
Document metadata works exactly like corpus metadata: it is defined
within a feat
file which has the annotation name in the featList
@type
attribute and the value in the feat
@value
attribute. The
feat
element should point at a struct
element from the document's
annoSet
. It is possible but not necessary to annotate all struct
elements in the annoSet
. The following example demonstrates this.
Metadata for the document mycorpus/doc1
<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_feat.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc1.meta_year"/>
<featList xmlns:xlink="http://www.w3.org/1999/xlink" type="year"
xml:base="mycorpus.doc1.anno.xml">
<feat xlink:href="#anno_1" value="1999"/><!-- year 1999 -->
</featList>
</paula>
If the annoSet
of doc1 contains several structs names "anno_1",
"anno_2" etc., it is possible to annotate them all using multiple
feat
elements. This is identical to annotating just one of the
elements, as in the example above: the metadata annotation "year" has
been applied to the document and given the value "1999".
Using multifeats in metadata
When using a large number of metadata annotations, it is sometimes more
convenient to use just one XML document to define all meta annotations.
This is made possible by using multiFeat
files. The following example
illustrates the use of multiFeat
annotations to define metadata. For
more detailed information on multiFeat
annotations see also multiFeat
annotations.
Multiple metadata annotations in one file using multiFeat
elements.
<?xml version="1.0" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_multiFeat.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc1.meta_multiFeat"/>
<multiFeatList xmlns:xlink="http://www.w3.org/1999/xlink"
type="multiFeat" xml:base="mycorpus.doc1.anno.xml">
<multiFeat xlink:href="#anno_1">
<feat name="year" value="2012"/>
<feat name="language" value="English"/>
<feat name="source_format" value="PAULA XML"/>
<!-- ... -->
</multiFeat>
</multiFeatList>
</paula>
AnnoFeats
Each PAULA document may optionally contain an annoFeat
file listing
the types of all annotation files including mark
, feat
, struct
and
rel
files, for validation purposes. Not including an annofeat
file
means that the annotation layers available within the files specified in
the annoSet
cannot be validated, though it may make it easier to
update annotation layers dynamically. The following example illustrates
the use of the annoFeat
file in reference to the first example in the previous section.
An annoFeat
file for doc1 in mycorpus
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_feat.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc1.annoFeat" />
<featList type="annoFeat" xml:base="mycorpus.doc1.anno.xml"
xmlns:xlink="http://www.w3.org/1999/xlink">
<feat xlink:href="#rel_1" value="annoFeat" />
<feat xlink:href="#rel_2" value="text" />
<feat xlink:href="#rel_3" value="tok" />
<feat xlink:href="#rel_4" value="pos" />
<feat xlink:href="#rel_5" value="lemma" />
<feat xlink:href="#rel_6" value="phrase" />
<feat xlink:href="#rel_7" value="cat" />
<feat xlink:href="#rel_8" value="func" />
</featList>
</paula>
Note that since the value of the feat
is a string and not an ID, it is
possible for multiple rels to refer to the same annotation type name. In
order to disambiguate in such cases, it is possible to use
namespaces, provided that these have been used in the
corresponding annotation files. The value then takes the form
"namespace:anno_name", e.g. "stts:pos".
The annoFeat
file cannot be used in corpus and subcorpus directories.