Corpus structure
Corpus and subcorpus
In PAULA a corpus structure is defined by means of a file system folder structure. The name of the corpus is determined by the name of the top level directory of the folder structure. The top level directory may contain further directories. If these directories contain subdirectories themselves, then they are considered to be subcorpora. Subcorpora are generally used to provide meaningful subdivisions of a corpus, e.g. based on genre, period, language etc. These may be accompanied by appropriate metadata.
Each subcorpus carries the name of its directory. It is possible, but
not recommended, to repeat subcorpus names at different levels of
nesting. A directory cannot contain two identically named
subdirectories, and therefore it is impossible for two sibling
subcorpora to have the same name. Under *NIX systems it is possible to
have directories with identical names except for capitalization. This is
not recommended for compatibility with other operating systems. In
addition to directories, a top level corpus or a subcorpus may contain
an annoSet
file, which lists the set of subfolders in the same
directory (see annoSets). This is not required unless the
corpus or subcorpus should receive metadata annotations (see
metadata).
Directory structure for a PAULA corpus
+-- mycorpus/
| +-- subcorpus1/
| | +-- doc1/
| | +-- doc2/
| | +-- doc3/
| +-- subcorpus2/
| | +-- doc4/
| | +-- doc5/
| | +-- ...
| +-- subcorpus3/
... ...
A subdirectory which contains no further directories is a document. Every corpus and subcorpus must contain at least one document (possibly nested within a lower level folder), empty corpora or subcorpora are not allowed. The minimal structure for a PAULA corpus is therefore a corpus folder containing a document folder, which must contain the minimal document structure described under documents.
Documents
A PAULA document
is a terminal directory within the directoy structure
of the PAULA corpus
, i.e. it is a folder that contains no subfolders.
Usually documents corresponds to coherent texts (e.g. an article), but
in some contexts other divisions may be sensible (e.g. chapters of a
book as individual documents). The primary consideration is whether or
not annotations need to cross boundaries between segments of the
annotated texts, since annotation nodes and edges can only exist within
a document. It is not possible for an element in one document to refer
to or include an element from another document.
The name of the document is determined by the name of the folder
representing it. A document must contain at least a primary text data
file, a tokenization
, an annoSet
file and the
relevant DTDs used in the document, unless these are stored in a
separate folder and refered to with appropriate relative paths. If the
document contains no tokenization or other annotations,
then these will be paula_text.dtd
, paula_struct.dtd
and
paula_header.dtd
. Typically, however, a document almost always
contains a tokenization of the primary text data and some annotations,
meaning at least paula_mark.dtd
and paula_feat.dtd
(see DTDs
for more information). It is generally advisable to contain all DTDs
used in a corpus in every document, as redundant DTDs do not disrupt
processing or validation.
By convention, all XML files within a document (i.e. all files except DTDs) share the document name as part of the file name, which appears first except for possible namespaces, and is followed by annotation layer-specific elements. For more information about recommended naming practices see naming conventions.
AnnoSets
Each PAULA document
must contain an annoSet
file which describes the
set of annotations contained in the document. The annoSet
conforms
with the DTD) paula_struct.dtd
and contains a structList
element which contains one or more struct
elements, each of which
contains one or more rel
elements (these are the same elements used
for the description of hierarchical annotations as well).
Every XML file within the document directory (but not DTDs and not the
annoSet
file itself) must be the @xlink:href
attribute of some rel
in the annoSet
, including the special annoFeat
file if it has been
included (see Annofeats). There are therefore as many rel
elements in the annoSet
as there are XML files in the directory, minus
one (since the annoSet
itself is not referenced). Different structs
can be used to group together files belonging to one logical annotation
layer, such as the primary text data
and its tokenization
, or related annotations such as part of
speech and lemma. The following example shows some typical groupings
following the PAULA naming conventions.
An annoSet
file for doc1 in mycorpus
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_struct.dtd">
<paula version="1.1">
<header paula_id="mycorpus.doc1.anno" />
<structList xmlns:xlink="http://www.w3.org/1999/xlink"
type="annoSet">
<struct id="anno_1">
<rel id="rel_1" xlink:href="mycorpus.doc1.anno_feat.xml" />
</struct>
<struct id="anno_2">
<rel id="rel_2" xlink:href="mycorpus.doc1.text.xml" />
<rel id="rel_3" xlink:href="mycorpus.doc1.tok.xml" />
</struct>
<struct id="anno_3">
<rel id="rel_4" xlink:href="mycorpus.doc1.tok_pos.xml" />
<rel id="rel_5" xlink:href="mycorpus.doc1.tok_lemma.xml" />
</struct>
<struct id="anno_4">
<rel id="rel_6" xlink:href="mycorpus.doc1.phrase.xml" />
<rel id="rel_7" xlink:href="mycorpus.doc1.phrase_cat.xml" />
<rel id="rel_8" xlink:href="mycorpus.doc1.phrase_func.xml" />
</struct>
</structList>
</paula>
Annotation layers within the same struct are often interdependent, such
that removing one of the files from the document may disrupt the
annotation graph shared with the others. Also note that since
namespaces are also used to group related annotation
layers together, often (but not necessarily always) layers with the same
namespace will also be in the same struct
in the annoSet
.
A second function of annoSets is to list the contents of corpora or
subcorpora. AnnoSets within subcorpus or corpus folders are optional,
though if they are missing, the contents of the folder cannot be
validated against a list. AnnoSets in corpora or subcorpora are only
required if the corpus or subcorpus should receive metadata annotations,
in which case an annoSet
to which the metadata features must point is
required (see metadata for more information). An annoSet
for a subcorpus or corpus can look like the following example.
An annoSet
file for the corpus mycorpus with three documents
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE paula SYSTEM "paula_struct.dtd">
<paula version="1.1">
<header paula_id="mycorpus.anno" />
<structList xmlns:xlink="http://www.w3.org/1999/xlink"
type="annoSet">
<struct id="anno_1">
<rel id="rel_1" xlink:href="doc1/" />
<rel id="rel_2" xlink:href="doc2/" />
<rel id="rel_3" xlink:href="doc3/" />
</struct>
</structList>
</paula>
Corpus or subcorpus annoSets generally place all child subcorpora or
documents within one struct
element as in the example above, though it
is not prohibited to group some items into different struct
elements.
It is also possible to mix subcorpora and documents within the same
corpus or subcorpus level folder. There is no difference in notation and
all immediate subfolders in the file system are simply listed:
subcorpus1/
, doc1/
etc.