Primary text data

The primary text data forms the lowest level of resource representation, corresponding to the minimally analyzed linguistic data: a strech of untokenized plain text. The presence of at least one such file is obligatory in every PAULA document. Even if the resource to be annotated originates in spoken data for which a primary recording exists, its textual transcription forms the primary data. A segment of a recording is therefore seen to 'take place' in correspondences with a certain stretch of text (see Aligned audio/video files for details). The primary data follows the schema definition in paula_text.dtd, which must be present. The type of the file is "text", and by convention the file name ends with the extension *.text.xml and its paula_id is the same as the file name prefix, ending in _text instead of the file extension *.text.xml. The following example illustrates a primary text data file called mycorpus.doc1.text.xml.

A primary text data file

<?xml version="1.0" standalone="no"?>

<!DOCTYPE paula SYSTEM "paula_text.dtd">

<paula version="1.1">
<header paula_id="mycorpus.doc1_text" type="text"/>

<body>This is an example.</body>

</paula>

A PAULA document can also contain more than one primary text data file. There are at least two scenarios where this is recommended, for which the respective sections should be consulted: parallel corpora with aligned texts in multiple languages and dialogue data with multiple simultaneous speakers.

As with other PAULA XML files, the first segment of text before a period within the filename of the primary text data file can be interpreted as a PAULA namespace. In documents with only one such file, this is usually not important, but it is possible to use namespaces to group together text from different languages or speakers in parallel corpora or dialogue data respectively.

PAULA XML Documentation

Primary text data