Required files and DTDs

Minimal document structure

Every document within a PAULA corpus requires at least one instance of each of the following three XML file types: a primary text data file, a tokenization, and an annoSet file. These accordingly define the raw data, a basic segmentation of the data into minimal units and a list of the files in the directory (see documentation of the individual file types for details).

Additionally, the relevant DTDs must be added which define these file types. At a minimum, the DTDs necessary for the required files above are:

  • paula_header.dtd
  • paula_struct.dtd
  • paula_mark.dtd
  • paula_text.dtd

The DTDs may be repeated in each document to simplify moving and adding documents at any point in the corpus structure (as in the examples in this documentation), or else DTDs can be saved in one folder (e.g. the corpus root) and referred to from each document using a relative path.

Additional DTDs

Beyond the DTDs in the previous section, if the document contains any feat annotations or an annoFeat file, it will require the DTD paula_feat.dtd, and if it contains pointing relations using the rel element, the file paula_rel.dtd will also be necessary. A further DTD, paula_multiFeat.dtd, is needed if multiple feat annotations should be defined in one XML file, see multifeats.

Usually the necessary DTDs are repeatedly included in every document folder for validation purposes, though it is possible to include them in only one folder and refer to them from each document using a relative path (cf. the previous section). It is not necessary to include paula_rel.dtd or paula_feat.dtd for corpora or documents that do not contain pointing relations, even if some other documents in the corpus do, though it may be recommended to have the same DTDs or DTD references in all folders in case pointing relations or feature annotations are added to further corpus documents later on. The following full list of DTDs may therefore be included in every document:

  • paula_header.dtd
  • paula_struct.dtd
  • paula_mark.dtd
  • paula_text.dtd
  • paula_feat.dtd
  • paula_rel.dtd
  • paula_multiFeat.dtd