Pepper 3.7.0
A highly extensible plattform for conversion and
|
Salt differentiates between the corpus structure and the document structure. The document structure contains the primary data (data sources) and the linguistic annotations. A bunch of such information is grouped to a document (SDocument
in Salt). The corpus structure now is a grouping mechanism to group several documents to a corpus or sub corpus (SCorpus
in Salt). Therefore, mapping the document structure and corpus structure is the main task of a Pepper module. Normally the conceptual mapping of elements between a model or format X and Salt is the most tricky part. Not necessarily in a technical sense, but in a semantical. For getting a clue how the mapping can technically be realized, we strongly recommend, to read the Salt model guide and the quick user guide on http://corpus-tools.org/salt/#documentation.
There are two aspects having a big impact on the inner architecture of a Pepper module. First we have the convention over configuration aspect and second we have the aspect of parallelizing a mapping job. This results in a relatively long stack of function calls to give you an intervention option on several points. We come to this later. But if you are happy with the defaults, it is rather simple to implement your module. Again, the org.corpus_tools.pepper.modules.PepperModule is a singleton instance for each Pepper step, whereas there is one instance of org.corpus_tools.pepper.modules.PepperMapper per SDocument
and SCorpus
object in the workflow.
Have a look at the following snippet, which is part of each org.corpus_tools.pepper.modules.PepperModule:
This method is supposed to create a new instance of a specialized org.corpus_tools.pepper.modules.PepperMapper. Although the main initializations, necessary for the workflow (e.g. passing the customization properties, see Customizing the mapping) are done by Pepper in the back. The method is the place to make some specific configurations depending on your implementation. If your module is an im- or exporter, it might be necessary to pass the physical location of the file or folder where the Salt model is supposed to be imported from or exported to (see position 1 in code). Sometimes it might be necessary to differentiate the type of object which is supposed to be mapped (either an SCorpus
or SDocument
object). This is shown in the snippet under position 2.
That's all we have to do in org.corpus_tools.pepper.impl.PepperModuleImpl for the mapping task, now we come to org.corpus_tools.pepper.impl.PepperMapperImpl. Here you find three methods, supposed to be overwritten, as shown in the following snippet.
Not very surprising, the method 'initialize()' is invoked by the constructor and should do some initialization stuff if necessary. The methods org.corpus_tools.pepper.modules.PepperMapper.mapSCorpus() and org.corpus_tools.pepper.modules.PepperMapper.mapSDocument() are the more interesting ones. Here is the place to implement the mapping of the corpus structure or the document structure. Note, that one instance of the mapper always processes exactly one object, so either a SCorpus
or a SDocument
object. If you set the physical location at position 1 in method org.corpus_tools.pepper.modules.PepperModule.createPepperMapper(), you can now get that location via calling org.corpus_tools.pepper.modules.PepperMapper.getResourceURI() as shown at position 1 and 4. This method returns a URI
pointing to the physical location.
Note
If your module is an exporter, that location does not physically exist and has to be created on your own.
Position 2 shows, how to access the current SCorpus
object and how to annotate it for instance with a meta-annotation (in the snippet, the meta-annotation is about an author having the name 'Bart Simpson', the null
-value means, that no namespace is used). In method org.corpus_tools.pepper.modules.PepperMapper.mapSDocument(), at position 5, you can access the current object (here it is of type SDocument
) with 'getSDocument()'. If your module is an importer, you need to create a container for the document structure, a SDocumentGraph
object. The snippet further shows the creation of a primary text at position 6. In Salt each object can be annotated or meta-annotated, so do the SDocument
objects, as shown at position 7. Last but not least, both methods have to return a value describing whether the mapping was successful or not (see position 3 and 8). The possible results are described in org.corpus_tools.pepper.common.DOCUMENT_STATUS. Usually you only need to return the org.corpus_tools.pepper.common.DOCUMENT_STATUS.COMPLETED when everything was ok. In case of an error, Pepper will set the status org.corpus_tools.pepper.common.DOCUMENT_STATUS.FAILED automatically, as long, as the exception is thrown, which marks the document or corpus to be not processed any further.
During the mapping it is very helpful for the user, to give some progress status from time to time. Especially when a mapping takes a longer term, it will keep the user from a frustrating experience to have a not responding tool. More information on that can be found in Monitoring the progress.
In a few cases, a format does not allow or only hardly allow a parallel processing. In that case you can switch-off the parallelization in your constructor with org.corpus_tools.pepper.modules.PepperModule.setIsMultithreaded(). The Pepper framework does not directly call the method org.corpus_tools.pepper.modules.PepperModule.createPepperMapper(). If you need to intervene at an earlier point, you can do this at any point in the stack as shown in the following excerpt.
Even the first two methods could be overwritten by your module, to adapt their functionality on different levels.
Note
Take care when overriding one of them, since they handle some more functionality than explained here in this guide. To get a clue of what happens there, please take a look into the source code at https://github.com/korpling/pepper. It might be well documented and hopefully understandable. But if questions occur, please send a mail to saltnpepper@lists.hu-berlin.de.