Pepper
3.6.0
A highly extensible plattform for conversion and manipulationoflinguisticdata.
|
Inherits org.corpus_tools.pepper.modules.PepperModule.
Inherited by org.corpus_tools.pepper.impl.PepperImporterImpl, org.corpus_tools.pepper.modules.coreModules.DoNothingImporter, org.corpus_tools.pepper.modules.coreModules.SaltXMLImporter, and org.corpus_tools.pepper.modules.coreModules.TextImporter.
Public Member Functions | |
List< FormatDesc > | getSupportedFormats () |
Returns a list of formats, which are importable by this PepperImporter object. More... | |
CorpusDesc | getCorpusDesc () |
TODO docu. More... | |
void | setCorpusDesc (CorpusDesc corpusDesc) |
TODO docu. More... | |
Map< Identifier, URI > | getIdentifier2ResourceTable () |
Stores Identifier objects corresponding to either a SDocument or a SCorpus object, which has been created during the run of importCorpusStructure(SCorpusGraph). More... | |
Collection< String > | getDocumentEndings () |
Returns list containing all format endings for files, which are importable and could be mapped to SDocument or SDocumentGraph objects by this Pepper module. More... | |
Collection< String > | getCorpusEndings () |
Returns a collection of all file endings for a SCorpus object. More... | |
Collection< String > | getIgnoreEndings () |
Returns a collection of filenames, not to be imported. More... | |
SALT_TYPE | setTypeOfResource (URI resource) |
This method is a callback and can be overridden by derived importers. More... | |
void | importCorpusStructure (SCorpusGraph corpusGraph) throws PepperModuleException |
This method is called by Pepper at the start of a conversion process to create the corpus-structure. More... | |
FormatDesc | addSupportedFormat (String formatName, String formatVersion, URI formatReference) |
{@inheritDoc PepperModuleDesc::addSupportedFormat(String, String, URI)} | |
Double | isImportable (URI corpusPath) |
This method is called by Pepper and returns if a corpus located at the given URI is importable by this importer. More... | |
Public Member Functions inherited from org.corpus_tools.pepper.modules.PepperModule | |
PepperModuleDesc | getFingerprint () |
Returns a PepperModuleDesc object, which is a kind of a fingerprint of this PepperModule. More... | |
MODULE_TYPE | getModuleType () |
Returns the type of this module. More... | |
ComponentContext | getComponentContext () |
Returns the ComponentContext of the OSGi environment the bundle was started in. More... | |
String | getName () |
Returns the name of this module. More... | |
String | getVersion () |
Returns the version of this module. More... | |
void | setVersion (String value) |
Sets the version of this module. More... | |
String | getDesc () |
Returns a short description of this module. More... | |
void | setDesc (String desc) |
Sets a short description of this module. More... | |
URI | getSupplierContact () |
Returns a uri where to find more information about this module and where to find some contact information to contact the supplier. More... | |
void | setSupplierContact (URI eMail) |
Sets a uri where to find more information about this module and where to find some contact information to contact the supplier. More... | |
URI | getSupplierHomepage () |
Sets the URI to the homepage describing the functionality of the module. More... | |
void | setSupplierHomepage (URI hp) |
Returns the URI to the homepage describing the functionality of the module. More... | |
PepperModuleProperties | getProperties () |
Returns a PepperModuleProperties object containing properties to customize the behavior of this PepperModule. More... | |
void | setProperties (PepperModuleProperties properties) |
Sets thePepperModuleProperties object containing properties to customize the behavior of this PepperModule. More... | |
ModuleController | getModuleController () |
Returns the container and controller object for the current module. More... | |
void | setPepperModuleController (ModuleController value) |
Sets the container and controller object for the current module. More... | |
void | setPepperModuleController_basic (ModuleController value) |
Sets the container and controller object for the current module. More... | |
SaltProject | getSaltProject () |
Returns the SaltProject object, which is filled, manipulated or exported by the current module. More... | |
void | setSaltProject (SaltProject value) |
Sets the SaltProject object, which is filled, manipulated or exported by the current module. More... | |
SCorpusGraph | getCorpusGraph () |
Returns the SCorpusGraph object which is filled, manipulated or exported by the current module. More... | |
void | setCorpusGraph (SCorpusGraph value) |
Sets the SCorpusGraph object which is filled, manipulated or exported by the current module. More... | |
URI | getResources () |
Returns the path of the folder which might contain resources for a Pepper module. More... | |
void | setResources (URI value) |
Sets the resource folder used by getResources(). More... | |
URI | getTemproraries () |
TODO make docu. | |
void | setTemproraries (URI value) |
TODO make docu. | |
String | getSymbolicName () |
Returns the symbolic name of this OSGi bundle. More... | |
void | setSymbolicName (String value) |
Sets the symbolic name of this OSGi bundle. More... | |
Collection< String > | getStartProblems () |
If isReadyToStart() has returned false, this method returns a list of reasons why this module is not ready to start. More... | |
boolean | isReadyToStart () throws PepperModuleNotReadyException |
This method is called by the pepper framework after initializing this object and directly before start processing. More... | |
void | setIsMultithreaded (boolean isMultithreaded) |
Sets whether this PepperModule is able to run multithreaded. More... | |
boolean | isMultithreaded () |
Returns whether this PepperModule is able to run multithreaded. More... | |
void | start () throws PepperModuleException |
Starts the conversion process. More... | |
void | start (Identifier sElementId) throws PepperModuleException |
This method is called by the method start(). More... | |
PepperMapper | createPepperMapper (Identifier sElementId) |
OVERRIDE THIS METHOD FOR CUSTOMIZED MAPPING. More... | |
List< Identifier > | proposeImportOrder (SCorpusGraph sCorpusGraph) |
This method could be overridden, to make a proposal for the import order of SDocument objects. More... | |
Double | getProgress (String globalId) |
This method is invoked by the Pepper framework, to get the current progress concerning the SDocument object corresponding to the given Identifier in percent. More... | |
Double | getProgress () |
This method is invoked by the Pepper framework, to get the current total progress of all SDocument objects being processed by this module. More... | |
void | end () throws PepperModuleException |
This method is called by the pepper framework at the end of a conversion process. More... | |
void | done (PepperMapperController controller) |
This method is called by a PepperMapperController object to notify the PepperModule object, that the mapping is done. More... | |
void | done (Identifier identifier, DOCUMENT_STATUS result) |
This method is called by a PepperMapperController object to notify the PepperModule object, that the mapping for this object is done. More... | |
SelfTestDesc | getSelfTestDesc () |
This method is called by the Pepper framework to run an integration test for module. More... | |
Static Public Attributes | |
static final String | NEGATIVE_FILE_EXTENSION_MARKER = "-" |
A character or character sequence to mark a file extension as not to be one of the imported ones. | |
Static Public Attributes inherited from org.corpus_tools.pepper.modules.PepperModule | |
static final String | ENDING_FOLDER = "FOLDER" |
A string specifying a value for a folder as ending. More... | |
static final String | ENDING_LEAF_FOLDER = "LEAF_FOLDER" |
A string specifying a value for a leaf folder as ending. More... | |
static final String | ENDING_XML = "xml" |
Ending for an xml file. More... | |
static final String | ENDING_TXT = "txt" |
Ending for an txt file. More... | |
static final String | ENDING_TAB = "tab" |
Ending for an tab file. More... | |
static final String | ENDING_ALL_FILES = "ALL_FILES" |
All kinds of file endings. | |
A mapping task in the Pepper workflow is not a monolithic block. It consists of several smaller steps.
The following describes the single steps in short. To get a more detailed explanation, take a look to the documentations found at http://u.hu-berlin.de/ saltnpepper.
Initialize the module and set the modules name, its description and the format description of data which are importable. This is part of the constructor:
public MyModule() { super("Name of the module"); setSupplierContact(URI.createURI("Contact address of the module's supplier")); setSupplierHomepage(URI.createURI("homepage of the module")); setDesc("A short description of what is the intention of this module, for instance which formats are importable. "); this.addSupportedFormat("The name of a format which is importable e.g. txt", "The version corresponding to the format name", null); }
This method is invoked by the Pepper framework before the mapping process is started. This method must return true, otherwise, this Pepper module could not be used in a Pepper workflow. At this point problems which prevent the module from being used you can report all problems to the user, for instance a database connection could not be established.
public boolean isReadyToStart() { return (true); }
Depending on the formats you want to support with your importer the detection can be very different. In the simplest case, it only is necessary, to search through the files at the given location (or to recursively traverse through directories, in case the location points to a directory), and to read their header section. For instance some formats like the xml formats PAULA (see: http:// www.sfb632.uni-potsdam.de/en/paula.html ) or TEI (see: http://www.tei-c.org/Guidelines/P5/). The method should return a value between 0 and 1, where 0 means not importable and 1 means definitely importable. If null is returned, Pepper interprets this as unknown and will never suggest this module to the user.
public Double isImportable(URI corpusPath) { return null; }
The classes PepperImporterImpl and PepperExporterImpl provide an automatic mechanism to im- or export the corpus-structure. This mechanism is adaptable step by step, according to your specific purpose. Since many formats do not care about the corpus-structure and they only encode the document-structure, the corpus-structure is simultaneous to the file structure of a corpus. Pepper's default mapping maps the root-folder to a root-corpus (SCorpus object). A sub-folder then corresponds to a sub-corpus (SCorpus object). The relation between super- and sub-corpus, is represented as a SCorpusRelation object. Following the assumption, that files contain the document-structure, there is one SDocument corresponding to each file in a sub-folder. The SCorpus and the SDocument objects are linked with a SCorpusDocumentRelation.
For keeping the correspondance between the corpus-structure and the file structure, both the im- and the exporter make use of a map, which can be accessed via getIdentifier2ResourceTable().
To adapt the behavior, you can set the file endings in the constructor as follows:
this.getDocumentEndings().add("file ending");
You can also add the value PepperModule#ENDING_LEAF_FOLDER to import not files but leaf folders as SDocument objects. Another possibility is to add the value PepperModule#ENDING_ALL_FILES to import all files no matter their ending.
In the method createPepperMapper(Identifier) a PepperMapper object needs to be initialized and returned. The PepperMapper is the major part major part doing the mapping. It provides the methods PepperMapper#mapSCorpus() to handle the mapping of a single SCorpus object and PepperMapper#mapSDocument() to handle a single SDocument object. Both methods are invoked by the Pepper framework. To set the PepperMapper#getResourceURI(), which offers the mapper the file or folder of the current SCorpus or SDocument object, this filed needs to be set in the createPepperMapper(Identifier) method. The following snippet shows a dummy of that method:
public PepperMapper createPepperMapper(Identifier sElementId) { PepperMapper mapper = new PepperMapperImpl() { @Override public DOCUMENT_STATUS mapSCorpus() { // handling the mapping of a single corpus
// accessing the current file or folder getResourceURI();
// returning, that the corpus was mapped successfully return (DOCUMENT_STATUS.COMPLETED); }
@Override public DOCUMENT_STATUS mapSDocument() { // handling the mapping of a single document
// accessing the current file or folder getResourceURI();
// returning, that the document was mapped successfully return (DOCUMENT_STATUS.COMPLETED); } }; // pass current file or folder to mapper. When using // PepperImporter.importCorpusStructure or // PepperExporter.exportCorpusStructure, the mapping between file or // folder // and SCorpus or SDocument was stored here mapper.setResourceURI(getIdentifier2ResourceTable().get(sElementId)); return (mapper); }
Sometimes it might be necessary to clean up after the module did the job. For instance when writing an im- or an exporter it might be necessary to close file streams, a db connection etc. Therefore, after the processing is done, the Pepper framework calls the method described in the following snippet:
public void end() { super.end(); // do some clean up like closing of streams etc. }
CorpusDesc org.corpus_tools.pepper.modules.PepperImporter.getCorpusDesc | ( | ) |
Collection<String> org.corpus_tools.pepper.modules.PepperImporter.getCorpusEndings | ( | ) |
Returns a collection of all file endings for a SCorpus object.
See {@inheritDoc #sCorpusEndings}. This list contains per default value {@value ENDING_FOLDER}. To remove the default value, call Collection#remove(Object) on getCorpusEndings(). To add endings to the collection, call Collection#add(Ending) and to remove endings from the collection, call Collection#remove(Ending).
Implemented in org.corpus_tools.pepper.impl.PepperImporterImpl.
Collection<String> org.corpus_tools.pepper.modules.PepperImporter.getDocumentEndings | ( | ) |
Returns list containing all format endings for files, which are importable and could be mapped to SDocument or SDocumentGraph objects by this Pepper module.
Implemented in org.corpus_tools.pepper.impl.PepperImporterImpl.
Map<Identifier, URI> org.corpus_tools.pepper.modules.PepperImporter.getIdentifier2ResourceTable | ( | ) |
Stores Identifier objects corresponding to either a SDocument or a SCorpus object, which has been created during the run of importCorpusStructure(SCorpusGraph).
Corresponding to the Identifier object this table stores the resource from where the element shall be imported.
For instance:
corpus_1 | /home/me/corpora/myCorpus |
corpus_2 | /home/me/corpora/myCorpus/subcorpus |
doc_1 | /home/me/corpora/myCorpus/subcorpus/document1.xml |
doc_2 | /home/me/corpora/myCorpus/subcorpus/document2.xml |
Implemented in org.corpus_tools.pepper.impl.PepperImporterImpl.
Collection<String> org.corpus_tools.pepper.modules.PepperImporter.getIgnoreEndings | ( | ) |
Returns a collection of filenames, not to be imported.
{@inheritDoc #importIgnoreList} . To add endings to the collection, call Collection#add(Ending) and to remove endings from the collection, call Collection#remove(Ending).
Implemented in org.corpus_tools.pepper.impl.PepperImporterImpl.
List<FormatDesc> org.corpus_tools.pepper.modules.PepperImporter.getSupportedFormats | ( | ) |
Returns a list of formats, which are importable by this PepperImporter object.
Implemented in org.corpus_tools.pepper.impl.PepperImporterImpl.
void org.corpus_tools.pepper.modules.PepperImporter.importCorpusStructure | ( | SCorpusGraph | corpusGraph | ) | throws PepperModuleException |
This method is called by Pepper at the start of a conversion process to create the corpus-structure.
A corpus-structure consists of corpora (represented via the Salt element SCorpus), documents (represented represented via the Salt element SDocument) and a linking between corpora and a corpus and a document (represented via the Salt element SCorpusRelation and SCorpusDocumentRelation ). Each corpus corpus can contain 0..* subcorpus and 0..* documents, but a corpus cannot contain both document and corpus.
For many cases the creation of the corpus-struccture can be done automatically, therefore, just adopt the two lists #gets
This method creates the corpus-structure via a top down traversal in file structure. For each found file (real file and folder), the method setTypeOfResource(URI) is called to set the type of the resource. If the type is a SALT_TYPE#SDOCUMENT a SDocument object is created for the resource, if the type is a SALT_TYPE#SCORPUS a SCorpus object is created, if the type is null, the resource is ignored.
corpusGraph | an empty graph given by Pepper, which shall contains the corpus structure |
Implemented in org.corpus_tools.pepper.impl.PepperImporterImpl, org.corpus_tools.pepper.modules.coreModules.SaltXMLImporter, and org.corpus_tools.pepper.modules.coreModules.DoNothingImporter.
Double org.corpus_tools.pepper.modules.PepperImporter.isImportable | ( | URI | corpusPath | ) |
This method is called by Pepper and returns if a corpus located at the given URI is importable by this importer.
If yes, 1 must be returned, if no 0 must be returned. If it is not quite sure, if the given corpus is importable by this importer any value between 0 and 1 can be returned. If this method is not overridden, null is returned.
Implemented in org.corpus_tools.pepper.impl.PepperImporterImpl, org.corpus_tools.pepper.modules.coreModules.SaltXMLImporter, org.corpus_tools.pepper.modules.coreModules.TextImporter, and org.corpus_tools.pepper.modules.coreModules.DoNothingImporter.
void org.corpus_tools.pepper.modules.PepperImporter.setCorpusDesc | ( | CorpusDesc | corpusDesc | ) |
SALT_TYPE org.corpus_tools.pepper.modules.PepperImporter.setTypeOfResource | ( | URI | resource | ) |
This method is a callback and can be overridden by derived importers.
This method is called via the import of the corpus-structure ( importCorpusStructure(SCorpusGraph)). During the traversal of the file-structure the method importCorpusStructure(SCorpusGraph) calls this method for each resource, to determine if the resource either represents a SCorpus, a SDocument object or shall be ignored.
If this method is not overridden, the default behavior is:
resource | URI resource to be specified |
Implemented in org.corpus_tools.pepper.impl.PepperImporterImpl.