The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.
More...
|
void | setsDocumentGraph (SDocumentGraph sDocumentGraph) |
|
SDocumentGraph | getDocumentGraph () |
|
| Tokenizer () |
| Initializes a new TTokenizer object.
|
|
List< SToken > | tokenize (STextualDS sTextualDSs) |
| Sets the STextualDS to be tokenized. More...
|
|
List< SToken > | tokenize (STextualDS sTextualDSs, LanguageCode language) |
| Sets the STextualDS to be tokenized and the language of the text. More...
|
|
List< SToken > | tokenize (STextualDS sTextualDS, LanguageCode language, Integer startPos, Integer endPos) |
| Sets the STextualDS to be tokenized and the language of the text. More...
|
|
void | addAbbreviation (LanguageCode language, HashSet< String > abbreviations) |
| Adds the given list of abbreviation to the internal map corresponding to given language. More...
|
|
void | addAbbreviation (LanguageCode language, File abbreviationFile) |
| Adds the content of given file as a list of abbreviation to the internal map corresponding to given language. More...
|
|
HashSet< String > | getAbbreviations (LanguageCode language) |
| Returns a list of abbreviations corresponding to the given language. More...
|
|
void | addClitics (LanguageCode language, Clitics clitics) |
| Adds the given clitics to the internal map corresponding to given language. More...
|
|
void | addClitics (LanguageCode language, File cliticsFile) |
| Adds the content of given file as a set of clitics to the internal map corresponding to given language. More...
|
|
Clitics | getClitics (LanguageCode language) |
| Returns a list of abbreviations corresponding to the given language. More...
|
|
List< SToken > | tokenizeToToken (STextualDS sTextualDS, LanguageCode language, Integer startPos, Integer endPos) |
| The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. More...
|
|
List< String > | tokenizeToString (String strInput, LanguageCode language) |
| The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. More...
|
|
The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.
A list of tokenized text is returned with the text anchor (start and end position) in original text. Reimplemented in Java with permission from the original TreeTagger tokenizer in Perl by Helmut Schmid (see http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/). This implementation uses sets of abbreviations to detect tokens, which are abbreviations in a specific language. Therefore you can set a file containing abbreviations, to take others than the default ones. Because of abbreviations are language dependend, you can set a language, to use only a specific set of abbreviations. The current version of the Tokenizer supports abbreviations for english, french, italian and german language. If no language is set, all available abbreviations will be used.
- Author
- Amir Zeldes
-
Florian Zipser
void org.corpus_tools.salt.common.tokenizer.Tokenizer.addClitics |
( |
LanguageCode |
language, |
|
|
File |
cliticsFile |
|
) |
| |
Adds the content of given file as a set of clitics to the internal map corresponding to given language.
Form of the file: Adm.
The file must be structured so that the first line contains the regex for proclitics, and the second line the regex for enclitics, e.g.:
([dcjlmnstDCJLNMST]'|[Qq]u'|[Jj]usqu'|[Ll]orsqu') (-t-elles?|-t-ils?|-t-on|-ce|-elles?|-ils?|-je|-la|-les?|-leur|-lui|-mêmes?|-m'|-moi|-nous|-on|-toi|-tu|-t'|-vous|-en|-y|-ci|-là)
- Parameters
-
List<SToken> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenizeToToken |
( |
STextualDS |
sTextualDS, |
|
|
LanguageCode |
language, |
|
|
Integer |
startPos, |
|
|
Integer |
endPos |
|
) |
| |
The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.
A list of tokenized text is returned with the text anchor (start and end position) in original text. If the SDocumentGraph already contains tokens, the tokens will be preserved, if they overlap the same textual range as the new one. Otherwise a SSpan is created covering corresponding to the existing token. The span than overlaps all new tokens and contains all annotations the old token did. In case, the span would overlaps the same textual range as the old token did, no span is created.
- Parameters
-
- Returns
- tokenized text fragments and their position in the original text
check, if there is an old token, overlapping the same or a bigger span as the currently created one. If yes, remove the old one and create a span overlapping the new one.