Salt  3.4.2
A powerful, tagset-independent and theory-neutral meta model and API for storing, manipulating, and representing nearly all types of linguistic data .
org.corpus_tools.salt.common.tokenizer.Tokenizer Class Reference

The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. More...

Public Member Functions

void setsDocumentGraph (SDocumentGraph sDocumentGraph)
 
SDocumentGraph getDocumentGraph ()
 
 Tokenizer ()
 Initializes a new TTokenizer object.
 
List< STokentokenize (STextualDS sTextualDSs)
 Sets the STextualDS to be tokenized. More...
 
List< STokentokenize (STextualDS sTextualDSs, LanguageCode language)
 Sets the STextualDS to be tokenized and the language of the text. More...
 
List< STokentokenize (STextualDS sTextualDS, LanguageCode language, Integer startPos, Integer endPos)
 Sets the STextualDS to be tokenized and the language of the text. More...
 
void addAbbreviation (LanguageCode language, HashSet< String > abbreviations)
 Adds the given list of abbreviation to the internal map corresponding to given language. More...
 
void addAbbreviation (LanguageCode language, File abbreviationFile)
 Adds the content of given file as a list of abbreviation to the internal map corresponding to given language. More...
 
HashSet< String > getAbbreviations (LanguageCode language)
 Returns a list of abbreviations corresponding to the given language. More...
 
void addClitics (LanguageCode language, Clitics clitics)
 Adds the given clitics to the internal map corresponding to given language. More...
 
void addClitics (LanguageCode language, File cliticsFile)
 Adds the content of given file as a set of clitics to the internal map corresponding to given language. More...
 
Clitics getClitics (LanguageCode language)
 Returns a list of abbreviations corresponding to the given language. More...
 
List< STokentokenizeToToken (STextualDS sTextualDS, LanguageCode language, Integer startPos, Integer endPos)
 The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. More...
 
List< String > tokenizeToString (String strInput, LanguageCode language)
 The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. More...
 

Static Public Member Functions

static LanguageCode checkLanguage (String text)
 Tries to detect language and returns ISO 639-2 language code. More...
 
static LanguageCode mapISOLanguageCode (String language)
 Maps the knallgrau TextCategorizer language description codes to ISO 639 codes. More...
 

Static Protected Attributes

static final String P_CHAR = "\\[\\{\\(´`\"»«‚„†‡‹‘’“”•–—›"
 
static final String F_CHAR = "\\]\\}'`\"\\),;:!\\?%»«‚„…†‡‰‹‘’“”•–—›"
 

Detailed Description

The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.

A list of tokenized text is returned with the text anchor (start and end position) in original text. Reimplemented in Java with permission from the original TreeTagger tokenizer in Perl by Helmut Schmid (see http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/). This implementation uses sets of abbreviations to detect tokens, which are abbreviations in a specific language. Therefore you can set a file containing abbreviations, to take others than the default ones. Because of abbreviations are language dependend, you can set a language, to use only a specific set of abbreviations. The current version of the Tokenizer supports abbreviations for english, french, italian and german language. If no language is set, all available abbreviations will be used.

Author
Amir Zeldes
Florian Zipser

Member Function Documentation

◆ addAbbreviation() [1/2]

void org.corpus_tools.salt.common.tokenizer.Tokenizer.addAbbreviation ( LanguageCode  language,
File  abbreviationFile 
)

Adds the content of given file as a list of abbreviation to the internal map corresponding to given language.

Form of the file: Adm.
Ala.
Ariz.
Ark.
Aug.
Ave.
Bancorp.

Parameters
language
abbreviations

◆ addAbbreviation() [2/2]

void org.corpus_tools.salt.common.tokenizer.Tokenizer.addAbbreviation ( LanguageCode  language,
HashSet< String >  abbreviations 
)

Adds the given list of abbreviation to the internal map corresponding to given language.

Parameters
language
abbreviations

◆ addClitics() [1/2]

void org.corpus_tools.salt.common.tokenizer.Tokenizer.addClitics ( LanguageCode  language,
Clitics  clitics 
)

Adds the given clitics to the internal map corresponding to given language.

Parameters
language
clitics

◆ addClitics() [2/2]

void org.corpus_tools.salt.common.tokenizer.Tokenizer.addClitics ( LanguageCode  language,
File  cliticsFile 
)

Adds the content of given file as a set of clitics to the internal map corresponding to given language.

Form of the file: Adm.

The file must be structured so that the first line contains the regex for proclitics, and the second line the regex for enclitics, e.g.:

([dcjlmnstDCJLNMST]'|[Qq]u'|[Jj]usqu'|[Ll]orsqu') (-t-elles?|-t-ils?|-t-on|-ce|-elles?|-ils?|-je|-la|-les?|-leur|-lui|-mêmes?|-m'|-moi|-nous|-on|-toi|-tu|-t'|-vous|-en|-y|-ci|-là)

Parameters
language
cliticsFile

◆ checkLanguage()

static LanguageCode org.corpus_tools.salt.common.tokenizer.Tokenizer.checkLanguage ( String  text)
static

Tries to detect language and returns ISO 639-2 language code.

Parameters
text
Returns

◆ getAbbreviations()

HashSet<String> org.corpus_tools.salt.common.tokenizer.Tokenizer.getAbbreviations ( LanguageCode  language)

Returns a list of abbreviations corresponding to the given language.

Parameters
language
Returns

◆ getClitics()

Clitics org.corpus_tools.salt.common.tokenizer.Tokenizer.getClitics ( LanguageCode  language)

Returns a list of abbreviations corresponding to the given language.

Parameters
language
Returns

◆ mapISOLanguageCode()

static LanguageCode org.corpus_tools.salt.common.tokenizer.Tokenizer.mapISOLanguageCode ( String  language)
static

Maps the knallgrau TextCategorizer language description codes to ISO 639 codes.

Returns

◆ tokenize() [1/3]

List<SToken> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenize ( STextualDS  sTextualDS,
LanguageCode  language,
Integer  startPos,
Integer  endPos 
)

Sets the STextualDS to be tokenized and the language of the text.

If language is null, it will be detected automatically if possible.

Parameters
sTextualDSsSTextualDS object containing the text to be tokenized
languagelanguage of text, if null, language will be detected automatically
startPosstart position, if text to be tokenized is subset (0 assumed if set to null)
startPosend position, if text to be tokenized is subset (length of text assumed if set to null)

◆ tokenize() [2/3]

List<SToken> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenize ( STextualDS  sTextualDSs)

Sets the STextualDS to be tokenized.

Its language will be detected automatically if possible.

Parameters
sTextualDSs

◆ tokenize() [3/3]

List<SToken> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenize ( STextualDS  sTextualDSs,
LanguageCode  language 
)

Sets the STextualDS to be tokenized and the language of the text.

If language is null, it will be detected automatically if possible.

Parameters
sTextualDSs

◆ tokenizeToString()

List<String> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenizeToString ( String  strInput,
LanguageCode  language 
)

The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.

Returns a list of tokenized text.

Parameters
strInputoriginal text
Returns
tokeized text fragments

◆ tokenizeToToken()

List<SToken> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenizeToToken ( STextualDS  sTextualDS,
LanguageCode  language,
Integer  startPos,
Integer  endPos 
)

The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.

A list of tokenized text is returned with the text anchor (start and end position) in original text. If the SDocumentGraph already contains tokens, the tokens will be preserved, if they overlap the same textual range as the new one. Otherwise a SSpan is created covering corresponding to the existing token. The span than overlaps all new tokens and contains all annotations the old token did. In case, the span would overlaps the same textual range as the old token did, no span is created.

Parameters
strInputoriginal text
Returns
tokenized text fragments and their position in the original text

check, if there is an old token, overlapping the same or a bigger span as the currently created one. If yes, remove the old one and create a span overlapping the new one.