The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. More...

Public Member Functions
void	setsDocumentGraph (SDocumentGraph sDocumentGraph)

SDocumentGraph	getDocumentGraph ()

	Tokenizer ()
	Initializes a new TTokenizer object.

List< SToken >	tokenize (STextualDS sTextualDSs)
	Sets the STextualDS to be tokenized. More...

List< SToken >	tokenize (STextualDS sTextualDSs, LanguageCode language)
	Sets the STextualDS to be tokenized and the language of the text. More...

List< SToken >	tokenize (STextualDS sTextualDS, LanguageCode language, Integer startPos, Integer endPos)
	Sets the STextualDS to be tokenized and the language of the text. More...

void	addAbbreviation (LanguageCode language, HashSet< String > abbreviations)
	Adds the given list of abbreviation to the internal map corresponding to given language. More...

void	addAbbreviation (LanguageCode language, File abbreviationFile)
	Adds the content of given file as a list of abbreviation to the internal map corresponding to given language. More...

HashSet< String >	getAbbreviations (LanguageCode language)
	Returns a list of abbreviations corresponding to the given language. More...

void	addClitics (LanguageCode language, Clitics clitics)
	Adds the given clitics to the internal map corresponding to given language. More...

void	addClitics (LanguageCode language, File cliticsFile)
	Adds the content of given file as a set of clitics to the internal map corresponding to given language. More...

Clitics	getClitics (LanguageCode language)
	Returns a list of abbreviations corresponding to the given language. More...

List< SToken >	tokenizeToToken (STextualDS sTextualDS, LanguageCode language, Integer startPos, Integer endPos)
	The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. More...

List< String >	tokenizeToString (String strInput, LanguageCode language)
	The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do. More...

Static Public Member Functions
static LanguageCode	checkLanguage (String text)
	Tries to detect language and returns ISO 639-2 language code. More...

static LanguageCode	mapISOLanguageCode (String language)
	Maps the knallgrau TextCategorizer language description codes to ISO 639 codes. More...

Static Protected Attributes
static final String	P_CHAR = "\\[\\{\\(´`\"»«‚„†‡‹‘’“”•–—›"

static final String	F_CHAR = "\\]\\}'`\"\\),;:!\\?%»«‚„…†‡‰‹‘’“”•–—›"

Detailed Description

The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.

A list of tokenized text is returned with the text anchor (start and end position) in original text. Reimplemented in Java with permission from the original TreeTagger tokenizer in Perl by Helmut Schmid (see http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/). This implementation uses sets of abbreviations to detect tokens, which are abbreviations in a specific language. Therefore you can set a file containing abbreviations, to take others than the default ones. Because of abbreviations are language dependend, you can set a language, to use only a specific set of abbreviations. The current version of the Tokenizer supports abbreviations for english, french, italian and german language. If no language is set, all available abbreviations will be used.

Author: Amir Zeldes; Florian Zipser

Member Function Documentation

◆ addAbbreviation() [1/2]

void org.corpus_tools.salt.common.tokenizer.Tokenizer.addAbbreviation	(	LanguageCode	language,
		File	abbreviationFile
	)

Adds the content of given file as a list of abbreviation to the internal map corresponding to given language.

Form of the file: Adm.
Ala.
Ariz.
Ark.
Aug.
Ave.
Bancorp.

Parameters

language
abbreviations

◆ addAbbreviation() [2/2]

void org.corpus_tools.salt.common.tokenizer.Tokenizer.addAbbreviation	(	LanguageCode	language,
		HashSet< String >	abbreviations
	)

Adds the given list of abbreviation to the internal map corresponding to given language.

Parameters

language
abbreviations

◆ addClitics() [1/2]

void org.corpus_tools.salt.common.tokenizer.Tokenizer.addClitics	(	LanguageCode	language,
		Clitics	clitics
	)

Adds the given clitics to the internal map corresponding to given language.

Parameters

language
clitics

◆ addClitics() [2/2]

void org.corpus_tools.salt.common.tokenizer.Tokenizer.addClitics	(	LanguageCode	language,
		File	cliticsFile
	)

Adds the content of given file as a set of clitics to the internal map corresponding to given language.

Form of the file: Adm.

The file must be structured so that the first line contains the regex for proclitics, and the second line the regex for enclitics, e.g.:

([dcjlmnstDCJLNMST]'|[Qq]u'|[Jj]usqu'|[Ll]orsqu') (-t-elles?|-t-ils?|-t-on|-ce|-elles?|-ils?|-je|-la|-les?|-leur|-lui|-mêmes?|-m'|-moi|-nous|-on|-toi|-tu|-t'|-vous|-en|-y|-ci|-là)

Parameters

language
cliticsFile

◆ checkLanguage()

static LanguageCode org.corpus_tools.salt.common.tokenizer.Tokenizer.checkLanguage ( String text )

static

Tries to detect language and returns ISO 639-2 language code.

Parameters

text

Returns

◆ getAbbreviations()

HashSet<String> org.corpus_tools.salt.common.tokenizer.Tokenizer.getAbbreviations ( LanguageCode language )

Returns a list of abbreviations corresponding to the given language.

Parameters

language

Returns

◆ getClitics()

Clitics org.corpus_tools.salt.common.tokenizer.Tokenizer.getClitics ( LanguageCode language )

Returns a list of abbreviations corresponding to the given language.

Parameters

language

Returns

◆ mapISOLanguageCode()

static LanguageCode org.corpus_tools.salt.common.tokenizer.Tokenizer.mapISOLanguageCode ( String language )

static

Maps the knallgrau TextCategorizer language description codes to ISO 639 codes.

Returns

◆ tokenize() [1/3]

List<SToken> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenize	(	STextualDS	sTextualDS,
		LanguageCode	language,
		Integer	startPos,
		Integer	endPos
	)

Sets the STextualDS to be tokenized and the language of the text.

If language is null, it will be detected automatically if possible.

Parameters

sTextualDSs	STextualDS object containing the text to be tokenized
language	language of text, if null, language will be detected automatically
startPos	start position, if text to be tokenized is subset (0 assumed if set to null)
startPos	end position, if text to be tokenized is subset (length of text assumed if set to null)

◆ tokenize() [2/3]

List<SToken> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenize ( STextualDS sTextualDSs )

Sets the STextualDS to be tokenized.

Its language will be detected automatically if possible.

Parameters

sTextualDSs

◆ tokenize() [3/3]

List<SToken> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenize	(	STextualDS	sTextualDSs,
		LanguageCode	language
	)

Sets the STextualDS to be tokenized and the language of the text.

If language is null, it will be detected automatically if possible.

Parameters

sTextualDSs

◆ tokenizeToString()

List<String> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenizeToString	(	String	strInput,
		LanguageCode	language
	)

The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.

Returns a list of tokenized text.

Parameters

strInput original text

Returns: tokeized text fragments

◆ tokenizeToToken()

List<SToken> org.corpus_tools.salt.common.tokenizer.Tokenizer.tokenizeToToken	(	STextualDS	sTextualDS,
		LanguageCode	language,
		Integer	startPos,
		Integer	endPos
	)

The general task of this class is to tokenize a given text in the same order as the tool TreeTagger will do.

A list of tokenized text is returned with the text anchor (start and end position) in original text. If the SDocumentGraph already contains tokens, the tokens will be preserved, if they overlap the same textual range as the new one. Otherwise a SSpan is created covering corresponding to the existing token. The span than overlaps all new tokens and contains all annotations the old token did. In case, the span would overlaps the same textual range as the old token did, no span is created.

Parameters

strInput original text

Returns: tokenized text fragments and their position in the original text

check, if there is an old token, overlapping the same or a bigger span as the currently created one. If yes, remove the old one and create a span overlapping the new one.

Public Member Functions

Static Public Member Functions

Static Protected Attributes

Detailed Description

Member Function Documentation

◆ addAbbreviation() [1/2]

◆ addAbbreviation() [2/2]

◆ addClitics() [1/2]

◆ addClitics() [2/2]

◆ checkLanguage()

◆ getAbbreviations()

◆ getClitics()

◆ mapISOLanguageCode()

◆ tokenize() [1/3]

◆ tokenize() [2/3]

◆ tokenize() [3/3]

◆ tokenizeToString()

◆ tokenizeToToken()