Preprocess

Preprocess is the first step in MOTS pipeline. This step need to load documents from files into memory and make appropriate preprocessing for the future generation of abstracts.

For this task, MOTS use the AbstractPreProcess class. All subclass will be compute during the Preprocess step and should represent an atomic preprocess:

GenerateTextModel <MANDATORY> : Load documents in memory and write them in <OUTPUT>/temp/<CORPUS_NAME> after preprocessing.
- <OPTION NAME="StopWordListFile"> <OPTIONAL> Load list of stopword. If not present, documents won't be filtered from stopwords.
StanfordNLPPreProcess <OPTIONAL> : Can apply some StanfordNLPPreprocess such as tokenizing, sentence splitting, pos tagging, lemmatization, and more to come
- <OPTION NAME="PropStanfordNLP"> <OPTIONAL> Specify StanfordNLP annotators. Language model path are specify in stanfordNLP.StanfordNLPProperties. Default : 'tokenize, ssplit, pos, lemma'
WordSplitter <OPTIONAL> Greedy tokenizer
SentenceSplitter <OPTIONAL> Greedy SentenceSplitter

TextStemming <OPTIONAL> Stemming based on SnowballStemmer (16 languages supported)
Lemmatization <OPTIONAL> Lemmatization based on Ahmet Aker work. French, german, italian and english supported. Need POSTagging.

Example

config_preprocess.xml for english preprocessing (lemmatization based)

<?xml version="1.0" encoding="UTF-8"?>
<CONFIG>
	<TASK ID="1">
		<LANGUAGE>english</LANGUAGE>
		<OUTPUT_PATH>output</OUTPUT_PATH>
		<MULTITHREADING>true</MULTITHREADING>		
		<PREPROCESS NAME="GenerateTextModel">
			<OPTION NAME="StopWordListFile">docs/StopWords/englishStopWords.txt</OPTION>
		</PREPROCESS>
		<PREPROCESS NAME="StanfordNLPPreProcess">
		</PREPROCESS>	
	</TASK>
</CONFIG>

Here, the Preprocess step is composed of two atomic preprocess :

GenerateTextModel which load the documents and the stopwords list in memory
StanfordNLPPreProcess with default behavior (i.e. tokenize, ssplit, pos, lemma)

config_preprocessFrench.xml for french preprocessing (lemmatization based)

<?xml version="1.0" encoding="UTF-8"?>
<CONFIG>
	<TASK ID="1">
		<LANGUAGE>french</LANGUAGE>
		<OUTPUT_PATH>output</OUTPUT_PATH>
		<MULTITHREADING>true</MULTITHREADING>		
		<PREPROCESS NAME="GenerateTextModel">
			<OPTION NAME="StopWordListFile">docs/StopWords/frenchStopWords.txt</OPTION>
		</PREPROCESS>
		<PREPROCESS NAME="StanfordNLPPreProcess">
			<OPTION NAME="PropStanfordNLP">tokenize, ssplit, pos</OPTION> <!-- Need to remove lemma because StanfordNLP don't handle French lemmatization -->
		</PREPROCESS>
		<PREPROCESS NAME="Lemmatization">
		</PREPROCESS>	
	</TASK>
</CONFIG>

Here, the Preprocess step is composed of three atomic preprocess :

GenerateTextModel which load the documents and the stopwords list in memory
StanfordNLPPreProcess with specific behavior (tokenize, ssplit, pos) since StanfordNLP can't handle french lemmatization.
Lemmatization Notice the 'french' in the tag.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess

Preprocess

Example

Clone this wiki locally