Skip to content

Preprocess

ValNyz edited this page Jun 27, 2018 · 6 revisions

Preprocess

Preprocess is the first step in MOTS pipeline. This step need to load documents from files into memory and make appropriate preprocessing for the future generation of abstracts.

For this task, MOTS use the AbstractPreProcess class. All subclass will be compute during the Preprocess step and should represent an atomic preprocess:

  • GenerateTextModel <MANDATORY> : Load documents in memory and write them in <OUTPUT>/temp/<CORPUS_NAME> after preprocessing.

    • <OPTION NAME="StopWordListFile"> <OPTIONAL> Load list of stopword. If not present, documents won't be filtered from stopwords.
  • StanfordNLPPreProcess <OPTIONAL> : Can apply some StanfordNLPPreprocess such as tokenizing, sentence splitting, pos tagging, lemmatization, and more to come

    • <OPTION NAME="PropStanfordNLP"> <OPTIONAL> Specify StanfordNLP annotators. Language model path are specify in stanfordNLP.StanfordNLPProperties. Default : 'tokenize, ssplit, pos, lemma'
  • WordSplitter <OPTIONAL> Greedy tokenizer

  • SentenceSplitter <OPTIONAL> Greedy SentenceSplitter

  • TextStemming <OPTIONAL> Stemming based on SnowballStemmer (16 languages supported)
  • Lemmatization <OPTIONAL> Lemmatization based on Ahmet Aker work. French, german, italian and english supported. Need POSTagging.

Example

config_preprocess.xml for english preprocessing (lemmatization based)

<?xml version="1.0" encoding="UTF-8"?>
<CONFIG>
	<TASK ID="1">
		<LANGUAGE>english</LANGUAGE>
		<OUTPUT_PATH>output</OUTPUT_PATH>
		<MULTITHREADING>true</MULTITHREADING>		
		<PREPROCESS NAME="GenerateTextModel">
			<OPTION NAME="StopWordListFile">docs/StopWords/englishStopWords.txt</OPTION>
		</PREPROCESS>
		<PREPROCESS NAME="StanfordNLPPreProcess">
		</PREPROCESS>	
	</TASK>
</CONFIG>

Here, the Preprocess step is composed of two atomic preprocess :

  • GenerateTextModel which load the documents and the stopwords list in memory
  • StanfordNLPPreProcess with default behavior (i.e. tokenize, ssplit, pos, lemma)

config_preprocessFrench.xml for french preprocessing (lemmatization based)

<?xml version="1.0" encoding="UTF-8"?>
<CONFIG>
	<TASK ID="1">
		<LANGUAGE>french</LANGUAGE>
		<OUTPUT_PATH>output</OUTPUT_PATH>
		<MULTITHREADING>true</MULTITHREADING>		
		<PREPROCESS NAME="GenerateTextModel">
			<OPTION NAME="StopWordListFile">docs/StopWords/frenchStopWords.txt</OPTION>
		</PREPROCESS>
		<PREPROCESS NAME="StanfordNLPPreProcess">
			<OPTION NAME="PropStanfordNLP">tokenize, ssplit, pos</OPTION> <!-- Need to remove lemma because StanfordNLP don't handle French lemmatization -->
		</PREPROCESS>
		<PREPROCESS NAME="Lemmatization">
		</PREPROCESS>	
	</TASK>
</CONFIG>

Here, the Preprocess step is composed of three atomic preprocess :

  • GenerateTextModel which load the documents and the stopwords list in memory
  • StanfordNLPPreProcess with specific behavior (tokenize, ssplit, pos) since StanfordNLP can't handle french lemmatization.
  • Lemmatization Notice the 'french' in the tag.
Clone this wiki locally