Preprocessing textual data for text classification - a good idea? #7154

oliverbj · 2021-02-21T21:28:37Z

oliverbj
Feb 21, 2021

I have a somewhat basic question. I am building my own text classification model. I currently have a large dataset that I am splitting up into a train, test and dev dataset.

My question is: Should I preprocess (remove stopwords, remove special characters, remove punctuations etc.) my entire dataset before:

Splitting it up into train, test and dev
Training my custom model with the data

SandeepNaidu · 2021-02-22T02:00:10Z

SandeepNaidu
Feb 22, 2021

After distributed representations a.k.a. word vector embeddings came to extensive use, the need for preprocessing has been eliminated as they no longer affect the classification and other language tasks if you are using automatic feature building methods like Neural Networks or Deep Learning.

Even in Machine Learning techniques use of methods like TF-IDF and its variants bring down the effect of the frequently used words in the language which do not add much of semantic value to the documents and mostly only serve the purpose of syntactic completion.

For SpaCy preprocessing is not needed for English and most latin languages. I have not worked on other languages extensively using SpaCy, so no comments there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Preprocessing textual data for text classification - a good idea? #7154

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Preprocessing textual data for text classification - a good idea? #7154

Uh oh!

oliverbj Feb 21, 2021

Replies: 1 comment

Uh oh!

SandeepNaidu Feb 22, 2021

oliverbj
Feb 21, 2021

SandeepNaidu
Feb 22, 2021