Preprocessing textual data for text classification - a good idea? #7154
Replies: 1 comment
-
After distributed representations a.k.a. word vector embeddings came to extensive use, the need for preprocessing has been eliminated as they no longer affect the classification and other language tasks if you are using automatic feature building methods like Neural Networks or Deep Learning. Even in Machine Learning techniques use of methods like TF-IDF and its variants bring down the effect of the frequently used words in the language which do not add much of semantic value to the documents and mostly only serve the purpose of syntactic completion. For SpaCy preprocessing is not needed for English and most latin languages. I have not worked on other languages extensively using SpaCy, so no comments there. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a somewhat basic question. I am building my own text classification model. I currently have a large dataset that I am splitting up into a train, test and dev dataset.
My question is: Should I preprocess (remove stopwords, remove special characters, remove punctuations etc.) my entire dataset before:
Beta Was this translation helpful? Give feedback.
All reactions