This paper presents the NLP (Natural Language Processing) approach to detecting spoilers in the IMDB review. Generally, these reviews reveal some information associated with the plot of a movie. An automated approach, filtering out such spoilers, would be ideal as manual labeling is impossible due to a large amount of content. To identify those reviews, we propose supervised machine learning models. So, we explored Bi-LSTM, XGBoost, Random Forest, and Naive Bayes to improve the accuracy in text classification. In addition to this, we used the pretrained word embeddings(word2vec & Glove), cosine similarity, and Term-Frequency and Inverse Document Frequency (TF-IDF) method to process the text vectors. The results shown from our models are satisfactory. Quantitative and qualitative results demonstrate the proposed method substantially outperforms the baseline model.
- 'IMDB-NB & XGBoost .ipynb' Implement the feature engineering and modelling of naive bayes, XGboost and sematic similarity method.
- 'IMDB-word2vec-Bi-lstm.ipynb' Implement the Pretrained Word2vec embedding with Bi-LSTM model.
- 'IMDB-GloVe-Random Forest.ipynb' Implement the Pretrained Glove embedding to convert sentence to vectors and predicted by Random Forest method.
- 'IMDB-GloVe-Bi-LSTM.ipynb' Implement the Pretrained Glove embedding with Bi-LSTM model.
- imdb-spoiler-dataset Dataset obtained from kaggle by RISHABH MISRA