NLP_V2

This repository contains all important stuff I have learned , implemented and I find useful for nlp. I want to make this repo. as a go to place to find all the nlp stuff when I need hence this repo will always be a work in progress as learning never stops.

The datsets used here are :

StackOverFlow Tag Prediction (in the problem/multilabel folder)
Toxicity Classification
IMDB Movie Review

The structure of this repo:

There are basically five main folders.

1. Visualization:

This folder is ment to contain all the stuff which is required to visualize text data. As text data is different from tabular data we cannot use the same visualization methods to get insights of out data. Aside from visualization this also includes extracting meta-features like word_len,number of tokens ect. Which may or may not be useful but gives us insights about the data.

Visualizing using wordcloud is the most effective way to find anomalies in the text data.

Visualization for N grams. Helps select the best value of N reducing the trial and error method.

2.Preprocessing

From the name one might guess this folder contains different preprocessing steps. This also includes preprocessing steps needed to increase the word coverage when using pre-trained enbeddings. Things like replacing digits greater than 9 with #'s, handling contractiong etc. The steps that hurt the word coverage the most are stemming and lemmatize. But there are still ways in which you can use them see the code snippet.

I am also working on a more general notebook which contains all the preprocessing steps regardless the dataset being used.

Initial Coverage

Increased The Coverage

Below is a code snipped demonstrating the right way to us stemming and lemmatization when dealing with pre-trained word embeddings.

for stemming. (Notice the use of different stemmers)

        word = ps.stem(key)
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            stem.append(word)
            known_words += 1
            continue


        word = lc.stem(key)
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            stem.append(word)
            known_words += 1
            continue


        word = sb.stem(key)
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            stem.append(word)
            known_words += 1
            
            continue

Code Snippet for lemmatization

        word = wnl.lemmatize(key,'v')
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            lemma.append(key)
            known_words += 1
            continue

3. Embedding:

This folder includes the meta-embeddings. In this we try to combine more than one embeddings eg concatenating word2vec and glove embedings.The different methods covered here are meta-embedding by 1. Taking the average 2. Concatenation 3. DME 4. PME

This folder also includes research papers on different Embedding Methods.

Image of PME block

4.Problems:

This folder is made of 2 seperate folder BinaryProbelms, Multi-Label Problems

BinaryProbelms: Includes notebooks for handling sementic analysis problems. It also includes differnt model architectures used to solve the probelm

Multi-Label : In this notebook I have solved two problems

    1. stackoverflow tag prediction : Where the input is the question and title and the model needs to predict tags.
    2. Toxicity Classification problem : Where each comment can belong in any one or more of the 6 classes.

BackStory (dont need to read) In this section I have made use of ml model like logistic regression with OneVsRestClassifier as thses models can hangle multi-label data. The reason for doing this is that my final year project was based on a similar mult-label problem where in I had to develope a model which could be deployed on a server as a result both the compute power and memory needed for re-training the model is limited hence the use of ml models instead of resource intensive LSTMs,also the size of the dataset was small as this was a piolit project the company who had give us the project.

5.Attention:

Attention has become an very important part of NLP.The reason why most of the advancements in the field (ImageNet moment). The purpose of this folder is to house all the different attentions architectures. The attention mechanisms included in the folder are

1. Simple Attention : 
2. Self Attention 
3. Attention with context (I like this one.)
4. Additive Attention 
5. Scaled Dot Product Attentoon
6. MultiHeadAttention.

There are reasearch papers which help in understanding why this mechanisms work.

A new notebook comparing the different attention's is in porcess, will be added shortly

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Attention		Attention
Embeddings		Embeddings
Preprocessing		Preprocessing
Probelm		Probelm
Visualization		Visualization
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP_V2

1. Visualization:

2.Preprocessing

Below is a code snipped demonstrating the right way to us stemming and lemmatization when dealing with pre-trained word embeddings.

for stemming. (Notice the use of different stemmers)

Code Snippet for lemmatization

3. Embedding:

4.Problems:

BinaryProbelms: Includes notebooks for handling sementic analysis problems. It also includes differnt model architectures used to solve the probelm

Multi-Label : In this notebook I have solved two problems

5.Attention:

Transformer Methods. (work in progress...)

About

Uh oh!

Releases

Packages

Languages

evilc3/NLP_V2

Folders and files

Latest commit

History

Repository files navigation

NLP_V2

1. Visualization:

2.Preprocessing

Below is a code snipped demonstrating the right way to us stemming and lemmatization when dealing with pre-trained word embeddings.

for stemming. (Notice the use of different stemmers)

Code Snippet for lemmatization

3. Embedding:

4.Problems:

BinaryProbelms: Includes notebooks for handling sementic analysis problems. It also includes differnt model architectures used to solve the probelm

Multi-Label : In this notebook I have solved two problems

5.Attention:

Transformer Methods. (work in progress...)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages