GitHub - apaladugu3/594-Humor

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Bert_tokens.py		Bert_tokens.py
Readme		Readme
data_helpers.py		data_helpers.py
extract_features.py		extract_features.py
filteredinput.txt		filteredinput.txt
filteredinputn		filteredinputn
final_cleansing.py		final_cleansing.py
final_positive.py		final_positive.py
negativeid.txt		negativeid.txt
padding.py		padding.py
positive_clean.py		positive_clean.py
sample_text.txt		sample_text.txt
text_cnn.py		text_cnn.py
train.py		train.py
vocabgenerator.py		vocabgenerator.py

Repository files navigation

Go to the bert repository and download the model that you want to use to generate embeddings there will be an option of 768 or 1024. If you use 768 then you may have to change the embedding dims in CNN.

There are a lot of file paths that are specified in all these files so you should first download and change them to make it easier to work.
You may have missed downloading a vocab file which is also available online so just make sure.
After downloading bert use the filtered input and filtered inputn files to generate bert vectors to do this run the following code

Run python3 extract_features.py --input_file=filteredinput.txt --output_file=output.json --vocab_file=vocab.txt --bert_config_file=bert_config.json --init_checkpoint=bert_model.ckpt --layers=-1 --max_seq_length=128 --batch_size=8

make sure that in the above command layers is always -1.

Create a new folder as where you want to store data as specified in the paths you chose

Run final_cleansing.py and final_positive.py and specify inputs as if the data is positive or negative and then name of the file with the bert vector embeddings

Remove positiveid, negativeid, vocabcnn and then run vocab_generator.py

Then run train.py to run the CNN module on positive id file and negative id file. The CNN will need details about a vocbular file that is generated by vocab_generator.
train.py uses text_cnn.py and datahelpers.py to run.

If you want some back tracking capability like convert tensor back to the sentences look at Bert_tokens.py

For any other questions please do contact me at [email protected]