Corpus_Analysis_and_Sentence_Embeddings

The data for part 1 can be found here:

Use the Atticus dataset of legal contacts: https://zenodo.org/record/4595826#.YyXT6HbMI2w

Download the file CUAD_v1.zip, unzip, and see the folder full_contact_txt/

It contains 510 files with full text contracts (a collection of TXT files of the underlying contracts). Each file is named as “[document name].txt”. These contracts are in a plaintext format and are not labeled. You will need to concatenate all the text files to form a corpus.

The data for part 2 is in a zipped file, and can be found here:

Use the dataset from the Semeval 2016-Task1 Semantic Textual Similarity (STS).

Use the test data STS Core (English Monolingual subtask) - test data with gold labels. Do not use the training data. Read more about the task at https://alt.qcri.org/semeval2016/task1/#

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
output		output
README.md		README.md
README.txt		README.txt
output.txt		output.txt
question1.py		question1.py
question2_part1.py		question2_part1.py
question2_part2.py		question2_part2.py
sts2016-english-with-gs-v1.0.zip		sts2016-english-with-gs-v1.0.zip
tokens.txt		tokens.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corpus_Analysis_and_Sentence_Embeddings

About

Releases

Packages

Languages

Juliane2210/Corpus_Analysis_and_Sentence_Embeddings

Folders and files

Latest commit

History

Repository files navigation

Corpus_Analysis_and_Sentence_Embeddings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages