PROJECT 3 README

Henry Thomas -- CS5293sp23-Project3 FINAL PROJECT

1. HOW TO INSTALL

The project can be installed by running the following code:

pipenv install project3

2. HOW TO RUN

TO RUN FROM THE ROOT DIRECTORY USE THE FOLLOWING

pipenv run python project3.py --document testcity.pdf

NOTE: I DEVELOPED AND RAN THIS USING CLUSTER SIZE e2-standard-8

--document is the only implemented flag. A sample pdf file is saved in the root directory for use in testing functionality of the file (testcity.pdf)

Visit this link for a video example of how to run: youtube.com

3. DESIGN CONSIDERATIONS

project3.py is the name of the source file that extracts the text data from the provided new pdf file which then vectorizes and asssigns it to a cluster based on two models saved using pickle to the directory: vectorizer.pkl, and model.pkl.

Vectorizer.pkl is a pipeline object that was trained on all the PDF documents provided and runs this data through tfidfvectorizer, TruncatedSVD, and Normalizer as recommended in the sklearn documentation and as I used it in the project 2 clustering assignment and vectorizes the data. This pipeline object was saved with pickle so that future documents that we would like to be assigned to a cluster could be vectorized the same way as the originals.

The second pickle model is the model.pkl file which is the AgglomerativeClustering SKLEARN model with K=2. We use this to actually cluster the new documents.

I spent a lot of time trying to find the best k, and trying to figure out which words to exclude. My main strategy in excluding the words was to use a tfidfvectorizer object to identify words that appear in 97.5% or more of the documents. If a word appears in this many of the documents it is likely one of those words like 'smart' or 'city' which doesn't help much to discern between a pile of documents about smart cities. Since TfidfVectorizers can automatically remove words that appear in more than X% of the documents, as well as words that appear in less than X% of the documents (helpful for eliminating words that are highly specific to one document, but don't help with understanding meaning, i.e. the name of a particular city). I was able to do this very easily with just the arguments of the tfidfvectorizer constructor. That said, after doing that (and also adding a method to the text cleaner that removed words not in the medium sized spacy english model) there were still a few words appearing which were not words i.e. 'st' and 'zp' which I went back and made a special list for to remove. A lot of time was spent looking for that perfect value of X at the top of the document frequency value to cut out frequently occuring words by document.

After doing all this I was able to run the model and get a pretty high silhouette score. The one thing I was not too happy with was the imbalance of the cluster labeling. If K=2 I was hoping for more of a 50/50 split between the two clusters, but it was more of a 15/85 split.

4. TESTS

Tests were implemented for each grouping of models to verify operability.

5. OTHER NOTES

This is the final - so I'd like to say that I really enjoyed this class. I felt like it gave me a lot to think about and really laid the groundwork for continued learning in this area. I haven't had as much exposure as I have had to sklearn as I have in this class and it is a really great tool and so much excellent documentation which I look forward to continuing to explore as I further my career in this field.

Note to self for future use: COMMAND TO RUN jupyter notebook: pipenv run jupyter-lab --no-browser --ip 0.0.0.0 --ServerApp.iopub_data_rate_limit=1.0e10

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
smartcity		smartcity
tests		tests
COLLABORATORS		COLLABORATORS
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
model.pkl		model.pkl
project3.ipynb		project3.ipynb
project3.py		project3.py
setup.cfg		setup.cfg
setup.py		setup.py
smartcity_eda.tsv		smartcity_eda.tsv
smartcity_predict.tsv		smartcity_predict.tsv
testcity.pdf		testcity.pdf
vectorizer.pkl		vectorizer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PROJECT 3 README

Henry Thomas -- CS5293sp23-Project3 FINAL PROJECT

1. HOW TO INSTALL

2. HOW TO RUN

3. DESIGN CONSIDERATIONS

4. TESTS

5. OTHER NOTES

About

Releases

Packages

Languages

HotelTango314/cs5293sp23-project3

Folders and files

Latest commit

History

Repository files navigation

PROJECT 3 README

Henry Thomas -- CS5293sp23-Project3 FINAL PROJECT

1. HOW TO INSTALL

2. HOW TO RUN

3. DESIGN CONSIDERATIONS

4. TESTS

5. OTHER NOTES

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages