Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete text mining #1

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

vivienyuwenchen
Copy link

@vivienyuwenchen vivienyuwenchen commented Oct 9, 2017

Revised. Fixed syntax. Imported functions from text_mining instead of repeating them in text_mining_tfidf. Removed redundant stop_words removal from text_mining_tfidf, which changes the TF-IDF score of each word (weighs stop words into the score).


Top 50 Words in Paradise_Lost:

- ['thir', 'thy', 'thou', 'thee', "heav'n", 'shall', 'th', 'god', 'earth', 'man', 'high', 'great', 'death', 'till', 'hath', 'hell', 'stood', 'day', 'good', 'like', 'things', 'night', 'light', 'farr', 'love', 'eve', 'o', 'world', 'adam', 'soon', 'let', 'hee', 'son', 'life', 'know', 'place', 'long', 'forth', 'self', 'mee', 'ye', 'way', 'power', 'hand', 'new', 'deep', 'end', 'fair', 'men', 'satan']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I wonder how much of the differences could be associated with unrecognized words - e. g., archaic spellings or contractions which the sentiment analyzer doesn't recognize and thus returns "neutral" for.

text_mining.py Outdated
Returns:
text from url
"""
if exists(file_name) == False:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small style thing - rather than checking if a boolean is equal to false, we can just do "if not exists(filename) :" or "!exists".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, I like the structure of reading a file unless the file doesn't exist, then grabbing from the URL instead. It makes the program nice and portable!

word_list[i] = word_list[i].strip(string.punctuation)

stop_words = get_stop_words('en')
stop_words_2 = ["a", "about", "above", "across", "after", "afterwards", "again", "against", "all",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what happens if you only use the stop_words words, rather than your manually-assembled ones?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop_words = ['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', "there's", 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', "weren't", 'what', "what's", 'when', "when's", 'where', "where's", 'which', 'while', 'who', "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves']

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked pretty well, but there were a few common words that weren't included in stop_words (I can only remember 'one' being a very common word off the top of my head), so I just googled another set of stop words and copied it over.

text_mining.py Outdated

ordered_by_frequency = sorted(word_counts, key=word_counts.get, reverse=True)

return ordered_by_frequency[0:n]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, you can shorten [0:n] to [:n] and the 0 is implied. Your thing still works, though - just a personal preference thing, really.

# print the sentiment of the top n words
print('Sentiment of Top %d Words in %s:' % (n, title))
print(sentiment_analyzer(top_n_words), '\n')
# print the sentiment of the whole text

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a little too much commenting - print statements like these mostly stand on their own.

Though, a bit too much documentation is better than a bit too little!

from textblob import TextBlob as tb # pip install textblob


def get_cache(url, file_name):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to look into: you can import your own functions in Python (e. g., "from text_mining import get_cache" - it'd save you some repeated code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants