Feat: Spam Detection Feature #693

aimura09 · 2024-01-12T01:23:32Z

Attempts to close https://github.com/comses/planning/issues/113

Squashed commits and solved merge conflicts.

Summary

Management commands for Machine Learning spam detection.

Features

Before running the commands, make sure spam_dataset.csv is located in the curator folder.
spam_dataset.csv consist of user_id and is_spam columns. is_spam column contains 1(Spam) or 0(Ham).

XGBoostClassifier() ... Uses XGboost as a classifier. Takes a data frame that has columns "user_id" and "input_data." The "input_data" column is a numerical vector where the selected fields are encoded by an encoder.
CountVectEncoder() ... Uses CountVectorizer as an encoder. Takes selected fields from "user_id," "labelled_by_curator," "first_name," "last_name," "is_active," "email," "affiliations," "bio," "research_interests" of the MemberProfiles as input.

Run the following command to get a list of spam users.
./manage.py curator_spam_detection --predict
options
./manage.py curator_spam_detection --fit
./manage.py curator_spam_detection --get_model_metrics
./manage.py curator_spam_detection --load_labels

Tests

Wrote 16 unit tests using Django tests

feat: rewrite UserPipeline to include user id feat: correct user pipeline for user id feat: fix user id column in dataframes

- use 'first_name', 'last_name', 'is_active', 'email', 'affiliations', and 'bio' from MemberProfile - update user pipeline to fix latency issues - small fix in all_users_df(). - Convert df.value from markup to string - Fix name of df.columns fix: modify the partial_train to use the correct tokenizer

feat: - save to database from df - save recommendations - add load_labels() function to curator/spam_detect.py - load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table. - get all unlabelled users in dataframe chore: - migrations for altering SpamRecommendation fix:fixed SpamRecommendation __str__ function fix: - using None instead of an extra column in SpamRecommendation - Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before. So, we are going to switch to that.

refactor: Move BioSpamClassifier to spam_detection_model.py and change functions names in SpamClassifier chore:removed print statements feat: add stub dataset for initial traning. This dataset should be replaced by an actual dataset with correct labels add 'TODO' comments on the parts to be fixed.

chore: todo noel chore: organized imports fix: fix the issue that data in database is not updated

…ile is created feat: - fit text spam classifier - prediction function in classifiers

…lass SpamClassifier refactor: refactored UserMetadataSpamClassifier, integrate TextSpamClassifier, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus

feat: added model validation in TextClassifier fix: fixed positional argument bug - fixed positional argument bug - dataset.csv replaced and create a directory in shared folder for spam detection related files fix: fix typing issue in df[labelled_by_curator] column - manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.

…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/

fix: fixed KeyError bug in TextSpamClassifier refactor: - create new file 'spam_processor.py' for UserSpamStatusProcessor. - change name from dataset.csv to spam_detaset.csv

fix: - move SPAM_DIR_PATH into settings - set SPAM_DIR_PATH as a pathlib.Path - remove last reference to update_labelled_by_curator - adjust test curator labelling references - use assertCountEqual for order independent comparison - could also convert to sets because there shouldn't be any duplicates refactor: restructure code and tests - tests should use SpamDetector entrypoint instead of instantiating - individual components to ensure proper initialization - move initial training dataset path into settings - move UserSpamStatusProcessor from detected file into curator/models.py as a collaborating class of UserSpamStatus. Should consider integrating more tightly into the UserSpamStatus objects manager Co-Authored-By: Allen Lee <[email protected]>

…al_fit() because CountVectorizer requires a model to fit the entire training dataset. Fix the management commands and tests accordingly. Also re-generated migration files

also clean up duplicate / dead imports

…/curator/spam

… to the spam feature. - adding headline comments for the functions. - Cleaning up the management command code and clarifying code responsibilities. - Bettering execution messages.

…ection testcases

…d_by_curator=None - using get_all_users_df() instead of get_unlabelled_by_curator_df() to obtain dataframe, because a user previously labelled as ham may turn into spam. - improved the management command messages. - added exception handling for file operations. - replaced MultinomialNB with XGboost.

…ated architecture.

aimura09 force-pushed the feat_spam_detection branch from 100b49a to a2c8118 Compare January 16, 2024 22:14

aimura09 force-pushed the feat_spam_detection branch 2 times, most recently from a2c8118 to 2215d7e Compare January 25, 2024 17:02

aimura09 force-pushed the feat_spam_detection branch from e5c16cd to e3f2f52 Compare May 6, 2024 05:10

sgfost mentioned this pull request May 13, 2024

basic spam detection #719

Merged

3 tasks

alee force-pushed the feat_spam_detection branch from 590a4c4 to 3623378 Compare June 17, 2024 23:30

CharlesSheelam and others added 23 commits June 17, 2024 16:50

feat: create user pipeline for spam detection

d22cffa

feat: rewrite UserPipeline to include user id feat: correct user pipeline for user id feat: fix user id column in dataframes

feat:created a model for storing spam recommendations

acd73e1

feat: added extra field in SpamRecommendation for user classifier

f105aa5

chore: todo noel chore: organized imports fix: fix the issue that data in database is not updated

feat: initializing a new SpamRecommendation whenever a new MemberProf…

1c072a1

…ile is created feat: - fit text spam classifier - prediction function in classifiers

fix/refactor: modifies UserPipeline functions and added an abstruct c…

18ce5f3

…lass SpamClassifier refactor: refactored UserMetadataSpamClassifier, integrate TextSpamClassifier, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus

feat: unit tests added, comments added, SpamDetection class moved fro…

912605b

…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/

fix/refactor: TextSpamClassifier and UserSpamStatusProcessor

f4f309a

fix: fixed KeyError bug in TextSpamClassifier refactor: - create new file 'spam_processor.py' for UserSpamStatusProcessor. - change name from dataset.csv to spam_detaset.csv

fix: replace Tensorflow Tokenizer with CountVectorizer. Deleted parti…

32d1f01

…al_fit() because CountVectorizer requires a model to fit the entire training dataset. Fix the management commands and tests accordingly. Also re-generated migration files

fix: create UserSpamStatus with MemberProfiles

8b849db

also clean up duplicate / dead imports

fix: style with black and fix the timing to create the dir to /shared…

f3efa17

…/curator/spam

refactor: Cleaning and adding more comments for the functions related…

b777fad

… to the spam feature. - adding headline comments for the functions. - Cleaning up the management command code and clarifying code responsibilities. - Bettering execution messages.

fix: replace assertListEqual with assertTrue for sets in the spam det…

6f3ea68

…ection testcases

fix: fix variable name inconsistency.

2676bae

wip: initial implementation of the spam detection system with the upd…

5a49cb3

…ated architecture.

feat, refactor: add unit tests and comments.

a113516

fix: dependency conflicts + apply black

7744438

chore: typos

0074230

alee force-pushed the feat_spam_detection branch from 3623378 to 0074230 Compare June 17, 2024 23:52

fix: remove tensorflow

863f3c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Spam Detection Feature #693

Feat: Spam Detection Feature #693

aimura09 commented Jan 12, 2024 •

edited

Loading

Feat: Spam Detection Feature #693

Are you sure you want to change the base?

Feat: Spam Detection Feature #693

Conversation

aimura09 commented Jan 12, 2024 • edited Loading

Summary

Features

Tests

aimura09 commented Jan 12, 2024 •

edited

Loading