Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Spam Detection Feature #693

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

aimura09
Copy link
Contributor

@aimura09 aimura09 commented Jan 12, 2024

Attempts to close https://github.com/comses/planning/issues/113

Squashed commits and solved merge conflicts.

Summary

Management commands for Machine Learning spam detection.

Features

Before running the commands, make sure spam_dataset.csv is located in the curator folder.
spam_dataset.csv consist of user_id and is_spam columns. is_spam column contains 1(Spam) or 0(Ham).

  • XGBoostClassifier() ... Uses XGboost as a classifier. Takes a data frame that has columns "user_id" and "input_data." The "input_data" column is a numerical vector where the selected fields are encoded by an encoder.

  • CountVectEncoder() ... Uses CountVectorizer as an encoder. Takes selected fields from "user_id," "labelled_by_curator," "first_name," "last_name," "is_active," "email," "affiliations," "bio," "research_interests" of the MemberProfiles as input.

  1. Run the following command to get a list of spam users.
    ./manage.py curator_spam_detection --predict

  2. options
    ./manage.py curator_spam_detection --fit
    ./manage.py curator_spam_detection --get_model_metrics
    ./manage.py curator_spam_detection --load_labels

Tests

Wrote 16 unit tests using Django tests

@aimura09 aimura09 force-pushed the feat_spam_detection branch 2 times, most recently from a2c8118 to 2215d7e Compare January 25, 2024 17:02
@sgfost sgfost mentioned this pull request May 13, 2024
3 tasks
CharlesSheelam and others added 23 commits June 17, 2024 16:50
feat: rewrite UserPipeline to include user id

feat: correct user pipeline for user id

feat: fix user id column in dataframes
- use 'first_name', 'last_name', 'is_active', 'email', 'affiliations', and 'bio' from MemberProfile
- update user pipeline to fix latency issues
- small fix in all_users_df().
- Convert df.value from markup to string
- Fix name of df.columns

fix: modify the partial_train to use the correct tokenizer
feat:
 - save to database from df
 - save recommendations
 - add load_labels() function to curator/spam_detect.py
 - load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table.
 - get all unlabelled users in dataframe

chore:
 - migrations for altering SpamRecommendation
fix:fixed SpamRecommendation __str__ function

fix:
  - using None instead of an extra column in SpamRecommendation
  - Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before. So, we are going to switch to that.
refactor: Move BioSpamClassifier to spam_detection_model.py and change functions names in SpamClassifier

chore:removed print statements

feat: add stub dataset for initial traning. This dataset should be replaced by an actual dataset with correct labels add 'TODO' comments on the parts to be fixed.
chore: todo noel
chore: organized imports
fix: fix the issue that data in database is not updated
…ile is created

feat:
 - fit text spam classifier
 - prediction function in classifiers
…lass SpamClassifier

refactor: refactored UserMetadataSpamClassifier, integrate TextSpamClassifier, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus
feat: added model validation in TextClassifier
fix: fixed positional argument bug
- fixed positional argument bug
- dataset.csv replaced and create a directory in shared folder for spam detection related files

fix: fix typing issue in df[labelled_by_curator] column
- manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.
…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/
fix: fixed KeyError bug in TextSpamClassifier
refactor:
- create new file 'spam_processor.py' for UserSpamStatusProcessor.
- change name from dataset.csv to spam_detaset.csv
fix:
- move SPAM_DIR_PATH into settings
- set SPAM_DIR_PATH as a pathlib.Path
- remove last reference to update_labelled_by_curator
- adjust test curator labelling references
- use assertCountEqual for order independent comparison
- could also convert to sets because there shouldn't be any duplicates

refactor: restructure code and tests
- tests should use SpamDetector entrypoint instead of instantiating
- individual components to ensure proper initialization
- move initial training dataset path into settings
- move UserSpamStatusProcessor from detected file into curator/models.py as a collaborating class of UserSpamStatus. Should consider integrating more tightly into the UserSpamStatus objects manager

Co-Authored-By: Allen Lee <[email protected]>
…al_fit() because CountVectorizer requires a model to fit the entire training dataset. Fix the management commands and tests accordingly.

Also re-generated migration files
also clean up duplicate / dead imports
… to the spam feature.

 - adding headline comments for the functions.

 - Cleaning up the management command code and clarifying code responsibilities.

 - Bettering execution messages.
…d_by_curator=None

 - using get_all_users_df() instead of get_unlabelled_by_curator_df() to obtain dataframe, because a user previously labelled as ham may turn into spam.

 - improved the management command messages.

 - added exception handling for file operations.

 - replaced MultinomialNB with XGboost.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants