"Intelligent Filtering: Detecting Nuance and Context with Machine Learning."
Moving beyond keyword matching, this project introduces a machine learning-powered profanity filter. By analyzing context and linguistic patterns, it aims to identify and filter out offensive language more accurately and intelligently, even when subtle variations or creative spellings are used.
Korcen: original before innovation.
Korcen-13M-EXAONE: This failure, though another, is a better one.
total samples: 2,000,000
Training samples: 1,800,000
Validation samples: 200,000
Tokenizer: SKT-AI/KoGPT2
모델 | korean-malicious-comments-dataset | Curse-detection-data | kmhas_korean_hate_speech | Korean Extremist Website Womad Hate Speech Data | LGBT-targeted HateSpeech Comments Dataset (Korean) |
---|---|---|---|---|---|
korcen | 0.7121 | 0.8415 | 0.6800 | 0.6305 | 0.4479 |
TF VDCNN_KOGPT2 (23.06.15) | 0.7545 | 0.7824 | 0.7055 | 0.6875 |
# py: 3.10, tf: 2.10
import tensorflow as tf
import numpy as np
import pickle
from tensorflow.keras.preprocessing.sequence import pad_sequences
maxlen = 1000
model_path = 'vdcnn_model.h5'
tokenizer_path = "tokenizer.pickle"
model = tf.keras.models.load_model(model_path)
with open(tokenizer_path, "rb") as f:
tokenizer = pickle.load(f)
def preprocess_text(text):
text = text.lower()
return text
def predict_text(text):
sentence = preprocess_text(text)
encoded_sentence = tokenizer.encode_plus(
sentence,
max_length=maxlen,
padding="max_length",
truncation=True
)['input_ids']
sentence_seq = pad_sequences([encoded_sentence], maxlen=maxlen, truncating="post")
prediction = model.predict(sentence_seq)[0][0]
return prediction
while True:
text = input("Enter the sentence you want to test: ")
result = predict_text(text)
if result >= 0.5:
print("This sentence contains abusive language.")
else:
print("It's a normal sentence.")