Skip to content

Tanat05/korcen-kogpt2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Korcen-kogpt2

This failure is the seed of innovation.

131_20220604170616

"Intelligent Filtering: Detecting Nuance and Context with Machine Learning."

Moving beyond keyword matching, this project introduces a machine learning-powered profanity filter. By analyzing context and linguistic patterns, it aims to identify and filter out offensive language more accurately and intelligently, even when subtle variations or creative spellings are used.

Korcen: original before innovation.

Korcen-13M-EXAONE: This failure, though another, is a better one.

Model Overview

total samples: 2,000,000
Training samples: 1,800,000
Validation samples: 200,000

Tokenizer: SKT-AI/KoGPT2

Verification

Example

# py: 3.10, tf: 2.10
import tensorflow as tf
import numpy as np
import pickle
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 1000

model_path = 'vdcnn_model.h5'
tokenizer_path = "tokenizer.pickle"

model = tf.keras.models.load_model(model_path)

with open(tokenizer_path, "rb") as f:
    tokenizer = pickle.load(f)

def preprocess_text(text):
    text = text.lower()
    return text

def predict_text(text):
    sentence = preprocess_text(text)
    encoded_sentence = tokenizer.encode_plus(
        sentence,
        max_length=maxlen,
        padding="max_length",
        truncation=True
    )['input_ids']

    sentence_seq = pad_sequences([encoded_sentence], maxlen=maxlen, truncating="post")
    prediction = model.predict(sentence_seq)[0][0]
    return prediction

while True:
    text = input("Enter the sentence you want to test: ")
    result = predict_text(text)
    if result >= 0.5:
        print("This sentence contains abusive language.")
    else:
        print("It's a normal sentence.")

About

Korean deeplearning swear word(딥러닝 기반 욕설/비속어 판별)

Topics

Resources

License

Stars

Watchers

Forks

Languages