Skip to content

Commit 42239ec

Browse files
authored
fix: bug in ngram filter for Chinese (#147)
1 parent 2a6236b commit 42239ec

File tree

4 files changed

+225
-121
lines changed

4 files changed

+225
-121
lines changed

docs/en/notes/api/operators/general_text/eval/NgramSampleEvaluator.md

Lines changed: 52 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -6,29 +6,35 @@ permalink: /en/api/operators/general_text/eval/ngramsampleevaluator/
66

77
## 📘 Overview
88

9-
`NgramSampleEvaluator` is an operator for evaluating text originality by analyzing the n-gram repetition rate in text. It helps detect content repetition and assess text diversity by calculating the proportion of unique n-grams.
9+
`NgramSampleEvaluator` is an operator for evaluating text redundancy and originality. It calculates the repetition ratio of n-grams to assess how diverse or repetitive a text is. It supports both **English (word-level)** and **Chinese (character-level)** modes.
1010

11-
## __init__
11+
**Scoring Principle**: The score is computed as the ratio between the number of unique n-grams and the total number of n-grams. A higher score (closer to 1.0) indicates lower repetition and better originality.
12+
13+
## `__init__`
1214

1315
```python
14-
def __init__(self, n=3)
16+
def __init__(self, ngrams=5, language: str = 'en')
1517
```
1618

19+
### Init Parameters
20+
1721
| Parameter | Type | Default | Description |
1822
| :--- | :--- | :--- | :--- |
19-
| **n** | int | 3 | N-gram length (number of consecutive words). Default is trigrams (3-grams). |
23+
| **ngrams** | int | 5 | Length of n-grams. |
24+
| **language** | str | 'en' | Processing language; `'zh'` uses character-level splitting for Chinese, other values use whitespace tokenization for English. |
2025

21-
## run
26+
## `run`
2227

2328
```python
24-
def run(self, storage: DataFlowStorage, input_key: str)
29+
def run(self, storage: DataFlowStorage, input_key: str, output_key: str='NgramScore')
2530
```
2631

2732
#### Parameters
2833
| Name | Type | Default | Description |
2934
| :------------- | :---------------- | :---------------- | :----------------- |
3035
| **storage** | DataFlowStorage | Required | DataFlow storage instance for reading and writing data. |
3136
| **input_key** | str | Required | Input column name specifying the column containing the text to be evaluated. |
37+
| **output_key** | str | 'NgramScore' | Output column name for the generated n-gram score. |
3238

3339
## 🧠 Example Usage
3440
```python
@@ -43,13 +49,25 @@ class NgramSampleEvaluatorTest():
4349
file_name_prefix="dataflow_cache_step",
4450
cache_type="jsonl",
4551
)
46-
47-
self.evaluator = NgramSampleEvaluator(n=3)
48-
52+
53+
# English mode (default)
54+
self.evaluator_en = NgramSampleEvaluator(ngrams=5, language='en')
55+
# Chinese mode
56+
self.evaluator_zh = NgramSampleEvaluator(ngrams=5, language='zh')
57+
4958
def forward(self):
50-
self.evaluator.run(
59+
# English samples
60+
self.evaluator_en.run(
5161
storage=self.storage.step(),
52-
input_key='text'
62+
input_key='text_en',
63+
output_key='NgramScore_en'
64+
)
65+
66+
# Chinese samples
67+
self.evaluator_zh.run(
68+
storage=self.storage.step(),
69+
input_key='text_zh',
70+
output_key='NgramScore_zh'
5371
)
5472

5573
if __name__ == "__main__":
@@ -60,37 +78,38 @@ if __name__ == "__main__":
6078
#### 🧾 Default Output Format
6179
| Field | Type | Description |
6280
| :--- | :---- | :---------- |
63-
| text | str | The original input text |
64-
| NgramScore | float | N-gram originality score (0-1, higher indicates less repetition and better originality) |
81+
| [input_key] | str | Original input text. |
82+
| NgramScore | float | Computed n-gram score (01, higher means less repetition and better originality). |
6583

6684
### 📋 Example Input
6785
```json
68-
{"text": "The quick brown fox jumps over the lazy dog. The sun is shining brightly in the clear blue sky. Birds are singing melodiously in the tall green trees. Children are playing happily in the beautiful park. Flowers are blooming magnificently everywhere you look. Nature displays its wonder through colorful butterflies dancing among fragrant roses. People enjoy peaceful walks along winding pathways surrounded by lush vegetation."}
69-
{"text": "The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."}
70-
{"text": "In contemporary discourse surrounding technological advancement, one must acknowledge the multifaceted ramifications of artificial intelligence implementation. The epistemological considerations necessitate comprehensive analysis of socioeconomic implications. Furthermore, the paradigmatic shift toward automation requires meticulous examination of ethical frameworks governing algorithmic decision-making processes. Subsequently, organizational infrastructures must accommodate transformative methodologies while simultaneously addressing unprecedented complexities inherent within technological ecosystems."}
86+
{"id":1,"type":"zh_normal","text_zh":"人工智能在大模型领域的应用已经非常广泛,从文本生成到逻辑推理都有显著进步,未来可期。"}
87+
{"id":2,"type":"zh_repeat_phrase","text_zh":"重要的事情说三遍:不要过拟合!不要过拟合!不要过拟合!这就叫重要的事情说三遍。"}
88+
{"id":3,"type":"zh_garbage","text_zh":"哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈"}
89+
90+
{"id":1,"type":"en_normal","text_en":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence."}
91+
{"id":2,"type":"en_repeat_phrase","text_en":"The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."}
92+
{"id":3,"type":"en_garbage","text_en":"test test test test test test test test test test test test test test test test test test"}
7193
```
7294

7395
### 📤 Example Output
7496
```json
75-
{"text": "The quick brown fox...", "NgramScore": 1.0}
76-
{"text": "The cat sat on the mat...", "NgramScore": 0.075}
77-
{"text": "In contemporary discourse...", "NgramScore": 1.0}
97+
{"id":1,"type":"zh_normal","text_zh":"人工智能在大模型领域的应用已经非常广泛,从文本生成到逻辑推理都有显著进步,未来可期。","NgramScore_zh":1.0}
98+
{"id":2,"type":"zh_repeat_phrase","text_zh":"重要的事情说三遍:不要过拟合!不要过拟合!不要过拟合!这就叫重要的事情说三遍。","NgramScore_zh":0.6666666667}
99+
{"id":3,"type":"zh_garbage","text_zh":"哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈","NgramScore_zh":0.03125}
100+
101+
{"id":1,"type":"en_normal","text_en":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.","NgramScore_en":1.0}
102+
{"id":2,"type":"en_repeat_phrase","text_en":"The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat.","NgramScore_en":0.3}
103+
{"id":3,"type":"en_garbage","text_en":"test test test test test test test test test test test test test test test test test test","NgramScore_en":0.0714285714}
78104
```
79105

80106
### 📊 Result Analysis
81107

82-
**Sample 1 (Normal Descriptive Text):**
83-
- NgramScore: 1.0 (perfect score, all 3-grams are unique)
84-
- No repetition, high text quality and originality
85-
86-
**Sample 2 (Highly Repetitive Text):**
87-
- NgramScore: 0.075 (very low score, severe repetition)
88-
- The sentence "The cat sat on the mat" repeats 14 times
89-
- Only a few unique 3-grams, very low text quality
90-
91-
**Sample 3 (Complex Academic Text):**
92-
- NgramScore: 1.0 (perfect score, all 3-grams are unique)
93-
- Despite complex vocabulary, no repetitive patterns
94-
- Demonstrates high originality and diversity
108+
- **Chinese normal text (zh_normal)**: `NgramScore_zh = 1.0` (no repetition), rich and diverse content, almost all 5-gram combinations are unique, minimal redundancy.
109+
- **Chinese repeated phrase (zh_repeat_phrase)**: `NgramScore_zh ≈ 0.67`, includes repeated sentence pattern "不要过拟合!" three times, so some n-grams are repeated and the score is medium-low.
110+
- **Chinese garbage text (zh_garbage)**: `NgramScore_zh = 0.03125`, almost all n-grams are composed of repeated "哈", very few unique n-grams, extremely high redundancy.
111+
- **English normal text (en_normal)**: `NgramScore_en = 1.0`, nearly all 5-gram word sequences are unique, indicating rich information and no obvious repetition.
112+
- **English repeated phrase (en_repeat_phrase)**: `NgramScore_en = 0.3`, the sentence "The cat sat on the mat." is repeated multiple times, causing many repeated n-grams and a significantly lower score.
113+
- **English garbage text (en_garbage)**: `NgramScore_en ≈ 0.071`, dominated by repeated `test` tokens, highly template-like and low-quality text.
95114

96-
**Application Value**: This operator is particularly effective at detecting low-quality generated text (such as model-generated repetitive content) and can be used for automated text quality filtering.
115+
Scores closer to 1.0 indicate lower repetition and higher originality, while scores closer to 0.0 indicate heavy repetition.

docs/en/notes/api/operators/general_text/filter/NgramFilter.md

Lines changed: 68 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,12 @@ permalink: /en/api/operators/general_text/filter/ngramfilter/
66

77
## 📘 Overview
88

9-
`NgramFilter` is an N-gram score-based text filtering operator. It evaluates text redundancy by calculating the proportion of repeated n-grams and filters out text with excessively high or low redundancy based on set score thresholds. Higher scores indicate lower repetition proportions and generally higher text quality.
9+
`NgramFilter` is an N-gram score-based text filtering operator. It evaluates text redundancy by calculating the proportion of repeated n-grams and filters out text with excessively high or low redundancy based on set score thresholds. Higher scores indicate lower repetition proportions and generally higher text quality. It supports both **Chinese (character-level)** and **English (word-level)** modes.
1010

1111
## `__init__` Function
1212

1313
```python
14-
__init__(self, min_score=0.8, max_score=1, ngrams=5)
14+
__init__(self, min_score=0.8, max_score=1, ngrams=5, language: str = 'en')
1515
```
1616

1717
### Init Parameters
@@ -21,6 +21,7 @@ __init__(self, min_score=0.8, max_score=1, ngrams=5)
2121
| **min_score** | float | 0.8 | Minimum n-gram score threshold; text below this value will be filtered. |
2222
| **max_score** | int | 1 | Maximum n-gram score threshold; text above this value will be filtered. |
2323
| **ngrams** | int | 5 | N-gram size for calculating repetition rate. |
24+
| **language** | str | 'en' | Processing language; `'zh'` uses character-level splitting for Chinese, other values use word-level tokenization for English. |
2425

2526
## `run` Function
2627

@@ -50,11 +51,13 @@ class NgramFilterTest():
5051
file_name_prefix="dataflow_cache_step",
5152
cache_type="jsonl",
5253
)
53-
54+
55+
# Example: Chinese filtering
5456
self.filter = NgramFilter(
5557
min_score=0.8,
5658
max_score=1.0,
57-
ngrams=5
59+
ngrams=5,
60+
language='zh'
5861
)
5962

6063
def forward(self):
@@ -80,52 +83,80 @@ The operator adds an `output_key` field (default `NgramScore`) to the data and r
8083
### 📋 Example Input
8184

8285
```json
83-
{"text": "今天天气真不错,阳光明媚,万里无云,适合出门散步。"}
84-
{"text": "好好好好好好好好好好好好好好好好好好好好好好好好好好"}
85-
{"text": "The fascinating world of natural language processing encompasses various sophisticated algorithms."}
86+
{"id":1,"type":"zh_normal","text":"人工智能在大模型领域的应用已经非常广泛,从文本生成到逻辑推理都有显著进步,未来可期。"}
87+
{"id":2,"type":"zh_repeat_phrase","text":"重要的事情说三遍:不要过拟合!不要过拟合!不要过拟合!这就叫重要的事情说三遍。"}
88+
{"id":3,"type":"zh_garbage","text":"哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈"}
89+
90+
{"id":1,"type":"en_normal","text":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence."}
91+
{"id":2,"type":"en_repeat_phrase","text":"The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."}
92+
{"id":3,"type":"en_garbage","text":"test test test test test test test test test test test test test test test test test test"}
8693
```
8794

8895
### 📤 Example Output
8996

9097
```json
91-
{"text": "The fascinating world of natural language processing encompasses various sophisticated algorithms.", "NgramScore": 1.0}
98+
// Assume we run the filter separately for Chinese and English.
99+
100+
// Chinese mode (language='zh'), threshold [0.8, 1.0]
101+
{"id":1,"type":"zh_normal","text":"人工智能在大模型领域的应用已经非常广泛,从文本生成到逻辑推理都有显著进步,未来可期。","NgramScore":1.0}
102+
103+
// English mode (language='en'), threshold [0.8, 1.0]
104+
{"id":1,"type":"en_normal","text":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.","NgramScore":1.0}
92105
```
93106

94107
### 📊 Result Analysis
95108

96-
**Sample 1 ("今天天气真不错,阳光明媚,万里无云,适合出门散步。")**:
97-
- Text length: 26 characters
98-
- Total 5-grams: 22
99-
- Unique 5-gram count: Approximately 20
100-
- N-gram score: 20 / 22 ≈ 0.91
101-
- Score range: [0.8, 1.0]
102-
- **Passes filter** (but not shown in output, possibly only showing passed samples)
103-
104-
**Sample 2 ("好好好好好好好好好好好好好好好好好好好好好好好好好好")**:
105-
- Text length: 26 characters
106-
- Total 5-grams: 22
107-
- Unique 5-gram count: Approximately 1-2 (repeated "好好好好好")
108-
- N-gram score: 1 / 22 ≈ 0.045
109-
- Score range: [0.8, 1.0]
110-
- **Filtered out** (0.045 < 0.8, excessively high repetition rate)
111-
112-
**Sample 3 ("The fascinating world of natural language processing...")**:
113-
- Text length: 95 characters
114-
- Total 5-grams: 91
115-
- Unique 5-gram count: 91 (no repetition)
116-
- N-gram score: 91 / 91 = 1.0
117-
- Score range: [0.8, 1.0]
118-
- **Passes filter** (1.0 within range, text has no redundancy)
109+
**Chinese samples (language='zh')**:
110+
111+
- **Sample zh_normal**
112+
- N-gram score: `NgramScore = 1.0`
113+
- Meaning: rich content with very little repetition and minimal redundancy.
114+
- Threshold range: [0.8, 1.0]
115+
- **Passes filter** (1.0 within threshold range)
116+
117+
- **Sample zh_repeat_phrase**
118+
- N-gram score: `NgramScore ≈ 0.6667`
119+
- Meaning: contains a repeated \"three times\" slogan pattern, so many n-grams are repeated.
120+
- Threshold range: [0.8, 1.0]
121+
- **Filtered out** (0.6667 < 0.8)
122+
123+
- **Sample zh_garbage**
124+
- N-gram score: `NgramScore = 0.03125`
125+
- Meaning: almost entirely made of repeated \"\", highly garbage/template-like text.
126+
- Threshold range: [0.8, 1.0]
127+
- **Filtered out** (severe repetition)
128+
129+
**English samples (language='en')**:
130+
131+
- **Sample en_normal**
132+
- N-gram score: `NgramScore = 1.0`
133+
- Meaning: high information density, nearly all 5-gram combinations are unique.
134+
- Threshold range: [0.8, 1.0]
135+
- **Passes filter**
136+
137+
- **Sample en_repeat_phrase**
138+
- N-gram score: `NgramScore = 0.3`
139+
- Meaning: the sentence \"The cat sat on the mat.\" is repeated many times, leading to many repeated n-grams.
140+
- Threshold range: [0.8, 1.0]
141+
- **Filtered out**
142+
143+
- **Sample en_garbage**
144+
- N-gram score: `NgramScore ≈ 0.0714`
145+
- Meaning: dominated by repeated `test` tokens, typical low-quality garbage text.
146+
- Threshold range: [0.8, 1.0]
147+
- **Filtered out**
148+
149+
> In summary, when we set the threshold `[min_score=0.8, max_score=1.0]`, **only high-quality, low-redundancy texts (such as zh_normal / en_normal) are retained**, while repetitive slogans, spammy or garbage-like texts are filtered out.
119150
120151
**Calculation Formula**:
121152
```
122153
N-gram Score = Unique n-gram count / Total n-gram count
123154
```
124155

125156
**Score Meaning**:
126-
- **1.0**: Text has no repetition, highest quality
127-
- **0.8-0.99**: Text has minor repetition, good quality
128-
- **< 0.8**: Text has high repetition rate, poor quality
157+
- **1.0**: No repetition, highest quality
158+
- **0.8-0.99**: Minor repetition, good quality
159+
- **< 0.8**: High repetition rate, low quality
129160

130161
**Use Cases**:
131162
- Filter low-quality, highly repetitive text
@@ -134,8 +165,8 @@ N-gram Score = Unique n-gram count / Total n-gram count
134165
- Filter template text from scraped data
135166

136167
**Notes**:
137-
- Uses character-level n-gram (default n=5), suitable for mixed Chinese-English
168+
- Uses character-level n-gram for Chinese and word-level n-gram for English (default n=5)
138169
- Higher scores indicate better text diversity
139-
- Short text (< n characters) may have abnormal scores
170+
- Short text (< n characters/tokens) may have unstable scores
140171
- Different languages and scenarios may require adjusted `min_score` thresholds
141-
- Special text types like poetry and lyrics may score low due to repetition
172+
- Special text types such as poetry and lyrics may score low due to intentional repetition

0 commit comments

Comments
 (0)