You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`NgramSampleEvaluator` is an operator for evaluating text originality by analyzing the n-gram repetition rate in text. It helps detect content repetition and assess text diversity by calculating the proportion of unique n-grams.
9
+
`NgramSampleEvaluator` is an operator for evaluating text redundancy and originality. It calculates the repetition ratio of n-grams to assess how diverse or repetitive a text is. It supports both **English (word-level)** and **Chinese (character-level)** modes.
10
10
11
-
## __init__
11
+
**Scoring Principle**: The score is computed as the ratio between the number of unique n-grams and the total number of n-grams. A higher score (closer to 1.0) indicates lower repetition and better originality.
12
+
13
+
## `__init__`
12
14
13
15
```python
14
-
def__init__(self, n=3)
16
+
def__init__(self, ngrams=5, language: str='en')
15
17
```
16
18
19
+
### Init Parameters
20
+
17
21
| Parameter | Type | Default | Description |
18
22
| :--- | :--- | :--- | :--- |
19
-
|**n**| int | 3 | N-gram length (number of consecutive words). Default is trigrams (3-grams). |
23
+
|**ngrams**| int | 5 | Length of n-grams. |
24
+
|**language**| str | 'en' | Processing language; `'zh'` uses character-level splitting for Chinese, other values use whitespace tokenization for English. |
| NgramScore | float |N-gram originality score (0-1, higher indicates less repetition and better originality) |
81
+
|[input_key]| str |Original input text.|
82
+
| NgramScore | float |Computed n-gram score (0–1, higher means less repetition and better originality).|
65
83
66
84
### 📋 Example Input
67
85
```json
68
-
{"text": "The quick brown fox jumps over the lazy dog. The sun is shining brightly in the clear blue sky. Birds are singing melodiously in the tall green trees. Children are playing happily in the beautiful park. Flowers are blooming magnificently everywhere you look. Nature displays its wonder through colorful butterflies dancing among fragrant roses. People enjoy peaceful walks along winding pathways surrounded by lush vegetation."}
69
-
{"text": "The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."}
70
-
{"text": "In contemporary discourse surrounding technological advancement, one must acknowledge the multifaceted ramifications of artificial intelligence implementation. The epistemological considerations necessitate comprehensive analysis of socioeconomic implications. Furthermore, the paradigmatic shift toward automation requires meticulous examination of ethical frameworks governing algorithmic decision-making processes. Subsequently, organizational infrastructures must accommodate transformative methodologies while simultaneously addressing unprecedented complexities inherent within technological ecosystems."}
{"id":1,"type":"en_normal","text_en":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.","NgramScore_en":1.0}
102
+
{"id":2,"type":"en_repeat_phrase","text_en":"The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat.","NgramScore_en":0.3}
103
+
{"id":3,"type":"en_garbage","text_en":"test test test test test test test test test test test test test test test test test test","NgramScore_en":0.0714285714}
78
104
```
79
105
80
106
### 📊 Result Analysis
81
107
82
-
**Sample 1 (Normal Descriptive Text):**
83
-
- NgramScore: 1.0 (perfect score, all 3-grams are unique)
84
-
- No repetition, high text quality and originality
85
-
86
-
**Sample 2 (Highly Repetitive Text):**
87
-
- NgramScore: 0.075 (very low score, severe repetition)
88
-
- The sentence "The cat sat on the mat" repeats 14 times
89
-
- Only a few unique 3-grams, very low text quality
90
-
91
-
**Sample 3 (Complex Academic Text):**
92
-
- NgramScore: 1.0 (perfect score, all 3-grams are unique)
93
-
- Despite complex vocabulary, no repetitive patterns
94
-
- Demonstrates high originality and diversity
108
+
-**Chinese normal text (zh_normal)**: `NgramScore_zh = 1.0` (no repetition), rich and diverse content, almost all 5-gram combinations are unique, minimal redundancy.
109
+
-**Chinese repeated phrase (zh_repeat_phrase)**: `NgramScore_zh ≈ 0.67`, includes repeated sentence pattern "不要过拟合!" three times, so some n-grams are repeated and the score is medium-low.
110
+
-**Chinese garbage text (zh_garbage)**: `NgramScore_zh = 0.03125`, almost all n-grams are composed of repeated "哈", very few unique n-grams, extremely high redundancy.
111
+
-**English normal text (en_normal)**: `NgramScore_en = 1.0`, nearly all 5-gram word sequences are unique, indicating rich information and no obvious repetition.
112
+
-**English repeated phrase (en_repeat_phrase)**: `NgramScore_en = 0.3`, the sentence "The cat sat on the mat." is repeated multiple times, causing many repeated n-grams and a significantly lower score.
113
+
-**English garbage text (en_garbage)**: `NgramScore_en ≈ 0.071`, dominated by repeated `test` tokens, highly template-like and low-quality text.
95
114
96
-
**Application Value**: This operator is particularly effective at detecting low-quality generated text (such as model-generated repetitive content) and can be used for automated text quality filtering.
115
+
Scores closer to 1.0 indicate lower repetition and higher originality, while scores closer to 0.0 indicate heavy repetition.
`NgramFilter` is an N-gram score-based text filtering operator. It evaluates text redundancy by calculating the proportion of repeated n-grams and filters out text with excessively high or low redundancy based on set score thresholds. Higher scores indicate lower repetition proportions and generally higher text quality.
9
+
`NgramFilter` is an N-gram score-based text filtering operator. It evaluates text redundancy by calculating the proportion of repeated n-grams and filters out text with excessively high or low redundancy based on set score thresholds. Higher scores indicate lower repetition proportions and generally higher text quality. It supports both **Chinese (character-level)** and **English (word-level)** modes.
// English mode (language='en'), threshold [0.8, 1.0]
104
+
{"id":1,"type":"en_normal","text":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.","NgramScore":1.0}
92
105
```
93
106
94
107
### 📊 Result Analysis
95
108
96
-
**Sample 1 ("今天天气真不错,阳光明媚,万里无云,适合出门散步。")**:
97
-
- Text length: 26 characters
98
-
- Total 5-grams: 22
99
-
- Unique 5-gram count: Approximately 20
100
-
- N-gram score: 20 / 22 ≈ 0.91
101
-
- Score range: [0.8, 1.0]
102
-
-**Passes filter** (but not shown in output, possibly only showing passed samples)
103
-
104
-
**Sample 2 ("好好好好好好好好好好好好好好好好好好好好好好好好好好")**:
105
-
- Text length: 26 characters
106
-
- Total 5-grams: 22
107
-
- Unique 5-gram count: Approximately 1-2 (repeated "好好好好好")
108
-
- N-gram score: 1 / 22 ≈ 0.045
109
-
- Score range: [0.8, 1.0]
110
-
-**Filtered out** (0.045 < 0.8, excessively high repetition rate)
111
-
112
-
**Sample 3 ("The fascinating world of natural language processing...")**:
113
-
- Text length: 95 characters
114
-
- Total 5-grams: 91
115
-
- Unique 5-gram count: 91 (no repetition)
116
-
- N-gram score: 91 / 91 = 1.0
117
-
- Score range: [0.8, 1.0]
118
-
-**Passes filter** (1.0 within range, text has no redundancy)
109
+
**Chinese samples (language='zh')**:
110
+
111
+
-**Sample zh_normal**
112
+
- N-gram score: `NgramScore = 1.0`
113
+
- Meaning: rich content with very little repetition and minimal redundancy.
114
+
- Threshold range: [0.8, 1.0]
115
+
-**Passes filter** (1.0 within threshold range)
116
+
117
+
-**Sample zh_repeat_phrase**
118
+
- N-gram score: `NgramScore ≈ 0.6667`
119
+
- Meaning: contains a repeated \"three times\" slogan pattern, so many n-grams are repeated.
120
+
- Threshold range: [0.8, 1.0]
121
+
-**Filtered out** (0.6667 < 0.8)
122
+
123
+
-**Sample zh_garbage**
124
+
- N-gram score: `NgramScore = 0.03125`
125
+
- Meaning: almost entirely made of repeated \"哈\", highly garbage/template-like text.
126
+
- Threshold range: [0.8, 1.0]
127
+
-**Filtered out** (severe repetition)
128
+
129
+
**English samples (language='en')**:
130
+
131
+
-**Sample en_normal**
132
+
- N-gram score: `NgramScore = 1.0`
133
+
- Meaning: high information density, nearly all 5-gram combinations are unique.
134
+
- Threshold range: [0.8, 1.0]
135
+
-**Passes filter**
136
+
137
+
-**Sample en_repeat_phrase**
138
+
- N-gram score: `NgramScore = 0.3`
139
+
- Meaning: the sentence \"The cat sat on the mat.\" is repeated many times, leading to many repeated n-grams.
> In summary, when we set the threshold `[min_score=0.8, max_score=1.0]`, **only high-quality, low-redundancy texts (such as zh_normal / en_normal) are retained**, while repetitive slogans, spammy or garbage-like texts are filtered out.
119
150
120
151
**Calculation Formula**:
121
152
```
122
153
N-gram Score = Unique n-gram count / Total n-gram count
123
154
```
124
155
125
156
**Score Meaning**:
126
-
-**1.0**: Text has no repetition, highest quality
127
-
-**0.8-0.99**: Text has minor repetition, good quality
128
-
-**< 0.8**: Text has high repetition rate, poor quality
0 commit comments