Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,29 +6,35 @@ permalink: /en/api/operators/general_text/eval/ngramsampleevaluator/

## 📘 Overview

`NgramSampleEvaluator` is an operator for evaluating text originality by analyzing the n-gram repetition rate in text. It helps detect content repetition and assess text diversity by calculating the proportion of unique n-grams.
`NgramSampleEvaluator` is an operator for evaluating text redundancy and originality. It calculates the repetition ratio of n-grams to assess how diverse or repetitive a text is. It supports both **English (word-level)** and **Chinese (character-level)** modes.

## __init__
**Scoring Principle**: The score is computed as the ratio between the number of unique n-grams and the total number of n-grams. A higher score (closer to 1.0) indicates lower repetition and better originality.

## `__init__`

```python
def __init__(self, n=3)
def __init__(self, ngrams=5, language: str = 'en')
```

### Init Parameters

| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| **n** | int | 3 | N-gram length (number of consecutive words). Default is trigrams (3-grams). |
| **ngrams** | int | 5 | Length of n-grams. |
| **language** | str | 'en' | Processing language; `'zh'` uses character-level splitting for Chinese, other values use whitespace tokenization for English. |

## run
## `run`

```python
def run(self, storage: DataFlowStorage, input_key: str)
def run(self, storage: DataFlowStorage, input_key: str, output_key: str='NgramScore')
```

#### Parameters
| Name | Type | Default | Description |
| :------------- | :---------------- | :---------------- | :----------------- |
| **storage** | DataFlowStorage | Required | DataFlow storage instance for reading and writing data. |
| **input_key** | str | Required | Input column name specifying the column containing the text to be evaluated. |
| **output_key** | str | 'NgramScore' | Output column name for the generated n-gram score. |

## 🧠 Example Usage
```python
Expand All @@ -43,13 +49,25 @@ class NgramSampleEvaluatorTest():
file_name_prefix="dataflow_cache_step",
cache_type="jsonl",
)

self.evaluator = NgramSampleEvaluator(n=3)


# English mode (default)
self.evaluator_en = NgramSampleEvaluator(ngrams=5, language='en')
# Chinese mode
self.evaluator_zh = NgramSampleEvaluator(ngrams=5, language='zh')

def forward(self):
self.evaluator.run(
# English samples
self.evaluator_en.run(
storage=self.storage.step(),
input_key='text'
input_key='text_en',
output_key='NgramScore_en'
)

# Chinese samples
self.evaluator_zh.run(
storage=self.storage.step(),
input_key='text_zh',
output_key='NgramScore_zh'
)

if __name__ == "__main__":
Expand All @@ -60,37 +78,38 @@ if __name__ == "__main__":
#### 🧾 Default Output Format
| Field | Type | Description |
| :--- | :---- | :---------- |
| text | str | The original input text |
| NgramScore | float | N-gram originality score (0-1, higher indicates less repetition and better originality) |
| [input_key] | str | Original input text. |
| NgramScore | float | Computed n-gram score (01, higher means less repetition and better originality). |

### 📋 Example Input
```json
{"text": "The quick brown fox jumps over the lazy dog. The sun is shining brightly in the clear blue sky. Birds are singing melodiously in the tall green trees. Children are playing happily in the beautiful park. Flowers are blooming magnificently everywhere you look. Nature displays its wonder through colorful butterflies dancing among fragrant roses. People enjoy peaceful walks along winding pathways surrounded by lush vegetation."}
{"text": "The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."}
{"text": "In contemporary discourse surrounding technological advancement, one must acknowledge the multifaceted ramifications of artificial intelligence implementation. The epistemological considerations necessitate comprehensive analysis of socioeconomic implications. Furthermore, the paradigmatic shift toward automation requires meticulous examination of ethical frameworks governing algorithmic decision-making processes. Subsequently, organizational infrastructures must accommodate transformative methodologies while simultaneously addressing unprecedented complexities inherent within technological ecosystems."}
{"id":1,"type":"zh_normal","text_zh":"人工智能在大模型领域的应用已经非常广泛,从文本生成到逻辑推理都有显著进步,未来可期。"}
{"id":2,"type":"zh_repeat_phrase","text_zh":"重要的事情说三遍:不要过拟合!不要过拟合!不要过拟合!这就叫重要的事情说三遍。"}
{"id":3,"type":"zh_garbage","text_zh":"哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈"}

{"id":1,"type":"en_normal","text_en":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence."}
{"id":2,"type":"en_repeat_phrase","text_en":"The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."}
{"id":3,"type":"en_garbage","text_en":"test test test test test test test test test test test test test test test test test test"}
```

### 📤 Example Output
```json
{"text": "The quick brown fox...", "NgramScore": 1.0}
{"text": "The cat sat on the mat...", "NgramScore": 0.075}
{"text": "In contemporary discourse...", "NgramScore": 1.0}
{"id":1,"type":"zh_normal","text_zh":"人工智能在大模型领域的应用已经非常广泛,从文本生成到逻辑推理都有显著进步,未来可期。","NgramScore_zh":1.0}
{"id":2,"type":"zh_repeat_phrase","text_zh":"重要的事情说三遍:不要过拟合!不要过拟合!不要过拟合!这就叫重要的事情说三遍。","NgramScore_zh":0.6666666667}
{"id":3,"type":"zh_garbage","text_zh":"哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈","NgramScore_zh":0.03125}

{"id":1,"type":"en_normal","text_en":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.","NgramScore_en":1.0}
{"id":2,"type":"en_repeat_phrase","text_en":"The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat.","NgramScore_en":0.3}
{"id":3,"type":"en_garbage","text_en":"test test test test test test test test test test test test test test test test test test","NgramScore_en":0.0714285714}
```

### 📊 Result Analysis

**Sample 1 (Normal Descriptive Text):**
- NgramScore: 1.0 (perfect score, all 3-grams are unique)
- No repetition, high text quality and originality

**Sample 2 (Highly Repetitive Text):**
- NgramScore: 0.075 (very low score, severe repetition)
- The sentence "The cat sat on the mat" repeats 14 times
- Only a few unique 3-grams, very low text quality

**Sample 3 (Complex Academic Text):**
- NgramScore: 1.0 (perfect score, all 3-grams are unique)
- Despite complex vocabulary, no repetitive patterns
- Demonstrates high originality and diversity
- **Chinese normal text (zh_normal)**: `NgramScore_zh = 1.0` (no repetition), rich and diverse content, almost all 5-gram combinations are unique, minimal redundancy.
- **Chinese repeated phrase (zh_repeat_phrase)**: `NgramScore_zh ≈ 0.67`, includes repeated sentence pattern "不要过拟合!" three times, so some n-grams are repeated and the score is medium-low.
- **Chinese garbage text (zh_garbage)**: `NgramScore_zh = 0.03125`, almost all n-grams are composed of repeated "哈", very few unique n-grams, extremely high redundancy.
- **English normal text (en_normal)**: `NgramScore_en = 1.0`, nearly all 5-gram word sequences are unique, indicating rich information and no obvious repetition.
- **English repeated phrase (en_repeat_phrase)**: `NgramScore_en = 0.3`, the sentence "The cat sat on the mat." is repeated multiple times, causing many repeated n-grams and a significantly lower score.
- **English garbage text (en_garbage)**: `NgramScore_en ≈ 0.071`, dominated by repeated `test` tokens, highly template-like and low-quality text.

**Application Value**: This operator is particularly effective at detecting low-quality generated text (such as model-generated repetitive content) and can be used for automated text quality filtering.
Scores closer to 1.0 indicate lower repetition and higher originality, while scores closer to 0.0 indicate heavy repetition.
105 changes: 68 additions & 37 deletions docs/en/notes/api/operators/general_text/filter/NgramFilter.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ permalink: /en/api/operators/general_text/filter/ngramfilter/

## 📘 Overview

`NgramFilter` is an N-gram score-based text filtering operator. It evaluates text redundancy by calculating the proportion of repeated n-grams and filters out text with excessively high or low redundancy based on set score thresholds. Higher scores indicate lower repetition proportions and generally higher text quality.
`NgramFilter` is an N-gram score-based text filtering operator. It evaluates text redundancy by calculating the proportion of repeated n-grams and filters out text with excessively high or low redundancy based on set score thresholds. Higher scores indicate lower repetition proportions and generally higher text quality. It supports both **Chinese (character-level)** and **English (word-level)** modes.

## `__init__` Function

```python
__init__(self, min_score=0.8, max_score=1, ngrams=5)
__init__(self, min_score=0.8, max_score=1, ngrams=5, language: str = 'en')
```

### Init Parameters
Expand All @@ -21,6 +21,7 @@ __init__(self, min_score=0.8, max_score=1, ngrams=5)
| **min_score** | float | 0.8 | Minimum n-gram score threshold; text below this value will be filtered. |
| **max_score** | int | 1 | Maximum n-gram score threshold; text above this value will be filtered. |
| **ngrams** | int | 5 | N-gram size for calculating repetition rate. |
| **language** | str | 'en' | Processing language; `'zh'` uses character-level splitting for Chinese, other values use word-level tokenization for English. |

## `run` Function

Expand Down Expand Up @@ -50,11 +51,13 @@ class NgramFilterTest():
file_name_prefix="dataflow_cache_step",
cache_type="jsonl",
)


# Example: Chinese filtering
self.filter = NgramFilter(
min_score=0.8,
max_score=1.0,
ngrams=5
ngrams=5,
language='zh'
)

def forward(self):
Expand All @@ -80,52 +83,80 @@ The operator adds an `output_key` field (default `NgramScore`) to the data and r
### 📋 Example Input

```json
{"text": "今天天气真不错,阳光明媚,万里无云,适合出门散步。"}
{"text": "好好好好好好好好好好好好好好好好好好好好好好好好好好"}
{"text": "The fascinating world of natural language processing encompasses various sophisticated algorithms."}
{"id":1,"type":"zh_normal","text":"人工智能在大模型领域的应用已经非常广泛,从文本生成到逻辑推理都有显著进步,未来可期。"}
{"id":2,"type":"zh_repeat_phrase","text":"重要的事情说三遍:不要过拟合!不要过拟合!不要过拟合!这就叫重要的事情说三遍。"}
{"id":3,"type":"zh_garbage","text":"哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈"}

{"id":1,"type":"en_normal","text":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence."}
{"id":2,"type":"en_repeat_phrase","text":"The cat sat on the mat. The cat sat on the mat. The cat sat on the mat. The cat sat on the mat."}
{"id":3,"type":"en_garbage","text":"test test test test test test test test test test test test test test test test test test"}
```

### 📤 Example Output

```json
{"text": "The fascinating world of natural language processing encompasses various sophisticated algorithms.", "NgramScore": 1.0}
// Assume we run the filter separately for Chinese and English.

// Chinese mode (language='zh'), threshold [0.8, 1.0]
{"id":1,"type":"zh_normal","text":"人工智能在大模型领域的应用已经非常广泛,从文本生成到逻辑推理都有显著进步,未来可期。","NgramScore":1.0}

// English mode (language='en'), threshold [0.8, 1.0]
{"id":1,"type":"en_normal","text":"Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.","NgramScore":1.0}
```

### 📊 Result Analysis

**Sample 1 ("今天天气真不错,阳光明媚,万里无云,适合出门散步。")**:
- Text length: 26 characters
- Total 5-grams: 22
- Unique 5-gram count: Approximately 20
- N-gram score: 20 / 22 ≈ 0.91
- Score range: [0.8, 1.0]
- **Passes filter** (but not shown in output, possibly only showing passed samples)

**Sample 2 ("好好好好好好好好好好好好好好好好好好好好好好好好好好")**:
- Text length: 26 characters
- Total 5-grams: 22
- Unique 5-gram count: Approximately 1-2 (repeated "好好好好好")
- N-gram score: 1 / 22 ≈ 0.045
- Score range: [0.8, 1.0]
- **Filtered out** (0.045 < 0.8, excessively high repetition rate)

**Sample 3 ("The fascinating world of natural language processing...")**:
- Text length: 95 characters
- Total 5-grams: 91
- Unique 5-gram count: 91 (no repetition)
- N-gram score: 91 / 91 = 1.0
- Score range: [0.8, 1.0]
- **Passes filter** (1.0 within range, text has no redundancy)
**Chinese samples (language='zh')**:

- **Sample zh_normal**
- N-gram score: `NgramScore = 1.0`
- Meaning: rich content with very little repetition and minimal redundancy.
- Threshold range: [0.8, 1.0]
- **Passes filter** (1.0 within threshold range)

- **Sample zh_repeat_phrase**
- N-gram score: `NgramScore ≈ 0.6667`
- Meaning: contains a repeated \"three times\" slogan pattern, so many n-grams are repeated.
- Threshold range: [0.8, 1.0]
- **Filtered out** (0.6667 < 0.8)

- **Sample zh_garbage**
- N-gram score: `NgramScore = 0.03125`
- Meaning: almost entirely made of repeated \"哈\", highly garbage/template-like text.
- Threshold range: [0.8, 1.0]
- **Filtered out** (severe repetition)

**English samples (language='en')**:

- **Sample en_normal**
- N-gram score: `NgramScore = 1.0`
- Meaning: high information density, nearly all 5-gram combinations are unique.
- Threshold range: [0.8, 1.0]
- **Passes filter**

- **Sample en_repeat_phrase**
- N-gram score: `NgramScore = 0.3`
- Meaning: the sentence \"The cat sat on the mat.\" is repeated many times, leading to many repeated n-grams.
- Threshold range: [0.8, 1.0]
- **Filtered out**

- **Sample en_garbage**
- N-gram score: `NgramScore ≈ 0.0714`
- Meaning: dominated by repeated `test` tokens, typical low-quality garbage text.
- Threshold range: [0.8, 1.0]
- **Filtered out**

> In summary, when we set the threshold `[min_score=0.8, max_score=1.0]`, **only high-quality, low-redundancy texts (such as zh_normal / en_normal) are retained**, while repetitive slogans, spammy or garbage-like texts are filtered out.

**Calculation Formula**:
```
N-gram Score = Unique n-gram count / Total n-gram count
```

**Score Meaning**:
- **1.0**: Text has no repetition, highest quality
- **0.8-0.99**: Text has minor repetition, good quality
- **< 0.8**: Text has high repetition rate, poor quality
- **1.0**: No repetition, highest quality
- **0.8-0.99**: Minor repetition, good quality
- **< 0.8**: High repetition rate, low quality

**Use Cases**:
- Filter low-quality, highly repetitive text
Expand All @@ -134,8 +165,8 @@ N-gram Score = Unique n-gram count / Total n-gram count
- Filter template text from scraped data

**Notes**:
- Uses character-level n-gram (default n=5), suitable for mixed Chinese-English
- Uses character-level n-gram for Chinese and word-level n-gram for English (default n=5)
- Higher scores indicate better text diversity
- Short text (< n characters) may have abnormal scores
- Short text (< n characters/tokens) may have unstable scores
- Different languages and scenarios may require adjusted `min_score` thresholds
- Special text types like poetry and lyrics may score low due to repetition
- Special text types such as poetry and lyrics may score low due to intentional repetition
Loading