Label Studio User Guide - Text Information Extraction

Table of contents

1. Installation
2. Text Extraction Task Annotation

1. Installation

Environmental configuration used in the following annotation examples:

Python 3.8+
label-studio == 1.6.0
paddleocr >= 2.6.0.1

Use pip to install label-studio in the terminal:

pip install label-studio==1.6.0

Once the installation is complete, run the following command line:

label-studio start

Open http://localhost:8080/ in the browser, enter the user name and password to log in, and start using label-studio for labeling.

2. Text extraction task annotation

2.1 Project Creation

Click Create to start creating a new project, fill in the project name, description, and select Object Detection with Bounding Boxes.

Fill in the project name, description

For Named Entity Recognition, Relation Extraction, Event Extraction, Opinion Extraction tasks please select ``Relation Extraction`.

For Text classification, Sentence-level sentiment classification tasks please select Text Classification.

Define labels

The figure shows the construction of entity type tags, and the construction of other types of tags can refer to 2.3 Label Construction

2.2 Data upload

First upload the txt format file locally, select List of tasks, and then choose to import this project.

2.3 Label construction

Entity label

Relation label

Relation XML template:

   <Relations>
     <Relation value="Singer"/>
     <Relation value="Published"/>
     <Relation value="Album"/>
   </Relations>

Classification label

2.4 Task annotation

Entity extraction

Callout example:

The schema corresponding to this annotation example is:

schema = [
    '时间',
    '选手',
    '赛事名称',
    '得分'
]

Relation extraction

For relation extraction, the type setting of P is very important, and the following principles need to be followed

"{P} of {S} is {O}" needs to be able to form a semantically reasonable phrase. For example, for a triple (S, father and son, O), there is no problem with the relation category being father and son. However, according to the current structure of the UIE relation type prompt, the expression "the father and son of S is O" is not very smooth, so it is better to change P to child, that is, "child of S is O". A reasonable P type setting will significantly improve the zero-shot performance.

The schema corresponding to this annotation example is:

schema = {
    '作品名': [
        '歌手',
        '发行时间',
        '所属专辑'
    ]
}

Event extraction

The schema corresponding to this annotation example is:

schema = {
    '地震触发词': [
        '时间',
        '震级'
    ]
}

Sentence level classification

The schema corresponding to this annotation example is:

schema = '情感倾向[正向，负向]'

Opinion Extraction

The schema corresponding to this annotation example is:

schema = {
    '评价维度': [
        '观点词',
        '情感倾向[正向，负向]'
    ]
}

2.5 Data Export

Check the marked text ID, select the exported file type as JSON, and export the data:

2.6 Data conversion

Rename the exported file to label_studio.json and put it in the ./data directory. Through the label_studio.py script, it can be converted to the data format of UIE.

Extraction task

python label_studio.py\
     --label_studio_file ./data/label_studio.json \
     --save_dir ./data \
     --splits 0.8 0.1 0.1 \
     --task_type ext

Sentence-level classification tasks

In the data conversion stage, we will automatically construct prompt information for model training. For example, in sentence-level sentiment classification, the prompt is Sentiment Classification [positive, negative], which can be configured through prompt_prefix and options parameters.

python label_studio.py\
     --label_studio_file ./data/label_studio.json \
     --task_type cls \
     --save_dir ./data \
     --splits 0.8 0.1 0.1 \
     --prompt_prefix "Sentiment Classification" \
     --options "positive" "negative"

Opinion Extraction

In the data conversion stage, we will automatically construct prompt information for model training. For example, in the emotional classification of the evaluation dimension, the prompt is Sentiment Classification of xxx [positive, negative], which can be declared through the prompt_prefix and options parameters.

python label_studio.py\
     --label_studio_file ./data/label_studio.json \
     --task_type ext \
     --save_dir ./data \
     --splits 0.8 0.1 0.1 \
     --prompt_prefix "Sentiment Classification" \
     --options "positive" "negative" \
     --separator "##"

2.7 More Configuration

label_studio_file: Data labeling file exported from label studio.
save_dir: The storage directory of the training data, which is stored in the data directory by default.
negative_ratio: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default.
splits: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of 8:1:1.
task_type: Select the task type, there are two types of tasks: extraction and classification.
options: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"].
prompt_prefix: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency".
is_shuffle: Whether to randomly shuffle the data set, the default is True.
seed: random seed, default is 1000.
schema_lang: Select the language of the schema, which will be the construction method of the training data prompt, optional ch and en. Defaults to ch.
separator: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##".

Note:

By default the label_studio.py script will divide the data proportionally into train/dev/test datasets
Each time the label_studio.py script is executed, the existing data file with the same name will be overwritten
In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by negative_ratio; the number of negative samples = negative_ratio * the number of positive samples.
For files exported from label_studio, each piece of data in the default file is correctly labeled manually.

References

Label Studio