Table of contents
Environmental configuration used in the following annotation examples:
- Python 3.8+
- label-studio == 1.6.0
- paddleocr >= 2.6.0.1
Use pip to install label-studio in the terminal:
pip install label-studio==1.6.0
Once the installation is complete, run the following command line:
label-studio start
Open http://localhost:8080/ in the browser, enter the user name and password to log in, and start using label-studio for labeling.
Click Create to start creating a new project, fill in the project name, description, and select Object Detection with Bounding Boxes
.
- Fill in the project name, description
- For Named Entity Recognition, Relation Extraction, Event Extraction, Opinion Extraction tasks please select ``Relation Extraction`.
- For Text classification, Sentence-level sentiment classification tasks please select
Text Classification
.
- Define labels
The figure shows the construction of entity type tags, and the construction of other types of tags can refer to 2.3 Label Construction
First upload the txt format file locally, select List of tasks
, and then choose to import this project.
- Entity label
- Relation label
Relation XML template:
<Relations>
<Relation value="Singer"/>
<Relation value="Published"/>
<Relation value="Album"/>
</Relations>
- Classification label
- Entity extraction
Callout example:
The schema corresponding to this annotation example is:
schema = [
'ζΆι΄',
'ιζ',
'θ΅δΊεη§°',
'εΎε'
]
- Relation extraction
For relation extraction, the type setting of P is very important, and the following principles need to be followed
"{P} of {S} is {O}" needs to be able to form a semantically reasonable phrase. For example, for a triple (S, father and son, O), there is no problem with the relation category being father and son. However, according to the current structure of the UIE relation type prompt, the expression "the father and son of S is O" is not very smooth, so it is better to change P to child, that is, "child of S is O". A reasonable P type setting will significantly improve the zero-shot performance.
The schema corresponding to this annotation example is:
schema = {
'δ½εε': [
'ζζ',
'εθ‘ζΆι΄',
'ζε±δΈθΎ'
]
}
- Event extraction
The schema corresponding to this annotation example is:
schema = {
'ε°ι触εθ―': [
'ζΆι΄',
'ιηΊ§'
]
}
- Sentence level classification
The schema corresponding to this annotation example is:
schema = 'ζ
ζεΎε[ζ£εοΌθ΄ε]'
- Opinion Extraction
The schema corresponding to this annotation example is:
schema = {
'θ―δ»·η»΄εΊ¦': [
'θ§ηΉθ―',
'ζ
ζεΎε[ζ£εοΌθ΄ε]'
]
}
Check the marked text ID, select the exported file type as JSON
, and export the data:
Rename the exported file to label_studio.json
and put it in the ./data
directory. Through the label_studio.py script, it can be converted to the data format of UIE.
- Extraction task
python label_studio.py\
--label_studio_file ./data/label_studio.json \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--task_type ext
- Sentence-level classification tasks
In the data conversion stage, we will automatically construct prompt information for model training. For example, in sentence-level sentiment classification, the prompt is Sentiment Classification [positive, negative]
, which can be configured through prompt_prefix
and options
parameters.
python label_studio.py\
--label_studio_file ./data/label_studio.json \
--task_type cls \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "Sentiment Classification" \
--options "positive" "negative"
- Opinion Extraction
In the data conversion stage, we will automatically construct prompt information for model training. For example, in the emotional classification of the evaluation dimension, the prompt is Sentiment Classification of xxx [positive, negative]
, which can be declared through the prompt_prefix
and options
parameters.
python label_studio.py\
--label_studio_file ./data/label_studio.json \
--task_type ext \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--prompt_prefix "Sentiment Classification" \
--options "positive" "negative" \
--separator "##"
label_studio_file
: Data labeling file exported from label studio.save_dir
: The storage directory of the training data, which is stored in thedata
directory by default.negative_ratio
: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default.splits
: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of8:1:1
.task_type
: Select the task type, there are two types of tasks: extraction and classification.options
: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"].prompt_prefix
: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency".is_shuffle
: Whether to randomly shuffle the data set, the default is True.seed
: random seed, default is 1000.schema_lang
: Select the language of the schema, which will be the construction method of the training data prompt, optionalch
anden
. Defaults toch
.separator
: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##".
Note:
- By default the label_studio.py script will divide the data proportionally into train/dev/test datasets
- Each time the label_studio.py script is executed, the existing data file with the same name will be overwritten
- In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by
negative_ratio
; the number of negative samples = negative_ratio * the number of positive samples. - For files exported from label_studio, each piece of data in the default file is correctly labeled manually.