Skip to content

SIGARRA News Corpus

André Pires edited this page Jun 16, 2017 · 10 revisions

I manually annotated a subset of SIGARRA news, from its different domains, using the brat tool.

Description

SIGARRA is the information system of the University of Porto (UP), where every organic unit has its own domain. SIGARRA has a news section so I manually annotated some them using the BRAT tool. First, the news were gathered from the information system and saved to a csv file with the attributes being: news id, title, subtitle, source url, content and published date. The gathered news were published between 2016-12-14 and 2017-03-01.

Apart from the HAREM collection, this is the only publicly available Portuguese (from Portugal) annotated corpus to this date, to my knowledge. The developed corpus is twice the size of the HAREM collection (HAREM with approximately 86k tokens, and SIGARRA with 185k tokens), with twice the number of entity annotations (HAREM with 7255, and SIGARRA with 12644 entity annotations).

Entity classes: Hora (Hour), Evento (Event), Organizacao (Organization), Curso (Course), Pessoa (Person), Localizacao (Location), Data (Date) and UnidadeOrganica (Organic Unit).

Distribution of entity annotations in SIGARRA news

Entity tag Number of annotated classes %
Data 2811 22.23%
Organizacao 2320 18.35%
Pessoa 2159 17.08%
UnidadeOrganica 1814 14.35%
Localizacao 1593 12.60%
Hora 1015 8.03%
Curso 521 4.12%
Evento 411 3.25%
Total 12644 100%

Entities by SIGARRA domain

Entities by SIGARRA domain

Number of characters in news by SIGARRA domain

Number of characters in news by SIGARRA domain

Number of news by SIGARRA domain

Number of news by SIGARRA domain