Important: I will pause this issue and advance on #75 . Fetching negative-labelled articles might take at least 1 week.
Problem
The NGO's tagged data only contains positive labels (e.g., this tweet IS a public service report). At this point, we haven't included negative labels (e.g., this tweet is NOT a public service report).
Proposed Solution
Add negative labels from the 2020 tagged data.
Tasks
Negative labels web scraping strategy
- Loop over
elpitazo.net/category/<LOCATION>/page/<N> to get all the links from PSCDD positive labels dataset.
- Select the links that aren't in
PSCDD positive labels dataset.
- Webscrape this links with
PSCDD elpitazo web scraper.
Notes
count 2401
unique 397
top 2020-06-10 00:00:00
freq 27
first 2019-05-02 00:00:00
last 2020-10-30 00:00:00
News articles per location
|
count |
| occidente |
519 |
| gran-caracas |
403 |
| oriente |
396 |
| los-andes |
287 |
| los-llanos |
284 |
| centro |
196 |
| guayana |
93 |
| pitazo-en-la-calle |
88 |
| regiones |
64 |
| economia |
21 |
| infociudadanos |
16 |
| tecnologia |
10 |
| vista_2 |
8 |
| reportajes |
4 |
| radio |
3 |
| alianzas |
2 |
| sucesos |
2 |
| salud |
2 |
| sin-categoria |
1 |
| fotogalerias |
1 |
| cronicas |
1 |
Important: I will pause this issue and advance on #75 . Fetching negative-labelled articles might take at least 1 week.
Problem
The NGO's tagged data only contains positive labels (e.g., this tweet IS a public service report). At this point, we haven't included negative labels (e.g., this tweet is NOT a public service report).
Proposed Solution
Add negative labels from the 2020 tagged data.
Tasks
Use the data annotated last year in C4V for Negative Labels (and positive if quick)
The labels are not compatible with Add Positive labels [PSCDD] #56 , at this point in time it would take more effort to unify the schemas rather than web scraping from scratch.
If it's not possible to use last year's negative labels
- [ ] Webscrape el pitazo articles where the URLs are not within Add Positive labels [PSCDD] - elpitazo #48 dataset.- This will give us articles that are not public services problems.
- [ ] Concatenate these articles with Add Positive labels [PSCDD] - elpitazo #48 dataset.Negative labels web scraping strategy
elpitazo.net/category/<LOCATION>/page/<N>to get all the links fromPSCDDpositive labels dataset.PSCDDpositive labels dataset.PSCDDelpitazo web scraper.occidenteand store it within a list.PSCDDpositive labels links.Notes
News articles per location