📊 war: add brecke dataset #1367

lucasrodes · 2023-07-20T21:02:00Z

parent issue: https://github.com/owid/owid-issues/issues/446

👀 dataset preview

📝 Notes

Each row in the original dataset reports data on a particular conflict.
- That is, there is no conflict data de-aggregated by year
- Therefore, we need to estimate # of deaths by uniformly distributing these over the years of the conflict.
Missing values for the number of fatalities:
- In the notes from the source, the author mentions that these fields are filled "when possible".
- There are (3,708) rows (i.e. conflicts reported). From these, only 32% have a value on the number of fatalities.
- Similarly, only 11% of all conflicts have values on the number of military fatalities.
- Therefore, I think that missing values should not be assumed to be zero.
- Consequently, I have estimated the total number of fatalities in a region for a year only if there are no missing values for that year.
  - An example of this is "World" for "all conflicts". We only have reports for World when conflict_type="intrastate". For example, take the year 1539, there are three conflicts: _ Spain-Yucatan, 1539_ (inter), _ Spain (Ghent), 1539-40_ (intra) and Spain-Florida, 1539 (inter). The field total_fatalities is only filled for the intra-conflict, for the inter-conflicts is null (and I think assuming zero would be wrong?).
Conflict "Japan, Germany-US,USSR,Britain, China,others, 1937-45" (row 3233) has region '-9' assigned, for which there is no mapping of the source documentation. I have labelled this region as "Unknown" for now. It looks as if it was the "World" region. If so, we could split the number of deaths across all regions. If so, we'd count ongoing and new conflicts in all regions.

bastianherre · 2023-07-25T15:08:08Z

I agree with distributing deaths evenly across years for conflicts that last several years. I also agree that we should code conflict #3233 as region 'World' and distribute the deaths accordingly.

I worry that the approach you propose for dealing with the many conflicts for which death estimates are missing will make it difficult to visualize the data, and may confuse our users. We will have missing values for a fair (if not large) share of years. This will distort line charts, as the many years with missing data (which most likely have few deaths) will be skipped, and the lines charted across them. This will make the (implicit) area under the line incorrectly large. Bar charts are not a good alternative, because the dataset covers many years, so users will struggle to see that many years are skipped, or which are not included.

At the same time, I agree that entirely ignoring these conflicts and setting their deaths to zero (which is what our previous work with the data did) also seems wrong.

I therefore propose another approach: Brecke writes that he only includes major violent conflicts. Among other characteristics, this means for him that there were at least 32 deaths per year. So what we could do for conflicts with missing death estimates is to create a (possibly very) lower-bound estimate of 32 deaths for conflicts that lasted one year, 64 for those lasting two years, and so on. This would take the source seriously, allow us to calculate aggregates while still including these conflicts, and we could use line charts to visualize the data.

I would definitely make that clear in the indicator description, and probably even add a disclaimer to each chart using the data.

This also means that we for now focus entirely on all fatalities and set military fatalities aside because we cannot make any analogous inferences about the latter.

What do you think about this approach?

lucasrodes · 2023-07-25T22:27:38Z

@bastianherre
Thanks for the careful explanation. Given the context, I think your proposal makes sense and is a good trade-off. I'll implement this and the issues raised in the spreadsheet and get back to you.

lucasrodes · 2023-07-26T12:02:33Z

Hi @bastianherre, in Brecke's dataset, there are 26 conflicts with unspecified end years.

	conflict_code	name	startyear
3306	3306	Israel-Palestinians, 1948-	1948
3587	3587	Sudan (south), 1982-	1982
3592	3592	Sri Lanka (Tamils), 1983-	1983
3593	3593	India (Assam), 1983-	1983
3594	3594	Sudan, 1983- G	1983
3595	3595	Turkey (Kurds), 1984-	1984
3598	3598	Colombia, 1984-	1984
3599	3599	India (Sikhs), 1984-	1984
3641	3641	India (Jammu and Kashmir), 1990-	1990
3654	3654	Cambodia, 1991-	1991
3664	3664	Egypt, 1992-	1992
3670	3670	Somalia, 1993-	1993
3677	3677	Burundi, 1995-	1995
3680	3680	Nepal (Maoist rebellion), 1996-	1996
3683	3683	Uganda (near Sudan), 1996-	1996
3685	3685	Liberia, 1997-	1997
3689	3689	Yemen (tribal uprising), 1998	1998
3693	3693	Sierra Leone, 1998-	1998
3694	3694	Congo (Brazzaville), 1998-	1998
3697	3697	Angola, 1998-	1998
3699	3699	India (Jammu and Kashmir), 1998-	1998
3700	3700	Indonesia (Ambon), 1999-	1999
3702	3702	Russia (Dagestan), 1999-	1999
3705	3705	Indonesia (Celebes, Christians vs Muslims), 1999	1999
3706	3706	Nigeria (Muslim vs Christian, Hausa vs Ibo), 2000	2000
3707	3707	Israel-Palestine, 2000	2000

My first intuition is that while these conflicts may have end years later than 2000, we should assume that the dataset only considers data until 2000. This means that:

The number of deaths for a conflict with an unknown end year should be uniformly distributed from the start until 2000. E.g., for conflict 3306 (Israel-Palestinians, 1948-), we should be uniformly distributed deaths between 1948 and 2000.
The number of ongoing and new conflicts should only be estimated until the year 2000.

I haven't found much in Brecke's documentation on missing values for end-years.

bastianherre · 2023-07-26T12:06:57Z

Hi @lucasrodes! Thanks for checking in. Yes, let's do it as you say!

spoonerf

Hey @lucasrodes

The code generally looks good, but I think there are some issues with the World aggregate.

For some of the variables, the World value is less than the sum of all the other regions between 1937-1945 only. I guess this is because of the 'Unknown' flag around WW2? It just looks a bit wrong if you do a stacked area chart e.g. here.

The affected variables are:

For some other variables, the value for the World is greater than the sum of all the other regions, also just between 1937 and 1945. The affected variables are:

And for the soldier deaths, the World value is only equal to the sum of the other regions for the years 1937-1945, and for all other years, it is 0. These variables:

lucasrodes · 2023-07-27T15:56:30Z

Thanks for reviewing, Fiona; it helps a lot, really <3!

For some of the variables, the World value is less than the sum of all the other regions between 1937-1945 only. I guess this is because of the 'Unknown' flag around WW2? It just looks a bit wrong if you do a stacked area chart e.g. here.

The number of ongoing conflicts in the World may not be the sum of all conflicts in all regions. This is because the same conflict may occur in multiple regions (e.g. WWII) but should only be counted as +1 globally. The same happens with the number of new conflicts.

In this particular period, we have +1 conflict in all regions (we consider WWII as an 'ongoing conflict' in all regions). So, if you add the numbers for all regions, you'd count this conflict several times.

I have added a clarification to the indicator number_ongoing_conflicts description.

For some other variables, the value for the World is greater than the sum of all the other regions, also just between 1937 and 1945. The affected variables are:

Good catch! I just found a critical bug in the code. It should be fixed now.

And for the soldier deaths, the World value is only equal to the sum of the other regions for the years 1937-1945, and for all other years, it is 0. These variables:

I have just removed these metrics for now. It is not needed.

📊 war: add brecke dataset

db2f092

github-actions bot assigned lucasrodes Jul 20, 2023

lucasrodes added 3 commits July 20, 2023 23:03

fix source year

618f9a7

enhance

9a2bab3

Merge branch 'master' into data/brecke

79028a8

intrastate -> internal

ed13039

remove region -9

6121c37

lucasrodes marked this pull request as ready for review July 26, 2023 22:40

lucasrodes requested a review from spoonerf July 26, 2023 22:41

spoonerf reviewed Jul 27, 2023

View reviewed changes

minor fixes

60dd199

fix

7cc105f

lucasrodes merged commit 758f3f7 into master Jul 28, 2023

lucasrodes deleted the data/brecke branch July 28, 2023 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📊 war: add brecke dataset #1367

📊 war: add brecke dataset #1367

lucasrodes commented Jul 20, 2023 •

edited

Loading

bastianherre commented Jul 25, 2023

lucasrodes commented Jul 25, 2023 •

edited

Loading

lucasrodes commented Jul 26, 2023 •

edited

Loading

bastianherre commented Jul 26, 2023

spoonerf left a comment

lucasrodes commented Jul 27, 2023 •

edited

Loading

📊 war: add brecke dataset #1367

📊 war: add brecke dataset #1367

Conversation

lucasrodes commented Jul 20, 2023 • edited Loading

👀 dataset preview

📝 Notes

bastianherre commented Jul 25, 2023

lucasrodes commented Jul 25, 2023 • edited Loading

lucasrodes commented Jul 26, 2023 • edited Loading

bastianherre commented Jul 26, 2023

spoonerf left a comment

Choose a reason for hiding this comment

lucasrodes commented Jul 27, 2023 • edited Loading

lucasrodes commented Jul 20, 2023 •

edited

Loading

lucasrodes commented Jul 25, 2023 •

edited

Loading

lucasrodes commented Jul 26, 2023 •

edited

Loading

lucasrodes commented Jul 27, 2023 •

edited

Loading