Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📊 war: add brecke dataset #1367

Merged
merged 8 commits into from
Jul 28, 2023
Merged

📊 war: add brecke dataset #1367

merged 8 commits into from
Jul 28, 2023

Conversation

lucasrodes
Copy link
Member

@lucasrodes lucasrodes commented Jul 20, 2023

parent issue: https://github.com/owid/owid-issues/issues/446

👀 dataset preview

📝 Notes

(mostly for @bastianherre)

  • Each row in the original dataset reports data on a particular conflict.
    • That is, there is no conflict data de-aggregated by year
    • Therefore, we need to estimate # of deaths by uniformly distributing these over the years of the conflict.
  • Missing values for the number of fatalities:
    • In the notes from the source, the author mentions that these fields are filled "when possible".
    • There are (3,708) rows (i.e. conflicts reported). From these, only 32% have a value on the number of fatalities.
    • Similarly, only 11% of all conflicts have values on the number of military fatalities.
    • Therefore, I think that missing values should not be assumed to be zero.
    • Consequently, I have estimated the total number of fatalities in a region for a year only if there are no missing values for that year.
      • An example of this is "World" for "all conflicts". We only have reports for World when conflict_type="intrastate". For example, take the year 1539, there are three conflicts: _ Spain-Yucatan, 1539_ (inter), _ Spain (Ghent), 1539-40_ (intra) and Spain-Florida, 1539 (inter). The field total_fatalities is only filled for the intra-conflict, for the inter-conflicts is null (and I think assuming zero would be wrong?).
  • Conflict "Japan, Germany-US,USSR,Britain, China,others, 1937-45" (row 3233) has region '-9' assigned, for which there is no mapping of the source documentation. I have labelled this region as "Unknown" for now. It looks as if it was the "World" region. If so, we could split the number of deaths across all regions. If so, we'd count ongoing and new conflicts in all regions.

@bastianherre
Copy link
Collaborator

Hi @lucasrodes!

I agree with distributing deaths evenly across years for conflicts that last several years. I also agree that we should code conflict #3233 as region 'World' and distribute the deaths accordingly.

I worry that the approach you propose for dealing with the many conflicts for which death estimates are missing will make it difficult to visualize the data, and may confuse our users. We will have missing values for a fair (if not large) share of years. This will distort line charts, as the many years with missing data (which most likely have few deaths) will be skipped, and the lines charted across them. This will make the (implicit) area under the line incorrectly large. Bar charts are not a good alternative, because the dataset covers many years, so users will struggle to see that many years are skipped, or which are not included.

At the same time, I agree that entirely ignoring these conflicts and setting their deaths to zero (which is what our previous work with the data did) also seems wrong.

I therefore propose another approach: Brecke writes that he only includes major violent conflicts. Among other characteristics, this means for him that there were at least 32 deaths per year. So what we could do for conflicts with missing death estimates is to create a (possibly very) lower-bound estimate of 32 deaths for conflicts that lasted one year, 64 for those lasting two years, and so on. This would take the source seriously, allow us to calculate aggregates while still including these conflicts, and we could use line charts to visualize the data.

I would definitely make that clear in the indicator description, and probably even add a disclaimer to each chart using the data.

This also means that we for now focus entirely on all fatalities and set military fatalities aside because we cannot make any analogous inferences about the latter.

What do you think about this approach?

@lucasrodes
Copy link
Member Author

lucasrodes commented Jul 25, 2023

@bastianherre
Thanks for the careful explanation. Given the context, I think your proposal makes sense and is a good trade-off. I'll implement this and the issues raised in the spreadsheet and get back to you.

@lucasrodes
Copy link
Member Author

lucasrodes commented Jul 26, 2023

Hi @bastianherre, in Brecke's dataset, there are 26 conflicts with unspecified end years.

conflict_code name startyear endyear
3306 3306 Israel-Palestinians, 1948- 1948
3587 3587 Sudan (south), 1982- 1982
3592 3592 Sri Lanka (Tamils), 1983- 1983
3593 3593 India (Assam), 1983- 1983
3594 3594 Sudan, 1983- G 1983
3595 3595 Turkey (Kurds), 1984- 1984
3598 3598 Colombia, 1984- 1984
3599 3599 India (Sikhs), 1984- 1984
3641 3641 India (Jammu and Kashmir), 1990- 1990
3654 3654 Cambodia, 1991- 1991
3664 3664 Egypt, 1992- 1992
3670 3670 Somalia, 1993- 1993
3677 3677 Burundi, 1995- 1995
3680 3680 Nepal (Maoist rebellion), 1996- 1996
3683 3683 Uganda (near Sudan), 1996- 1996
3685 3685 Liberia, 1997- 1997
3689 3689 Yemen (tribal uprising), 1998 1998
3693 3693 Sierra Leone, 1998- 1998
3694 3694 Congo (Brazzaville), 1998- 1998
3697 3697 Angola, 1998- 1998
3699 3699 India (Jammu and Kashmir), 1998- 1998
3700 3700 Indonesia (Ambon), 1999- 1999
3702 3702 Russia (Dagestan), 1999- 1999
3705 3705 Indonesia (Celebes, Christians vs Muslims), 1999 1999
3706 3706 Nigeria (Muslim vs Christian, Hausa vs Ibo), 2000 2000
3707 3707 Israel-Palestine, 2000 2000

My first intuition is that while these conflicts may have end years later than 2000, we should assume that the dataset only considers data until 2000. This means that:

  • The number of deaths for a conflict with an unknown end year should be uniformly distributed from the start until 2000. E.g., for conflict 3306 (Israel-Palestinians, 1948-), we should be uniformly distributed deaths between 1948 and 2000.
  • The number of ongoing and new conflicts should only be estimated until the year 2000.

I haven't found much in Brecke's documentation on missing values for end-years.

@bastianherre
Copy link
Collaborator

Hi @lucasrodes! Thanks for checking in. Yes, let's do it as you say!

@lucasrodes lucasrodes marked this pull request as ready for review July 26, 2023 22:40
@lucasrodes lucasrodes requested a review from spoonerf July 26, 2023 22:41
Copy link
Contributor

@spoonerf spoonerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lucasrodes

The code generally looks good, but I think there are some issues with the World aggregate.

For some of the variables, the World value is less than the sum of all the other regions between 1937-1945 only. I guess this is because of the 'Unknown' flag around WW2? It just looks a bit wrong if you do a stacked area chart e.g. here.

The affected variables are:

For some other variables, the value for the World is greater than the sum of all the other regions, also just between 1937 and 1945. The affected variables are:

And for the soldier deaths, the World value is only equal to the sum of the other regions for the years 1937-1945, and for all other years, it is 0. These variables:

@lucasrodes
Copy link
Member Author

lucasrodes commented Jul 27, 2023

Thanks for reviewing, Fiona; it helps a lot, really <3!

For some of the variables, the World value is less than the sum of all the other regions between 1937-1945 only. I guess this is because of the 'Unknown' flag around WW2? It just looks a bit wrong if you do a stacked area chart e.g. here.

The number of ongoing conflicts in the World may not be the sum of all conflicts in all regions. This is because the same conflict may occur in multiple regions (e.g. WWII) but should only be counted as +1 globally. The same happens with the number of new conflicts.

In this particular period, we have +1 conflict in all regions (we consider WWII as an 'ongoing conflict' in all regions). So, if you add the numbers for all regions, you'd count this conflict several times.

I have added a clarification to the indicator number_ongoing_conflicts description.

For some other variables, the value for the World is greater than the sum of all the other regions, also just between 1937 and 1945. The affected variables are:

Good catch! I just found a critical bug in the code. It should be fixed now.

And for the soldier deaths, the World value is only equal to the sum of the other regions for the years 1937-1945, and for all other years, it is 0. These variables:

I have just removed these metrics for now. It is not needed.

@lucasrodes lucasrodes merged commit 758f3f7 into master Jul 28, 2023
@lucasrodes lucasrodes deleted the data/brecke branch July 28, 2023 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants