Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] recovered cases in China - _subset_by_area() records selection #484

Closed
Inglezos opened this issue Jan 3, 2021 · 18 comments
Closed
Labels
bug Something isn't working

Comments

@Inglezos
Copy link
Collaborator

Inglezos commented Jan 3, 2021

Summary

For some reason, in the covid19dh.csv file, the recovered for China exist only for province level records while for "China, -" records they are not accumulated there too. The _subset_by_area() method selects only the "China, -" records when no province has been specified. This leads to the wrong result that recovered for China are zero and thus full complement is then applied, despite the fact that the provinces hold the recovered cases information indeed.

Codes and outputs:

import covsirphy as cs
# Dataset preparation
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
# Scenario analysis
chn_scenario = cs.Scenario(jhu_data, population_data, "China")

Environment

  • CovsirPhy version: 2.13.3-iota
  • Python version: 3.8
  • Installation: Anaconda/pipenv
  • System: Windows
@Inglezos Inglezos added the bug Something isn't working label Jan 3, 2021
@Inglezos Inglezos changed the title [Fix] _subset_by_area() ignores all province recovered records if no province specified [Fix] recovered cases in China - _subset_by_area() records selection Jan 3, 2021
@lisphilar lisphilar added this to the Release v2.15.0 milestone Jan 3, 2021
@lisphilar
Copy link
Owner

I tried.

df = jhu_data.cleaned()
sum_df = df.loc[(df["Country"] == "China") & (df["Province"] != "-")].groupby("Date").sum()
sum_df.tail()
cs.line_plot(sum_df, title="Total value of provinces in China", y_integer=True)
Date Confirmed Infected Fatal Recovered
2020/12/30 95876 1282 4781 89813
2020/12/31 95963 1258 4782 89923
2021/1/1 96023 1210 4782 90031
2021/1/2 96086 1203 4784 90099
2021/1/3 96086 1203 4784 90099

Figure_1

chn_scenario = cs.Scenario(jhu_data, population_data, "China")
chn_scenario.records(variables=["Confirmed", "Infected", "Fatal", "Recovered"]).tail()
Date Confirmed Infected Fatal Recovered
2020/12/30 96592 1614 4784 90194
2020/12/31 96673 1579 4788 90306
2021/1/1 96762 1567 4789 90406
2021/1/2 96829 1524 4790 90515
2021/1/3 96829 1428 4790 90611

Figure_1

@lisphilar
Copy link
Owner

With the results above, I think we can use total value of provinces in China for recovered data in JHUData._cleaning().
Becuase the values of confirmed/fatal are not identical between the first table and the second table, it is recommended to use apply the values of the first table (sum of provinces) as China country level data.

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 3, 2021

Yes I agree, the province data seem more correct and hold all the recovered cases information we need.

@lisphilar
Copy link
Owner

lisphilar commented Jan 4, 2021

I created pull request #491. Please review it.
However, full complement of recovery data is still performed with China dataset.
This may be another issue, but we may need to investigate it. (Could we divide up this work?)

Full complement is performed for many countries as follows.

import covsirphy as cs
data_loader = cs.DataLoader()
jhu_data = loader.jhu()
df = jhu_data.show_complement()
print(df.loc[df["Full_recovered"]].Country.tolist())

['Andorra', 'United Arab Emirates', 'American Samoa', 'Antigua and Barbuda', 'Burundi', 'Benin', 'Bahrain', 'Belarus', 'Bermuda', 'Barbados', 'Brunei', 'Bhutan', 'Chile', "Cote d'Ivoire", 'Cameroon', 'Democratic Republic of the Congo', 'Colombia', 'Comoros', 'Cape Verde', 'Cuba', 'Germany', 'Djibouti', 'Dominica', 'Ecuador', 'Egypt', 'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia', 'Ghana', 'Gambia', 'Guinea-Bissau', 'Equatorial Guinea', 'Grand Princess', 'Grenada', 'Guam', 'Croatia', 'Iran', 'Iceland', 'Jordan', 'Kyrgyzstan', 'Cambodia', 'Saint Kitts and Nevis', 'Laos', 'Liechtenstein', 'Madagascar', 'Marshall Islands', 'Malta', 'Montenegro', 'Northern Mariana Islands', 'Mauritania', 'MS Zaandam', 'Mauritius', 'Malaysia', 'Namibia', 'Niger', 'Nicaragua', 'Netherlands', 'Norway', 'New Zealand', 'Pakistan', 'Peru', 'Papua New Guinea', 'Puerto Rico', 'Qatar', 'Saudi Arabia', 'Senegal', 'Singapore', 'Solomon Islands', 'San Marino', 'Serbia', 'South Sudan', 'Sao Tome and Principe', 'Suriname', 'Slovenia', 'Sweden', 'Swaziland', 'Seychelles', 'Chad', 'Togo', 'Thailand', 'Timor-Leste', 'Turkey', 'Taiwan', 'Uzbekistan', 'Holy See', 'Saint Vincent and the Grenadines', 'Virgin Islands, U.S.', 'Vanuatu', 'Samoa', 'Yemen', 'Zambia', 'Zimbabwe', 'China']

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 4, 2021

Sure I will check into this too. I don't think France has full complement though, only partial (it just caught my eye).

@lisphilar
Copy link
Owner

Do you have "COVID-19 Data Hub" as-of 31Dec2020 (or before)?
This appears caused by irregular records in raw dataset from Jan2021 and I found a related issue.
covid19datahub/COVID19#145

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 4, 2021

Yes I just realized the same problem with the actual dataset.

@lisphilar
Copy link
Owner

lisphilar commented Jan 5, 2021

I confirmed the issue for France has been solved thanks to "COVID-19 Data Hub" with the latest data.
(We need not create a GitHub issue for this problem.)

@lisphilar
Copy link
Owner

I do not think Singapore recovered data needs full complement. How do you think?
Can we create a new issue for this problem? (Singapore, China)

country = "Singapore"
cs.line_plot(jhu_data.subset(country).set_index("Date"), f"Subset for {country} without complement")

Figure_1

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 5, 2021

No no, we need to revise the conditions. The problem is the 99% threshold and to identify when it is stopping

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 8, 2021

I confirmed the issue for France has been solved thanks to "COVID-19 Data Hub" with the latest data.
(We need not create a GitHub issue for this problem.)

The France issue unfortunately remains:
image

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 8, 2021

I notified covid19datahub team for this in covid19datahub/COVID19#145.

@lisphilar
Copy link
Owner

lisphilar commented Jan 8, 2021

Thank you for notification to the team.
This is also discussed in the original dataset repository. opencovid19-fr/data#564

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 8, 2021

Yes it seems that it depends on when we download the dataset. If the covid19datahub team has applied preprocessing first then we are okay. This has to be handled preferably by the original source opencovid19-fr.

@lisphilar
Copy link
Owner

We will create a new issue for the threshold of full complement?
With debug for China data, it was difficult to select specific value as threshold. Around June, Recovered is near to Confirmed - Fatal because the outbreak ended very quickly according to the dataset.

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 8, 2021

Yes we should. If you have some time please create a new issue, otherwise I will do that later.

@lisphilar
Copy link
Owner

@Inglezos
Copy link
Collaborator Author

Inglezos commented Jan 9, 2021

Yes we will continue in 514. I will close this issue.

@Inglezos Inglezos closed this as completed Jan 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants