Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing diagnosis labels in episode*.csv generated by extract_episode_from_subjects #101

Open
mistycheney opened this issue Sep 3, 2020 · 1 comment

Comments

@mistycheney
Copy link

mistycheney commented Sep 3, 2020

This bug can be found in the two episode*.csv files generated for patient 49037. In both files, no diagnosis columns have label 1, which is clearly not right.

The cause is in preprocessing.py. In function extract_diagnosis_labels, in the input dataframe diagnosis, the ICD9_CODE column has a numerical dtype. This causes the columns of labels to also be numerical. However the match condition in Line 82 is against the hardcoded list diagnosis_labels which contains strings. This means Line 82 will never be true, and no diagnosis value will be set to 1.

This bug affects all episodes who only have numerical diagnosis ICD codes (i.e. no alpha-numerical codes like V28492). In these cases pandas automatically infers the dtype to be int64, rather than object/str, causing the bug.

This bug however does not seem to affect the labels in task-specific datasets, which still look correct.

A fix is to add this line
diagnoses['ICD9_CODE'] = diagnoses['ICD9_CODE'].astype(str)
before diagnoses['VALUE'] = 1.

@KimballCai
Copy link

I find this problem too, and this problem occurs in many episodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants