Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ihm.pkl files differ and lot of missing values in an episode #139

Open
sivakumarlakkoju opened this issue Feb 3, 2023 · 6 comments
Open

Comments

@sivakumarlakkoju
Copy link

After creating the benchmark dataset for in-hospital-mortality risk, the ihm.pkl files differ when the test for checking is run.
Also the csv's for each episode have lot of missing values at each time stamp, for example capillary refill rate, always has no value, is this a norm?

Is there something I'm doing wrong while building the dataset, please let me know, thank you.

PS: Is there any possibility of getting the updated library with 50+ variables as mentioned previously?

@hrayrhar
Copy link
Member

hrayrhar commented Apr 13, 2023

Hi Siva,

Since the code hasn't been updated for a while it might be that some things don't work as expected with new versions of libraries. Have your tried using the exact versions of libraries specified in the requirements.txt file?

PS: Is there any possibility of getting the updated library with 50+ variables as mentioned previously?

Unfortunately, we wrote code only for 17 variables.

@sivakumarlakkoju
Copy link
Author

sivakumarlakkoju commented Apr 13, 2023

Hey Hrayr, thanks for replying.
Yes I tried with the exact versions, and ran the process multiple times, but the end result is the same always.

I get the following warning when running the validate_events script:
DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False. events_df = pd.read_csv(os.path.join(args.subjects_root_path, subject, 'events.csv'), index_col=False,

Unfortunately, we wrote code only for 17 variables.

Okay.

Also, I'd like to know the rationale behind choosing the impute values, as mentioned in table 3 of the paper.

@hrayrhar
Copy link
Member

Hi Siva,

Unfortunately, the tests I wrote before are too rigid and detect even insignificant differences. The current version of the code does not pass those tests, but I have verified manually that all the produced csv files match with those generated by older and tested versions of the code. I am currently trying to write better tests.

Also the csv's for each episode have lot of missing values at each time stamp, for example capillary refill rate, always has no value, is this a norm?

Most episodes have a lot of missing data. But if you suspect that any particular csv file is incorrect, please paste here, I will verify with the local version.

I get the following warning when running the validate_events script

I get that warning too. It has no effect, don't worry about it.

@sivakumarlakkoju
Copy link
Author

Unfortunately, the tests I wrote before are too rigid and detect even insignificant differences. The current version of the code does not pass those tests, but I have verified manually that all the produced csv files match with those generated by older and tested versions of the code. I am currently trying to write better tests.

I'll wait for the updated tests, thank you.

Most episodes have a lot of missing data. But if you suspect that any particular csv file is incorrect, please paste here, I will verify with the local version.

Will paste one soon, just to be sure.

@hrayrhar
Copy link
Member

Hi Siva,

I have updated the tests. They are still not ideal, but you should get the same results if you follow the exact installation and benchmark building instructions of README.md. You can find updated information about the tests in mimic3benchmark/tests/README.md.

@sivakumarlakkoju
Copy link
Author

sivakumarlakkoju commented Apr 17, 2023

Hi Hrayr,
Thank you for updating the tests. I've rerun the benchmark creation process and tested with the updates you made and it worked for me.

Can you please comment on this?
"I'd like to know the rationale behind choosing the impute values, as mentioned in table 3 of the paper."

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants