Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic data != synthetic patients ...? #1

Open
jklann opened this issue Nov 20, 2015 · 3 comments
Open

Synthetic data != synthetic patients ...? #1

jklann opened this issue Nov 20, 2015 · 3 comments
Labels

Comments

@jklann
Copy link

jklann commented Nov 20, 2015

Not sure where to post this so please move if appropriate.

I'm interested in this but while the proposed approach will create synthetic data, it will not create synthetic patients. Meaning the associations in the data will not be preserved. So we can expect pregnant males and married six-year-olds and all the other possible data weirdness. Any thoughts on this? It'd be nice to create a real set of synthetic patients (larger and more appropriate for PCORnet CDM than i2b2's 133).

Also it occurs to me that using counts does not tell you value distribution for e.g., lab values. You could do a distribution within the normal range for that I suppose.

Thoughts?

Thanks,
Jeff Klann

@dckc
Copy link
Member

dckc commented Nov 20, 2015

Indeed, "ugly DECOY" may well exhibit pregnant males and such. The hope is that it's still useful as a framework for test-driven development. In fact, it could serve as test data for a tool that would point out pregnant males as an anomaly.

p.s. This is the right place! In fact, you get a bonus point for being the first to raise an issue here. Unfortunately, you conflated two issues into one, so we'll have to take that point back ;-) That is: please raise a separate issue for the "Also..." bit.

@dckc
Copy link
Member

dckc commented Nov 20, 2015

On numeric distributions and starting from more than just aggregate counts, we've started some related work, doing basic stats on tumor registry data and synthesizing data based on those stats. (code isn't public yet. IOU.)

p.s. maybe this is one issue after all.

@dckc
Copy link
Member

dckc commented Nov 20, 2015

Code for stats on tumor registry data and synthesizing data: data_char_sim.sql

rev 8c2443dbb7c7 Oct 14

@dckc dckc added the question label Nov 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants