-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[framework update] multimodal draft #304
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a bandpass filter optional here? According to jathurshan:
- IIIC data may not use explicitly a bandpass filter
- ECOG data may do raw signal or band stop filtering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure TextFeaturizer, BioSignal Featurizer are not too heavy or restrictive.
And, I also added some suggestions table processing for CXR and notes using the MIMIC4. But, I think we probably need to figure out how we want to initialize the dataset, and the pathing for it, because CXR and Notes will have different related pathings. We can probably ask the user to just put everything in one directory, or we can add additional optional filepathing variables and check if None else: parse_{table}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is mostly fine. I think maybe any option for the user to define their own AutoTokenizer and AutoModel outside might be less heavy, because sometimes people may want to throw in their own existing finetuned model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Zhenbang, did we forget about MIMIC Note, and CXR here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def parse_discharge(self, patients: Dict[str, Patient]) -> Dict[str, Patient]:
table = "discharge" # hardcoded, again might need user to explicitly download files into same directory/.
df = pd.read_csv(os.path.join(self.tables_dir[table], f"{table}.csv"),
dtype={"subject_id": str, "hadm_id": str})
df = df.dropna(subset=["subject_id", "hadm_id", "text", "charttime"])
df = df.sort_values(["subject_id", "hadm_id"], ascending=True)
group_df = df.groupby("subject_id")
def discharge_unit(p_id, p_info):
events = []
for v_id, v_info in p_info.groupby("hadm_id"):
for text in v_info["text"]:
attr_dict = {"text" : text,
"vocabulary" : "text",
"visit_id"=v_id,
"patient_id"=p_id}
event = Event(
attr_dict = attr_dict,
timestamp=strptime(v_info["charttime"].values[0])
)
events.append(event)
return events
group_df = group_df.parallel_apply(
lambda x: discharge_unit(x.subject_id.unique()[0], x)
)
patients = self._add_events_to_patient_dict(patients, group_df)
return patients
def parse_cxr(self, patients: Dict[str, Patient]) -> Dict[str, Patient]:
table = "cxr"
cxr_file = "mimic-cxr-2.0.0-metadata"
# hardcoded, might need to explicitly just have a CXR path in init.
df = pd.read_csv(os.path.join(self.tables_dir[table], f"{cxr_file}.csv"),
dtype={"subject_id": str, "hadm_id": str})
# combine date and time to create timestamp
df = df.dropna(subset=["subject_id", "study_id", "dicom_id"])
df.StudyDate = df.StudyDate.astype(str)
df.StudyTime = df.StudyTime.astype(str)
# process all the dates and times
df['StudyDateTime'] = df.apply(lambda row: self.transform_study_datetime(str(row['StudyDate']), str(row['StudyTime'])), axis=1)
df = df.sort_values(["subject_id", "study_id"], ascending=True)
group_df = df.groupby("subject_id")
def cxr_unit(p_id, p_info):
events = []
for v_id, v_info in p_info.groupby("study_id"):
for dicom_id, timestamp in zip(v_info["dicom_id"], v_info["StudyDateTime"]):
attr_dict = { "dicom_id"=dicom_id, # used for the dicom_id pathing
"vocabulary"="cxr"}
event = Event(
visit_id=v_id,
patient_id=p_id,
timestamp=strptime(timestamp)
)
events.append(event)
return events
group_df = group_df.parallel_apply(lambda x: cxr_unit(x.subject_id.unique()[0], x))
patients = self._add_events_to_patient_dict(patients, group_df)
return patients
No description provided.