Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[framework update] multimodal draft #304

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

[framework update] multimodal draft #304

wants to merge 2 commits into from

Conversation

zzachw
Copy link
Collaborator

@zzachw zzachw commented Oct 28, 2024

No description provided.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a bandpass filter optional here? According to jathurshan:

  • IIIC data may not use explicitly a bandpass filter
  • ECOG data may do raw signal or band stop filtering

Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure TextFeaturizer, BioSignal Featurizer are not too heavy or restrictive.

And, I also added some suggestions table processing for CXR and notes using the MIMIC4. But, I think we probably need to figure out how we want to initialize the dataset, and the pathing for it, because CXR and Notes will have different related pathings. We can probably ask the user to just put everything in one directory, or we can add additional optional filepathing variables and check if None else: parse_{table}.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly fine. I think maybe any option for the user to define their own AutoTokenizer and AutoModel outside might be less heavy, because sometimes people may want to throw in their own existing finetuned model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Zhenbang, did we forget about MIMIC Note, and CXR here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def parse_discharge(self, patients: Dict[str, Patient]) -> Dict[str, Patient]:
    table = "discharge" # hardcoded, again might need user to explicitly download files into same directory/.
    df = pd.read_csv(os.path.join(self.tables_dir[table], f"{table}.csv"), 
                     dtype={"subject_id": str, "hadm_id": str})
    df = df.dropna(subset=["subject_id", "hadm_id", "text", "charttime"])
    df = df.sort_values(["subject_id", "hadm_id"], ascending=True)
    group_df = df.groupby("subject_id")
    def discharge_unit(p_id, p_info):
        events = []
        for v_id, v_info in p_info.groupby("hadm_id"):
            for text in v_info["text"]:
                attr_dict = {"text" : text, 
                                  "vocabulary" : "text",
                                  "visit_id"=v_id,
                                  "patient_id"=p_id}
                event = Event(
                    attr_dict = attr_dict,
                    timestamp=strptime(v_info["charttime"].values[0])
                )
                events.append(event)
        return events
    group_df = group_df.parallel_apply(
        lambda x: discharge_unit(x.subject_id.unique()[0], x)
    )
    patients = self._add_events_to_patient_dict(patients, group_df)
    return patients

def parse_cxr(self, patients: Dict[str, Patient]) -> Dict[str, Patient]:
      table = "cxr"
      cxr_file = "mimic-cxr-2.0.0-metadata"
      # hardcoded, might need to explicitly just have a CXR path in init.
      df = pd.read_csv(os.path.join(self.tables_dir[table], f"{cxr_file}.csv"),
                         dtype={"subject_id": str, "hadm_id": str})
      
      # combine date and time to create timestamp 
      df = df.dropna(subset=["subject_id", "study_id", "dicom_id"])
      df.StudyDate = df.StudyDate.astype(str)
      df.StudyTime = df.StudyTime.astype(str)
      # process all the dates and times
      df['StudyDateTime'] = df.apply(lambda row: self.transform_study_datetime(str(row['StudyDate']), str(row['StudyTime'])), axis=1)
      df = df.sort_values(["subject_id", "study_id"], ascending=True)
      
      group_df = df.groupby("subject_id")
      
      def cxr_unit(p_id, p_info):
          events = []
          for v_id, v_info in p_info.groupby("study_id"):
              for dicom_id, timestamp in zip(v_info["dicom_id"], v_info["StudyDateTime"]):
                  attr_dict = { "dicom_id"=dicom_id, # used for the dicom_id pathing
                      "vocabulary"="cxr"}
                  event = Event(
                      visit_id=v_id,
                      patient_id=p_id,
                      timestamp=strptime(timestamp)
                  )
                  events.append(event)
          return events
      
      group_df = group_df.parallel_apply(lambda x: cxr_unit(x.subject_id.unique()[0], x))
      patients = self._add_events_to_patient_dict(patients, group_df)
      return patients

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants