Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality
Link to the article: https://www.nature.com/articles/s44184-023-00046-7
This repositoy contains the computer code that has been executed to generate the results of the article:
Bey, Romain, Ariel Cohen, Vincent Trebossen, Basile Dura, Pierre-Alexis Geoffroy, Charline Jean, Benjamin Landman, et al.
« Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality ».
npj Mental Health Research 3, nᵒ 1 (14 février 2024): 6.
https://doi.org/10.1038/s44184-023-00046-7.
The code has been executed on the database of the Greater Paris University Hospitals
- IRB number: CSE210013
- Code of puublished article.
You should run the file set_environment.py
in order to create a conda environment and an associated jupyter kernel.
python set_environment.py -n env_cse_210013
conda activate env_cse_210013
pip install --upgrade pip
cd cse_210013
poetry install
You can run all the analysis pipelines with the ./bash/run_analysis.sh
command:
bash bash/run_analysis.sh <conf_name>
Example:
bash bash/run_analysis.sh conf_article
It requires the prior training/import of the machine learning model for SA detection.
bash
: Bash files to execute the pipelines and testsconf
: Configuration filesdata
: Intermediate data and export resultsfigures
: Figures and their associated tablesnotebooks
: Tutorials and examplessuicide_attempt
: Source code (functions and pipelines)
- Stay & Document selection: Retrieve documents that mention a lexical variant of Suicide Attempt for the stays that fulfill the inclusion criteria
- `Rule-based entity classification
- Machine learning (ML) entity classification
- Stay classification using text data
- Stay classification using claim data
- Retrieve documents with a risk factor (RF) mention for the previously SA visits (text data).
- Rule-based entity classification for the RF
- Make plots
- Evaluate configuration & data description
- Train ML model
debug
: (Boolean) If set toTrue
, the pipelines will be executed using only a sample of data. Useful for debuging.schema
: Name of the schema to query.admission_mode
: Admissions mode to keep. For example: [2-URG
] for admission through the emergency department. IfNone
, no criterion is applied.type_of_visit
: Type of visit to keep. For example: [I
,U
] for hospitalizations and emergency visits, respectively. IfNone
, all visits will be considered.only_cat_docs
: List of text document categories to use exclusively. IfNone
, no action is applied.rule_select_docs
: Method used to select one document per visit. IfNone
, no selection is applied.text_classification_method
: name of the method used to classify an identified SA entity as positive (is_true_instance
variable).rule_icd10
: Name of the rule used to classify a visit as positive for SA using claim data.icd10_type
: Source database that is considered for claim data (eitherORBIS
orAREM
).threshold_positive_instances
: Minimum number of positive suicide attempt text instances found in text to classify the visit as positive.delta_min_visits
: timedelta used to tag recurrent visits related to the same SA event (string with the accepted format of pd.to_timedelta). IfNone
, no action is applied.delta_history
: timedelta used to discard SA detected by NLP algorithms but that are related to a patient's history. If the algorithm detects the date of a SA and if the date is before the admission date minusdelta_history
, the visit is not tagged as a SA-caused visit. IfNone
, no action is applied.date_from
: Consider only visits fulfillingstart_date
>=date_from
.date_upper_limit
: date up to which analysis is carried out. Only visits that start strictly beforedate_upper_limit
are considered. Also used to fill values of visits with novisit_end_date
for the Kaplan-Meier estimator.hospitals_train
: List of hospital considered in the training set (trigrams).hospitals_test
: List of hospital considered in the testing set (trigram). IfNone
, no action is applied.ehr_deployement_file
: Name of the file containing information on the deployement dates of the electronic health record used for data collection.encounter_subset
: List of encounter numbers to consider exclusively. IfNone
, no action is applied.
We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.