GitHub - GMendiratta/ROSETTA-for-Cancer-Mutations: Mendiratta et al Nature Communications 2021

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Genomics_Analysis		Genomics_Analysis
Results		Results
SEER_Analysis		SEER_Analysis
input_matrices		input_matrices
interim_tmp_files		interim_tmp_files
Analysis_Source_Code.ipynb		Analysis_Source_Code.ipynb
Analysis_Source_Code.pdf		Analysis_Source_Code.pdf
OpenBSD_License.txt		OpenBSD_License.txt
README.txt		README.txt

Repository files navigation

====File Descriptions===
1. Results/ folder contains figures and all the data tables presented in the manuscript. The final figures shown in the manuscript were re-generated and formatted using GraphPad software and in some cases with Adobe Illustrator. The content is identical to plots output in Results folder.

2. input_matrices/ folder contains input data files for our analysis.

3. Analysis_Source_Code.ipynb is a Jupyter notebook python code designed for visual and logical representation of our pipeline and is capable of generating all of our results.

4. Analysis_Source_Code.pdf is a pdf copy of the jupyter notebook above so the algorithm may be viewed without installing python and jupyter.

5. SEER_Analysis/ folder contains epidemiological data from SEER and our python code which implements ROSETTA histology map and reformats SEER data for input into the convoluting code above.

6. Genomics_Analysis/ folder contains code to process genomics data, curated files containing ROSETTA encoding of each counted sample among the genomics studies, code to download genomics studies from cbioportal and unpack them.
===Installation Guide===

A. Running source code - Analysis_Source_Code.ipynb:

The source code was written in python 3.8.5 using Jupyter notebook format.
The authors recommend installing Anaconda or similar package where the libraries required are conveniently obtained: https://www.anaconda.com/

Jupyter notebook is required to open the source code. This package is installed with Anaconda and can be started using Anaconda Navigator.

If not using Anaconda, Jupyter can be independently installed at the following link. https://jupyter.org

The source code is present in the file, Analysis_Source_Code.ipynb. To open the file, initiate Jupyter notebook from Anaconda Navigator.
The software opens up in the default browser application. Now, change the current working directory to the folder where the has been unzipped.
It is important that the whole zipped file is unzipped in the same location preserving the sub-folder structure and files.
Once within the folder, clicking on the source code Analysis_Source_Code.ipynb opens the main processing notebook.
The commands are divided into blocks which can be run independently by pressing the play button.
The whole notebook can be run by scrolling the notebook menu and clicking the tab 'Cell' and then 'Run All'.
This file generates data and plots corresponding to figures and tables in the manuscript and saves them in a Results folder.
Running time is under 20 mins for a processor with a >2Ghz clock and >=8GB RAM.

A copy of fully run code is present with filename 'Analysis_Source_Code.pdf' in the package home folder which can be viewed using a PDF viewer.

B. Dependencies Required to run the source code:

The following libraries were used in compiling the source code.
1. NumPy 1.19.2
2. pandas 1.1.3
3. matplotlib 3.3.2

The libraries may be installed/updated using pip command or conda command or Anaconda navigator GUI. Anaconda by default installs the latest version of these three libraries.

C. Updating Epidemiological Data (not needed in 2021):
The epidemiological data is up to date in early 2021 required files are already present in input_matrices/ folder.
The data may be updated in the future by downloading the required incidence data from SEER database and replacing the input.txt and input.dic files in the SEER_Analysis/ folder and running the file SEER_python_code_RUNME.py.
This file can be run from the anaconda command line prompt with the following code:

python SEER_python_code_RUNME.py

The description for setting up SEERStat software and for downloading the SEER data is available on their website https://seer.cancer.gov/seerstat/ . The resulting Output_SEER.txt file is then copied back into the input_matrices/ folder replacing the file already present. The source code will then use the updated epidemiological data to construct the results.

D. Genomic Analysis
Genomic analysis output file 'input_matrices/Genomics_Output_Processed.txt' can be generated by the code in the folder Genomic_Analysis/.
The genomic data can be downloaded and unpacked by running the python code 'Raw_Studies_Downloader.py' in folder 'Genomics_Analysis/cbioportal_raw_EXOME139/'. Subsequently, an updated Genomics_Output_Processed.txt file can be generated using the code 'Genomics_Clinical_Mutations_Counter.ipynb'.
*Note that while the location within a computer is not important, the relative folder structure of the contents of this package are important and must be extracted as-is from the provided zipped file.