RPIntDB

(Update) Union of S/W Universities in Korea, 2020 National Talent Festival (prize-winning)

Team Name : GIGA Mak-hi-jan-ni
Date of Festival : 12/03/20

2020 graduation Piece (prize-winning project)

Subject : Developing model which improves performance better than previous researches using RNA Protein Interaction(RPI) Database datasets.
Team No. : 6
Team Name : Could we graduate ..?
Team Member : 201724545 이지현, 201729127 박성아, 201724493 Woojung Son
Visualization Server : http://52.79.184.82:5601/app/kibana#/dashboard/4d687850-fe37-11ea-9a80-5b5ed9a16699 -> Dashboard (valid until end of Oct. 2020)

RNA Protein Interaction(RPI) Database

RPIntDB

The database in order to check whether a certain RNA and protein molecules can be interacted or not. We can use it as developing machine learning model to solve classification problem of them.

About our project

We use ensemble model as a classifier which bundles RandomForest, Support Vector Machine and several other classifiers up with soft voting.

We evaluate our model using accuracy (Acc), sensitivity (Sn), specificity (Sp), precision (Pre), Matthews correlation coefficient (MCC), and AUC (the area under the receiver operating characteristic curve (ROC).

You can see the best performance of our projects on best_output.json since save_best_output.py file tracks the best result when we get the highest Accuracy of each dataset. It is used to visualize the result of performance using AWS, ElasticSearch, Kibana.

We did visualize using AWS EC2 server and ElasticSearch, Kibana. The server reads best_output.json file and visualize the performances. On the dashboard of kibana, There are vertical bar graphes comparing our accuracy scores with other models made by other research labs, line graphes comparing other kinds of scores with them as well.

About the Raw Data

Dataset	#Positive pairs	#Negative pairs	RNAs	Proteins
RPI369	369	0	332	338
RPI488	243	245	25	247
RPI1807	1807	1436	1078	3131
RPI2241	2241	0	841	2042
NPInter	10412	0	4636	449

In data/ folder, there are files named ending _pairs which contain information of pairs of ID of RNAs and proteins with the label. Files named ending in _pos_pairs mean what are only consisted of interactable pairs.
In data/sequence folder, there are files which contain sequence information of RNAs and proteins.
In data/struct folder, there are files which contain struct information of RNAs and proteins, which mean two-dimentional structures of molecules.

Feature Preprocessing

We use improved CTF(Conjoint Triad Feature) in order to preprocess both sequence and struct data of RNAs and proteins.

CTF(Conjoint Triad Feature) is the way of preprocessing DNA-like data usually used in Bioinformatics. It makes several patterns consisted of elements of RNA and reduced protein with the maximum of 3 alphabetic-digits and uses them as a features. Improved CTF uses one more digit to express patterns.

Usage of making preprocessed file

make_preprocessed_file.ipynb file preprocesses sequence and struct data, and produces processed files of each dataset with .npz extension.

Usage of operating our model

main.ipynb file reads preprocessed files stored in npz/ folder. It has binary classifier, splits the whole set into train and test one, returns the scores of performance of model with six criteria.

References

[1] https://github.com/Pengeace/RPITER [2] PEDREGOSA, Fabian, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research, 2011, 12.Oct: 2825-2830.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
npz		npz
.gitattributes		.gitattributes
Logger.ipynb		Logger.ipynb
Logger.py		Logger.py
README.md		README.md
WOOJUNG's LOG.log		WOOJUNG's LOG.log
best_output.json		best_output.json
feature_processing.py		feature_processing.py
features.ipynb		features.ipynb
features.py		features.py
hyperparams.py		hyperparams.py
main.ipynb		main.ipynb
make_preprocessed_file.ipynb		make_preprocessed_file.ipynb
rawdata_preprocessing.py		rawdata_preprocessing.py
rawstruct_preprocessing.ipynb		rawstruct_preprocessing.ipynb
rawstruct_preprocessing.py		rawstruct_preprocessing.py
save_best_output.ipynb		save_best_output.ipynb
save_best_output.py		save_best_output.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RPIntDB

About our project

About the Raw Data

Feature Preprocessing

Usage of making preprocessed file

Usage of operating our model

References

About

Releases

Packages

Contributors 3

Languages

woojung-son/PNURPI

Folders and files

Latest commit

History

Repository files navigation

RPIntDB

About our project

About the Raw Data

Feature Preprocessing

Usage of making preprocessed file

Usage of operating our model

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages