(Update) Union of S/W Universities in Korea, 2020 National Talent Festival (prize-winning)
- Team Name :
GIGA Mak-hi-jan-ni
- Date of Festival : 12/03/20
2020 graduation Piece (prize-winning project)
- Subject : Developing model which improves performance better than previous researches using RNA Protein Interaction(RPI) Database datasets.
- Team No. : 6
- Team Name :
Could we graduate ..?
- Team Member : 201724545 이지현, 201729127 박성아, 201724493 Woojung Son
- Visualization Server : http://52.79.184.82:5601/app/kibana#/dashboard/4d687850-fe37-11ea-9a80-5b5ed9a16699 -> Dashboard (valid until end of Oct. 2020)
RNA Protein Interaction(RPI) Database
The database in order to check whether a certain RNA and protein molecules can be interacted or not. We can use it as developing machine learning model to solve classification problem of them.
We use ensemble model as a classifier which bundles RandomForest, Support Vector Machine and several other classifiers up with soft voting.
We evaluate our model using accuracy (Acc
), sensitivity (Sn
), specificity (Sp
), precision (Pre
), Matthews correlation coefficient (MCC
), and AUC
(the area under the receiver operating characteristic curve (ROC).
You can see the best performance of our projects on best_output.json
since save_best_output.py
file tracks the best result when we get the highest Accuracy of each dataset. It is used to visualize the result of performance using AWS, ElasticSearch, Kibana.
We did visualize using AWS EC2 server and ElasticSearch, Kibana. The server reads best_output.json
file and visualize the performances. On the dashboard of kibana, There are vertical bar graphes comparing our accuracy scores with other models made by other research labs, line graphes comparing other kinds of scores with them as well.
Dataset | #Positive pairs | #Negative pairs | RNAs | Proteins |
---|---|---|---|---|
RPI369 | 369 | 0 | 332 | 338 |
RPI488 | 243 | 245 | 25 | 247 |
RPI1807 | 1807 | 1436 | 1078 | 3131 |
RPI2241 | 2241 | 0 | 841 | 2042 |
NPInter | 10412 | 0 | 4636 | 449 |
- In
data/
folder, there are files named ending_pairs
which contain information of pairs of ID of RNAs and proteins with the label. Files named ending in_pos_pairs
mean what are only consisted of interactable pairs. - In
data/sequence
folder, there are files which contain sequence information of RNAs and proteins. - In
data/struct
folder, there are files which contain struct information of RNAs and proteins, which mean two-dimentional structures of molecules.
We use improved CTF(Conjoint Triad Feature) in order to preprocess both sequence and struct data of RNAs and proteins.
CTF(Conjoint Triad Feature) is the way of preprocessing DNA-like data usually used in Bioinformatics. It makes several patterns consisted of elements of RNA and reduced protein with the maximum of 3 alphabetic-digits and uses them as a features. Improved CTF uses one more digit to express patterns.
make_preprocessed_file.ipynb
file preprocesses sequence and struct data, and produces processed files of each dataset with .npz
extension.
main.ipynb
file reads preprocessed files stored in npz/
folder. It has binary classifier, splits the whole set into train and test one, returns the scores of performance of model with six criteria.
[1] https://github.com/Pengeace/RPITER [2] PEDREGOSA, Fabian, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research, 2011, 12.Oct: 2825-2830.