Table of Contents
Our machine learning project consists of building a tool that improves Data Science & Machine Learning experts and learners’ experience of writing notebooks and organizing them. The features include: the classification of the notebooks by ML domains it belongs to (NLP, computer vision and reinforcement learning), and techniques it uses (classification, clustering and regression). As well as the automatic generation of documentation for individual code cells in a notebook. As for the tools used, we respectively applied the pre-trained CodeBERT model alongside a Classifier (created locally & under Oracle AutoML) for the classification task, and a PLBART model for the documentation generation task. All these features are showcased in a Streamlit web application offering users the opportunity to test and explore them.
- Classification of the notebooks by ML domains : NLP , Computer Vision and Reinforcement Learning.
- Classification of the notebooks by ML techniques : Regression , Classification and Clustering.
- Automatic generation of documentation for individual code cells for a notebook.
Data Analysis : Pandas, numpy, regex, plotly, matplotlib, seaborn, nltk, tokenize...
Modeling : Torch,scikit-learn
Deployement : Streamlit
01-NotebookDocGen_Demo_NoUpload.webm
-
Build list of keywords for each domain and technique : List_Keywords .
-
Collect about 10463 notebooks from Kaggle based on research of keywords .
- Remove non-english, non-python and duplicate notebooks .
- Parse JSON notebooks into Dataframe containing notebooks content and tag.
- Delete raw cells.
- Clean the Data by removing punctuation, special characters, spelling checker, non-python code ,emojis ...
Embed 332605 markdown and code cells using codeBert .
- Local experiements with different classification models .
- Used Oracle AutoML : Oracle Machine Learning interface that provides you no-code automated machine learning .
- Best model gives about : 86% accuracy .
- Collect suitable notebooks from Kaggle courses .
- Collect top voted notebooks from getting started competitions on Kaggle .
- Remove non-english and non-python notebooks .
- Parse JSON notebooks into Dataframe containing notebooks content and tag.
- Delete raw cells.
- Clean the Data by removing special characters, spelling checker, non-python code ,emojis ... While being careful to keep punctuation .
- Create markdown and code pairs .
We have about ~3000 code cells that have comments included and 9633 is the total number of comments .Those comments can have even more precise and accurate documentation for code than markdowns that can be more general .So we've extracted comments from code cells and mark them as markdown while keeping the order of notebook's cells.
-
First strategy consists into grouping all markdowns cells that follow each other and group them into one markdown ,we do the same for code cells .And then we build pairs based on markdown cell and code cell that follow each other .
-
Second strategy consists into grouping only markdown and code cells that follow each other into one pair .
By first use of PLBART model without finetunning , we got an Overall Mean bert_score : 0.74 .