NotebookDocGen

Table of Contents

Description
Contributors
Features
Tech Stack
Demo
Classification Task
Documentation generation Task

Description

Our machine learning project consists of building a tool that improves Data Science & Machine Learning experts and learners’ experience of writing notebooks and organizing them. The features include: the classification of the notebooks by ML domains it belongs to (NLP, computer vision and reinforcement learning), and techniques it uses (classification, clustering and regression). As well as the automatic generation of documentation for individual code cells in a notebook. As for the tools used, we respectively applied the pre-trained CodeBERT model alongside a Classifier (created locally & under Oracle AutoML) for the classification task, and a PLBART model for the documentation generation task. All these features are showcased in a Streamlit web application offering users the opportunity to test and explore them.

⬆

Contributors

⬆

Features

Classification of the notebooks by ML domains : NLP , Computer Vision and Reinforcement Learning.
Classification of the notebooks by ML techniques : Regression , Classification and Clustering.
Automatic generation of documentation for individual code cells for a notebook.

⬆

Tech Stack

Data Analysis : Pandas, numpy, regex, plotly, matplotlib, seaborn, nltk, tokenize...

Modeling : Torch,scikit-learn

Deployement : Streamlit

⬆

Demo

Using the UI interface

01-NotebookDocGen_Demo_NoUpload.webm

Domain classification + PLBART DocGen

Technique classification + PLBART DocGen

Domain&Technique classification

⬆

Classification Task

Data Collection

Build list of keywords for each domain and technique : List_Keywords .
Collect about 10463 notebooks from Kaggle based on research of keywords .

Data Cleaning and preprocessing

Remove non-english, non-python and duplicate notebooks .
Parse JSON notebooks into Dataframe containing notebooks content and tag.
Delete raw cells.
Clean the Data by removing punctuation, special characters, spelling checker, non-python code ,emojis ...

Embedding using codeBert

Embed 332605 markdown and code cells using codeBert .

Classification experiements

Local experiements with different classification models .
Used Oracle AutoML : Oracle Machine Learning interface that provides you no-code automated machine learning .
Best model gives about : 86% accuracy .

Link to our dataset

Final_Classification_Dataset

⬆

Documentation generation Task

Data Collection

Collect suitable notebooks from Kaggle courses .
Collect top voted notebooks from getting started competitions on Kaggle .

Data Cleaning and preprocessing

Remove non-english and non-python notebooks .
Parse JSON notebooks into Dataframe containing notebooks content and tag.
Delete raw cells.
Clean the Data by removing special characters, spelling checker, non-python code ,emojis ... While being careful to keep punctuation .
Create markdown and code pairs .

Data Transformation

We have about ~3000 code cells that have comments included and 9633 is the total number of comments .Those comments can have even more precise and accurate documentation for code than markdowns that can be more general .So we've extracted comments from code cells and mark them as markdown while keeping the order of notebook's cells.

Pairs construction

First strategy consists into grouping all markdowns cells that follow each other and group them into one markdown ,we do the same for code cells .And then we build pairs based on markdown cell and code cell that follow each other .
Second strategy consists into grouping only markdown and code cells that follow each other into one pair .

PLBART

By first use of PLBART model without finetunning , we got an Overall Mean bert_score : 0.74 .

Link to our dataset

Final_DocGen_Dataset

⬆

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
Classification_Task		Classification_Task
DEMO		DEMO
DocGen_Task		DocGen_Task
HACGNN_Tests		HACGNN_Tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NotebookDocGen

Description

Contributors

Features

Tech Stack