Skip to content

dikraMasrour/NotebookDocGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NotebookDocGen

Contributors Forks Stargazers Issues

Table of Contents
  1. Description
  2. Contributors
  3. Features
  4. Tech Stack
  5. Demo
  6. Classification Task
  7. Documentation generation Task

Description

Our machine learning project consists of building a tool that improves Data Science & Machine Learning experts and learners’ experience of writing notebooks and organizing them. The features include: the classification of the notebooks by ML domains it belongs to (NLP, computer vision and reinforcement learning), and techniques it uses (classification, clustering and regression). As well as the automatic generation of documentation for individual code cells in a notebook. As for the tools used, we respectively applied the pre-trained CodeBERT model alongside a Classifier (created locally & under Oracle AutoML) for the classification task, and a PLBART model for the documentation generation task. All these features are showcased in a Streamlit web application offering users the opportunity to test and explore them.

Contributors

Features

  • Classification of the notebooks by ML domains : NLP , Computer Vision and Reinforcement Learning.
  • Classification of the notebooks by ML techniques : Regression , Classification and Clustering.
  • Automatic generation of documentation for individual code cells for a notebook.

Tech Stack

Data Analysis : Pandas, numpy, regex, plotly, matplotlib, seaborn, nltk, tokenize...

Modeling : Torch,scikit-learn

Deployement : Streamlit

Demo

Using the UI interface

01-NotebookDocGen_Demo_NoUpload.webm

Domain classification + PLBART DocGen

domain

Technique classification + PLBART DocGen

technique

Domain&Technique classification

dom_tech

Classification Task

Data Collection

  • Build list of keywords for each domain and technique : List_Keywords .

  • Collect about 10463 notebooks from Kaggle based on research of keywords .

Data Cleaning and preprocessing

  • Remove non-english, non-python and duplicate notebooks .
  • Parse JSON notebooks into Dataframe containing notebooks content and  tag.
  • Delete raw cells.
  • Clean the Data by removing punctuation, special characters, spelling checker, non-python code ,emojis ...

Embedding using codeBert

Embed 332605 markdown and code cells using codeBert .

Classification experiements

  • Local experiements with different classification models .
  • Used Oracle AutoML : Oracle Machine Learning interface that provides you no-code automated machine learning .
  • Best model gives about : 86% accuracy .

Link to our dataset

Documentation generation Task

Data Collection

  • Collect suitable notebooks from Kaggle courses .
  • Collect top voted notebooks from getting started competitions on Kaggle .

Data Cleaning and preprocessing

  • Remove non-english and non-python notebooks .
  • Parse JSON notebooks into Dataframe containing notebooks content and  tag.
  • Delete raw cells.
  • Clean the Data by removing special characters, spelling checker, non-python code ,emojis ... While being careful to keep punctuation .
  • Create markdown and code pairs .

Data Transformation

We have about ~3000 code cells that have comments included and 9633 is the total number of comments .Those comments can have even more precise and accurate documentation for code than markdowns that can be more general .So we've extracted comments from code cells and mark them as markdown while keeping the order of notebook's cells.

Pairs construction

  • First strategy consists into grouping all markdowns cells that follow each other and group them into one markdown ,we do the same for code cells .And then we build pairs based on markdown cell and code cell that follow each other .

  • Second strategy consists into grouping only markdown and code cells that follow each other into one pair .

PLBART

By first use of PLBART model without finetunning , we got an Overall Mean bert_score  :   0.74 .

Link to our dataset

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •