Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for improved handling of jupyter notebooks #105

Conversation

filipchristiansen
Copy link
Collaborator

This PR introduces the process_notebook function to process .ipynb files and return them as Python scripts, converting markdown and raw cells into multi-line string literals. It also refactors the function name ingest_from_query to run_ingest_query in ingest_from_query.py to avoid naming conflicts with the module, ensuring clearer code organization.

Changes include:

  • Adding the process_notebook function to handle Jupyter notebooks.
  • Renaming the ingest_from_query function to run_ingest_query to avoid naming conflicts with the module.
  • Updating _read_file_content to invoke process_notebook for .ipynb files.
  • Adding unit tests in test_notebook_utils.py for the notebook processing logic.
  • Adding integration tests in test_ingest.py to verify that .ipynb files trigger process_notebook.

These changes integrate Jupyter notebook processing into the file ingestion workflow, while also improving code clarity and test coverage.

@filipchristiansen filipchristiansen changed the title Refactor Function Name, Add Jupyter Notebook Processing, and Expand Test Coverage feat: Refactor Function Name, Add Jupyter Notebook Processing, and Expand Test Coverage Jan 6, 2025
@filipchristiansen filipchristiansen changed the title feat: Refactor Function Name, Add Jupyter Notebook Processing, and Expand Test Coverage feat: add support for improved handling of jupyter notebooks Jan 6, 2025
Copy link
Owner

@cyclotruc cyclotruc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tested this with:
https://raw.githubusercontent.com/cyclotruc/test/refs/heads/main/Exploration%20of%20Airline%20On-Time%20Performance.ipynb and got:

 File "/workspaces/gitingest/src/gitingest/notebook_utils.py", line 30, in process_notebook
    for cell in notebook["cells"]:
                ~~~~~~~~^^^^^^^^^
KeyError: 'cells'

It seems to fail because the "cells" field is nested
image

@filipchristiansen
Copy link
Collaborator Author

filipchristiansen commented Jan 7, 2025

@cyclotruc

The use of worksheets is deprecated. (However, I have now still added support for it as requested.)

The worksheets field is a list, but we have no UI to support multiple worksheets. Our design has since shifted to heading-cell based structure, so we never intend to support the multiple worksheet model. The worksheets list of lists shall be replacedwith a single list, called cells."
https://github.com/ipython/ipython/wiki/IPEP-17:-Notebook-Format-4#remove-multiple-worksheets

@filipchristiansen filipchristiansen force-pushed the feature/process-notebook-integration branch from 7e61807 to 3a7e4b8 Compare January 7, 2025 10:42
@filipchristiansen
Copy link
Collaborator Author

@cyclotruc Tests added for the notebook_utils module – ready for review.

@filipchristiansen filipchristiansen force-pushed the feature/process-notebook-integration branch from f505f23 to f6c3a0b Compare January 8, 2025 16:30
@cyclotruc cyclotruc merged commit d2825ea into cyclotruc:main Jan 8, 2025
8 checks passed
@filipchristiansen filipchristiansen deleted the feature/process-notebook-integration branch January 8, 2025 22:50
filipchristiansen added a commit to filipchristiansen/gitingest that referenced this pull request Jan 8, 2025
@IsNoobgrammer
Copy link

IsNoobgrammer commented Jan 10, 2025

@filipchristiansen liked your PR , can you create a new PR such that ,cell number, cell type are commented above the source and if cells[output][-1]['text'] are commented below , also we can make such that that results always init_s with "### Jupyter-Notebook"
or some filter to identify it as a notebook , thanks!! for this PR , too

If you are busy , do mention , I would create a PR in that case

@filipchristiansen
Copy link
Collaborator Author

filipchristiansen commented Jan 10, 2025

@filipchristiansen liked your PR , can you create a new PR such that ,cell number, cell type are commented above the source and if cells[output][-1]['text'] are commented below , also we can make such that that results always init_s with "### Jupyter-Notebook" or some filter to identify it as a notebook , thanks!! for this PR , too

If you are busy , do mention , I would create a PR in that case

@IsNoobgrammer

What do you mean by output here? cells is a list, so I assume output is an int? but then cells[output][-1] would be the last character in the cell, and cells[output][-1]['text'] would be trying to index the last character with a ['text'] key. Maybe I am misunderstanding something obvious here. Please let me know.

For the second point, may I ask your use case for this? You would still identify that it is a notebook based on the .ipynb file extension.

@filipchristiansen
Copy link
Collaborator Author

filipchristiansen commented Jan 10, 2025

@filipchristiansen liked your PR , can you create a new PR such that ,cell number, cell type are commented above the source and if cells[output][-1]['text'] are commented below , also we can make such that that results always init_s with "### Jupyter-Notebook" or some filter to identify it as a notebook , thanks!! for this PR , too

If you are busy , do mention , I would create a PR in that case

What do you (@cyclotruc) say about the suggestion to start each notebook with ### Jupyter-Notebook directly after the following, but before its content?

================================================
File: /notebooks/foo/bar.ipynb
================================================

@IsNoobgrammer
Copy link

IsNoobgrammer commented Jan 10, 2025

@filipchristiansen liked your PR , can you create a new PR such that ,cell number, cell type are commented above the source and if cells[output][-1]['text'] are commented below , also we can make such that that results always init_s with "### Jupyter-Notebook" or some filter to identify it as a notebook , thanks!! for this PR , too
If you are busy , do mention , I would create a PR in that case

@IsNoobgrammer

What do you mean by output here? cells is a list, so I assume output is an int? but then cells[output][-1] would be the last character in the cell, and cells[output][-1]['text'] would be trying to index the last character with a ['text'] key. Maybe I am misunderstanding something obvious here. Please let me know.

For the second point, may I ask your use case for this? You would still identify that it is a notebook based on the .ipynb file extension.

1st part
meant for each cell
so for each code-type cell the outputs is [] if no output ;

else-if it ran and has output then

"outputs":[{"name":"stderr","output_type":"stream","text":"/usr/local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n\n  from .autonotebook import tqdm as notebook_tqdm\n"}]

for nth-cell in cell:
, so in my part

outputs=nth-cell.get("outputs","") ; if outputs: output = outputs[-1]["text"] ; del outputs # [-1] always gets last ran output

2nd part
Oh! My bad , its redundant
However
data: List[str] =ingested_data.split("================================================")
so we can pass that whole document to llm or other source , without mentioning it is a python code file or jupyter code snippets and it might also help in sending data without specifying any private metadata

@filipchristiansen
Copy link
Collaborator Author

filipchristiansen commented Jan 10, 2025

@filipchristiansen liked your PR , can you create a new PR such that ,cell number, cell type are commented above the source and if cells[output][-1]['text'] are commented below , also we can make such that that results always init_s with "### Jupyter-Notebook" or some filter to identify it as a notebook , thanks!! for this PR , too
If you are busy , do mention , I would create a PR in that case

@IsNoobgrammer
What do you mean by output here? cells is a list, so I assume output is an int? but then cells[output][-1] would be the last character in the cell, and cells[output][-1]['text'] would be trying to index the last character with a ['text'] key. Maybe I am misunderstanding something obvious here. Please let me know.
For the second point, may I ask your use case for this? You would still identify that it is a notebook based on the .ipynb file extension.

1st part meant for each cell so for each code-type cell the outputs is [] if no output ;

else-if it ran and has output then

"outputs":[{"name":"stderr","output_type":"stream","text":"/usr/local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n\n  from .autonotebook import tqdm as notebook_tqdm\n"}]

for nth-cell in cell: , so in my part

outputs=nth-cell.get("outputs","") ; if outputs: output = outputs[-1]["text"] ; del outputs # [-1] always gets last ran output

2nd part Oh! My bad , its redundant However data: List[str] =ingested_data.split("================================================") so we can pass that whole document to llm or other source , without mentioning it is a python code file or jupyter code snippets and it might also help in sending data without specifying any private metadata

  1. Regarding Including Cell Outputs:

    I think I now understand what you meant. For each code-type cell, if outputs is an empty list (i.e., no output), we would do nothing. However, if the cell has output, you want to comment below the source code what the last output was. I'm not sure if we should include the output by default.

    We could consider adding a command-line flag for the terminal client that alters this behavior to include outputs when desired. Similarly, for the web client, this might be an option under "Advanced Settings" or a similar section.

    I created Issue #119 for this feature request. If you believe such a feature would be useful, please comment on that issue. Your feedback will help us consider and prioritize this request formally.

  2. Regarding Identification of Notebook Content:

    I'm not entirely sure what you mean here. Could you please elaborate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants