Kolibri is an open source educational platform to distribute content to areas with little or no internet connectivity. Educational content is created and edited on Kolibri Studio, which is a platform for organizing content to import from the Kolibri applications. The purpose of this project is to create a chef, or a program that scrapes a content source and puts it into a format that can be imported into Kolibri Studio.
-
Install Python 3 if you don't have it already.
-
Install pip if you don't have it already.
-
Create a Python virtual environment for this project (optional, but recommended):
- Install the virtualenv package:
pip install virtualenv - The next steps depends if you're using UNIX (Mac/Linux) or Windows:
- For UNIX systems:
- Create a virtual env called
venvin the current directory using the following command:virtualenv -p python3 venv - Activate the virtualenv called
venvby running:source venv/bin/activate. Your command prompt will change to indicate you're working insidevenv.
- Create a virtual env called
- For Windows systems:
- Create a virtual env called
venvin the current directory using the following command:virtualenv -p C:/Python36/python.exe venv. You may need to adjust the-pargument depending on where your version of Python is located. - Activate the virtualenv called
venvby running:.\venv\Scripts\activate
- Create a virtual env called
- For UNIX systems:
- Install the virtualenv package:
-
Run
pip install -r requirements.txtto install the required python libraries.
TODO: Explain how to run the CREE chef
export SOMEVAR=someval
./script.py -v --option2 --kwoard="val"
A sushi chef script is responsible for importing content into Kolibri Studio. The Rice Cooker library provides all the necessary methods for uploading the channel content to Kolibri Studio, as well as helper functions and utilities.
A sushi chef script has been started for you in sushichef.py.
Sushi chef docs can be found here.
For more sushi chef examples, see examples/openstax_sushichef.py (json) and
examples/wikipedia_sushichef.py (html) and also the examples/ dir inside the ricecooker repo.
The CREE channel includes content derived from PDFs. Given the unstructured nature of PDFs, there are a couple of scripts to run before running the full sushichef.
Before running any scripts, you will need to adjust the FOLDER variable under the config.py file. This should be the folder you would like to parse for pdfs. For example:
FOLDER = "C://Users/username/mypdfs"
You will need to generate an index for the pdf splitting code. To do this, run
python scripts/generateindex.py
This will parse the directory (see previous step to set this) and generate a <pdf filename>-index.json file for every pdf file found under that directory. For instance, a directory might look like this after running this script:
Some Directory
| - MyPdf.pdf
| - MyPdf-index.json
| - AnotherPdf.pdf
| - AnotherPdf-index.json
Note: If you add more pdfs to the directory, you can run this command again without overwriting any work you've previously done.
There may be some issues with the auto-generated indices, so you can edit these -index.json files in order to structure the channel correctly. You may also need to adjust the offset field to match where the first page actually starts (open the pdf and check the page number). Here is a sample of a valid index file:
{
"offset": 2,
"chapters": {
"Section Name": {
"Chapter 1": 5,
"Chapter 2": 10
},
"Appendix": 15
}
}
Here, Chapter 1 says it starts on page 5 according to the pdf's index page. However, the offset is set to 2, so Chapter 1 will be split at page 7, Chapter 2 will be split at page 12, etc.
Now that the index files are available, you can now generate the smaller pdfs and the associated exercise data by running:
python scripts/generatedata.py
This command will read the -index.json files from the previous script and split the pdfs based on the page numbers listed there. It will also read the pdfs and attempt to find any questions from the text. All of this data will be written to a <pdf filename>-data.json file. The directory will now look like this:
Some Directory
| - MyPdf.pdf
| - MyPdf-index.json
| - MyPdf-data.json
| - AnotherPdf.pdf
| - AnotherPdf-index.json
| - AnotherPdf-data.json
Note: If you add more pdfs to the directory, you can run this command again without overwriting any work you've previously done
Again, there may be some manual work needed to address any issues with the autogenerated -data.json file. While the header and chapter fields are based off of the extracted -index.json file, you may want to edit the exercises field. The questions field is a list of all questions associated with an exercise. Each item in this list comprises of the following fields:
question: the text of the questiontype: what type of question is this? You may set it as any of the following question types:single_selection: only one answer is correctmultiple_selection: select all that applyinput_question: numeric answer
answers: potential answers for the question. To set an answer as correct, you will need to set it toTrue
Here is an example of a valid -data.json file:
{
"header": "Section Name",
"chapters": [
{
"chapter": "Chapter 1",
"path": "path/to/splitfile.pdf",
"exercises": [
{
"description": "Some description",
"questions": [
{
"question": "Which of the following are fruits?",
"type": "multiple_selection",
"answers": {
"Apples": true,
"Oranges": true,
"Potatoes": false
}
},
{
"question": "Can birds fly?",
"type": "single_selection",
"answers": {
"Yes": true,
"No": false
}
}
]
}
]
}
]
}
If you find an issue with the -index.json file (e.g. the offset was set incorrectly, typos, etc.), you will need to rename or delete the -data.json file before proceeding. If you have edited the exercise data, please rename your -data.json file, run the generatedata.py command again, and copy your work into the newly created -data.json file.
Now that all of the pre-work has been done, it's now time to run your chef!
- JSON Validator: if you run into issues with invalid JSON files, this can help with fixing those issues
Please make sure your final chef matches the following standards.
- Does the code work (no infinite loops, exceptions thrown, etc.)?
- Are the
source_ids determined consistently (based on foreign database identifiers or permanent url paths)? - Is there documentation on how to run the script (include command line parameters to use)?
- Are there no obvious runtime or memory inefficiencies in the code?
- Are the functions succinct?
- Are clarifying comments provided where needed?
- Are the git commits easy to understand?
- Is there no unnecessary nested
iforforloops? - Are variables named descriptively (e.g.
pathvsp)?
- Is the code compatible with Python 3?
- Does the code use common standard library functions where needed?
- Does the code use common python idioms where needed (with/open, try/except, etc.)?