Skip to content

Commit

Permalink
Add resources2pdf to gh pages
Browse files Browse the repository at this point in the history
  • Loading branch information
beefoo committed May 31, 2024
1 parent 7db35d2 commit 13dd164
Show file tree
Hide file tree
Showing 8 changed files with 1,005 additions and 1 deletion.
284 changes: 284 additions & 0 deletions _sources/loc.gov JSON API/resources2pdf.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Downloading an item with multiple pages to PDF\n",
"\n",
"The loc.gov API provides structured data about Library of Congress collections in JSON and YAML formats. This notebook shows how you can take use the API to access image resources, belonging to an LoC Item, and aggregate them into a single PDF file. \n",
"\n",
"## Understanding API Responses Review:\n",
"\n",
"**JSON Response Objects**\n",
"Each of the endpoint types has a distinct response format, but they can be broadly grouped into two categories:\n",
"- responses to queries for a list of items, or Search Results Responses \n",
"- responses to queries for a **single item**, or Item and Resource Responses\n",
"\n",
"Furthermore, this notebook will focus on the **JSON Response Object** for a **single item** and formatting its corresponding Resources (files that make-up an item, e.g. pictures of book) into a .pdf file.\n",
"\n",
"## Prerequisites\n",
"\n",
"There are no prequisites in order to run this notebook, besides the installation of libraries listed in the imports section.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## I. Imports\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from PIL import Image\n",
"import os\n",
"from io import BytesIO\n",
"import requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## II. Create a request URL\n",
"\n",
"First, we will start by ensuring we have a link to an item of interest. In this instance we will look at the [Benjamin Harrison Papers: Series 13, Venezuela Boundary Dispute, 1895-1899; Part 2, 1895-1899](https://www.loc.gov/item/mss250640164/) as an example.\n",
"\n",
"Notice the format of the link to this item: `https://www.loc.gov/item/mss250640164/`\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Item API Request URL: https://www.loc.gov/item/mss250640164/?fo=json\n"
]
}
],
"source": [
"item_link=\"https://www.loc.gov/item/mss250640164/\"\n",
"request_url = item_link + \"?fo=json\"\n",
"\n",
"# Note: The addition of the \"fo=json\" string ensures that the item request is in JSON format\n",
"\n",
"print(f'Item API Request URL: {request_url}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also set a start and end page that we want to download and compile."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"start_page = 1 # starting a 1\n",
"end_page = 10 # up to and including this page, make this -1 to retrieve all pages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## III. Request Data"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Top-level data structure:\n",
"articles_and_essays, cite_this, item, more_like_this, options, related_items, resources, timestamp, type\n"
]
}
],
"source": [
"# Generates request from LOC API to extract data in JSON format\n",
"r = requests.get(request_url)\n",
"data = r.json()\n",
"# print(data)\n",
"\n",
"# Here is a quick way at looking at the structure of the data\n",
"print(\"Top-level data structure:\\n\" + \", \".join(value for value in data.keys()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## IV. Resource Data and Extracting Resource URLS\n",
"\n",
"In the previous code cell, you can see that the content itself has a lot of Metadata to that can be explored. However, in this notebook we will focus on access to information about the resources.\n",
"\n",
"As opposed to looking at item with `data['item']` we will look at the resources through `data['resources]`. Furthermore, we will be creating a list of all of the resources image urls with the best resolution (based on the largest height).\n",
"\n",
"First, let's just retrieve a list of resources/files"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total # of Resources: 1,143\n",
"Resource Data: caption, files, image, url\n",
"Selected 10 files\n"
]
}
],
"source": [
"resources = data['resources'][0]\n",
"files = resources['files']\n",
"num_resources = len(files)\n",
"print(f'Total # of Resources: {num_resources:,}')\n",
"print('Resource Data: ' + \", \".join(key for key in resources.keys()))\n",
"\n",
"# And select a subset of these files\n",
"files = files[(start_page-1):end_page]\n",
"print(f'Selected {len(files)} files')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we will select the highest resolution .jpg image for each file"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found 10 .jpg file URLs\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0018/full/pct:100/0/default.jpg\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0019/full/pct:100/0/default.jpg\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0020/full/pct:100/0/default.jpg\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0021/full/pct:100/0/default.jpg\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0022/full/pct:100/0/default.jpg\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0023/full/pct:100/0/default.jpg\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0024/full/pct:100/0/default.jpg\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0025/full/pct:100/0/default.jpg\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0026/full/pct:100/0/default.jpg\n",
"https://tile.loc.gov/image-services/iiif/service:mss:mss25064:mss25064-141:0027/full/pct:100/0/default.jpg\n"
]
}
],
"source": [
"urls = []\n",
"for i, file_sizes in enumerate(files):\n",
" # only select files that are .jpg and have a height\n",
" jpgs = [f for f in file_sizes if 'url' in f and f['url'].endswith('.jpg') and 'height' in f]\n",
"\n",
" # Check to see if we have at least one .jpg image\n",
" if len(jpgs) < 1:\n",
" print(f'No .jpgs found in file #{i+1}. Skipping.')\n",
" continue\n",
"\n",
" # sort the jpgs by height, descending\n",
" jpgs = sorted(jpgs, key=lambda f: -f['height'])\n",
"\n",
" # choose the largest one\n",
" urls.append(jpgs[0]['url'])\n",
"\n",
"print(f\"Found {len(urls)} .jpg file URLs\")\n",
"print(\"\\n\".join(urls))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## V. Downloading images into a PDF file\n",
"\n",
"Finally we will download each image to memory, convert them to images, then compile them into a single PDF file. You can change the resolution and file name in the code below.\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LOC Item Resources have been saved as pdf: output/sample.pdf\n"
]
}
],
"source": [
"# Function that facilitates the download of the image using it's url\n",
"def download_image(url):\n",
" response = requests.get(url)\n",
" return Image.open(BytesIO(response.content))\n",
"\n",
"# Leveraging the list of urls, this function allows you to create the final .pdf file with the aggregate resources.\n",
"def create_pdf(image_urls, pdf_name):\n",
" images = []\n",
" for url in image_urls:\n",
" image = download_image(url)\n",
" images.append(image)\n",
"\n",
" images[0].save(\n",
" pdf_name, \"PDF\", resolution=100.0, save_all=True, append_images=images[1:]\n",
" )\n",
"\n",
" print(\"LOC Item Resources have been saved as pdf: \"+ pdf_name)\n",
"\n",
"# creating the PDF\n",
"pdf_name = 'output/sample.pdf'\n",
"create_pdf(urls, pdf_name)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
1 change: 1 addition & 0 deletions genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Loc.gov JSON API notebooks</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/LOC.gov%20JSON%20API.html">LOC.gov JSON API Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/resources2pdf.html">Downloading an item with multiple pages to PDF</a></li>
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Accessing%20images%20for%20analysis.html">Accessing images from the loc.gov JSON API for image analysis</a></li>
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Dominant%20colors.html">Color clusters in images in Library of Congress collections</a></li>
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Downloading_Monographs_as_Images_in_Rosenwald_Collection/Downloading%20Monographs%20as%20Images%20in%20Rosenwald%20Collection.html">Connecting the image file to the metadata</a></li>
Expand Down
1 change: 1 addition & 0 deletions intro.html
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Loc.gov JSON API notebooks</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/LOC.gov%20JSON%20API.html">LOC.gov JSON API Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/resources2pdf.html">Downloading an item with multiple pages to PDF</a></li>
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Accessing%20images%20for%20analysis.html">Accessing images from the loc.gov JSON API for image analysis</a></li>
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Dominant%20colors.html">Color clusters in images in Library of Congress collections</a></li>
<li class="toctree-l1"><a class="reference internal" href="loc.gov%20JSON%20API/Downloading_Monographs_as_Images_in_Rosenwald_Collection/Downloading%20Monographs%20as%20Images%20in%20Rosenwald%20Collection.html">Connecting the image file to the metadata</a></li>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Loc.gov JSON API notebooks</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../LOC.gov%20JSON%20API.html">LOC.gov JSON API Guide</a></li>
<li class="toctree-l1"><a class="reference internal" href="../resources2pdf.html">Downloading an item with multiple pages to PDF</a></li>
<li class="toctree-l1"><a class="reference internal" href="../Accessing%20images%20for%20analysis.html">Accessing images from the loc.gov JSON API for image analysis</a></li>
<li class="toctree-l1"><a class="reference internal" href="../Dominant%20colors.html">Color clusters in images in Library of Congress collections</a></li>
<li class="toctree-l1"><a class="reference internal" href="../Downloading_Monographs_as_Images_in_Rosenwald_Collection/Downloading%20Monographs%20as%20Images%20in%20Rosenwald%20Collection.html">Connecting the image file to the metadata</a></li>
Expand Down
Loading

0 comments on commit 13dd164

Please sign in to comment.