Skip to content

Commit 1193982

Browse files
committed
docs: add data source docs and contribution guide
1 parent 6c5dd7e commit 1193982

File tree

6 files changed

+120
-0
lines changed

6 files changed

+120
-0
lines changed

CONTRIBUTING.md

100644100755
+8
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,14 @@ libquery/ - the Python package
3636

3737
Use [Black](https://github.com/psf/black) to detect code style issues and fix the issues before committing.
3838

39+
## Publishing the Package
40+
41+
The steps to publish the package:
42+
43+
1. Bump the version in `pyproject.toml`.
44+
2. create a new tag for the version with `git tag v*.*.*`.
45+
3. push the tag to Github with `git push origin v*.*.*`, which will trigger the Github workflow at `.github/workflow/publish-to-pypi.yml` to automatically publish the package to PyPI.
46+
3947
## Thanks
4048

4149
Thank you again for being interested in this project! You are awesome!
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# David Rumsey Map Collection
2+
3+
**URL:** <https://www.davidrumsey.com>
4+
5+
**Corresponding Library:** [David Rumsey Map Center](https://library.stanford.edu/rumsey)
6+
7+
**Collector:** [David Rumsey](https://en.wikipedia.org/wiki/David_Rumsey)
8+
9+
**Data Source Content:**
10+
> The collection focuses on rare 16th through 21st century maps of North and South America, as well as maps of the World, Asia, Africa, Europe, and Oceania. The collection includes atlases, globes, wall maps, school geographies, pocket maps, books of exploration, maritime charts, and a variety of cartographic materials including pocket, wall, children's, and manuscript maps. Items range in date from around 1550 to the present. ([Source](https://www.davidrumsey.com/about) last accessed on 2023/10/19)
11+
12+
**Data Source Size:**
13+
> The David Rumsey Map Collection was started over 35 years ago and contains more than 200,000 maps. (...) Digitization of the collection began in 1996 and there are now over 126,000 items online, with new additions added regularly. ([Source](https://www.davidrumsey.com/about) last accessed on 2023/10/19)
14+
15+
**Official API:** [LUNA API Documentation](https://doc.lunaimaging.com/display/V74D/LUNA+API+Documentation)
16+
17+
**Image Resolution:**
18+
> The digitized maps are very high resolution images scanned at 300 to 800 pixels per inch, as measured against the original map's dimensions. ([Source](https://www.davidrumsey.com/about))
19+
20+
**Image File Size:**
21+
> The larger maps generate files frequently approaching 5 gigabytes in size; the average file size of images in the collection is 250 megabytes. ([Source](https://www.davidrumsey.com/about))
22+
23+
**Term of Use:** CC BY-NC-SA 3.0 ([Source](https://www.davidrumsey.com/about))
24+
25+
## Additional Notes
26+
27+
- [Georeference](https://www.davidrumsey.com/view/georeferenced-maps) is available for the maps.
28+
- Available in formats such as WMTS, TileJSON, XYZ, and GeoTIFF
29+
- Example: <https://davidrumsey.georeferencer.com/maps/929406225627/>
30+
- David Rumsey Map Collection is actively maintained and its collection is growing over time.
31+
- On the reproducibility of search URLs of an entry:
32+
- The returned value of a *keyword* query (e.g., [q=subject=data+visualization](https://www.davidrumsey.com/luna/servlet/as/search?q=subject=data+visualization)) is not reproducible, as the data source is updating over time.
33+
- The returned value of a *mid* query (e.g., [mid=RUMSEY~8~1~286452~90058976](https://www.davidrumsey.com/luna/servlet/as/search?mid=RUMSEY~8~1~286452~90058976)) is reproducible.

libquery/gallica/README.md

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Gallica
2+
3+
**URL:** <https://gallica.bnf.fr/>
4+
5+
**Corresponding Library:** [Bibliothèque nationale de France](https://en.wikipedia.org/wiki/Biblioth%C3%A8que_nationale_de_France).
6+
7+
**Official API:**
8+
- [Search API](https://api.bnf.fr/api-gallica-de-recherche): for searching the digital holdings.
9+
- Note that when using multiple filtering conditions in a URL, only French search terms function as expected.
10+
- For example, <https://gallica.bnf.fr/SRU?operation=searchRetrieve&version=1.2&maximumRecords=10&startRecord=1&query=dc.title%20all%20%22cartes%20figurative%22%20and%20dc.type%20any%20image> only returns two records
11+
- [Document API](https://api.bnf.fr/api-document-de-gallica): for retrieving the metadata.
12+
- [Image API](https://api.bnf.fr/api-iiif-de-recuperation-des-images-de-gallica): for retrieving and manipulating image with [IIIF](https://iiif.io/) standard.
13+
14+
**Term of Use:**
15+
> The non-commercial reuse of these contents is free and free in compliance with the legislation in force and in particular the maintenance of the source mention of the contents as specified below: "Source gallica.bnf.fr / Bibliothèque nationale de France" or "Source gallica.bnf.fr / BnF"... The metadata are subject to the EtaLab license, which authorizes free and open use provided the source is mentioned (BnF / Gallica). ([Source](https://gallica.bnf.fr/edit/conditions-dutilisation-des-contenus-de-gallica))
16+
17+
## Additional Notes
18+
19+
### Structure of a Query URL
20+
21+
Example query URL to be used as an entry in `queries` of `querier.fetch_metadata(queries=queries)`: `https://gallica.bnf.fr/SRU?operation=searchRetrieve&version=1.2&maximumRecords=10&startRecord=1&query=+dc.title+all+%22{keyword}%22`.
22+
23+
- The parameter `{keyword}` is to be replaced with the keyword to be searched.
24+
- The URL parameter `maximumRecords` denotes the maximum number of records to be returned. When `maximumRecords` is not explicitly specified, its default value is 10.
25+
- The URL parameter `startRecord` denotes the index of the start record (starting from 1). For examine
26+
27+
Additional query URL example: <https://gallica.bnf.fr/SRU?operation=searchRetrieve&version=1.2&maximumRecords=10&startRecord=11&query=dc.title+all+%22cartes+figurative%22>.
28+
This URL returns the 11th to 20th records whose title includes the word "cartes figurative".
29+
30+
Each query URL returns a list of XML records (with identifier stored in `dc:identifier`).
31+
An example record is: <https://gallica.bnf.fr/ark:/12148/btv1b525109845>.
32+
33+
### Obtaining Image Information
34+
35+
Each record returned by a query URL may contain one or multiple images.
36+
For example, the record <https://gallica.bnf.fr/ark:/12148/btv1b525109845> has 16 images.
37+
38+
The image information for the record can be obtained with the URL `https://gallica.bnf.fr/services/Pagination?ark={ark_identifier}`.
39+
For example, the `ark_identifier` of <https://gallica.bnf.fr/ark:/12148/btv1b525109845> is `btv1b525109845`. We can retrieve its corresponding image information at <https://gallica.bnf.fr/services/Pagination?ark=btv1b525109845>.

libquery/gallica/_fetch_metadata.py

+11
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,17 @@ def _fetch_pagination(ark: str) -> List[Page]:
126126
For the identifier of an image collection,
127127
the pagination information can be used to fetch the images.
128128
Example ARK identifier: 'ark:/12148/cb32798952c'.
129+
130+
Args
131+
----
132+
ark : string
133+
The ARK identifier of a collection.
134+
135+
Notes
136+
-----
137+
Using the ARK identifier to obtain image information using the BNF service may fail in the following cases:
138+
- The ARK identifier corresponds to a record outside Gallica, e.g., <https://bibliotheques-specialisees.paris.fr/ark:/73873/pf0000855747>.
139+
- In rare cases, image information of records in Gallica can not be retrieved using the API, e.g., <https://gallica.bnf.fr/ark:/12148/bc6p06xk9kk>.
129140
"""
130141

131142
# Get image page information (i.e., image list) for a collection

libquery/internet_archive/README.md

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Internet Archive
2+
3+
**URL:** <https://archive.org/>
4+
5+
**Data Source Size:**
6+
- 4 million images ([Source](https://archive.org/details/image))
7+
- 6 million books ([Source](https://archive.org/details/books))
8+
9+
**Official API:** [The Internet Archive Python Library](https://archive.org/developers/internetarchive/index.html) ([Github repository](https://github.com/jjjake/internetarchive))
10+
11+
**Term of Use:** [link](https://archive.org/about/terms.php)
12+
13+
## Additional Notes
14+
15+
- Search Functions
16+
- [image search](https://archive.org/details/image)
17+
- [book search](https://archive.org/details/books)
18+
- [advanced search](https://archive.org/advancedsearch.php)
19+
- Internet Archive is actively maintained and its collection is growing over time.
+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Library of Congress
2+
3+
**URL:** <https://www.loc.gov/>
4+
5+
**Official API:** [APIs for LoC.gov](https://www.loc.gov/apis/)
6+
- [JSON/YAML API](https://www.loc.gov/apis/json-and-yaml/)
7+
- [sitemaps](https://www.loc.gov/apis/sitemaps/)
8+
- A sitemap provides information on the relationships between the pages, videos, images and other resources on a website.
9+
- Not useful for collecting metadata about collection items
10+
- [microservices](https://www.loc.gov/apis/micro-services/)

0 commit comments

Comments
 (0)