docs: add data source docs and contribution guide

zhangyu94 · zhangyu94 · commit 1193982ada65 · 2025-01-31T21:13:39.000+08:00
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -36,6 +36,14 @@ libquery/         - the Python package
 
 Use [Black](https://github.com/psf/black) to detect code style issues and fix the issues before committing.
 
+## Publishing the Package
+
+The steps to publish the package:
+
+1. Bump the version in `pyproject.toml`.
+2. create a new tag for the version with `git tag v*.*.*`.
+3. push the tag to Github with `git push origin v*.*.*`, which will trigger the Github workflow at `.github/workflow/publish-to-pypi.yml` to automatically publish the package to PyPI.
+
 ## Thanks
 
 Thank you again for being interested in this project! You are awesome!
diff --git a/libquery/david_rumsey_map_collection/README.md b/libquery/david_rumsey_map_collection/README.md
@@ -0,0 +1,33 @@
+# David Rumsey Map Collection
+
+**URL:** <https://www.davidrumsey.com>
+
+**Corresponding Library:** [David Rumsey Map Center](https://library.stanford.edu/rumsey)
+
+**Collector:** [David Rumsey](https://en.wikipedia.org/wiki/David_Rumsey)
+
+**Data Source Content:**
+> The collection focuses on rare 16th through 21st century maps of North and South America, as well as maps of the World, Asia, Africa, Europe, and Oceania. The collection includes atlases, globes, wall maps, school geographies, pocket maps, books of exploration, maritime charts, and a variety of cartographic materials including pocket, wall, children's, and manuscript maps. Items range in date from around 1550 to the present. ([Source](https://www.davidrumsey.com/about) last accessed on 2023/10/19)
+
+**Data Source Size:**
+> The David Rumsey Map Collection was started over 35 years ago and contains more than 200,000 maps. (...) Digitization of the collection began in 1996 and there are now over 126,000 items online, with new additions added regularly. ([Source](https://www.davidrumsey.com/about) last accessed on 2023/10/19)
+
+**Official API:** [LUNA API Documentation](https://doc.lunaimaging.com/display/V74D/LUNA+API+Documentation)
+
+**Image Resolution:**
+> The digitized maps are very high resolution images scanned at 300 to 800 pixels per inch, as measured against the original map's dimensions. ([Source](https://www.davidrumsey.com/about))
+
+**Image File Size:**
+> The larger maps generate files frequently approaching 5 gigabytes in size; the average file size of images in the collection is 250 megabytes. ([Source](https://www.davidrumsey.com/about))
+
+**Term of Use:** CC BY-NC-SA 3.0 ([Source](https://www.davidrumsey.com/about))
+
+## Additional Notes
+
+- [Georeference](https://www.davidrumsey.com/view/georeferenced-maps) is available for the maps.
+    - Available in formats such as WMTS, TileJSON, XYZ, and GeoTIFF
+    - Example: <https://davidrumsey.georeferencer.com/maps/929406225627/>
+- David Rumsey Map Collection is actively maintained and its collection is growing over time.
+- On the reproducibility of search URLs of an entry:
+    - The returned value of a *keyword* query (e.g., [q=subject=data+visualization](https://www.davidrumsey.com/luna/servlet/as/search?q=subject=data+visualization)) is not reproducible, as the data source is updating over time.
+    - The returned value of a *mid* query (e.g., [mid=RUMSEY~8~1~286452~90058976](https://www.davidrumsey.com/luna/servlet/as/search?mid=RUMSEY~8~1~286452~90058976)) is reproducible.
diff --git a/libquery/gallica/README.md b/libquery/gallica/README.md
@@ -0,0 +1,39 @@
+# Gallica
+
+**URL:** <https://gallica.bnf.fr/>
+
+**Corresponding Library:** [Bibliothèque nationale de France](https://en.wikipedia.org/wiki/Biblioth%C3%A8que_nationale_de_France).
+
+**Official API:**
+- [Search API](https://api.bnf.fr/api-gallica-de-recherche): for searching the digital holdings.
+    - Note that when using multiple filtering conditions in a URL, only French search terms function as expected.
+        - For example, <https://gallica.bnf.fr/SRU?operation=searchRetrieve&version=1.2&maximumRecords=10&startRecord=1&query=dc.title%20all%20%22cartes%20figurative%22%20and%20dc.type%20any%20image> only returns two records
+- [Document API](https://api.bnf.fr/api-document-de-gallica): for retrieving the metadata.
+- [Image API](https://api.bnf.fr/api-iiif-de-recuperation-des-images-de-gallica): for retrieving and manipulating image with [IIIF](https://iiif.io/) standard.
+
+**Term of Use:**
+> The non-commercial reuse of these contents is free and free in compliance with the legislation in force and in particular the maintenance of the source mention of the contents as specified below: "Source gallica.bnf.fr / Bibliothèque nationale de France" or "Source gallica.bnf.fr / BnF"... The metadata are subject to the EtaLab license, which authorizes free and open use provided the source is mentioned (BnF / Gallica). ([Source](https://gallica.bnf.fr/edit/conditions-dutilisation-des-contenus-de-gallica))
+
+## Additional Notes
+
+### Structure of a Query URL
+
+Example query URL to be used as an entry in `queries` of `querier.fetch_metadata(queries=queries)`: `https://gallica.bnf.fr/SRU?operation=searchRetrieve&version=1.2&maximumRecords=10&startRecord=1&query=+dc.title+all+%22{keyword}%22`.
+
+- The parameter `{keyword}` is to be replaced with the keyword to be searched.
+- The URL parameter `maximumRecords` denotes the maximum number of records to be returned. When `maximumRecords` is not explicitly specified, its default value is 10.
+- The URL parameter `startRecord` denotes the index of the start record (starting from 1). For examine
+
+Additional query URL example: <https://gallica.bnf.fr/SRU?operation=searchRetrieve&version=1.2&maximumRecords=10&startRecord=11&query=dc.title+all+%22cartes+figurative%22>.
+This URL returns the 11th to 20th records whose title includes the word "cartes figurative".
+
+Each query URL returns a list of XML records (with identifier stored in `dc:identifier`).
+An example record is: <https://gallica.bnf.fr/ark:/12148/btv1b525109845>.
+
+### Obtaining Image Information
+
+Each record returned by a query URL may contain one or multiple images.
+For example, the record <https://gallica.bnf.fr/ark:/12148/btv1b525109845> has 16 images.
+
+The image information for the record can be obtained with the URL `https://gallica.bnf.fr/services/Pagination?ark={ark_identifier}`.
+For example, the `ark_identifier` of <https://gallica.bnf.fr/ark:/12148/btv1b525109845> is `btv1b525109845`. We can retrieve its corresponding image information at <https://gallica.bnf.fr/services/Pagination?ark=btv1b525109845>.
diff --git a/libquery/gallica/_fetch_metadata.py b/libquery/gallica/_fetch_metadata.py
@@ -126,6 +126,17 @@ def _fetch_pagination(ark: str) -> List[Page]:
     For the identifier of an image collection,
     the pagination information can be used to fetch the images.
     Example ARK identifier: 'ark:/12148/cb32798952c'.
+
+    Args
+    ----
+    ark : string
+        The ARK identifier of a collection.
+
+    Notes
+    -----
+    Using the ARK identifier to obtain image information using the BNF service may fail in the following cases:
+    - The ARK identifier corresponds to a record outside Gallica, e.g., <https://bibliotheques-specialisees.paris.fr/ark:/73873/pf0000855747>.
+    - In rare cases, image information of records in Gallica can not be retrieved using the API, e.g., <https://gallica.bnf.fr/ark:/12148/bc6p06xk9kk>.
     """
 
     # Get image page information (i.e., image list) for a collection
diff --git a/libquery/internet_archive/README.md b/libquery/internet_archive/README.md
@@ -0,0 +1,19 @@
+# Internet Archive
+
+**URL:** <https://archive.org/>
+
+**Data Source Size:**
+- 4 million images ([Source](https://archive.org/details/image))
+- 6 million books ([Source](https://archive.org/details/books))
+
+**Official API:** [The Internet Archive Python Library](https://archive.org/developers/internetarchive/index.html) ([Github repository](https://github.com/jjjake/internetarchive))
+
+**Term of Use:** [link](https://archive.org/about/terms.php)
+
+## Additional Notes
+
+- Search Functions
+    - [image search](https://archive.org/details/image)
+    - [book search](https://archive.org/details/books)
+    - [advanced search](https://archive.org/advancedsearch.php)
+- Internet Archive is actively maintained and its collection is growing over time.
diff --git a/libquery/library_of_congress/README.md b/libquery/library_of_congress/README.md
@@ -0,0 +1,10 @@
+# Library of Congress
+
+**URL:** <https://www.loc.gov/>
+
+**Official API:** [APIs for LoC.gov](https://www.loc.gov/apis/)
+- [JSON/YAML API](https://www.loc.gov/apis/json-and-yaml/)
+- [sitemaps](https://www.loc.gov/apis/sitemaps/)
+    - A sitemap provides information on the relationships between the pages, videos, images and other resources on a website.
+    - Not useful for collecting metadata about collection items
+- [microservices](https://www.loc.gov/apis/micro-services/)