Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #128

Merged
merged 41 commits into from
Sep 24, 2024
Merged

Dev #128

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
5897068
init message
PalmPalm7 Feb 8, 2024
4c88ded
Merge pull request #120 from BU-Spark/230208_init_commit
WilliamLee101 Feb 14, 2024
397df10
Update Technical Project Document Format
PalmPalm7 Feb 15, 2024
e348e5b
Update and rename Technical Project Document to Technical Project Doc…
PalmPalm7 Feb 16, 2024
4c47fa1
Merge pull request #121 from BU-Spark/PalmPalm7-patch-1
WilliamLee101 Feb 21, 2024
41aaffb
add Spring2024 research.md
Aeronyx Feb 25, 2024
f4996f2
Merge pull request #122 from BU-Spark/spring2024-researchmd
WilliamLee101 Mar 1, 2024
7105bac
EDA notebook; finished EDA on GBIF Dataset Scraping and CVH Dataset S…
PalmPalm7 Mar 9, 2024
fa4cda8
Merge pull request #123 from BU-Spark/eda_2024_03_08
WilliamLee101 Mar 9, 2024
f930342
Update EDA_spring2024.ipynb
PalmPalm7 Mar 20, 2024
07ec6ef
Added data files
PalmPalm7 Mar 20, 2024
582fa77
Added documentations and refractored the web scraper ETL from ipynb n…
PalmPalm7 Mar 31, 2024
a20a65e
completed PoC pipeline
PalmPalm7 Apr 1, 2024
f944598
added pdf
PalmPalm7 Apr 1, 2024
a700f68
Update poc.md
PalmPalm7 Apr 1, 2024
cd893ec
Update poc.md
PalmPalm7 Apr 1, 2024
1b995bc
Update poc.md
PalmPalm7 Apr 1, 2024
a75fc78
Update poc.md
PalmPalm7 Apr 1, 2024
638fd7c
Merge pull request #124 from BU-Spark/web_scraper_0320
WilliamLee101 Apr 2, 2024
e90bc34
Merge pull request #125 from BU-Spark/poc_2024_03_30
WilliamLee101 Apr 2, 2024
41ba10e
checkout git repo on SCC
Apr 26, 2024
077d2db
checkout git repo on SCC
Apr 26, 2024
9117830
fixed directory errors
PalmPalm7 Apr 26, 2024
a87d620
Fixing duplicates
PalmPalm7 Apr 26, 2024
73556e2
1000 randomly selected samples for further benchmark testings
PalmPalm7 May 7, 2024
66dad52
Merge branch 'benchmark_2024_04_26' of github.com:BU-Spark/ml-herbari…
PalmPalm7 May 7, 2024
555be2e
added datasets samples
PalmPalm7 May 7, 2024
f738a2e
temp commit, unfinished
PalmPalm7 May 7, 2024
aa1bf54
demo folder
PalmPalm7 May 7, 2024
e70b430
Add files via upload
mkaramb May 7, 2024
0835530
Delete Spring2024/demo/dpcumentai_batch_processing_app directory
mkaramb May 7, 2024
d868b01
Add files via upload
mkaramb May 7, 2024
a13dc10
Update README.md
mkaramb May 7, 2024
f2a3c54
Update README.md
mkaramb May 7, 2024
4f47f0b
Update README.md
mkaramb May 7, 2024
9b41232
Update README.md
mkaramb May 7, 2024
8a67df5
Delete Spring2024/demo/documentai_batch_processing_app/herbaria-ai-3c…
mkaramb May 8, 2024
461fefe
Update README_scraper.md
PalmPalm7 May 8, 2024
2f3f20c
Updates on README.md
PalmPalm7 May 8, 2024
a56c0d5
Final updates
PalmPalm7 May 8, 2024
cecf8e0
Merge pull request #126 from BU-Spark/benchmark_2024_04_26
funkyvoong Sep 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6,827 changes: 6,827 additions & 0 deletions Spring2024/EDA_spring2024.ipynb

Large diffs are not rendered by default.

137 changes: 137 additions & 0 deletions Spring2024/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# ML-Herbarium Spring 2024 Summary Report

## 1. Overview

The Spring 2024 Team continues the machine-learning based approach to digitalize and mobilize Asian herbarium collections, our work is guided by our clients Professor Charles Davis, professor Thomas Gardos and Solution Engineer Michelle Voong (via NSF Grant [with Prof. Charles Davis](https://oeb.harvard.edu/news/herbaria-awarded-47-million-mobilize-digital-collections-asian-plant-biodiversity) and [with BU Spark!](https://oeb.harvard.edu/news/herbaria-awarded-47-million-mobilize-digital-collections-asian-plant-biodiversity)).

We built a pipeline with commercial OCR + LLM, achieving formidable accuracy result with over **62.3%** on Taxon names, **98.5%** on Collection Locations (Province/Country), **89.2%** on Collector, and **80.4%** on Collection Date and opened doors for potential collaborations with Chinese Virtual Herbarium (CVH).

## 2. Achievements

### 2.1 Pipeline and Performance

We have built a Highly accurate pipeline with sufficient benchmark testing on **1000 samples** scraped and randomly selected from the 15,000 samples we collected from [CVH dataset](https://www.cvh.ac.cn/index.php). The performance result below leveraged Document AI and GPT-4-Turbo.

* Highly accurate
* Taxon name: 62.3%
* Collection locations (Province/Country): 98.5%
* Collector name: 89.2%
* Collection Date: 80.4%
* Cost effective:
* 351 Min. / 1000 Samples
* $ 66.5 / 1000 Samples

The accuracy metrics are calculated this way:

* Taxon Accuracy metrics definition: Exact matching after both groundtruth and extraction of Taxon name are preprocessed, mainly getting rid of scholar name
Taxon name preprocessing example: (e.g. Lysimachia fortunei Maxim. --> Lysimachia fortunei)
* Taxon Accuracy metrics explanation: There is even discrepancies between groundtruth (scraped from website) and groundtruth.
* Location Accuracy metrics definition: Exact matching of Province / Municiple name. (Required by Charles). Groundtruth also holds similar geographical granularity, so the metrics finer granularity (e.g. city, village, road)
* Collector Accuracy metrics definition: Exact matching of collectors. Groundtruth often hides second authors (et.al.)
* Collection Date Accuracy metrics definition: Exact matching of YYYYMMDD timestamp.

Demo could be found at: [https://huggingface.co/spaces/spark-ds549/TrOCR](https://huggingface.co/spaces/spark-ds549/TrOCR)

### 2.2 Benchmark

On accuracy side, while last semesters' works mainly focuses on

* **Approach 1**: open-source models (DETR, CRAFT, TrOCR, TaxoNERD) with GBIF datasets (SU23 and prior) and
* **Approach 2** Commercial OCR/ ViT + LLM (FA23),

but both have shown significant drawbacks. Approach 1's CV models were not fine-tuned for botanics tasks and the first step (DETR) has pruned 30% of the labeled 1,000 samples creating significant drawback on downstream tasks, while TaxoNERD (a NER model for herberia) also only performs on English texts. Approach 2 have seen significant low accuracies on Chinese and Cyrillic texts.

![png](./cvh_images_examples/benchmark_results.png)

On cost and time, our benchmark results:

* 351 Min. / 1000 Samples
* $ 66.5 / 1000 Samples


The time performance was calculated under one linear thread for Document AI and GPT-4-Turbo (Input $10.00 / 1M tokens, Output $30.00 / 1M token), while one manual labeler takes around 8 ~ 16 hours and roughly $50 ~ $150 from an outsourcing service provider ([source1](https://mark.hk.cn/pricing/#), [source2](https://ai.baidu.com/support/news?action=detail&id=3192), [source3](https://scale.com/docs/rapid-faq)), while not guarantee the accuracies.

Furthermore, if future team seek to recreate Approach 1, please refer to Refer to README.md under /trocr for detailed instructions. If problem arise (likely), please refer to the github issue or the huggingface discussions section[https://huggingface.co/spark-ds549/detr-label-detection/discussions/3].

Benchmark pipeline: /ml-herbarium/Spring2024/benchmark_spring2024.ipynb

### 2.3 CVH Scraper

During our quest for training and validation datasets, we located Chinese Virtual Herbarium's dataset, the largest herbarium in China, collected around 10 million samples with 2.8 million samples hand-labeled by identifier over 20 years. A typical example they host often contains:

1. High resolution of the image.
* Image of dry plant collection
* Label created by collector documenting:
* Taxon name
* Collector
* Collection date
* Collection locality
* Habitat
* Label created by identifier documenting:
* Identified taxon name
* Identifier name
* Identified date
2. CVH's digitalized documentations, containing:
* Taxonomy
* Taxon name
* Scientific Name
* Chinese Name
* Identified By
* Identification Date
* Collector
* Collector's No.
* Collection Date
* Locality
* Elevation
* Habitat
* Life Form
* Reproductive Condition

It is worth noting that most modern samples contain printed Chinese and English labels created by both identifier and collector, with high contrast white background and black font.

![png](https://www.cvh.ac.cn/controller/spms/image.php?institutionCode=PE&catalogNumber=02334131)

<p align="center">
Example 1: Anaphalis margaritacea (L.) Benth. & Hook. f.
<a href="https://www.cvh.ac.cn/spms/detail.php?id=e6e73365">https://www.cvh.ac.cn/spms/detail.php?id=e6e73365</a>
</p>

However, for older samples, it may contain handwritten Chinese and English labels with a darker, harder to identify background by collector, while also likely containing a printed label by identifier. Thus when performing OCR precision, it is extremely important to identify which label (older handwritten label by collector or newer printed label by identifier) we are extracting from.

![png](https://www.cvh.ac.cn/controller/spms/image.php?institutionCode=PE&catalogNumber=01996346)

<p align="center">
Example 1: Symplocos Jacq.
<a href="https://www.cvh.ac.cn/spms/detail.php?id=e82ce487">https://www.cvh.ac.cn/spms/detail.php?id=e82ce487</a>
</p>

The webpage has a dynamic layout with php thus a selenium automation script was produced to scrape the results.

Please refer to ml-herbarium/Spring2024/scraper/README_scraper.md for instructions.


### 2.4 Collaboration with Chinese Virtual Herbarium.
I (Handi Xie, @palmpalm7) have successfully established communication with CVH and CVH hope to collaborate with BU Spark! and our work are mutually beneficial.
Detailed transactions could be found at [DS 549 - SP24 - Harvard Herberia - Communication with CVH](https://docs.google.com/document/d/1V_uP6HtzuC6917mslPUZzEJtAX-lI_cGUjk92zC6l0k/edit).

Summary of the communication:

1. CVH is willing to provide us the necessary datasets in exchange of authorships in the final academic output
2. CVH could provide expert labelers but these resources are demanding.
3. Edgecases CVH have discovered:
* Localities are prone to many errors due to
* Transcriber's manual errors (Same tone, different word in Chinese results in vast differences)
* Vague Description (300 meters from village A, turn right to road B for 50 meters, collections were found under bridge)
4. Detailed collaboration methods are awaiting to be discussed.

Note: CVH's 8 million datasets could be highly beneficial for a multimodal model with a herberia domain focus.


## 3. Words to future team
All the past developers are more than happy to guide and discuss the future of this amazing project! You could reach out to us at:

* (SP24) Andy Xie [email protected]
* (SP24) George Trammell [email protected]
* (SP24) Max Karambelas [email protected]
* (FA23) Smriti Suresh [email protected]
* (SP23 and SU23) Kabilan Mohanraj [email protected]
69 changes: 69 additions & 0 deletions Spring2024/Technical Project Document.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Technical Project Document
### George Trammell, Max Karambelas, Andy Xie - 2024-Feb-8 0.0.1-dev
## Overview
In this document, based on the available project outline and summary of the project pitch, to the best of your abilities, you will come up with the technical plan or goals for implementing the project such that it best meets the stakeholder requirements.

A. Provide a solution in terms of human actions to confirm if the task is within the scope of automation through AI.
Manually identifying and segmenting the label from the herbarium sheet.
Reading and transcribing the text from the label, which includes taxon, geography, collection code, barcode, location, date collected, collector name, collector number, and habitat.
Entering the transcribed data into a database.
Validating the accuracy of the transcription against known data.

B. Problem Statement:
The project aims to automate the transcription of handwritten labels from herbarium specimens into a digital format. Specifically, it is a machine learning problem that involves developing and improving OCR (Optical Character Recognition) models, with a focus on LSTM-RNN and Transformer-based deep learning models, to accurately recognize and transcribe text from images of specimen labels. This includes enhancing OCR functionality for Chinese characters and integrating metadata and contextual information to improve accuracy.

C. Checklist for project completion
Provide a bulleted list to the best of your current understanding, of the concrete technical goals and artifacts that, when complete, define the completion of the project. This checklist will likely evolve as your project progresses.
Develop an improved OCR model capable of handling Chinese characters.
Test and validate the OCR model's accuracy on a dataset of pre-1940 plant specimen images.
Incorporate metadata and contextual information into the model to enhance accuracy.
Create clean code and thorough documentation for the project.

D. Outline a path to operationalization.
For this refined project focusing on the improvement of OCR functionality for digitizing natural history specimens, particularly with an emphasis on Chinese characters, and building a public repository, operationalization involves specific technological solutions and collaboration strategies. The project aims to enhance OCR accuracy by incorporating advanced deep learning models such as LSTM-RNN and Transformer models, while also considering the use of metadata and contextual information (e.g., location, collector details) as knowledge priors to improve classification processes. This necessitates a multi-faceted approach involving data gathering from specified sources, model refinement, and the creation of a publicly accessible repository for disseminating the results.
To make the project's outcomes accessible and usable beyond a Jupyter notebook or initial proof of concept, a web-based platform or API could be developed, allowing researchers and the public to upload herbarium images for OCR processing. This platform could be hosted on cloud services like AWS, Google Cloud, or Azure, providing scalable resources for processing and storage. GitHub will serve as the repository for both the codebase and the dataset, facilitating collaboration and open-source contributions. Technologies like Docker could be employed to containerize the application, ensuring ease of deployment and compatibility across different environments. Additionally, integrating the project's outputs into existing databases or platforms frequented by climate change scientists and biodiversity researchers, such as the GBIF, could further extend its impact and utility.


## Resources
### Data Sets
CNH Portal: https://portal.neherbaria.org/portal/

Pre-1940 plant specimen images in GBIF: https://www.gbif.org/occurrence/gallery?basis_of_record=PRESERVED_SPECIMEN&media_ty[…]axon_key=6&year=1000,1941&advanced=1&occurrence_status=present

International Plant Names Index: https://www.gbif.org/dataset/046bbc50-cae2-47ff-aa43-729fbf53f7c5#dataDescription

Use for synonyms (GBIF is recommended): GBIF: https://hosted-datasets.gbif.org/datasets/backbone/current/

IPNI: https://storage.cloud.google.com/ipni-data/

CVIT: https://cvit.iiit.ac.in/research/projects/cvit-projects/matchdocimgs

IAM: https://fki.tic.heia-fr.ch/databases/iam-handwriting-database

### References
CRAFT (text detection): https://arxiv.org/abs/1904.01941

TrOCR: https://arxiv.org/abs/2109.10282

"What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis," 10.1109/ICCV.2019.00481

Kubeflow: https://www.kubeflow.org/docs/

Hugging Face Spaces: https://huggingface.co/docs/hub/spaces

GCP Vertex AI: https://cloud.google.com/vertex-ai/docs

AWS SageMaker: https://docs.aws.amazon.com/sagemaker/index.html

TensorFlow Serving: https://github.com/tensorflow/serving

TorchServe: https://github.com/pytorch/serve

# Weekly Meeting Updates

Keep track of ongoing meetings in the Project Description document prepared by Spark staff for your project.
Note: Once this markdown is finalized and merge, the contents of this should also be appended to the Project Description document.

## Temp Link
https://docs.google.com/document/d/1AkQW9WFcBbHqGl8Js3KIth1u3vtOKAgWTyO3nsYgzYI/edit?usp=sharing
Will update to github repo at the end of semester.
Loading
Loading