Merge pull request #111 from BU-Spark/research-doc-patch

Research doc patch
BU-Spark · Jan 17, 2024 · df277a7 · df277a7
2 parents 0625948 + ae64058
commit df277a7
Show file tree

Hide file tree

Showing 10 changed files with 591 additions and 251 deletions.
diff --git a/trocr/README.md b/trocr/README.md
@@ -15,7 +15,7 @@ This notebook is identical to the above notebook, except it does not include the
 
 # Model Deployment Files
 ## trocr_with_detr_transcription.py
-This is the main script for running the pipeline. This script performs several operations, such as object detection, text extraction, and Named Entity Recognition (NER) on a set of images. 
+This is the **main (inference)** script for running the pipeline. This script performs several operations, such as object detection, text extraction, and Named Entity Recognition (NER) on a set of images. 
 
 1. First, it initializes and runs a model called DETR to identify labels in each image and save their bounding boxes to a pickle file.
 2. Second, it runs a text detection model called CRAFT on the images to identify areas containing text, saving these bounding areas to another pickle file.
@@ -42,13 +42,14 @@ python trocr_with_detr_transcription.py --input-dir /path/to/input --save-dir /p
 
 ## trocr.py
 Contains all the functions which relate to running the trocr portion of the pipeline
-## utilities.py
-Contains a number of functions which are primarily related to the invluded CVIT_Training.py file.
 
-## requirements.txt
+## [Not in use] utilities.py
+Contains a number of functions which are primarily related to the included CVIT_Training.py file.
+
+## [Not in use] requirements.txt
 All required python installs for running the pipeline
 
-## trocr_env.txt
+## trocr_env.yml
 Conda environment configuration to run the pipeline.
 
 # Deployment instructions
@@ -65,32 +66,41 @@ cd ml-herbarium
 git checkout dev
 cd trocr
 ```
-Create a new conda environment and activate it
+Create a new conda environment with the environment YAML file and activate it
 ```
-conda create -n my-conda-env python=3.9
+conda env create -n my-conda-env --file=trocr_env.yml
 conda activate my-conda-env
 ```
 
-Install all required packages and Jupter
+Install Jupyter and required packages
 ```
 conda install jupyter
-pip install -r requirements.txt
-pip install taxonerd
+pip install transformers==4.27.0 --no-deps
 ```
-Currently, the setup uses `en_core_eco_biobert` model for entity recognition as part of the TaxoNERD pipeline. To download and add the model, run the folllowing command.
+We install the `transformers` package separately because of the dependency requirement that is imposed by `spacy-transformers`. The dependency does not cause any issues, albeit throwing an installation error.
+
+Currently, the setup uses `en_core_eco_md` (and `en_core_eco_biobert`) models for entity recognition as part of the TaxoNERD pipeline. To download and add the models, run the following commands.
 ```
+pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.0/en_core_eco_md-1.0.2.tar.gz
 pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.0/en_core_eco_biobert-1.0.2.tar.gz
 ```
-> **NOTE [SCC ONLY]:** If the `spacy` module throws an import error, you might have to uninstall the cublass package that is already installed, using the command `pip uninstall nvidia-cublas-cu11`. This is to avoid conflicts between the cuda module loaded in SCC and the installed packages from the requirements file.
-
 Other available models can be viewed [here](https://github.com/nleguillarme/taxonerd#models). Respective model installation instructions can be found [here](https://github.com/nleguillarme/taxonerd#models:~:text=To%20download%20the%20models%3A).
 
+To use the spaCy models for extracting location and date information, please run the following commands.
+```
+python -m spacy download en_core_web_sm
+python -m spacy download en_core_web_md
+python -m spacy download en_core_web_trf
+```
+
 To start Jupyter Notebooks in the current folder, use the command
 ```
 jupyter notebook
 ```
 
-To run the pipeline, please execute the `cleaned_trocr_test.ipynb` notebook in the current (`trocr`) folder.
+To run the pipeline, please execute the [`trocr_with_detr_label_extraction.ipynb`](https://github.com/BU-Spark/ml-herbarium/blob/research-doc-patch/trocr/trocr_with_detr_label_extraction.ipynb) notebook in the current (`trocr`) folder.
+
+> **For docker deployment instructions, refer to the `docker` folder in the current (`trocr`) folder.**
 
 > **NOTE:** It is HIGHLY recommended to run the pipeline on a GPU (V100(16 GB) on SCC is recommended so that multiple models in the pipeline can be hosted on the GPU; smaller GPUs have not been tested). Running on the CPU is significantly slower.
 
@@ -100,7 +110,7 @@ This column describes the position that a given image was processed
 ### Transcription
 This column contains every transcription that was found in the image. They are ordered based on the relative position of the top left coordinate for each bounding box in an image. 
 ## Transcription_Confidence
-This contains the TrOCR model confidences in each transcription. This list of values is ordered based on the `Transcription` column (i.e. you can reference each individual transcription and its confidence using the same index number).
+This contains the TrOCR model confidences in each transcription. This list of values is ordered based on the `Transcription` column (i.e., you can reference each individual transcription and its confidence using the same index number).
 ## Image_Path
 This is the absolute path of the location for a given image
 ## Bounding_Boxes

diff --git a/trocr/docker/Dockerfile b/trocr/docker/Dockerfile
@@ -0,0 +1,32 @@
+# Use an official Miniconda3 as a parent image
+FROM continuumio/miniconda3:latest
+
+# Set the working directory in docker
+WORKDIR /usr/src/app
+
+# Declare argument for conda environment name
+ARG CONDA_ENV_NAME=trocr_env
+
+# Clone the repository
+RUN git clone https://github.com/BU-Spark/ml-herbarium.git . && \
+    git checkout dev && \
+    cd trocr
+
+# Create a new conda environment from the YAML file and activate it
+RUN conda env create -n $CONDA_ENV_NAME --file=trocr_env.yml && \
+    echo "conda activate $CONDA_ENV_NAME" >> ~/.bashrc
+
+# Install Jupyter and other required packages
+RUN conda install -n $CONDA_ENV_NAME jupyter -y && \
+    /opt/conda/envs/$CONDA_ENV_NAME/bin/pip install transformers==4.27.0 --no-deps && \
+    /opt/conda/envs/$CONDA_ENV_NAME/bin/pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.0/en_core_eco_md-1.0.2.tar.gz && \
+    /opt/conda/envs/$CONDA_ENV_NAME/bin/pip install https://github.com/nleguillarme/taxonerd/releases/download/v1.5.0/en_core_eco_biobert-1.0.2.tar.gz && \
+    /opt/conda/envs/$CONDA_ENV_NAME/bin/python -m spacy download en_core_web_sm && \
+    /opt/conda/envs/$CONDA_ENV_NAME/bin/python -m spacy download en_core_web_md && \
+    /opt/conda/envs/$CONDA_ENV_NAME/bin/python -m spacy download en_core_web_trf
+
+# Make port 8888 available to the world outside this container
+EXPOSE 8888
+
+# Run Jupyter Notebook when the container launches
+CMD [ "/opt/conda/envs/$CONDA_ENV_NAME/bin/jupyter", "notebook", "--ip='*'", "--port=8888", "--no-browser", "--allow-root" ]
diff --git a/trocr/docker/ReadMe.md b/trocr/docker/ReadMe.md
@@ -0,0 +1,35 @@
+# Build and Run Instructions
+## **Build the Docker Image:**  
+   Navigate to the directory containing the Dockerfile and run:
+   ```sh
+   docker build --build-arg CONDA_ENV_NAME=<your-conda-env-name> -t my-herbarium-app .
+   ```
+   Replace `<your-conda-env-name>` with the desired conda environment name.
+
+> ### Notes
+> - If you don't provide the `--build-arg` while building, the default value `trocr_env` will be used as the conda environment name.
+> - Remember to replace `<your-conda-env-name>` with the actual name you want to give to your conda environment when building the Docker image.
+
+## **Run the Docker Container:**  
+### Using Docker Bind Mounts
+When you run your Docker container, you can use the `-v` or `--mount` flag to bind-mount a directory or a file from your host into your container.
+
+#### Example
+If you have the input images in a directory named `images` on your host, you can mount this directory to a directory inside your container like this:
+```sh
+docker run -v $(pwd)/images:/usr/src/app/images -p 8888:8888 my-herbarium-app
+```
+or
+```sh
+docker run --mount type=bind,source=$(pwd)/images,target=/usr/src/app/images -p 8888:8888 my-herbarium-app
+```
+
+Here:
+- `$(pwd)/images` is the absolute path to the `images` directory on your host machine.
+- `/usr/src/app/images` is the path where the `images` directory will be accessible from within your container.
+
+> ### Note
+> When using bind mounts, any changes made to the files in the mounted directory will be reflected in both the host and the container, since they are actually the same files on the host’s filesystem.
+
+> ### Modification in Script
+> We would need to modify the script to read images from the mounted directory (`/usr/src/app/images` in this example) instead of the original host directory.
diff --git a/trocr/evaluation-dataset/handwritten-typed-text-classification/ReadMe.md b/trocr/evaluation-dataset/handwritten-typed-text-classification/ReadMe.md
@@ -5,7 +5,9 @@
 >   1. You can further experiment with the `trocr-base-handwritten` model instead of the TrOCR large model. 
 
 ## Overview
-Here, we aim to build a pipeline to classify handwritten text and typed/machine-printed text extracted from images. The ultimate goal of this pipeline is to classify the plant specimen images into typed/handwritten categories to create an evaluation set. The evaluation set will be used to test the main TrOCR pipeline. We utilize various machine learning models and techniques for this purpose.
+Here, we aim to build a pipeline to classify handwritten text and typed/machine-printed text extracted from images. The ultimate goal of this pipeline is to classify the plant specimen images into typed/handwritten categories to create an evaluation set. The evaluation set will be used to test the main TrOCR pipeline. We utilize various machine learning models and techniques for this purpose. 
+
+The following sections of the ReadMe shed light on the files and folders in this directory and how to run the model scripts. For detailed insights on the models explored and the results of the implementation, refer to the [research.md](https://github.com/BU-Spark/ml-herbarium/blob/research-doc-patch/trocr/evaluation-dataset/handwritten-typed-text-classification/research.md) file.
 
 ## Getting Started
 

diff --git a/trocr/evaluation-dataset/handwritten-typed-text-classification/research.md b/trocr/evaluation-dataset/handwritten-typed-text-classification/research.md
@@ -0,0 +1,45 @@
+# [Research] TrOCR Encoder + FFN Decoder
+
+#### Overview
+To create a robust classification model for our task, multiple Convolutional Neural Network (CNN) models were explored and assessed. Details of each attempted model, along with their respective implementations, can be accessed in the [Introduction section](https://github.com/BU-Spark/ml-herbarium/blob/dev/trocr/evaluation-dataset/handwritten-typed-text-classification/notebooks/Classifier_NN.ipynb) of the project's Jupyter Notebook.
+
+#### Issues Encountered with CNNs
+During experimentation, I identified fundamental limitations with how CNNs process images containing text, affecting our ability to accurately classify text in images into either handwritten or machine-printed categories. 
+
+In specific, it was observed that the text in images, particularly handwritten text, constitutes a minimal portion of the image in terms of pixel count, thereby reducing our Region of Interest (ROI). This small ROI posed challenges in information retention and propagation when image filters were applied, leading to the loss of textual details. To mitigate this, I employed the morphological operation of **erosion** on binarized images to emphasize the text, effectively enlarging the ROI. This process proved useful in counteracting some of the undesirable effects of CNN filters and preserving the integrity of the text in the images.
+
+#### Methodology
+Given the encountered limitations with CNNs, I approached the classification task in two primary steps to circumvent the challenges:
+
+1. **Feature Extraction with TrOCR Encoder:**
+   Leveraged the encoder part of the TrOCR model to obtain reliable feature representations from the images, focusing on capturing inherent characteristics of text. The encoder from TrOCR was employed due to its capability to retain textual details in its feature representations, which are pivotal for decoding to characters. This stands in contrast to Convolutional Neural Networks (CNNs), which might not preserve such detailed textual information. In essence, the encoder within TrOCR ensures the conservation of textual nuances that are potentially overlooked or lost when using CNNs.
+
+2. **Training a Custom FFN Decoder:**
+   Employed a custom Feed-Forward Neural Network (FFN) as the decoder to make predictions based on the feature representations extracted from the encoder. The model was trained specifically to discern the subtle differences in features between the two categories.
+
+This methodology enabled to maintain a high level of accuracy and reliability in our classification task while overcoming the inherent shortcomings identified in CNN models for processing images with text.
+
+#### Readings
+
+The [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) paper inspired me to use this encoder-decoder architecture. In this paper, the authors use multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Additionally, BERT-like architectures also act as an inspiration to the encoder-decoder paradigm.
+
+This approach of utilizing an FFN as a decoder, post feature extraction, is important in handling various classification tasks, especially when dealing with specialized forms of data like text in images, because it allows us to define a custom network specific to our task.
+
+#### Results Summary
+
+In our handwritten vs. typed-text classification task, the model performed impressively with an overall accuracy of \(96\%\). The test samples were handpicked to be challenging for the model to classify (since some of these were misclassified by a human).
+
+- *Handwritten Text Class:*
+  - *Precision:* \(97.96\%\)
+  - *Recall:* \(96.00\%\)
+  - *F1-Score:* \(96.97\%\)
+  - *Support:* 50 samples
+
+- *Typed Text Class:*
+  - *Precision:* \(96.23\%\)
+  - *Recall:* \(98.08\%\)
+  - *F1-Score:* \(97.14\%\)
+  - *Support:* 52 samples
+
+The balanced performance across both classes, as shown in the nearly identical macro average and weighted average metrics, demonstrates the model's robustness in distinguishing between handwritten and typed texts.
+
diff --git a/trocr/label-extraction/ReadMe.md b/trocr/label-extraction/ReadMe.md
@@ -15,6 +15,8 @@
 ## Overview
 Here, we aim to use DETR (DEtection TRansformer) to segment labels from our plant sample images, through object detection.
 
+The following sections of the ReadMe shed light on the files and folders in this directory and how to run the model scripts. For detailed insights on the models explored and the results of the implementation, refer to the [research.md](https://github.com/BU-Spark/ml-herbarium/blob/research-doc-patch/trocr/label-extraction/research.md) file.
+
 ## Getting Started
 
 ### Prerequisites and Installation