Skip to content

Commit

Permalink
Fix typo, acknowledge more contributors
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano authored Nov 29, 2023
1 parent 048eb6f commit 4e6f989
Showing 1 changed file with 16 additions and 11 deletions.
27 changes: 16 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@ license: apache-2.0
## Introduction

Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
The streamlit application demonstrate the implementaiton of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
Differently to most of the projects, we focus on scientific articles.
We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) that provide and cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
Different to most of the projects, we focus on scientific articles.
We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).

Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).

The conversation is kept in memory up by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".

(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)

Expand All @@ -35,9 +35,9 @@ The conversation is kept in memory up by a buffered sliding window memory (top 4

## Getting started

- Select the model+embedding combination you want ot use
- Select the model+embedding combination you want to use
- Enter your API Key ([Open AI](https://platform.openai.com/account/api-keys) or [Huggingface](https://huggingface.co/docs/hub/security-tokens)).
- Upload a scientific article as PDF document. You will see a spinner or loading indicator while the processing is in progress.
- Upload a scientific article as a PDF document. You will see a spinner or loading indicator while the processing is in progress.
- Once the spinner stops, you can proceed to ask your questions

![screenshot2.png](docs%2Fimages%2Fscreenshot2.png)
Expand All @@ -53,9 +53,9 @@ With default settings, each question uses around 1000 tokens.

### Chunks size
When uploaded, each document is split into blocks of a determined size (250 tokens by default).
This setting allow users to modify the size of such blocks.
Smaller blocks will result in smaller context, yielding more precise sections of the document.
Larger blocks will result in larger context less constrained around the question.
This setting allows users to modify the size of such blocks.
Smaller blocks will result in a smaller context, yielding more precise sections of the document.
Larger blocks will result in a larger context less constrained around the question.

### Query mode
Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
Expand All @@ -65,7 +65,7 @@ Indicates whether sending a question to the LLM (Language Model) or to the vecto
### NER (Named Entities Recognition)

This feature is specifically crafted for people working with scientific documents in materials science.
It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, masurements).
It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, measurements).
This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services.


Expand All @@ -78,7 +78,9 @@ To release a new version:

To use docker:

- docker run `lfoppiano/document-insights-qa:latest`
- docker run `lfoppiano/document-insights-qa:{latest_version)`

- docker run `lfoppiano/document-insights-qa:latest-develop` for the latest development version

To install the library with Pypi:

Expand All @@ -88,6 +90,9 @@ To install the library with Pypi:
## Acknolwedgement

This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
Contributed by Pedro Ortiz Suarez (@pjox), Tomoya Mato (@t29mato).
Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).




0 comments on commit 4e6f989

Please sign in to comment.