-
Notifications
You must be signed in to change notification settings - Fork 0
Daily‐Updates
Changes include: i. Integration of Milvus vectorstore, and we implemented Cohere for multi-lingual embeddings.
- while I am very fond of our Canadian Cohere team, it's possible given the latest performance benchmarks that we may switch back to an OpenAI model. Additionally, documentation is widely available with OpenAI as the proprietary model, making finding research/ tutorials/ and other resources much easier to find given its widespread adoption.
- in addition to the above, we are exploring the language translation capabilities of other models in order to determine the best suited options to choose for development testing once ready.
ii. I've revisited llama-index, and I've included the index modules within our Milvus vector integration. The llama-index will expand the hub of tools available to the chain. This means we'll be able to continue to develop in the future without having to revise as much infrastructure.
iii. I've included the use of pyPDF and Tesseract libraries. This way we're adding an additional layer of OCR to the ingestion process. This is because our model was challenged in pulling information from the tables/ and charts within patient notes. Unfortunately, I'm still unable to share the actual patient notes given privacy concerns at this stage of development. I'll continue to release the ingest files with a quick 'query' command so the application can be tested a typical q/a over docs application. The file-loader modules are still being designed for code simplicity, I find it a bit janky the way it is.
iv. We've implemented the NLTK library as well. Again, the nltk library will improve the robust nature of the application in order to continue development beyond initial release.
More to follow. We're working diligently, and updating this repo so the Public can know more about what we're trying to do is a big priority of mine. Updates will continue as frequently as possible.
Yesterday, we hosted the Medex MLP tournament. Which essentially simulated best-case scenarios for code components to be implemented into our code. The two categories we ran simulations for include; (1) VectorStores, and (2) LLM. Ultimately, Milvus came out on top for the Vectorstore, and Cohere on top for the LLM of choice. While there are more categories we'll be testing in the future. We've spent most of the past couple days updating our block components to reflect a modernized Medex application.I believe that both Milvus and Cohere are the right choices for our applications. Cohere is a Canadian Company focused on advancing NLP, and they support multi-lingual language which will be important as a Canadian based founded company. Additionally, Cohere has a lot of other benefits such as embedding with text-splitting and preprocess included (more information to be provided). Milvus is a high-performance vectorstore and search program that enables efficient embedding retrieval with a variety of routing options to ensure the best retrieval for the appropriate context.
The purpose of the tournament was to feed GPT4 with relevant information pertaining to Medex, and then perform a single-elimination tournament bracket, where half of the possible choices were reduced based on further defining criteria for each stage. You can find more about the specific stages in our Wiki here.
Lastly, while the tournament serves a purpose for defining some of the initial software parameters for Medex, it was certainly not a scientific adventure, and mostly I had fun exploring some of the Langchain integrations throughout the tournament as my day was spent researching these topics. Overall, it was a fun experiment and we'll continue to explore different challenges as we go on.
I'm excited to have implemented the translation component for ingestion into a playground within Google Colab and I'm able to share that directly by publishing it to the Main page. You can tinker with your own medical questions, and our transformer will analyze pubmed for relevant articles, explode the embeddings based on the User-query, create a hypothetical answer, search pubmeds findings for relevant answers, then combine all the User information into a simple output for contextual understanding. today we're iomplementing the OpenAI-Cookbook [https://github.com/openai/openai-cookbook]. Recently, I've discovered, "Question/ answering using an API and HyDe." Which I believe will be a sucessful priliminary implementation of how we're going to transform our User query's.The concept is simple enough:
- Step 1: Search; User asks a question and GPT generates a list of potential queries. Search queries are executed in parallel. = Step 2: Re-rank; Embeddings for each result are used to calculate semantic similarity to a generated hypothetical ideal answer to the user question. Results are ranked and filtered based on this similarity metric.
- Step 3: Answer; Given the top search results, the model generates an answer to the user’s question, including references and links. This hybrid approach offers relatively low latency and can be integrated into any existing search endpoint, without requiring the upkeep of a vector database.
We are just getting started uploading and creating the information to make this a coherent space. My best estimate is that we'll be much easier to comprehend by the end of the month (July, 2023). In the meantime, feel free to explore and check-out what we have to offer. Alongside a great initiative for our Medex application, within the 'medex' folder you'll find our curated collection of resources that guide our research and understanding for this project. This includes medex-specific research papers, course-offerings, reading resources, helpful links, and more that will be beneficial to anyone working on q/a applications. Alongside the development of the project, we strive to generate an open platform for not only builders, but researchers, inventors, creatives, and more who can add to the repository in a posititive way. You are a contributor simply by uploading a document that's relevant to the project.
Welcome to the Medex public repo, the Public-Version for Open-Source Medex updates, changes, and contributions. Application specific readme notes can be found within the main directory. Application overview (not complete- the below is a placeholder for now until the finished diagram is done):
Today, we're going to expand the ingestion pipeline by including multiple formats other than the .pdf format we've been using for testing to date.
Today we're thinking about the strategic action for what we're trying to retrieve. Our key elements for retrieval include, but are not limited to;
a. quality of retrieval response; the chat-assistant needs to be able to retrieve the query, translate to user-knowledge-level, assess hypothetical/similar embeddings, evaluate multiple retrieval responses, and respond to the User with a coherent reply and appropriate meta-data sources. a.a. the appropriate benchmarks and evaluation methods will need to be in place for this. this means collecting responses from other LLM's for comparison, and determining evaluation criteria for quantifying responses (such as to make a leaderboard in which our model should always strive to be at the top spot).
b. along with retrieval, the user query translation is important. I believe we only need to make some simple additions to ensure this working even slightly well, and we'll work on the robust-ness of the application following successful and passing benchmarks and tests.
c. I think we're using a retrieve function in Langchain. But I've also found some interesting methods in Llama-Index. I think we need to begin creating a chart for evaluating the methods available, and determining exactly how we want to achieve the main purpose of forward/backward layman-medical/medical-layman. We can begin by creating a list of possible retrievers that have either partial functionality we can add to, or whole components that will need to be embedded within our source code.
d. I am concerned about hallucinations given the medical environment. So, just as important as the techniques being used to transform the user query, we also need to be cognitive of where the retrieval information is coming from and what the context of that document/ file is. We know agents/ (openai) functions can reduce the likelihood of hallucinations- so that is the direction of our continued research in this space.
To get started, simply clone this repo-and enter your key values in the playground. You should immediately (upon completion of our initial setup here) you should be able to enter your configurations into the playground-codeblocks as needed, and then being q/a over your data. Given this is day #2 of our project, the code is still being updated. Keep in mind, this means we'll begin seeing health/medical specific components beginning to be integrated in the coming weeks. As time progresses, the less effective generic data/documents will be, and the more effective health/medical specific data/documents will be.
Last night, and after evaluation and further consideration I've decided to implement the Llama-Index. The considerations for approval are the following: a. the Llama-Index uses new technologies to integrate query revisions, such as the Transform Query function, which takes a User-Query and transforms it into a query that is more likely to return the desired results. b. Hypothetical Document Embeddings (HyDE) query transform; HyDe is a new method for transforming queries that uses a neural network to generate a query that is more likely to return the desired results. It takes the User query, provides an answer, and then uses those answers to generate similar embeddings. c. easily integrates with our existing Langchain and Medex-Index. d. improved ingestion function such as Transformer-Embedding Accleration, which is a new method for accelerating the embedding generation process. It uses a transformer model to generate embeddings for each chunk of text, which are then used to index the document. This allows the application to quickly find similar documents or chunks based on their embeddings. Query Splitting which enables the model to query specific chunks, in specific documents, and return the results. Or, even Query Transformations where the query is expanded into 5+ relevant queries, all answers are searched within the embeddings, and then the best anwer is formed.
Generally, I believe the implementation of the Llama-Index will help to substantially accelerate our progress, while being able to provide moderate improvements to our existing models in complex areas of forward/backward document processing.
These documents are being continuously updated. Today (6/7/23) was the first day of the project release. While I've been working dilligently in the background, it's always been a goal of mine to cerate a meaningful public repository. I'm excited to share this with you. I hope you enjoy it.I'm having some difficulty with the privacy concerns related to the sample_patient_files that I've been recieving. As of today (7,7,2023) - I'm planning to blackout the privacy details of these documents so that the samples can be used by everyone. However, given the length, 2,000+ pages per document, it may take some time to ensure this is completed efficiently and according to our standards for the privacy of our Users, even in prototyping stages. In the meantime, I will not be pushing the sample-patient-files until they're ready/ likely next week.