From Vector Data to Multimodal RAG in Clinical Use Case

Product: For Derma Lab

Sep 04, 2024

A complete RAG system contains several flexible modules, such as document collections (corpora), retrieval algorithms(retrievers), and backbone LLMs, but the best practices for tuning these components are still unclear, hindering their optimal adoption in medicine. Following are some research and implementation related:

Clinical RAG Paper Reading

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Language models as clinical knowledge-bases with the ability to use external tools such as search engines, medical databases and calculators to answer queries related to clinical concepts and latest treatment recommendations.

The research result suggests that retrieval systems can effectively facilitate information retrieval.

The complete retrieval system starts with a query, Almanac follows the below steps:

Uses external tools to retrieve relevant information like Tavily search, etc
Synthesizing a response with citations referencing source material

Spotlight of Almanac:

Dynamic Retrieval:
Local Documents

MedRAG: Benchmarking RAG for Medicine

RAG is for solving hallucinations and outdated knowledge. MEDRAG’s experiment result show that the combination of various medical corpora and retrievers achieves the best performance. Also, a log-linear scaling property and the “lost-in-the-middle" effects in medical RAG.

The performance of one RAG system is strongly related to the corpus it selects.

Retrievers：
- Lexical Retriever: BM25
- Semantic Retriever: Contriever, SPECTER, MedCPT
LLM
- with CoT
- with MedRAG + CoT
Copora: General & Specific info source
- Wikipedia
- PubMed, StatPeals, Textbooks, MedCorp

STORM

Clinical Multimodal Paper

PanDerm: A General-Purpose Multimodal Foundation Model for Dermatology

PanDerm is a multimodal dermatology foundation model pretrained on over 2M images. In pretraining stage, PanDerm employs a novel combination of masked latent modeling and CLIP feature alignment for self-supervised learning, also surpassed vision-language models such as CLIP 37, MONET 41 and BiomedCLIP 42 in benchmark evaluations. Common tasks include:

Total body skin examination
Risk assessment at both patient and lesion levels
Differentiation of neoplastic from inflammatory diseases
Multimodal image analysis
Pathology Interpretation
Monitoring lesion changes
Predicting outcomes

Modalities include: total body photography (TBP), dermatopathology, clinical, dermascopic images.

Core concepts:

Ablation study: a method commonly used in ML/DL to assess the importance of different component or features of a model. The main idea is to understand which parts of a model contribute most to its overall performance.
CLIP-based Teacher Model: Contrastive Language-Image Pretraining has two encoders, contrastive training, and zero-shot learning.
CLIP-Large:
ViT-large 38 encoder: A vision transformer model to process images, ViT-large 38 would refer to a model with 38 transformer layers and likely more hidden units per layer than smaller models.
A regressor:

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks

BiomedGPT is the first open-source and lightweight vision-language foundation model, designed as a generalist capable of performing various biomedical tasks. A large-scale pre-training corpus comprising 592,567 images, approximately 183 million text sentences, 46,408 object-label pairs and 271,804 image-text pairs.

Key Concepts:

Masked Modeling

Reference

Postgres pgvector Extension - Vector Database with PostgreSQL / Langchain Integration

OpenAI Embeddings and Vector Databases Crash Course

Mixpeek & FLUX for Multimodal RAG

Building Multimodal Search and RAG

https://medium.com/@tenyks_blogger/rag-for-vision-building-genai-computer-vision-systems-6a812adfb20e

Yvaine’s Substack

Discussion about this post