Image credit: VentureBeat with DALL-E 3

Multimodal RAG is growing, here’s the best way to get started

by · VentureBeat

As companies begin experimenting with multimodal retrieval augmented generation (RAG), companies providing multimodal embeddings — a way to transform data to RAG-readable files — advise enterprises to start small when starting with embedding images and videos. 

Multimodal RAG, RAG that can also surface a variety of file types from text, images or videos, relies on embedding models that transform data into numerical representations that AI models can read. Embeddings that can process all kinds of files let enterprises find information from financial graphs, product catalogs or just any informational video they have and get a more holistic view of their company. 

Cohere, which updated its embeddings model, Embed 3, to process images and videos last month, said enterprises need to prepare their data differently, ensure suitable performance from the embeddings, and better use multimodal RAG.

“Before committing extensive resources to multimodal embeddings, it’s a good idea to test it on a more limited scale. This enables you to assess the model’s performance and suitability for specific use cases and should provide insights into any adjustments needed before full deployment,” a blog post from Cohere staff solutions architect Yann Stoneman said. 

The company said many of the processes discussed in the post are present in many other multimodal embedding models.

Stoneman said, depending on some industries, models may also need “additional training to pick up fine-grain details and variations in images.” He used medical applications as an example, where radiology scans or photos of microscopic cells require a specialized embedding system that understands the nuances in those kinds of images.

Data preparation is key

Before feeding images to a multimodal RAG system, these must be pre-processed so the embedding model can read them well. 

Images may need to be resized so they’re all a consistent size, while organizations need to figure out if they want to improve low-resolution photos so important details don’t get lost or make too high-resolution pictures a lower quality so it doesn’t strain processing time. 

“The system should be able to process image pointers (e.g. URLs or file paths) alongside text data, which may not be possible with text-based embeddings. To create a smooth user experience, organizations may need to implement custom code to integrate image retrieval with existing text retrieval,” the blog said. 

Multimodal embeddings become more useful 

Many RAG systems mainly deal with text data because using text-based information as embeddings is easier than images or videos. However, since most enterprises hold all kinds of data, RAG which can search pictures and texts has become more popular. Organizations often had to implement separate RAG systems and databases, preventing mixed-modality searches. 

Multimodal search is nothing new, as OpenAI and Google offer the same on their respective chatbots. OpenAI launched its latest generation of embeddings models in January. Other companies also provide a way for businesses to harness their different data for multimodal RAG. For example, Uniphore released a way to help enterprises prepare multimodal datasets for RAG.