LAI #78: RAG Evaluation, MCP 101, GRPO Fine-Tuning, and Multimodal Systems

Hallucination detection in healthcare AI, Langfuse tracing, and a new LLM aggregation tool from the community.

Louis-François Bouchard

Towards AI

, and

Louie Peters

Jun 05, 2025

Good morning, AI enthusiasts,

This week’s issue is for the builders who care about what works — and how to measure it. We’re starting with a deep dive into RAG evaluation pipelines: why so many get it wrong, and the metrics that actually matter. From there, we explore MCP, the protocol designed to make AI agents more structured, scalable, and tool-aware.

You’ll also find a practical walkthrough on detecting hallucinations in healthcare AI, a beginner-friendly guide to fine-tuning Mistral-7B with GRPO, and a full-stack look at building multimodal RAG systems that integrate text, vision, and audio.

Also in the mix: a powerful LLM aggregator from the community, new collab threads, and a meme that probably hits too close to home.

Let’s get into it.

What’s AI Weekly

If you are implementing RAG but don’t have an evaluation pipeline, you are probably missing out on easy improvements. How will you know if it’s optimal or if your system is improving with any changes? This is through evaluation, which is different from evaluating LLMs themselves. So this week in What’s AI, I am diving into the key evaluation metrics and methods we’ve found useful while developing RAG systems at Towards AI. Read the complete article here or watch the video on YouTube.

— Louis-François Bouchard, Towards AI Co-founder & Head of Community

Learn AI Together Community Section!

Featured Community post from the Discord

Thevarsek has built NanthAI, an LLM aggregator app with advanced search, document, and customization. You can access 50+ premium models via OpenRouter integration, and it includes built-in reasoning capabilities for complex problem-solving. Users can also use advanced internet search, RAG & advanced document retrieval, and advanced document context in combination with advanced internet searches and customize every AI “persona”. Check it out here and support a fellow community member. If you have any questions or feedback, share them in the thread!

AI poll of the week!

It’s telling that nearly a third of respondents still aren’t sure if agents will ever replace a full-time role on their team — while the rest are evenly spread across timelines as short as 6 months. This isn’t just about tech maturity — it’s about trust, workflow design, and the definition of what “replacement” really means. If you don’t see agents replacing roles outright, where do you expect them to start quietly replacing workflows? What tasks are on the edge already: manual, repetitive, or just begging for automation? Tell us in the thread!

Collaboration Opportunities

The Learn AI Together Discord community is flooding with collaboration opportunities. If you are excited to dive into applied AI, want a study partner, or even want to find a partner for your passion project, join the collaboration channel! Keep an eye on this section, too — we share cool opportunities every week!

1. Ashish_82402 is a data scientist, learning agentic workflow, MCP, CrewAI, etc., and is looking for a study partner who can spare a couple of hours every day. If you have a basic understanding of Langchain and LangGraph, message him in the thread!

2. Safar4352 is looking for a dedicated learning partner to study together, exchange knowledge, and grow collaboratively in the field. If you also want an accountability partner, connect with him in the thread!

3. Omegar1998 is looking to expand their DL, RL, and Gen AI skills. If you are on the same learning path, reach out to them in the thread!

Meme of the week!

Meme shared by marlonlp29_16646

TAI Curated section

Article of the week

MCP 101: Why This Protocol Matters in the Age of AI Agents 🤖 By Afaque Umer

This article introduces Anthropic’s Model Context Protocol (MCP), an open standard that streamlines LLM interactions with external tools. It explains MCP’s client-host-server architecture and its use of JSON-RPC 2.0 for communication, simplifying integration with APIs and data. It also outlines MCP’s structured lifecycle, from initialization to termination, highlighting its relevance for building scalable, tool-using AI agents.

Our must-read articles

1. Detecting Hallucinations in Healthcare AI By Marie Humbert-Droz, PhD

The article addresses the challenge of hallucinations in healthcare AI, even within Retrieval-Augmented Generation (RAG) systems that use citations. It introduces three complementary techniques to enhance safety: source attribution, verifying if answers are grounded in evidence; consistency checking, identifying unstable responses; and semantic entropy, measuring hidden uncertainty. Additionally, it describes a multi-stage retrieval for complex medical queries. These layers aim to create a more reliable AI by actively flagging inaccuracies.

2. Supercharge Mistral-7B with GRPO Finetuning: A Beginner-Friendly Tutorial with Code By Krishan Walia

The blog details using GRPO (Gradient Regularised Policy Optimisation) finetuning to improve the reasoning abilities of LLMs, specifically Mistral-7B. The process included setting up the environment, loading a 4-bit quantized model via Unsloth, preparing the GSM8K dataset, and creating various reward functions to guide the model’s output structure and accuracy. It also covered configuring and executing the GRPO training using TRL, with evaluations demonstrating enhanced reasoning. The article also notes GRPO’s efficiency, particularly for resource-constrained environments.

3. Enhancing LLM Capabilities: The Power of Multimodal LLMs and RAG By Sunil Rao

This article explores Multimodal Large Language Models (MLLMs) and their architecture, including specialized encoders (e.g., CLIP, LLaVA, Whisper) and fusion layers. It also discusses Multimodal RAG, a system integrating these MLLMs with Retrieval-Augmented Generation to utilize diverse data sources. It also presents key steps for building such a system, from data loading and multimodal embedding to retrieval and LLM integration. Finally, it highlights the importance of comprehensive evaluation metrics for retrieval, generation, and cross-modal consistency.

4. Monitor and Evaluate Open AI SDK Agents using Langfuse By Steve George

The blog walks through building a simple agentic workflow using the OpenAI SDK, featuring an input guardrail, an assist agent, and a validation agent. It demonstrates capturing and visualizing trace data from this workflow using Langfuse, integrated via OpenTelemetry. The explanation covered configuration of credentials and OpenTelemetry, agent implementation, sending queries, and programmatic analysis of trace data for basic visualizations using Matplotlib.

If you want to publish with Towards AI, check our guidelines and sign up. We will publish your work to our network if it meets our editorial policies and standards.