Innovaccer’s AI: Advancing Speaker Diarization for Healthcare Conversations

Discover how Innovaccer’s diarization backend powers Ambient AI—accurately identifying speakers, improving documentation, and enhancing patient care.

ByDeepak Sharma, Aditya N, Gaurav Singh, Nilav Baran Ghosh, Divyansh Mahajan, Amber Gautam

November 11, 2025

Innovaccer’s AI: Advancing Speaker Diarization for Healthcare Conversations

In healthcare, clear communication between providers, patients, and care teams is essential. Capturing who says what in conversations, from physician-patient interactions to care manager outreach, is critical for accurate clinical documentation, better patient outcomes, and operational efficiency.

At Innovaccer, we have developed a state-of-the-art diarization backend service that accurately identifies and separates speakers in healthcare conversations, even when audio is processed in multiple segments. This breakthrough powers our Ambient AI capabilities, transforming how healthcare conversations are transcribed and analyzed.

What Is Speaker Diarization, And Why Is It Important in Healthcare?

Speaker diarization is the process of determining “who spoke when” in an audio recording. In clinical settings, it helps distinguish between the voices of doctors, patients, nurses, and support staff. Accurate diarization ensures:

Precise clinical notes reflecting each speaker’s contributions
Better context for decision-making and follow-up care
Enhanced compliance with documentation and billing standards

Innovaccer’s Diarization Backend: How It Works

Our diarization service leverages advanced open-source and proprietary technologies to ensure consistent, accurate speaker identification throughout a conversation, even when audio is processed in discrete chunks.

Step 1: Audio Processing and Parallel Requests

The backend receives audio from our agent executable. If diarization is enabled, the system simultaneously sends the audio to both:

A transcription service that converts speech to text with precise word timestamps
A diarization API that splits audio into speaker-specific segments

Step 2: Speaker Segmentation with Pyannote and SpeechBrain

The diarization API uses the pyannote.audio library, powered by the speechbrain model, to segment the audio by speaker. If a speaker appears multiple times, their segments are grouped and merged into a longer clip to improve accuracy.

Step 3: Speaker Embeddings via NVIDIA NeMo

Each speaker’s audio segment is converted into a unique embedding using NVIDIA NeMo. These embeddings act as audio “fingerprints” that represent the speaker’s voice characteristics.

Step 4: Speaker Consistency with Redis and Threshold Matching

Our system maintains a Redis database of speaker embeddings for each call session. When a new segment arrives, it compares the embedding against stored ones, calculating a similarity score based on audio length and other parameters.

If the new segment matches an existing speaker beyond a dynamic threshold, labels are updated for consistency.
If not, and the maximum number of speakers isn’t exceeded, a new speaker label is created.
If the speaker limit is reached, the new segment is matched with the closest existing speaker.
If the stored audio sample is short, we append new incoming samples that have a strong dBFS value and high similarity to the existing sample. This helps build a more complete and reliable stored sample for accurate comparison.

This ensures consistent speaker labeling across multiple audio chunks, even if the speaker’s speech is intermittent.

Step 5: Merging Transcription and Diarization Outputs

Finally, the backend merges the transcription’s word timestamps with the diarization’s speaker timestamps, producing a clear transcript with labeled speaker turns. For example:

speaker_1: “Hello, I am calling from XYZ Healthcare.”

speaker_2: “Hi, thanks for calling.”

Flexible API for Integration

Our diarization backend exposes a simple POST API where clients can send audio files or bytes along with parameters like minimum and maximum expected speakers and a unique call ID. The API returns detailed speaker segments with start and end times, loudness (dBFS), and metadata for each chunk, supporting robust integration with healthcare workflows.

Why Innovaccer’s Diarization Matters for Healthcare

Improved Clinical Documentation: By accurately identifying speakers, our AI reduces manual transcription errors, improving the quality and usability of clinical notes.
Enhanced Patient Experience: Clear records of conversations help providers understand patient needs better and follow up appropriately.
Scalable Across Settings: Whether it’s telemedicine, in-person visits, or care coordination calls, this diarization backend supports diverse healthcare scenarios.
Supports Compliance and Billing: Reliable speaker labeling supports audit trails, coding, and regulatory requirements.
Real-Time Audio Feedback for Improved Call Quality: dBFS values and speaker labels help identify low volume in real time, guiding agents or patients to speak louder for better audio quality.

Innovaccer’s innovative diarization technology is a critical pillar in advancing Ambient AI solutions for healthcare. By ensuring every voice in a conversation is accurately captured and attributed, we empower providers with actionable insights—making every healthcare conversation count.