
In healthcare, clear communication between providers, patients, and care teams is essential. Capturing who says what in conversations, from physician-patient interactions to care manager outreach, is critical for accurate clinical documentation, better patient outcomes, and operational efficiency.
At Innovaccer, we have developed a state-of-the-art diarization backend service that accurately identifies and separates speakers in healthcare conversations, even when audio is processed in multiple segments. This breakthrough powers our Ambient AI capabilities, transforming how healthcare conversations are transcribed and analyzed.
Speaker diarization is the process of determining “who spoke when” in an audio recording. In clinical settings, it helps distinguish between the voices of doctors, patients, nurses, and support staff. Accurate diarization ensures:
Our diarization service leverages advanced open-source and proprietary technologies to ensure consistent, accurate speaker identification throughout a conversation, even when audio is processed in discrete chunks.
The backend receives audio from our agent executable. If diarization is enabled, the system simultaneously sends the audio to both:
The diarization API uses the pyannote.audio library, powered by the speechbrain model, to segment the audio by speaker. If a speaker appears multiple times, their segments are grouped and merged into a longer clip to improve accuracy.
Each speaker’s audio segment is converted into a unique embedding using NVIDIA NeMo. These embeddings act as audio “fingerprints” that represent the speaker’s voice characteristics.
Our system maintains a Redis database of speaker embeddings for each call session. When a new segment arrives, it compares the embedding against stored ones, calculating a similarity score based on audio length and other parameters.

This ensures consistent speaker labeling across multiple audio chunks, even if the speaker’s speech is intermittent.
Finally, the backend merges the transcription’s word timestamps with the diarization’s speaker timestamps, producing a clear transcript with labeled speaker turns. For example:
speaker_1: “Hello, I am calling from XYZ Healthcare.”
speaker_2: “Hi, thanks for calling.”
Our diarization backend exposes a simple POST API where clients can send audio files or bytes along with parameters like minimum and maximum expected speakers and a unique call ID. The API returns detailed speaker segments with start and end times, loudness (dBFS), and metadata for each chunk, supporting robust integration with healthcare workflows.
Innovaccer’s innovative diarization technology is a critical pillar in advancing Ambient AI solutions for healthcare. By ensuring every voice in a conversation is accurately captured and attributed, we empower providers with actionable insights—making every healthcare conversation count.