Build Audio Analysis Application With Speaker Diarization

Learn the code of building an audio analysis application with speaker diarization to automate note-taking, improve content accessibility and get useful insights.
Audio Analysis Application

Table of Content

Subscribe to latest Insights

By clicking "Subscribe", you are agreeing to the our Terms of Use and Privacy Policy.


In this blog, we have built an audio analysis application to provide insights from conversational Audio with speaker tagging for multiple speakers(speaker diarization). 

Our main focus for this blog is to get useful insights from any conversational audio, which can include a transcript of Audio with speaker tagging or speaker diarization, the text of the complete conversation, a summary of the conversation, topic analysis, or the emotion of the complete conversation.

For generating these insights, we have used only open-source LLMs (Large Language Models), whisper for converting Speech to Text and whisper with . 

Moreover, we can scale the Audio Analysis Application to analyze each speaker’s emotions. For this, we need to list each speaker’s spoken sentence and apply emotion analysis, or we can develop a different logic for this.

Now, we will discuss the code that we used to generate the insights from the audio analysis application. 

Code For Audio Analysis Application

  1. Let’s import useful libraries, which will be required to run our program for audio analysis application. 
					import torch
from import PretrainedSpeakerEmbedding
import whisper
import datetime
from import Audio
from pyannote.core import Segment
import speechbrain as sb
import wave
import contextlib
from sklearn.cluster import AgglomerativeClustering
import numpy as np
import subprocess
import torch
import random
import textwrap
from speech brain.inference.interfaces import foreign_class
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, AutoProcessor, AutoModel
from transformers import GenerationConfig, pipeline

2. Now, we give the path of the audio file. Note that the audio file for this code should be in .wav format . So, if you have .waveform file then you are good to go or its also easy to convert any format to .wav using python.

3. Video files can also be directly converted to audio.wav using ffmpeg

					path = "/media/gpu/1924f65c-d623-4aba-a06c-3bc31ca72302/Downloads/Audio_project/Meet_Audio.wav"
num_speakers = 5
language = 'English'
model_size = 'base'

4. Import an embedding model from speechbrain to generate the embedding 

					embedding_model = PretrainedSpeakerEmbedding(

5. Lucky for you, I have added a code snippet to convert the audio file to .wav if it is not in .waveform format.

					if path[-3:] != 'wav':['ffmpeg', '-i', path, 'audio.wav', '-y'])
 path = 'audio.wav'

6. Load the openai whisper model. After loading, make sure to delete the cache to utilize your GPU memory efficiently

					model = whisper.load_model(model_size

7. After loading the whisper model, use the model to generate a transcribe. I suggest that you should print the result variable to study the backend processing. Moreover, you would have knowledge of how the model generates Text from Speech, such as the format, important keys, etc. It is important to know it before we go through the next steps.

					result = model.transcribe(path)
segments = result["segments"]

8. This code snippet opens a .wav file located at a path in read mode using the function. It reads the number of frames and the frame rate of the audio file. Then, it calculates the duration of the audio in seconds by dividing the number of frames by the frame rate.

					with contextlib.closing(,'r')) as f:
 frames = f.getnframes()
 rate = f.getframerate()
 duration = frames / float(rate)

9. This code snippet consists of a function defined to generate the embeddings of audio within some time stamps. If you have studied the output of the result in step 7, then you can understand the function easily.

					audio = Audio()
def segment_embedding(segment):
 start = segment["start"]
 # Whisper overshoots the end timestamp in the last segment
 end = min(duration, segment["end"])
 clip = Segment(start, end)
 waveform, sample_rate = audio.crop(path, clip)
 waveform = waveform.mean(dim=0, keepdim=True)
 return embedding_model(waveform.unsqueeze(0))

10. This line iterates over the segment variable defined in step 7 and generates the embedding for each time stamp.

					embeddings = np.zeros(shape=(len(segments), 192))
for i, segment in enumerate(segments):
 embeddings[i] = segment_embedding(segment)
embeddings = np.nan_to_num(embeddings)

11. The code snippet below is used for performing speaker diarization, partitions an audio recording into segments, and assigns each segment to a specific speaker.

					clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
 segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)

12. This code snippet defines a function to write the transcript of the audio, including speaker labels and corresponding timestamps, to a text file named “transcript.txt.” 

					def time(secs):
 return datetime.timedelta(seconds=round(secs))
f = open("transcript.txt", "w")
x = ""
for (i, segment) in enumerate(segments):
 if i == 0 or segments[i - 1]["speaker"] != segment["speaker"]:
   f.write("\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n')
 f.write(segment["text"][1:] + ' ')
 x += "\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n'
 x += segment["text"][1:] + ' '

13. Now , we are here to see the transcript with speaker diarization. Tagging each speaker as Speaker1, Speaker2 ,….SpeakerN, depending upon the number of speakers intended to be identified given in step 3.


14. The code below will be used to generate emotion for the audio file for our audio analysis application. We have used “speechbrain/emotion-recognition-wav2vec2-IEMOCAP” to generate emotion from an audio file.

					classifier = foreign_class(source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP", pymodule_file="", classname="CustomEncoderWav2vec2Classifier")
out_prob, score, index, text_lab = classifier.classify_file("Meet_Audio.wav")

15. The next steps are used to load a model for sentiment analysis, topic analysis, and topic-wise summarization. We have used the mistral-7b model to perform these tasks using the system prompting technique.

					bnb_config = BitsAndBytesConfig(
model_id = "/media/gpu/1924f65c-d623-4aba-a06c-3bc31ca72302/Downloads/Audio_project/mistralai_quantized"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained("/media/gpu/1924f65c-d623-4aba-a06c-3bc31ca72302/Downloads/Audio_project/mistral-7b_tokentizer")
					# Function to generate summary and sentiment analysis using Mistral
def generate_summary_and_sentiment(text):
   prompt = f"""<s>[INST]I need to summarize a conversation. The transcript of the conversation is between the <data> XML like tags.
The summary must contain a one word sentiment analysis, and a list of issues, topic of detailed discussion during the conversation. The output must be provided in JSON format shown in the following example.
Example output:
   "sentiment": <sentiment>,
   "issues": [
           "topic": <topic>,
           "summary": <issue_summary>
   encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
   output = model.generate(**encodeds, max_length=1200, temperature=0.02, use_cache=True, early_stopping=True, pad_token_id=model.config.eos_token_id, eos_token_id=model.config.eos_token_id, do_sample=True)
   response = tokenizer.decode(output[0], skip_special_tokens=True)
   return response

Use Cases And Benefits

Audio Analysis Application has vast applications in various sectors like:

  • Telecommunication 
  • Defence 
  • Customer Support and BPOs 
  • Generating Conversational Dataset

    The various benefits of audio analysis applications include:

  • Improve content accessibility
  • Get useful insights 
  • Automatic note-taking


Now, our audio analysis application is ready. We have seen how we can generate useful insights from an Audio file(waveform format). We used pre-trained models from Huggingface to perform transcription, summarization, sentiment analysis, etc., and speechbrain models for speaker diarization and emotion analysis for the audio analysis application.

We can use the audio analysis application to analyze conversational audio data in sectors like telecommunication, BPOs, customer support, technical support, and military or defence(here, we need to encrypt our process).


Speaker Diarization is the process of partitioning audio data by converting human speech into segments regarding each speaker. It is used to enhance the process of audio-to-text transcription.

For the Speech-to-text task, we can use open-source LLMs like whisper from openai,  Libraries like , speechbrain for segmentation of audio file to generate embedding.

To perform audio analysis or to generate useful insights from the audio file, we first generate text from the audio file and then partition the audio file. Then, we generate the embedding of segmented audio file with respect to each speaker. Finally, we generate a diarized Transcript, and analyze the transcript and text for summarization, emotion analysis, and topic analysis. 

The Audio Analysis Application has multiple use cases, such as telecommunication, BPOs, customer support, and technical support in the military or defence. It benefits these sectors by providing useful insights, such as text from audio, a summary of the conversation and its outcome, which can be scaled to perform tone analysis of each speaker, etc.