Build Audio Analysis Application With Speaker Diarization

Ishita Kaur
May 7, 2024

Introduction

In this blog, we have built an audio analysis application to provide insights from conversational Audio with speaker tagging for multiple speakers(speaker diarization).

Our main focus for this blog is to get useful insights from any conversational audio, which can include a transcript of Audio with speaker tagging or speaker diarization, the text of the complete conversation, a summary of the conversation, topic analysis, or the emotion of the complete conversation.

For generating these insights, we have used only open-source LLMs (Large Language Models), whisper for converting Speech to Text and whisper with pyannote.audio .

Moreover, we can scale the Audio Analysis Application to analyze each speaker’s emotions. For this, we need to list each speaker’s spoken sentence and apply emotion analysis, or we can develop a different logic for this.

Now, we will discuss the code that we used to generate the insights from the audio analysis application.

Code For Audio Analysis Application

Let’s import useful libraries, which will be required to run our program for audio analysis application.

    
     import torch
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
import whisper
import datetime
from pyannote.audio import Audio
from pyannote.core import Segment
import speechbrain as sb
import wave
import contextlib
from sklearn.cluster import AgglomerativeClustering
import numpy as np
import subprocess
import torch
import random
import textwrap
from speech brain.inference.interfaces import foreign_class
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, AutoProcessor, AutoModel
from transformers import GenerationConfig, pipeline

2. Now, we give the path of the audio file. Note that the audio file for this code should be in .wav format . So, if you have .waveform file then you are good to go or its also easy to convert any format to .wav using python.

3. Video files can also be directly converted to audio.wav using ffmpeg

    
     path = "/media/gpu/1924f65c-d623-4aba-a06c-3bc31ca72302/Downloads/Audio_project/Meet_Audio.wav"
num_speakers = 5
language = 'English'
model_size = 'base'

4. Import an embedding model from speechbrain to generate the embedding

    
     embedding_model = PretrainedSpeakerEmbedding(
   "speechbrain/spkrec-ecapa-voxceleb",
   device=torch.device("cuda"))

5. Lucky for you, I have added a code snippet to convert the audio file to .wav if it is not in .waveform format.

    
     if path[-3:] != 'wav':
 subprocess.call(['ffmpeg', '-i', path, 'audio.wav', '-y'])
 path = 'audio.wav'

6. Load the openai whisper model. After loading, make sure to delete the cache to utilize your GPU memory efficiently

    
     model = whisper.load_model(model_size
gc.collect()
torch.cuda.empty_cache()

7. After loading the whisper model, use the model to generate a transcribe. I suggest that you should print the result variable to study the backend processing. Moreover, you would have knowledge of how the model generates Text from Speech, such as the format, important keys, etc. It is important to know it before we go through the next steps.

    
     result = model.transcribe(path)
segments = result["segments"]

8. This code snippet opens a .wav file located at a path in read mode using the wave.open() function. It reads the number of frames and the frame rate of the audio file. Then, it calculates the duration of the audio in seconds by dividing the number of frames by the frame rate.

    
     with contextlib.closing(wave.open(path,'r')) as f:
 frames = f.getnframes()
 rate = f.getframerate()
 duration = frames / float(rate)

9. This code snippet consists of a function defined to generate the embeddings of audio within some time stamps. If you have studied the output of the result in step 7, then you can understand the function easily.

    
     audio = Audio()
def segment_embedding(segment):
 start = segment["start"]
 # Whisper overshoots the end timestamp in the last segment
 end = min(duration, segment["end"])
 clip = Segment(start, end)
 waveform, sample_rate = audio.crop(path, clip)
 waveform = waveform.mean(dim=0, keepdim=True)
 return embedding_model(waveform.unsqueeze(0))

10. This line iterates over the segment variable defined in step 7 and generates the embedding for each time stamp.

    
     embeddings = np.zeros(shape=(len(segments), 192))
for i, segment in enumerate(segments):
 embeddings[i] = segment_embedding(segment)
embeddings = np.nan_to_num(embeddings)

11. The code snippet below is used for performing speaker diarization, partitions an audio recording into segments, and assigns each segment to a specific speaker.

    
     clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
 segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)

12. This code snippet defines a function to write the transcript of the audio, including speaker labels and corresponding timestamps, to a text file named “transcript.txt.”

    
     def time(secs):
 return datetime.timedelta(seconds=round(secs))
f = open("transcript.txt", "w")
x = ""
for (i, segment) in enumerate(segments):
 if i == 0 or segments[i - 1]["speaker"] != segment["speaker"]:
   f.write("\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n')
 f.write(segment["text"][1:] + ' ')
 x += "\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n'
 x += segment["text"][1:] + ' '
f.close()

13. Now , we are here to see the transcript with speaker diarization. Tagging each speaker as Speaker1, Speaker2 ,….SpeakerN, depending upon the number of speakers intended to be identified given in step 3.

    
     print(open('transcript.txt').read())

14. The code below will be used to generate emotion for the audio file for our audio analysis application. We have used “speechbrain/emotion-recognition-wav2vec2-IEMOCAP” to generate emotion from an audio file.

    
     classifier = foreign_class(source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP", pymodule_file="custom_interface.py", classname="CustomEncoderWav2vec2Classifier")
out_prob, score, index, text_lab = classifier.classify_file("Meet_Audio.wav")
print(text_lab)
gc.collect()
torch.cuda.empty_cache()

15. The next steps are used to load a model for sentiment analysis, topic analysis, and topic-wise summarization. We have used the mistral-7b model to perform these tasks using the system prompting technique.

    
     bnb_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_use_double_quant=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16
)
model_id = "/media/gpu/1924f65c-d623-4aba-a06c-3bc31ca72302/Downloads/Audio_project/mistralai_quantized"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained("/media/gpu/1924f65c-d623-4aba-a06c-3bc31ca72302/Downloads/Audio_project/mistral-7b_tokentizer")

    
     # Function to generate summary and sentiment analysis using Mistral
def generate_summary_and_sentiment(text):
   prompt = f"""<s>[INST]I need to summarize a conversation. The transcript of the conversation is between the <data> XML like tags.
<data>
{text}
</data>
The summary must contain a one word sentiment analysis, and a list of issues, topic of detailed discussion during the conversation. The output must be provided in JSON format shown in the following example.
Example output:
{{
   "sentiment": <sentiment>,
   "issues": [
       {{
           "topic": <topic>,
           "summary": <issue_summary>
       }}
   ]
}}
[/INST]</s>"""
   encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
   output = model.generate(**encodeds, max_length=1200, temperature=0.02, use_cache=True, early_stopping=True, pad_token_id=model.config.eos_token_id, eos_token_id=model.config.eos_token_id, do_sample=True)
   response = tokenizer.decode(output[0], skip_special_tokens=True)
   print(response)
   return response
generate_summary_and_sentiment(text)

Use Cases And Benefits

Audio Analysis Application has vast applications in various sectors like:

Telecommunication
Defence
Customer Support and BPOs
Generating Conversational Dataset

The various benefits of audio analysis applications include:

Improve content accessibility
Get useful insights
Automatic note-taking

Conclusion

Now, our audio analysis application is ready. We have seen how we can generate useful insights from an Audio file(waveform format). We used pre-trained models from Huggingface to perform transcription, summarization, sentiment analysis, etc., and speechbrain models for speaker diarization and emotion analysis for the audio analysis application.

We can use the audio analysis application to analyze conversational audio data in sectors like telecommunication, BPOs, customer support, technical support, and military or defence(here, we need to encrypt our process).

FAQs

What is Speaker Diarization?

Speaker Diarization is the process of partitioning audio data by converting human speech into segments regarding each speaker. It is used to enhance the process of audio-to-text transcription.

What are the different models or resources we can use for Speech-to-Text?

For the Speech-to-text task, we can use open-source LLMs like whisper from openai, Libraries like pyannote.audio , speechbrain for segmentation of audio file to generate embedding.

How to perform audio analysis?

To perform audio analysis or to generate useful insights from the audio file, we first generate text from the audio file and then partition the audio file. Then, we generate the embedding of segmented audio file with respect to each speaker. Finally, we generate a diarized Transcript, and analyze the transcript and text for summarization, emotion analysis, and topic analysis.

What is the use case of the Audio Analysis Application and its important features?

The Audio Analysis Application has multiple use cases, such as telecommunication, BPOs, customer support, and technical support in the military or defence. It benefits these sectors by providing useful insights, such as text from audio, a summary of the conversation and its outcome, which can be scaled to perform tone analysis of each speaker, etc.

Need Help To Kick-Start Your AI Journey Today ?

Reach out to us now to know how we can help you improve business productivity, efficiency, and scale with AI solutions.

Industries

Are You AI Ready?

Insights

Table of Content

Build Audio Analysis Application With Speaker Diarization

Introduction

Code For Audio Analysis Application

Use Cases And Benefits

Conclusion

FAQs

Related Articles

Need Help To Kick-Start Your AI Journey Today ?

send your query

Recognized by

Quick Links

Services

Contact

Subscribe to our Newsletter!

Let's Transform Your Business with AI

Get latest AI insights, tips, and updates directly to your inbox.