Voice Assistant | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Avishek Biswas — Sat, 12 Apr 2025 01:09:27 +0000

Recently, Sesame AI published a demo of their latest Speech-to-Speech model. A conversational AI agent who is really good at speaking, they provide relevant answers, they speak with expressions, and honestly, they are just very fun and interactive to play with.

Note that a technical paper is not out yet, but they do have a short blog post that provides a lot of information about the techniques they used and previous algorithms they built upon.

Thankfully, they provided enough information for me to write this article and make a YouTube video out of it. Read on!

Training a Conversational Speech Model

Sesame is a Conversational Speech Model, or a CSM. It inputs both text and audio, and generates speech as audio. While they haven’t revealed their training data sources in the articles, we can still try to take a solid guess. The blog post heavily cites another CSM, 2024’s Moshi, and fortunately, the creators of Moshi did reveal their data sources in their paper. Moshi uses 7 million hours of unsupervised speech data, 170 hours of natural and scripted conversations (for multi-stream training), and 2000 more hours of telephone conversations (The Fischer Dataset).

Sesame builds upon the Moshi Paper (2024)

But what does it really take to generate audio?

In raw form, audio is just a long sequence of amplitude values — a waveform. For example, if you’re sampling audio at 24 kHz, you are capturing 24,000 float values every second.

There are 24000 values here to represent 1 second of speech! (Image generated by author)

Of course, it is quite resource-intensive to process 24000 float values for just one second of data, especially because transformer computations scale quadratically with sequence length. It would be great if we could compress this signal and reduce the number of samples required to process the audio.

We will take a deep dive into the Mimi encoder and specifically Residual Vector Quantizers (RVQ), which are the backbone of Audio/Speech modeling in Deep Learning today. We will end the article by learning about how Sesame generates audio using its special dual-transformer architecture.

Preprocessing audio

Compression and feature extraction are where convolution helps us. Sesame uses the Mimi speech encoder to process audio. Mimi was introduced in the aforementioned Moshi paper as well. Mimi is a self-supervised audio encoder-decoder model that converts audio waveforms into discrete “latent” tokens first, and then reconstructs the original signal. Sesame only uses the encoder section of Mimi to tokenize the input audio tokens. Let’s learn how.

Mimi inputs the raw speech waveform at 24Khz, passes them through several strided convolution layers to downsample the signal, with a stride factor of 4, 5, 6, 8, and 2. This means that the first CNN block downsamples the audio by 4x, then 5x, then 6x, and so on. In the end, it downsamples by a factor of 1920, reducing it to just 12.5 frames per second.

The convolution blocks also project the original float values to an embedding dimension of 512. Each embedding aggregates the local features of the original 1D waveform. 1 second of audio is now represented as around 12 vectors of size 512. This way, Mimi reduces the sequence length from 24000 to just 12 and converts them into dense continuous vectors.

Before applying any quantization, the Mimi Encoder downsamples the input 24KHz audio by 1920 times, and embeds it into 512 dimensions. In other words, you get 12.5 frames per second with each frame as a 512-dimensional vector. (Image from author’s video)

What is Audio Quantization?

Given the continuous embeddings obtained after the convolution layer, we want to tokenize the input speech. If we can represent speech as a sequence of tokens, we can apply standard language learning transformers to train generative models.

Mimi uses a Residual Vector Quantizer or RVQ tokenizer to achieve this. We will talk about the residual part soon, but first, let’s look at what a simple vanilla Vector quantizer does.

Vector Quantization

The idea behind Vector Quantization is simple: you train a codebook , which is a collection of, say, 1000 random vector codes all of size 512 (same as your embedding dimension).

A Vanilla Vector Quantizer. A codebook of embeddings is trained. Given an input embedding, we map/quantize it to the nearest codebook entry. (Screenshot from author’s video)

Then, given the input vector, we will map it to the closest vector in our codebook — basically snapping a point to its nearest cluster center. This means we have effectively created a fixed vocabulary of tokens to represent each audio frame, because whatever the input frame embedding may be, we will represent it with the nearest cluster centroid. If you want to learn more about Vector Quantization, check out my video on this topic where I go much deeper with this.

More about Vector Quantization! (Video by author)

Residual Vector Quantization

The problem with simple vector quantization is that the loss of information may be too high because we are mapping each vector to its cluster’s centroid. This “snap” is rarely perfect, so there is always an error between the original embedding and the nearest codebook.

The big idea of Residual Vector Quantization is that it doesn’t stop at having just one codebook. Instead, it tries to use multiple codebooks to represent the input vector.

First, you quantize the original vector using the first codebook.
Then, you subtract that centroid from your original vector. What you’re left with is the residual — the error that wasn’t captured in the first quantization.
Now take this residual, and quantize it again, using a second codebook full of brand new code vectors — again by snapping it to the nearest centroid.
Subtract that too, and you get a smaller residual. Quantize again with a third codebook… and you can keep doing this for as many codebooks as you want.

Residual Vector Quantizers (RVQ) hierarchically encode the input embeddings by using a new codebook and VQ layer to represent the previous codebook’s error. (Illustration by the author)

Each step hierarchically captures a little more detail that was missed in the previous round. If you repeat this for, let’s say, N codebooks, you get a collection of N discrete tokens from each stage of quantization to represent one audio frame.

The coolest thing about RVQs is that they are designed to have a high inductive bias towards capturing the most essential content in the very first quantizer. In the subsequent quantizers, they learn more and more fine-grained features.

If you’re familiar with PCA, you can think of the first codebook as containing the primary principal components, capturing the most critical information. The subsequent codebooks represent higher-order components, containing information that adds more details.

Residual Vector Quantizers (RVQ) uses multiple codebooks to encode the input vector — one entry from each codebook. (Screenshot from author’s video)

Acoustic vs Semantic Codebooks

Since Mimi is trained on the task of audio reconstruction, the encoder compresses the signal to the discretized latent space, and the decoder reconstructs it back from the latent space. When optimizing for this task, the RVQ codebooks learn to capture the essential acoustic content of the input audio inside the compressed latent space.

Mimi also separately trains a single codebook (vanilla VQ) that only focuses on embedding the semantic content of the audio. This is why Mimi is called a split-RVQ tokenizer – it divides the quantization process into two independent parallel paths: one for semantic information and another for acoustic information.

The Mimi Architecture (Source: Moshi paper) License: Free

To train semantic representations, Mimi used knowledge distillation with an existing speech model called WavLM as a semantic teacher. Basically, Mimi introduces an additional loss function that decreases the cosine distance between the semantic RVQ code and the WavLM-generated embedding.

Audio Decoder

Given a conversation containing text and audio, we first convert them into a sequence of token embeddings using the text and audio tokenizers. This token sequence is then input into a transformer model as a time series. In the blog post, this model is referred to as the Autoregressive Backbone Transformer. Its task is to process this time series and output the “zeroth” codebook token.

A lighterweight transformer called the audio decoder then reconstructs the next codebook tokens conditioned on this zeroth code generated by the backbone transformer. Note that the zeroth code already contains a lot of information about the history of the conversation since the backbone transformer has visibility of the entire past sequence. The lightweight audio decoder only operates on the zeroth token and generates the other N-1 codes. These codes are generated by using N-1 distinct linear layers that output the probability of choosing each code from their corresponding codebooks.

You can imagine this process as predicting a text token from the vocabulary in a text-only LLM. Just that a text-based LLM has a single vocabulary, but the RVQ-tokenizer has multiple vocabularies in the form of the N codebooks, so you need to train a separate linear layer to model the codes for each.

The Sesame Architecture (Illustration by the author)

Finally, after the codewords are all generated, we aggregate them to form the combined continuous audio embedding. The final job is to convert this audio back to a waveform. For this, we apply transposed convolutional layers to upscale the embedding back from 12.5 Hz back to KHz waveform audio. Basically, reversing the transforms we had applied originally during audio preprocessing.

In Summary

Check out the accompanying video on this article! (Video by author)

So, here is the overall summary of the Sesame model in some bullet points.

Sesame is built on a multimodal Conversation Speech Model or a CSM.
Text and audio are tokenized together to form a sequence of tokens and input into the backbone transformer that autoregressively processes the sequence.
While the text is processed like any other text-based LLM, the audio is processed directly from its waveform representation. They use the Mimi encoder to convert the waveform into latent codes using a split RVQ tokenizer.
The multimodal backbone transformers consume a sequence of tokens and predict the next zeroth codeword.
Another lightweight transformer called the Audio Decoder predicts the next codewords from the zeroth codeword.
The final audio frame representation is generated from combining all the generated codewords and upsampled back to the waveform representation.

Thanks for reading!

References and Must-read papers

Check out my ML YouTube Channel

Sesame Blogpost and Demo

Relevant papers:
Moshi: https://arxiv.org/abs/2410.00037
SoundStream: https://arxiv.org/abs/2107.03312
HuBert: https://arxiv.org/abs/2106.07447
Speech Tokenizer: https://arxiv.org/abs/2308.16692

The post Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech appeared first on Towards Data Science.

Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop

Yu-Cheng Tsai — Tue, 19 Nov 2024 01:45:03 +0000

Please enjoy the read: Free link!

With the rise of multimodal Large Language Models (LLMs), we can now interact with them in more ways than just typing text, like using audio inputs. OpenAI has recently released a voice feature for ChatGPT, allowing one to talk directly with the chat platform. This opens up a myriad of novel opportunities and applications built around it.

As machine learning and Data Science practitioners, it’s an exciting time to be involved. Using OpenAI’s realtime speech to speech APIs, you can create a voice assistant powered by these multi-modal LLMs. However, if you are interested in the open-source libraries, you can build a voice assistant as well, completely in a local environment and without subscriptions to proprietary APIs!

Why local voice assistant?

Data privacy
No API calls limit
Fine-tuning models

First, I am sure most people who use mainstream generative AI chatbots are aware of the data that was transmitted through their servers. A lot of people may be concerned about the data privacy issue and leak of information.

Second, using proprietary APIs can be subject to the API calls limitation. For example, the OpenAI’s realtime API is rate-limited to approximately 100 simultaneous sessions for Tier 5 developers, with lower limits for Tiers 1–4.

Third, the LLMs hosts behind these proprietary API gates are powerful but are not fine-tuned or tailored to your specific domain. On the other hand, a locally hosted LLMs-based voice assistant allows you do inference without transferring data over to the cloud server. And you can choose lightweight LLMs to fine-tune and deploy on a CPU machine (i.e. a laptop or mobile device). How nice is that!

In this post, I will walk you through how I built a voice assistant on a CPU-based machine. In fact, I did this on my intel CPU (2 GHz Quad-Core Intel Core i5) MacBook Pro laptop with 32 GB of RAM, no GPU involved!

Local Voice Assistant

To build a voice assistant, there are four main components that we will need to set up:

Record the voice
Convert voice to text
Answer the question in text format
Convert the text answer back to voice

First, we need a library that can record audio from the device’s microphone. Conveniently, sounddevice library provides the functionality that allows one to capture audio and save it as a WAV file.

import sounddevice as sd
import wave

sampling_rate = 16000  # set sample rate to 16 kHz for compatibility with whisper.cpp

# Record audio using sounddevice
recorded_audio = sd.rec(
    int(duration * sampling_rate),
    samplerate=sampling_rate,
    channels=1,
    dtype=np.int16,
)
sd.wait()  # Wait until recording is finished

# Save audio to WAV file
audio_file = "/recorded_audio.wav"
with wave.open(audio_file, "w") as wf:
    wf.setnchannels(1)
    wf.setsampwidth(2)  # 16-bit audio
    wf.setframerate(sampling_rate)
    wf.writeframes(recorded_audio.tobytes())

The sampling rate is set to 16000 to match the rate used by the OpenAI’s Whisper model.

Next, we use OpenAI’s Whisper model to transcribe audio to text. For this, we select the ggml-base.en.bin model. However, there are a wide range of models that you can choose and experiment with.


import subprocess

WHISPER_BINARY_PATH = "//whisper.cpp/main"
MODEL_PATH = "//whisper.cpp/models/ggml-base.en.bin"

extracted_text = ""
try:
    result = subprocess.run(
        [
            WHISPER_BINARY_PATH,
            "-m",
            MODEL_PATH,
            "-f",
            audio_file,
            "-l",
            "en",
            "-otxt",
        ],
        capture_output=True,
        text=True,
    )
    # Display the transcription
    transcription = result.stdout.strip()
except FileNotFoundError:
    st.error(
        "Whisper.cpp binary not found. Make sure the path to the binary is correct."
    )

Then, we can use an LLM to generate a text-based answer. Here, we use Ollama‘s server to load a lightweight LLM, qwen:0.5b, which is about 400 MB, so that it can easily fit into my laptop’s memory. A utility function, run_ollama_command, is used to achieve that.

import subprocess

def run_ollama_command(model, prompt):
    try:
        # Execute the ollama command using subprocess
        result = subprocess.run(
            ["ollama", "run", model],
            input=prompt,
            text=True,
            capture_output=True,
            check=True,
        )

        # Output the result from Ollama
        print("Response from Ollama:")
        print(result.stdout)
        return result.stdout

    except subprocess.CalledProcessError as e:
        # Handle errors in case of a problem with the command
        print("Error executing Ollama command:")
        print(e.stderr)

We give it a simple prompt, asking LLM to answer the transcribed text in less than 15 words.

# Parse the transcription text
# Use regex to find all text after timestamps
matches = re.findall(r"] *(.*)", transcription)

# Concatenate all extracted text
concatenated_text = " ".join(matches)

# Call ollama to get an answer
prompt = f"""
Please ignore the text [BLANK_AUDIO]. Given this question: "{concatenated_text}, please answer it in less than 15 words."
"""
answer = run_ollama_command(model="qwen:0.5b", prompt=prompt)

Finally, we can use another model to transcribe the answer in text format to audio using NVIDIA’s NeMo toolkit. The fastpitch_model (a transformer network) converts the text answer into a spectrogram, and then hifigan_model (a Generative Adversarial Network) is used to convert the spectrogram into an audio waveform.

# Integrate NVIDIA NeMo TTS to read the answer from ollama
if answer:
    try:
        # Load the FastPitch and HiFi-GAN models from NeMo
        fastpitch_model = nemo_tts.models.FastPitchModel.from_pretrained(
            model_name="tts_en_fastpitch"
        )
        hifigan_model = nemo_tts.models.HifiGanModel.from_pretrained(
            model_name="tts_en_lj_hifigan_ft_mixerttsx"
        )

        # Set the FastPitch model to evaluation mode
        fastpitch_model.eval()
        parsed_text = fastpitch_model.parse(answer)
        spectrogram = fastpitch_model.generate_spectrogram(tokens=parsed_text)

        # Convert the spectrogram into an audio waveform using HiFi-GAN vocoder
        hifigan_model.eval()
        audio = hifigan_model.convert_spectrogram_to_audio(spec=spectrogram)

        # Save the audio to a byte stream
        audio_buffer = BytesIO()
        torchaudio.save(audio_buffer, audio.cpu(), sample_rate=22050, format="wav")
        audio_buffer.seek(0)

    except Exception as e:
        print(f"An error occurred during speech synthesis: {e}")

Does it work?

Bringing everything together, I used Streamlit to create a prototype. Here’s the overall system diagram. The Streamlit app provides a start button for users to record audio. The audio is recorded and saved as WAV file using sounddevice. Then a whisper.cpp model transcribes the WAV file to text. LatentDirichletAllocation is applied for topic modeling, along with CountVectorizer for word counts, which provides insights into the voice input. Afterward, a local LLM model, qwen:0.5b , is used to generate a text-based answer to the question. Finally, NVIDIA’s NeMo toolkit is used to transcribe the text back to speech, which is then displayed in the Streamlit app for users to review.

System diagram of the voice assistant by author

Please take a look at the video below to see how it works as well. I asked the voice assistant to provide a good recipe for making a delicious pizza. The spoken answer appears at 54 seconds into the video. Please feel free to fast forward to that point to check out out the response. There is definitely room for improvement in terms of latency!

What Else Can be Improved?

Great, I just walked you through setting up a local voice assistant on a CPU laptop! Now, what else could we improve? The list could be long, but here are my personal top picks: adding features to search and filter past conversations, organize them with labels or tabs, make it multilingual, and allow users to know where the source of answers are from.

Final Thoughts

With the increased popularity of multi-modal LLMs, we now have more ways to interact with AI tools. However, the principles applied to other Machine Learning models also apply to generative AI models. These models can sometimes generate hallucinated answers, so it’s important to verify the accuracy of their outputs and remain mindful of fairness and ethics. Nevertheless, the local voice assistant is helpful for many tasks and requires to run on CPU only. It can be extended to run on mobile devices too. If you have interesting ideas for extending this or suggestions, please don’t hesitate to reach out or share them with other readers as well. I would also suggest giving Open WebUI a try, as it integrates audio models and can run locally as well. I hope you enjoyed reading the post.

The post Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop appeared first on Towards Data Science.

Exploring How the New OpenAI Realtime API Simplifies Voice Agent Flows

Sami Maameri — Thu, 03 Oct 2024 09:53:50 +0000

Introduction

At the recent OpenAI Dev Day on October 1st, 2024, OpenAI’s biggest release was the reveal of their Realtime API:

"Today, we’re introducing a public beta of the Realtime API, enabling all paid developers to build low-latency, multimodal experiences in their apps.

Similar to ChatGPT’s Advanced Voice Mode, the Realtime API supports natural speech-to-speech conversations using the six preset voices already supported in the API."

(source: OpenAI website)

As per their message, some of its key benefits include low latency, and its speech to speech capabilities. Let’s see how that plays out in practice in terms of building out voice AI agents.

It also has an interruption handling feature, so that the realtime stream will stop sending audio if it detects you are trying to speak over it, a useful feature for sure when building voice agents.

In this article we will:

Compare what a phone voice agent flow might have looked like before the Realtime API, and what it looks like now,
Review a GitHub project from Twilio that sets up a voice agent using the new Realtime API, so we can see what the implementation looks like in practice, and get an idea how the websockets and connections are setup for such an application,
Quickly review the React demo project from OpenAI that uses the Realtime API,
Compare the pricing of these various options.

Voice Agent Flows

Before the OpenAI Realtime API

To get a phone voice agent service working, there are some key services we require

Speech to Text ( e.g Deepgram),
Llm/Agent ( e.g OpenAI),
Text to Speech (e.g ElevenLabs).

These services are illustrated in the diagram below

(source https://github.com/twilio-labs/call-gpt, MIT license)

That of course means integration with a number of services, and separate API requests for each parts.

The new OpenAI Realtime API allows us to bundle all of those together into a single request, hence the term, speech to speech.

After the OpenAI Realtime API

This is what the flow diagram would look like for a similar new flow using the new OpenAI Realtime API.

Obviously this is a much simpler flow. What is happening is we are just passing the speech/audio from the phone call directly to the OpenAI Realtime API. No need for a speech to text intermediary service.

And on the response side, the Realtime API is again providing an audio stream as the response, which we can send right back to Twilio (i.e to the phone call response). So again, no need for an extra text to speech service, as it is all taken care of by the OpenAI Realtime API.

Source code review for a Twilio and Realtime API voice agent

Let’s look at some code samples for this. Twilio has provided a great github repository example for setting up this Twilio and OpenAI Realtime API flow. You can find it here:

GitHub – twilio-samples/speech-assistant-openai-realtime-api-node

Here are some excerpts from key parts of the code related to setting up

the websockets connection from Twilio to our application, so that we can receive audio from the caller, and send audio back,
and the websockets connection to the OpenAI Realtime API from our application.

I have added some comments in the source code below to try and explain what is going on, expecially regarding the websocket connection between Twilio and our applicaion, and the websocket connection from our application to OpenAI. The triple dots (…) refere to sections of the source code that have been removed for brevity, since they are not critical to understanding the core features of how the flow works.

// On receiving a phone call, Twilio forwards the incoming call request to
// a webhook we specify, which is this endpoint here. This allows us to 
// create programatic voice applications, for example using an AI agent
// to handle the phone call
// 
// So, here we are providing an initial response to the call, and creating
// a websocket (called a MediaStream in Twilio, more on that below) to receive
// any future audio that comes into the call
fastify.all('/incoming', async (request, reply) => {
    const twimlResponse = `
                          
                              Please wait while we connect your call to the A. I. Voice Assistant, powered by Twilio and the Open-A.I. Realtime API
                              
                              O.K. you can start talking!
                              
                                  
                              
                          `;

    reply.type('text/xml').send(twimlResponse);
});

fastify.register(async (fastify) => {

    // Here we are connecting our application to the websocket media stream we
    // setup above. That means all audio that comes though the phone will come
    // to this websocket connection we have setup here
    fastify.get('/media-stream', { websocket: true }, (connection, req) => {
        console.log('Client connected');

        // Now, we are creating websocket connection to the OpenAI Realtime API
        // This is the second leg of the flow diagram above
        const openAiWs = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01', {
            headers: {
                Authorization: `Bearer ${OPENAI_API_KEY}`,
                "OpenAI-Beta": "realtime=v1"
            }
        });

        ...

        // Here we are setting up the listener on the OpenAI Realtime API 
        // websockets connection. We are specifying how we would like it to
        // handle any incoming audio streams that have come back from the
        // Realtime API.
        openAiWs.on('message', (data) => {
            try {
                const response = JSON.parse(data);

                ...

        // This response type indicates an LLM responce from the Realtime API
        // So we want to forward this response back to the Twilio Mediat Stream
        // websockets connection, which the caller will hear as a response on
        // on the phone
                if (response.type === 'response.audio.delta' && response.delta) {
                    const audioDelta = {
                        event: 'media',
                        streamSid: streamSid,
                        media: { payload: Buffer.from(response.delta, 'base64').toString('base64') }
                    };
         // This is the actual part we are sending it back to the Twilio
         // MediaStream websockets connection. Notice how we are sending the
         // response back directly. No need for text to speech conversion from
         // the OpenAI response. The OpenAI Realtime API already provides the
         // response as an audio stream (i.e speech to speech)
                    connection.send(JSON.stringify(audioDelta));
                }
            } catch (error) {
                console.error('Error processing OpenAI message:', error, 'Raw message:', data);
            }
        });

        // This parts specifies how we handle incoming messages to the Twilio
        // MediaStream websockets connection i.e how we handle audio that comes
        // into the phone from the caller
        connection.on('message', (message) => {
            try {
                const data = JSON.parse(message);

                switch (data.event) {
        // This case ('media') is that state for when there is audio data 
        // available on the Twilio MediaStream from the caller
                    case 'media':
                        // we first check out OpenAI Realtime API websockets
                        // connection is open 
                        if (openAiWs.readyState === WebSocket.OPEN) {
                            const audioAppend = {
                                type: 'input_audio_buffer.append',
                                audio: data.media.payload
                            };
                        // and then forward the audio stream data to the
                        // Realtime API. Again, notice how we are sending the
                        // audio stream directly, not speech to text converstion
                        // as would have been required previously
                            openAiWs.send(JSON.stringify(audioAppend));
                        }
                        break;

                  ...
                }
            } catch (error) {
                console.error('Error parsing message:', error, 'Message:', message);
            }
        });

...

fastify.listen({ port: PORT }, (err) => {
    if (err) {
        console.error(err);
        process.exit(1);
    }
    console.log(`Server is listening on port ${PORT}`);
});

So, that is how the new OpenAI Realtime API flow plays out in practice.

Regarding the Twilio MediaStreams, you can read more about them here. They are a way to setup a websockets connection between a call to a Twilio phone number and your application. This allows streaming of audio from the call to and from you application, allowing you to build programmable voice applications over the phone.

To get to the code above running, you will need to setup a Twilio number and ngrok also. You can check out my other article over here for help setting those up.

AI Voice Agent with Twilio, Express and OpenAI

Since access to the OpenAI Realtime API has just been rolled, not everyone may have access just yet. I intially was not able to access it. Running the application worked, but as soon as it tries to connect to the OpenAI Realtime API I got a 403 error. So in case you see the same issue, it could be related to not having access yet also.

React OpenAI Realtime API Demo

OpenAI have also provided a great demo for testing out their Realtime API in the browser using a React app. I tested this out myself, and was very impressed with the speed of response from the voice agent coming from the Realtime API. The response is instant, there is no latency, and makes for a great user experience. I was definitley impressed when testing it out.

Sharing a link to the source code here. It has intructions in the README.md for how to get setup

GitHub – openai/openai-realtime-console: React app for inspecting, building and debugging with the…

This is a picture of what the application looks like once you get it running on local

(source https://github.com/openai/openai-realtime-console, MIT license)

Pricing

Let’s compare the cost the of using the OpenAI Realtime API versus a more conventional approach using Deepagram for speech to text (STT) and text to speech (TTS) and using OpenAI GPT-4o for the LLM part.

Comparison using the prices from their websites shows that for a 1 minute conversation, with the caller speaking half the time, and the AI agent speaking the other half, the cost per minute using Deepgram and GPT-4o would be $0.0117/minute, whereas using the OpenAI Realtime API would be $0.15/minute.

That means using the OpenAI Realtime API would be just over 10x the price per minute.

It does sound like a fair amount more expensive, though we should balance that with some of the benefits the OpenAI Realtime API could provide, including

reduced latencies, crucial for having a good voice experience,
ease of setup due to fewer moving parts,
conversation interruption handling provided out of the box.

Also, please do be aware that prices can change over time, so the prices you find at the time of reading this article, may not be the same as those reflected above.

Conclusion

Hope that was helpful! What do you think of the new OpenAI Realtime API? Think you will be using it in any upcoming projects?

While we are here, are there any other tutorials or articles around voice agents andvoice AI you would be interested in? I am deep diving into that field a bit just now, so would be happy to look into anything people find interesting.

Happy hacking!

All image provided are by the author, unless stated otherwise

The post Exploring How the New OpenAI Realtime API Simplifies Voice Agent Flows appeared first on Towards Data Science.

I Invented a Way to Speak to an AI, Keeping Your Privacy

Rafe Brena, Ph.D. — Fri, 28 Jun 2024 00:47:43 +0000

Cutting-edge smart assistants, like GPT-4o, could be awesome for voice interaction with an AI, but sometimes voice interaction itself has its drawbacks:

You may cringe about talking to a device in front of others and be afraid to look silly.
Sometimes, you are not supposed to talk, like in an office meeting (even less so to talk on a phone).
You don’t want others to overhear private information, such as dictating a phone number inside a train wagon full of people.

I was thinking about those issues and thought that perhaps the same AI that is bringing this problem could help with a solution. And I got an idea. I called it "Silent Voice."

With Silent Voice, you would put the phone in front of your mouth and talk out your request, but without applying your voice—not even a whisper.

How is that possible? Is it a form of lip-reading? No. Is it a way of amplifying any noise coming from your mouth? Nope. What is it, then?

How Silent Voice works

Silent Voice consists of an ultrasound generator and speaker, which throws short ultrasound pulses. You have to activate Silent Voice first, approach the ultrasound small speaker to your mouth, and start speaking normally.

Not exactly "speaking normally" because that would defeat the purpose. You speak almost normally but without applying your voice. You don’t need to whisper, as Silent Voice doesn’t work with sound at all—it uses the ultrasound that enters your mouth and bounces back, being reflected and disturbed in multiple and complex ways.

The critical part of Silent Voice is that the reflected ultrasound, which has been distorted by the vocal tract (mouth’s internal parts such as the tongue), is picked up by a microphone, digitized, and then passed to a Machine Learning classifier. The classifier predicts which "phoneme" corresponds to a given vector (a phoneme is each elemental sound, similar to a letter).

Once phonemes are predicted, Silent Voice uses standard speech recognition technology to identify the corresponding letters and words. The recognized text is then delivered to the operating system and then to any application you are using, such as WhatsApp. In the end, WhatsApp will show the text you dictate without using your voice.

TLDR version of "Silent Voice"

Somebody told me that "Silent Voice" can be described by the following equation:

Silent Voice = Ultrasound echo + Machine Learning

That’s it.

Easier said than done, of course.

The Machine Learning process

One critical phase is the classification of phonemes from the ultrasound echo picked up by the microphone. In (supervised) Machine Learning, there are several phases that I’ll explain—adapted to the Silent Voice case:

Raw data is collected, with "tags" indicating to which class each sample belongs. In Silent Voice, each sample contains the echo (digitized signal) of a single ultrasound pulse. The tag is an identification (provided by a human) of which phoneme the user pronounced at that exact moment.
Features are extracted from ultrasound samples, so this one is converted to a vector of signal characteristics like its intensity and many more (this part is way too technical to be described here). The result of this phase is a matrix called "dataset," where the columns are the calculated features and the rows are the samples.
The dataset is partitioned into "train" and "test" parts.
Using the training partition, a previously chosen classifier is trained (more on this below). Training is computationally intensive, but fortunately, training procedures are highly optimized in standard platforms like Google’s Colab.
The classifier’s predictions for the test partition are calculated, and then the predictions’ quality is evaluated using standard metrics like "accuracy," "precision," "recall," and many more.

The choice of the exact classifier (SVM, Random Forest, Neural Nets, etc.) can have a big impact on the quality of the predictions, so several are tried in practice to see which one works best. This is a highly empirical process.

Once the classifier is trained and its performance verified, we can use it to predict, in the case of Silent Voice, the phoneme corresponding to the last ultrasound echo collected by the microphone. This information is pure gold.

What would Silent Voice look like in real life?

Silent Voice is mostly intended to enhance smartphones, which means that the user wouldn’t see anything at all different from a regular phone. The "augmented" smartphones would have an ultrasound speaker (which could be an adapted version of the speaker they have at the bottom) and an ultrasound microphone near the speaker to collect the echo. This is the "embodiment" I presented in the patent I mention at the bottom of this post. All the Machine Learning parts would most likely need just software for advanced smartphones or a few additional chips.

In principle, it is possible to build a "Silent Voice peripheral" with the ultrasound speaker and mic, as well as all the electronic processing, until the text is obtained and sent to a phone or a computer via Bluetooth. But from the cost point of view, it makes much more sense to integrate Silent Voice into the smartphone you carry with you anyway.

Other Silent Voice use cases

I think the most relevant application of Silent Voice is dictating chat messages or emails, even with people around you, without losing privacy or looking silly or weird when talking to a device.

But there are other scenarios where Silent Voice could be a lifesaver:

If you get a phone call in an extremely noisy environment, you’ll have to yell to make you understand… unless you use Silent Voice. With it, you can get a phone call, and the system will replace ambient sound with just an artificial voice pronouncing the same phonemes you are pronouncing. Eventually, it will be possible to use a voice that imitates your own.
There are people who, due to an illness or accident, have damaged their throats and cannot produce their voices normally or at all. However, the Silent Voice system could allow these people to produce speech simply by moving the oral cavity without needing to correct the throat problem. Although the number of people with voice loss is not very high, their cases are important because it can be a disabling injury.

Where I got the idea from

Previously, I had worked as an AI researcher at a university, and one of my PhD students (Edgar) worked on a way of detecting (even counting) human bodies from outside a room using mostly regular WiFi. I clarify that we (Edgar and myself) are not spies or anything of the sort. However, the problem was interesting, and we made a nice contribution to the area by reporting the findings in an academic paper. I’m not going to bore you with the published paper: you can read a very digestible account of the techniques involved in the post "Do You Know that Human Bodies Can Be Seen From Outside a Room with WiFi?" published here at Medium.

I’m recalling that research work about detecting human bodies because the technique used in "Silent Voice" is basically the same. The idea is:

(…) human bodies produce characteristic disturbances on electromagnetic signals traversing them, making it possible to analyze the disturbances for different purposes—counting people in our case.

I thought this idea of analyzing disturbances could be applied to detecting the position of the mouth, including the jaw, the tongue, and everything else—without actually analyzing the vocal tract positions.

What if we send an acoustic signal, like an ultrasound pulse, to the mouth, then pick the ultrasound coming back with a microphone and compare it with patterns corresponding to the different letters we pronounce (more exactly, the phonemes)?

Is Silent Voice entirely original?

No, Silent Voice is one of what is called "Silent Speech Interface." It is intended, well, to give speech information without sound, but the specific solutions are wildly varied:

Some use medical ultrasound imaging equipment attached to the chin –the method analyzes the images.
"Non-audible murmur" technology, which tries to amplify whispers.
Electromagnetic and radar analysis of vocal tract activity.
Surface electromyography and encephalographic sensors.
Brain implants!

Silent Voice differs by relying entirely on Machine Learning analysis of ultrasound disturbances instead of "analytical" solutions (that is, mathematical models trying to predict the exact tongue position, and so on). Analytical solutions are extremely hard to develop, but I avoided them not because of laziness but because I wanted a data-driven Machine Learning-based solution.

In most situations where you can have lots of data, Data Science trumps mathematical ad-hoc models any day of the week.

Final remarks

I registered Silent Voice in a "Provisional Patent Application" at the USPTO, and they gave me the number "63/637,554" for it. Patenting like this is the most basic form of intellectual property protection for my idea.

Silent Voice is in the concept phase, meaning that it hasn’t been implemented as a working prototype. That is because I no longer work in an AI lab—I retired from my full-time job 5 years ago. So, I couldn’t know exactly how well Silent Voice works.

But I know Silent Voice could work because it follows a data-driven process that has succeeded in so many recent projects, particularly the person-detection one mentioned above.

Perhaps you think that talking –even silently– to a device makes you look silly anyway. But normally, you’d put the phone (and your hand) in front of your mouth (like in the figure at the top), and this would conceal your "talking." In the end, only real-life usage of Silent Voice will make evident its associated social concerns.

What I intend to do with Silent Voice is to give this tech to somebody who could implement it and make it as impactful as I think it could be.

With Silent Voice, in some years, we could all have conversations with AI systems without losing our Privacy. Yes, voice is a great way to communicate with AI, but not at the expense of our privacy.

— Get my personally curated AI news analysis and tech explainers with my short free newsletter, "The Skeptic AI Enthusiast," at https://rafebrena.substack.com/

The post I Invented a Way to Speak to an AI, Keeping Your Privacy appeared first on Towards Data Science.

Build a Locally Running Voice Assistant

Sébastien Gilbert — Fri, 29 Dec 2023 18:03:34 +0000

Image generated by the author, with help from openart.ai

I have to admit that I was initially skeptical about the ability of Large Language Models (Llm) to generate code snippets that actually worked. I tried it expecting the worst, and I was pleasantly surprised. Like any interaction with a chatbot, the way the question is formatted matters, but with time, you get to know how to specify the boundaries of the problem you need help with.

I was getting used to having an online chatbot service always available while writing code when my employer issued a company-wide policy prohibiting employees from using it. I could go back to my old googling habits, but I decided to build a locally running LLM service that I could question without leaking information outside the company walls. Thanks to the open-source LLM offering on HuggingFace, and the chainlit project, I could put together a service that satisfies the need for coding assistance.

The next logical step was to add some voice interaction. Although voice is not well-suited for coding assistance (you want to see the generated code snippets, not hear them), there are situations where you need help with inspiration on a creative project. The feeling of being told a story adds value to the experience. On the other hand, you may be reluctant to use an online service because you want to keep your work private.

In this project, I’ll take you through the steps to build an assistant that allows you to interact vocally with an open-source LLM. All the components are running locally on your computer.

Architecture

The architecture involves three separate components:

A wake-word detection service
A Voice Assistant service
A chat service

Flowchart of the three components. Image by the author.

The three components are standalone projects, each having its own github repository. Let’s walk through each component and see how they interact.

Chat service

The chat service runs the open-source LLM called HuggingFaceH4/zephyr-7b-alpha. The service receives a prompt through a POST call, passes the prompt through the LLM, and returns the output as the call response.

You can find the code here.

In …/chat_service/server/, rename chat_server_config.xml.example to chat_server_config.xml.

You can then start the chat server with the following command:

python .chat_server.py

When the service runs for the first time, it takes several minutes to start because large files get downloaded from the HuggingFace website and stored in a local cache directory.

You get a confirmation from the terminal that the service is running:

Confirmation that the chat service is running. Image by the author.

If you want to test the interaction with the LLM, go to …/chat_service/chainlit_interface/.

Rename app_config.xml.example to app_config.xml. Launch the web chat service with

.start_interface.sh

Browse to the local address localhost:8000

You should be able to interact with your locally running LLM through a text interface:

Text interaction with the locally running LLM. Image by the author.

Voice assistant service

The voice assistant service is where the speech-to-text and text-to-speech conversions happen. You can find the code here.

Go to …/voice_assistant/server/.

Rename voice_assistant_service_config.xml.example to voice_assistant_service_config.xml.

The assistant starts by playing the greeting to indicate that it is listening to the user. The greeting text is configured in voice_assistant_config.xml, under the element __:

The voice_assistant_config.xml file. Image by the author.

The text-to-speech engine that allows the program to convert text into spoken audio that you can hear through your audio output device is pyttsx3. From my experience, this engine speaks with a reasonably natural tone, both in English and in French. Unlike other packages that rely on an API call, it runs locally.

A model called facebook/seamless-m4t-v2-large performs the speech-to-text inference. Model weights get downloaded when voice_assistant_service.py is first run.

The principal loop in voice_assistant_service.main() performs the following tasks:

Get a sentence from the microphone. Convert it to text using the speech-to-text model.
Check if the user spoke the message defined in the __ element from the configuration file. In this case, the conversation ends, and the program terminates after playing the goodbye message.
Check if the sentence is gibberish. The speech-to-text engine often outputs a valid English sentence, even if I didn’t say anything. By chance, these undesirable outputs tend to repeat themselves. For example, gibberish sentences will sometimes start with "[" or "i’m going to". I collected a list of prefixes often associated with a gibberish sentence in the __ element of the configuration file (this list would likely change for another speech-to-text model). Whenever an audio input starts with one of the prefixes in the list, then the sentence is ignored.
If the sentence doesn’t appear to be gibberish, send a request to the chat service. Play the response.

Wake-word service

The last component is a service that continually listens to the user’s microphone. When the user speaks the wake-word, a system call starts the voice assistant service. The wake-word service runs a smaller model than the voice assistant service models. For this reason, it makes sense to have the wake-word service running continuously while the voice assistant service only launches when we need it.

You can find the wake-word service code here.

After cloning the project, move to …/wakeword_service/server.

Rename wakeword_service_gui_config.xml.example to wakeword_service_gui_config.xml.

Rename command.bat.example to command.bat. You’ll need to edit command.bat so the virtual environment activation and the call to voice_assistant_service.py correspond to your directory structure.

You can start the service by the following call:

python gui.py

The core of the wake-word detection service is the openwakeword project. Out of a few wake-word models, I picked the "hey jarvis" model. I found that simply saying "Jarvis?" will trigger the detection.

Whenever the wake-word is detected, a command file gets called, as specified in the __ element of the configuration file. In our case, the command.bat file activates the virtual environment and starts the voice assistant service.

The configuration file of the wake-word detection service GUI. Image by the author.

If you want to use the wake-word detection service for something else, you can edit the command.bat file to make it start whatever program you want.

When the wake-word is detected, the wake-word detection service emits a two-beep audio signal to indicate it was triggered. This quick feedback tells the user that they must wait as the text-to-speech and the speech-to-text models load in the computer’s GPU, which can take several seconds. When both models are loaded, the voice assistant service plays the greeting, and the conversation can begin. After the conversation ends, the voice assistant service terminates, and the wake-word detection service emits another two-beep signal.

When the three components are installed and tested, we can start the chat and the wake-word detection services. The wake-word detection service will constantly listen to the microphone. When the wake-word gets detected ("Jarvis?"), it will start the voice assistant service. You can then ask questions out loud and receive a spoken answer. When the end-of-conversation phrase ("Thank you and goodbye") is detected, the voice assistant service ends.

Example of an interaction with the assistant

Here is an example of an interaction with the voice assistant:

"Jarvis?" [beep-beep]

[Several seconds of silence as the voice assistant models load]

"Hello! How can I help you?"

"What are Saturn’s rings made of?"

"Saturn’s rings are primarily composed of ice particles, with the majority being made up of water ice. However, there are also smaller amounts of other ices, such as ammonia ice and rocky debris, in the rings. The exact composition of the rings varies depending on the distance from Saturn, with the inner rings being denser and made up of smaller particles, while the outer rings are more spread out and contain larger ice particles."

"Tell me the plot of the movie ‘Back to the Future’."

"The movie ‘Back to the Future’ follows the story of Marty McFly, a high school student who is accidentally transported back to 1955 in a time-traveling DeLorean car created by his eccentric scientist friend, Doc Brown. Marty must navigate the challenges of the past while trying to repair the damage to the present and prevent his teenage mother from falling in love with his teenage father. Along the way, Marty befriends a younger version of Doc Brown and helps him overcome a personal tragedy. The movie explores themes of fate, family, and the consequences of altering the past."

"Thank you and goodbye"

"Goodbye" [beep-beep]

Conclusion

In this project, we implemented a voice assistant system composed of a wake-word detection service, a voice assistant service, and a chat service. Each one of these components is a standalone building block reusable for other projects.

Among the many surprises I had while working on this voice assistant, what struck me the most was the quality of the speech-to-text conversion. If you’re like me, you have probably struggled with automated voice recognition systems that fail to transcribe simple commands such as "Turn down the volume"! I expected speech-to-text conversion to be the main stumbling block of the pipeline. After experimenting with a few unsatisfying models, I landed on facebook/seamless-m4t-v2-large and was impressed with the quality of the results. I can even speak a sentence in French, and the neural network will automatically translate it into English. Nothing less than amazing!

I hope you’ll try this fun project, and let me know what you use it for!

The post Build a Locally Running Voice Assistant appeared first on Towards Data Science.

Creating Your Personalized Voice Assistant with GPT and Whisper

Donato Riccio — Thu, 18 May 2023 20:44:54 +0000

Photo by Ivan Bandura on Unsplash

This article aims to guide you in creating a simple yet powerful voice assistant tailored to your preferences. We’ll use two powerful tools, Whisper and GPT, to make this happen. You probably already know GPT and how powerful it is, but what is Whisper?

Whisper is an advanced speech recognition model from OpenAI that offers accurate audio-to-text transcription.

We’ll walk you through each step, with coding instructions included. At the end, you’ll have your very own voice assistant up and running.

Before you begin

Open AI API keys

If you already have an OpenAI API key you can skip this section.

Both Whisper and GPT APIs require an OpenAI API key to be accessed. Unlike ChatGPT where the subscription is a fixed fee, the API key is paid based on how much you use the service. The prices are reasonable. At the time of writing, Whisper is priced at $0.006 / minute, GPT (with the model gpt-3.5-turbo) at $0.002 / 1K tokens (a token is approximately 0.75 words).

OpenAI’s website. Image by the author.

To get your key, first create an account on the OpenAI website. After signing in, click on your name at the top-right corner and choose View API keys. Once you click Create new secret key your key is displayed. Make sure to save it, because you won’t be able to see it again.

Packages

The code chunk shows the required libraries for the project. The project involves using OpenAI’s Python library for AI tasks, pyttsx3 for generating speech, SoundDevice for recording and playing back audio, numpy and scipy for mathematical operations. As always, you should create a new virtual environment before installing packages when starting a new project.

Code structure

Our code will be structured around a single class, and take up approximately 90 lines of code in total. It assumes that you have a basic understanding of Python classes.

Image by the author.

The listenmethod captures the user’s spoken input and converts it to text using Whisper. The thinkmethod sends the text to GPT, which generates a natural language response. The speakmethod converts the response text into an audio that is played back. The process repeats: the user is able to interact in a conversation by making another request.

Code structure. Image by the author.

init

This function takes care of initializing the history and setting up the API key. We need a history that keep track of the previous messages. It’s basically our assistant’s short-term memory, and allows it to remember what you said earlier in the conversation.

listen

The listen function. Image by the author.

This method is our assistant’s ears. The listen function allows to receive input from the user. This function records audio from your microphone and transcribes it into text. Here’s what it does:

Prints Listening… when recording audio.
Records audio for 3 seconds (or any duration you want) using sounddevice at a sample rate of 44100 Hz.
Saves the recorded audio as a NumPy array in a temporary WAV file.
Uses the OpenAI API’s transcribemethod to send the audio to Whisper, which transcribes it.
Prints the transcribed text to the console to confirm that the transcription was successful.
Returns the transcribed text as a string.

In the example, the assistant listens for 3 seconds, but you can change the time as you want.

think

The think function. Image by the author.

Our assistant’s brain is powered by GPT. The think function receives what the assistant hears and elaborates a response. How? The response is not created on your computer. The text needs to be sent to OpenAI’s servers to be processed through the APIs. The response is then saved in the response variable, and both the user message and the response are added to the history, the assistant’s short term memory. provide context to the GPT model for generating responses.

speak

The speak function. Image by the author.

The speak function is responsible for converting text into speech and playing it back to the user. This function takes a single parameter: text. It should be a string that represents the text to be converted to speech. When the function is called with a text string as an argument, it initializes the pyttsx3 speech engine with the command engine = pyttsx3.init() This object, engine is the main interface for converting text to speech.

The function then instructs the speech engine to convert the provided text into speech using the command engine.say(text). This queues up the provided text to be spoken. The command engine.runAndWait tells the engine to process the queued command.

Pyttsx3 handles all text-to-speech conversion locally, which can be a significant advantage in terms of latency.

Final touches

The assistant is now ready. We just need to create an assistant object, and begin the conversation. The conversation is an infinite loop that ends when the user says a sentence containing Goodbye.

Tips to Personalize Your Experience

Customizing your GPT assistant is a breeze! The code that we built is very modular, and it allows you to customize it by adding a a variety of features. Here are some ideas to get you started:

Give a role to the assistant: Change the initial prompt to make your assistant act as your English teacher, motivational speaker, or anything else you can think of! Check out Awesome ChatGPT Prompts for more ideas.
Change the language: Want to use another language? No problem! Simply change english in the code to your desired language.
Build an app: You can easily integrate the assistant in any application.
Add personality: Give your assistant a unique personality by adding custom responses or using different tones and language styles.
Integrate with other APIs: Integrate your assistant with other APIs to provide more advanced functionality, such as weather forecasts or news updates.

Conclusion

In this article, we explained how to retrieve your OpenAI API key and provided code examples for the listen, think, and speak functions that are used to capture user input, generate responses, and convert text to speech for playback.

With this knowledge, you may begin creating your own unique Voice Assistant that is suited to your specific demands. The possibilities are infinite, from creating a personal assistant to help with daily tasks, to building a voice-controlled automation system. You can access all the code in the linked GitHub repo.

Enjoyed this article? Get weekly data science interview questions delivered to your inbox by subscribing to my newsletter, The Data Interview.

Also, you can find me on LinkedIn.

The post Creating Your Personalized Voice Assistant with GPT and Whisper appeared first on Towards Data Science.

How to build your own AI personal assistant using Python

M.Mirthula — Fri, 24 Jul 2020 20:13:10 +0000

Guide to build your AI assistant:

An AI personal assistant is a piece of software that understands verbal or written commands and completes task assigned by the client. It is an example of weak AI that is it can only execute and perform quest designed by the user.

Want to build your own personal Ai Assistant like Apple Siri, Microsoft Cortana and Google assistant?

You can check out this blog to build one in a few simple steps!

With the python programming language, a script most commonly used by the developers can be used to build your personal AI assistant to perform task designed by the users.

Picture credits: Freepik

Now, let’s write a script for our personal voice assistant using python.

Skills:

The implemented voice assistant can perform the following task it can open YouTube, Gmail, Google chrome and stack overflow. Predict current time, take a photo, search Wikipedia to abstract required data, predict weather in different cities, get top headline news from Times of India and can answer computational and geographical questions too.

The following queries of the voice assistant can be manipulated as per the users need.

Packages required:

To build a personal voice assistant it’s necessary to install the following packages in your system using the pip command.

1) Speech recognition – Speech recognition is an important feature used in house automation and in Artificial Intelligence devices. The main function of this library is it tries to understand whatever the humans speak and converts the speech to text.

2) pyttsx3 – pyttxs3 is a text to speech conversion library in python. This package supports text to speech engines on Mac os x, Windows and on Linux.

3) wikipedia – Wikipedia is a multilingual online encyclopedia used by many people from academic community ranging from freshmen to students to professors who wants to gain information over a particular topic. This package in Python extracts data’s required from Wikipedia.

4) ecapture – This module is used to capture images from your camera

5) datetime – This is an inbuilt module in python and it works on date and time

6) os – This module is a standard library in python and it provides the function to interact with operating system

7) time – The time module helps us to display time

8) Web browser – This is an in-built package in python. It extracts data from the web

9) Subprocess – This is a standard library use to process various system commands like to log off or to restart your PC.

10) Json– The json module is used for storing and exchanging data.

11) request– The request module is used to send all types of HTTP request. Its accepts URL as parameters and gives access to the given URL’S.

12) wolfram alpha – Wolfram Alpha is an API which can compute expert-level answers using Wolfram’s algorithms, knowledge base and AI technology. It is made possible by the Wolfram Language.

Implementation:

Import the following libraries

Setting up the speech engine:

The pyttsx3 module is stored in a variable name engine.

Sapi5 is a Microsoft Text to speech engine used for voice recognition.

The voice Id can be set as either 0 or 1,

0 indicates Male voice

1 indicates Female voice

Now define a function speak which converts text to speech. The speak function takes the text as its argument,further initialize the engine.

runAndWait: This function Blocks while processing all currently queued commands. It Invokes callbacks for engine notifications appropriately and returns back when all commands queued before this call are emptied from the queue.

Initiate a function to greet the user:

Define a function wishMe for the AI assistant to greet the user.

The now().hour function abstract’s the hour from the current time.

If the hour is greater than zero and less than 12, the Voice Assistant wishes you with the message "Good Morning".

If the hour is greater than 12 and less than 18, the voice assistant wishes you with the following message "Good Afternoon".

Else it voices out the message "Good evening"

Setting up the command function for your AI assistant :

Define a function takecommand for the AI assistant to understand and to accept human language. The microphone captures the human speech and the recognizer recognizes the speech to give a response.

The exception handling is used to handle the exception during the run time error and,the recognize_google function uses google audio to recognize speech.

The Main function:

The main function starts from here,the commands given by the humans is stored in the variable statement.

If the following trigger words are there in the statement given by the users it invokes the virtual assistant to speak the below following commands.

Skill 1 -Fetching data from Wikipedia:

The following commands helps to extract information from wikipedia. The wikipedia.summary() function takes two arguments, the statement given by the user and how many sentences from wikipedia is needed to be extracted is stored in a variable result.

Skill 2 -Accessing the Web Browsers – Google chrome , G-Mail and YouTube:

The web browser extracts data from web. The open_new_tab function accepts URL as a parameter that needs to be accessed.

The Python time sleep function is used to add delay in the execution of a program. We can use this function to halt the execution of the program for given time in seconds.

Skill 3 -Predicting time:

The current time is abstracted from datetime.now() function which displays the hour, minute and second and is stored in a variable name strTime.

Skill 4 -To fetch latest news:

If the user wants to know the latest news , The voice assistant is programmed to fetch top headline news from Time of India by using the web browser function.

Skill 5 -Capturing photo:

The ec.capture() function is used to capture images from your camera. It accepts 3 parameter.

Camera index – The first connected webcam will be indicated as index 0 and the next webcam will be indicated as index 1

Window name – It can be a variable or a string. If you don’t wish to see the window, type as False

Save name – A name can be given to the image and if you don’t want to save the image, type as false

Skill 6-Searching data from web:

From the web browser you can search required data by passing the user statement (command) to the open_new_tab() function.

User: Hey G-One, please search images of butterfly

The Voice assistant opens the google window & fetches butterfly images from web.

Skill 7- Setting your AI assistant to answer geographical and computational questions:

Here we can use a third party API called Wolfram alpha API to answer computational and geographical questions.It is made possible by the Wolfram Language. The client is an instance (class) created for wolfram alpha. The res variable stores the response given by the wolfram alpha.

To access the wolfram alpha API an unique App ID is required which can be generated by the following ways:

Image by author

Now you will view the homepage of the website. Head to the account section in the top right corner where you see your email. In the drop down menu, select the My Apps (API) option.

Image by author

You will see this following window, now click Get APP_ID button

Image by author

Now you will get the following dialog box, give a suitable name and description and click the App ID button, an App ID will be generated and this is an unique ID. Using the App Id use can access the Wolfram alpha API.

Image by author

Human: Hey G-One ,what is the capital of California?

G-One Voice assistant: Sacramento, United States of America

Skill 8- Extra features:

It would be interesting to program your AI assistant to answer the following questions like what it can and who created it,isn’t it?

Skill 9- To forecast weather:

Now to program your AI assistant to detect weather we need to generate an API key from Open Weather map.

Open weather map is an online service which provides weather data. By generating an API ID in the official website you can use the APP_ID to make your voice assistant detect weather of all places whenever required. The necessary modules needed to be imported for this weather detection is json and request module.

The city_name variable takes the command given by the human using the takeCommand() function.

The get method of request module returns a response object. And the json methods of response object converts json format data into python format.

The variable X contains list of nested dictionaries which checks whether the value of ‘COD’ is 404 or not that is if the city is found or not.

The values such as temperature and humidity is stored in the main key of variable Y.

Human: Hey G-One ,I want to get the weather data

G-One: What is the city name?

Human: Himachal Pradesh

G-One: Temperature in kelvin unit is 301.09 , Humidity in percentage is 52 and Description is light rain.

Skill 10- To log off your PC:

The subprocess.call() function here is used to process the system function to log off or to turn off your PC. This invokes your AI assistant to automatically turn off your PC.

Hurray , We have finally built our own AI voice assistant . Further you can still add more functionalities to your AI voice assistant to perform more task.

Image by author

Check out my GitHub profile for code:

https://github.com/mmirthula02/AI-Personal-Voice-assistant-using-Python

Happy Coding !!

The post How to build your own AI personal assistant using Python appeared first on Towards Data Science.

Voice Assistant | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Training a Conversational Speech Model

But what does it really take to generate audio?

Preprocessing audio

What is Audio Quantization?

Vector Quantization

Residual Vector Quantization

Acoustic vs Semantic Codebooks

Audio Decoder

In Summary

References and Must-read papers

Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop

Why local voice assistant?

Local Voice Assistant

Does it work?

What Else Can be Improved?

Final Thoughts

Exploring How the New OpenAI Realtime API Simplifies Voice Agent Flows

Introduction

Contents

Voice Agent Flows

Before the OpenAI Realtime API

After the OpenAI Realtime API

Source code review for a Twilio and Realtime API voice agent

React OpenAI Realtime API Demo

Pricing

Conclusion

I Invented a Way to Speak to an AI, Keeping Your Privacy

How Silent Voice works

TLDR version of "Silent Voice"

The Machine Learning process

What would Silent Voice look like in real life?

Other Silent Voice use cases

Where I got the idea from

Is Silent Voice entirely original?

Final remarks

Build a Locally Running Voice Assistant

Architecture

Chat service

Voice assistant service

Wake-word service

Example of an interaction with the assistant

Conclusion

Creating Your Personalized Voice Assistant with GPT and Whisper

Before you begin

Open AI API keys

Packages

Code structure

init

listen

think

speak

Final touches

Tips to Personalize Your Experience

Conclusion

How to build your own AI personal assistant using Python

Guide to build your AI assistant:

Skills:

Packages required:

Implementation:

Setting up the speech engine:

Initiate a function to greet the user:

Setting up the command function for your AI assistant :

The Main function:

Skill 1 -Fetching data from Wikipedia:

Skill 2 -Accessing the Web Browsers – Google chrome , G-Mail and YouTube:

Skill 3 -Predicting time:

Skill 4 -To fetch latest news:

Skill 5 -Capturing photo:

Skill 6-Searching data from web:

Skill 7- Setting your AI assistant to answer geographical and computational questions:

Skill 8- Extra features:

Skill 9- To forecast weather:

Skill 10- To log off your PC: