AI voice detection using spectrograms in an OSINT investigation to analyze text-to-speech audio from YouTube content. Title text: This voice might be fake Bottom center text: bsquaredintel.com

How to Detect AI Voices Using OSINT Tools

Welcome to part 2 of our mini investigation into A.I. slop disinformation. If you’re just tuning in you’ll want to catch up with part 1.

To briefly summarize, we talked about what disinformation is, some ways to combat it, and what A.I. slop is.

Then we broke down what we observed on a few YouTube channels we stumbled on that were created using generative A.I.

This leads us to the present.

In part 1, we left off with the question of can we figure out what text-to-speech (TTS) platform was used to narrate these videos? We’re going to focus on one video and experiment.

To be completely clear, we’re expecting this little experiment to fail, but we’re going to learn a lot regardless because this isn’t an area of OSINT that we regularly find ourselves working in.

This is a perfect time to discuss what audio OSINT is.

Audio OSINT

For the uninitiated, OSINT stands for Open Source Intelligence. This discipline is the collection of publicly accessible data, then analyzing it, and finally exploiting what was learned.

Audio OSINT is taking audio data that’s publicly available, processing it, and finding actionable ways to use what you learned. You can check out a short article we wrote about audio OSINT here.

With a little primer on audio OSINT, we’re going to start with the most low tech way to figure out if we can identify which text-to-speech platform was used in one of the videos. That method is using your ears.

Listening

The first thing we need to do is learn about what text-to-speech A.I. platforms exist and preferably which ones are free to use. This is easily done using your search engine of choice and/or your favorite A.I. chatbot of choice.

A word of caution, before we go any further.

We’re doing this for fun and learning something new for us. If you are figuring out what tools to use for an active investigation for legal purposes, avoid entering data from the case into something like ChatGPT, Gemini, Meta AI, etc. It’s a privacy risk and puts the integrity of the case in jeopardy. If you’d like to see uses of A.I. with OSINT, check out this article.

With this disclaimer aside, with the results from search engines and A.I. chat bots, we generated a list of TTS platforms.

Our focus are the platforms that are free that don’t require you to create an account and allow you to generate usually up to 30 seconds of audio based on the text you input. This is being done for simplicity’s sake to explore. The drawback is that the A.I. slop video that we’re looking at is 21 minutes and change in length. This far exceeds the limits of any free tier of the list of platforms we found. Paid tiers of various platforms allow for longer audio clips to be generated and they have many more voices to choose from.

Platforms explored:

The text we entered into the platforms

The following text snippet, which creates around a 14 second audio clip, comes from the YouTube video we’re looking at. This is what the input was: “her heels clicking sharply against the polished tile floor it was 9:27 a.m. 12 minutes past the scheduled hearing time the delay was uncharacteristic for her but unavoidable

The next thing we did was record the audio clip.

We used Audacity, which is an open source DAW (digital audio workstation). It allows you to record and edit audio.

Since the target audio in the YouTube video is that of a man, we went through and listened to each male voice in the TTS platforms that were as close of a match to the original. If it was close, the clip was recorded.

The following are the TTS platforms whose free versions had “close” matches:

  • ElevenLabs
  • ReadSpeaker
  • NaturalReader
  • Platform.text.com
  • Unreal Speech

Microsoft’s product didn’t have a voice that really came close to the timbre, pitch, inflection, intonation and cadence of the target, so we’re eliminating this platform from the list of suspects.

Here are other issues that we see going the OSINT route to identify the source TTS platform used to create the A.I. slop narration:

  • Some platforms like Unreal Speech allow you to adjust the speed of the bot talking.
  • The cadence of the bots speaking our sample text varied. Sometimes, as with what we’ve experienced experimenting with NaturalReader in the past, the same bot can “speak” the same sentence, but it’s cadence or inflection makes things sound different.
  • Dumping the output from one of these TTS platforms into a DAW can further make it difficult to identify the source application used. This is because of settings changing the pitch make a voice sound different. If we’re talking about A.I. slop, if a TTS platform allows you to change pitch, it’d be an all-in-one solution, so no DAW is needed.

Out of 16 samples total across the 5 TTS platforms, we found that there were only 5 clips that sound remotely close to the target audio. 3 samples were from ElevenLabs, 1 was from Naturalreader, and 1 was from Unreal Speech.

Listening to these 5 samples against the target audio, there is one that is close, but still doesn’t sound like the original. This “voice” is called Bill from ElevenLabs.

Before we break out another tool, at this point of this mini investigation, we did not achieve the goal for finding an exact match.

What can we do next?

Well, we can look at audio frequencies. This leads us to the discussion of spectrograms.

Spectrograms

What is a spectrogram you ask?

It’s the “visual representation of a signal’s spectrum of frequencies over time.” [Source: Wikipedia]

Tools you can use to create a spectrogram that are FOSS (Free and Open Source Software) or freemium:

Here’s what a human voice looks like using Sonic Visualiser to create a spectrogram. The audio was recorded using Audacity with noise reduction applied, so it’s not totally raw audio. Some of the frequencies are filtered.

A spectrogram of a human voice. The parts that are read are the frequencies that are the loudest as far as decibels are concerned. Yellows, greens, and dark blue/green represent frequencies with lower decibels
Spectrogram of a human voice.

In the image, the red that you see is where the sound was the loudest at that particular frequency.

Over time, we see that the loudest decibels are around a horizontal band of about 900 Hz, which is the reddest.

Moving up the y axis (frequency), we see some more clusters of red across the x axis (time) at roughly 5500 Hz.

Then as we get up to about 14,000 hz, we see a few peaks across time that has a pretty consistent decibel level at various points.

Now let’s look at the the target.

Target sample

Below is the spectogram of the target sample from the A.I. slop YouTube channel we’re looking into.

Spectrogram of an A.I. generated voice. There's a lot of greens and yellows and very little red when looking at frequency/decibels.It's also very blocky and compressed looking.
Spectrogram of A.I. slop YouTube video.

Compared to the human sample, it is very clean looking. There’s no noise between pauses when the bot is speaking. Also all the frequencies are very consistent and very cleaned up.

What else can you see when comparing the human voice verses this A.I. generated one? Let us know

Now let’s get on to the ElevenLabs sample.

Spectrogram of an A.I. generated voice from ElevenLabs.It's very blockly looking and fost frequencies are green or yellow/yellowish-green
Spectrogram of A.I. generated voice from ElevenLabs using the same phrase as the target A.I. slop YouTube video

Comparing it to the YouTube sample, while it has the same very clean look, it’s different. The areas of silence are not in the same place and it looks much more quiet based on the minimal amount of red showing up.

Conclusions?

From being exposed to a lot of A.I. generated voices, right now it’s not too difficult to discern robot from human when using a TTS platform. It’s not only the sound of the voice and everything associated with it, but there are times when the algorithm doesn’t know how we humans speak. Because of this, there are awkward phrasings of sentences that would make a human sound very weird when talking.

When it comes to audio OSINT, right now, with listening to people speak vs bots, you can come to the conclusion if something was synthesized or natural. As the technology improves, that gap will most likely close and we won’t be able to tell if we’re listening/talking to a bot or human.

Using spectrograms to further look at audio, it becomes very apparent that A.I. generated audio doesn’t look “normal” at all. It’s too clean and very overly produced looking whereas the raw (and very slightly edited) audio of the human voice is pretty messy. Some of that mess includes room noise.

Were we able to find the exact TTS used?

No.

There are many variables at play that make this difficult.

What needs exploration is audio watermarking, and that might be left for a part 3.

Your thoughts

There were some interesting things learned here, but as stated at the very beginning, currently this isn’t a realm of OSINT that we find ourselves in. In fact this is leading more towards the forensic side of things.

So, what are your thoughts about this? Where did we fail? What did we get right? What should we learn more about on this matter?

Let us know in the contact form.

Reach out

Use the contact form if you’re an attorney that wants to understand how our OSINT services add value to your cases or helps put your firm in a better security posture. The same goes if you’re an SMB that is looking to enhance your Information Security goals.

And while you’re here, sign up for our newsletter

Contact Us | Bsquared Intel

Please fill out the form below, or call 203.828.0012, to learn how Bsquared Intel can assist you.

Name(Required)