Living With AI Podcast: Challenges of Living with Artificial Intelligence

AI & Audio

Sean Riley Season 3 Episode 12

From deep fakes to voice assistants, AI audio is ubiquitous yet often overlooked. Dr Jennifer Williams tells Sean all about the project: "The Next Big Thing in Trustworthy AI: Codesign of Context Aware Trustworthy Audio Capture" which asks the question "Do we really trust the autonomous audio systems (AAUS) that exist around us?"

This is the final episode in this season's Living with AI, we hope you've enjoyed listening and hopefully we'll see you next season!

If "Living with AI" floats your boat and you're looking for a new podcast, you might like to check out 'Robot Talk' - here's an episode from last year featuring TAShub director Gopal: Episode 24 – Gopal Ramchurn - Robot Talk

Podcast production by boardie.com

Podcast Host: Sean Riley

Producers: Louise Male  and Stacha Hicks

If you want to get in touch with us here at the Living with AI Podcast, you can visit the TAS Hub website at
www.tas.ac.uk where you can also find out more about the Trustworthy Autonomous Systems Hub Living With AI Podcast.

Podcast Host: Sean Riley

The UKRI Trustworthy Autonomous Systems (TAS) Hub Website



Living With AI Podcast: Challenges of Living with Artificial Intelligence

 This podcast digs into key issues that arise when building, operating, and using machines and apps that are powered by artificial intelligence. We look at industry, homes and cities. AI is increasingly being used to help optimise our lives, making software and machines faster, more precise, and generally easier to use. However, they also raise concerns when they fail, misuse our data, or are too complex for the users to understand their implications. Set up by the UKRI Trustworthy Autonomous Systems Hub this podcast brings in experts in the field from Industry & Academia to discuss Robots in Space, Driverless Cars, Autonomous Ships, Drones, Covid-19 Track & Trace and much more.

 
Season: 3, Episode: 12

AI & Audio

From deep fakes to voice assistants, AI audio is ubiquitous yet often overlooked. Dr Jennifer Williams tells Sean all about the project: "The Next Big Thing in Trustworthy AI: Codesign of Context Aware Trustworthy Audio Capture" which asks the question "Do we really trust the autonomous audio systems (AAUS) that exist around us?"

This is the final episode in this season's Living with AI, we hope you've enjoyed listening and hopefully we'll see you next season!

If "Living with AI" floats your boat and you're looking for a new podcast, you might like to check out 'Robot Talk' - here's an episode from last year featuring TAShub director Gopal: Episode 24 – Gopal Ramchurn - Robot Talk

Podcast production by boardie.com

Podcast Host: Sean Riley

Producers: Louise Male  and Stacha Hicks

If you want to get in touch with us here at the Living with AI Podcast, you can visit the TAS Hub website at
www.tas.ac.uk where you can also find out more about the Trustworthy Autonomous Systems Hub Living With AI Podcast.

 

Episode Transcript:

 

Sean:                  And welcome to The Living with AI a podcast where we look at how artificial intelligence is changing our lives and what impact it is having on us. Today we are looking at AI and audio so I am joined today by Jennifer Williams, assistant professor at the University of Southampton and she has been leading task project titled “The Next Big Thing in Trustworth AI: Codesign of Context Aware Trustworthy Audio Capture”. Welcome to the podcast Jennifer.

 

Jennifer:           Oh thank you very much Sean.

 

Sean:                  Just before we get chatting today is, well the day we are recording today is the 6th of September 2023, so keep that in mind you know if you are listening to us way into the future sitting in your, you know hover car, and you know wondering whether what we are saying is relevant or not, it may well not be.  

                            

So anyway back to the main point of today. Jennifer can we just have a, yeah let’s have a sort of précis or an overview of the project, what was it you were trying to achieve or is it still happening, tell us about it?

 

Jennifer:           Well the project is wrapping up now and what we have done is we have formulated a survey that we distribute to people and the general public including industry tech makers. I suppose artists who use audio as an artistic medium. And in that survey we explore people’s perception about their rights, their privacy preferences, issues of security and just a whole array of different issues related to audio AI.

 

Sean:                  What sorts of things are coming up? I mean when I think of audio and AI the most obvious thing is kind of our voice assistance and this kind of persistent myth, hopefully it’s a myth, that they are listening to us all the time when really they are probably just waiting for a watch word or a wait word. Is that the sorts of things we are talking about here?

 

Jennifer:           Sort of. So that’s one of the most commonly known types of audio AI in part because it’s on our mobile phones and also in the home with for example Alexa. But actually audio AI is a very wide range of different types of technologies, many of which have not yet been made into products and are in the research and development stages. These range from all kinds of things like voice calling technology to help people who cannot speak due to a disability or an injury to have a voice that is reconstructed so they can communicate. It also includes technologies like hearing aids to help people from the deaf and hard of hearing communities and for artists, this can involve things like converting spoken words like I am saying now into singing.

 

Sean:                  I have used some AI with my work in audio and video editing. There is some amazing AI tools that can basically take away noise from sound as I am sure, you know I am teaching grandma to suck eggs telling you this, but you know. I have recorded something I don’t know, a waterfall and the horrible noise of the water is drowning out what someone is saying and click this button and it is incredible what you can reconstruct these days. But when we are thinking of the Trustworthy Autonomous System, so obviously trust is the question. So this is what you have been looking at isn’t it really. I mean what people’s rights are, trust, privacy, how does that kind of come into some of those technologies you just talked about there?

 

Jennifer:           Well the example you gave earlier about a device that’s always listening, that’s actually a really great example. On the one hand it makes the device easier for people to use, especially hands free when it’s listening for a wake word. On the other hand a lot of people feel like they might be surveilled and they have concerns about whether or not devices are recording their conversations.

 

Sean:                  Yeah you often find people say oh I was talking about, I don’t know, potatoes and then suddenly an advert for potatoes came up and I am sure it was listening to me. And often, I would like to think, these are usually coincidences right?

 

Jennifer:           Yeah and that actually depends on the product, each manufacturer has different terms of agreement. One of the questions in our survey we ask is whether or not people read those principles that they agree to when they use the service.

 

Sean:                  It’s certain most people don’t they just tick the I agree let’s get going, come on, come on you are keeping me waiting here. 

 

I noticed in some of the documentation about the project you have been working on there that there was talk of people who have impairments, visual or hearing impairments. How does AI help them or you know, AI audio help these, you know, members of the community?

 

Jennifer:           Well there is different types of technology that can help people with that need and one of them would just be simple  noise cancellation like you mentioned a little bit earlier where you can reduce background noise for example static noise or AC on a phone call. But for the deaf and hard of hearing community specifically they need a technology called speech enhancement. Now speech enhancement is a little different from just turning up the volume and making it louder, that’s what traditional hearing aids do and they are not always effective because sometimes when you amplify the sound of the voice you are also amplifying the noise.

 

So these technologies can become really complex because we need to suppress certain types of noise while enhancing the speech in a way that people can understand better, hear better.

 

Sean:                  The other side of this where I wondered if perhaps this also came in to was turning sounds into text or into you know, something that is tokenised things that are kind of a computer is going to understand. Is that part of what you have been looking at or is that on the other edge?

 

Jennifer:           Yeah absolutely and that’s another technology that helps people who can’t hear. So in our survey one of the questions we ask is about an imagined scenario where there is a group of people and someone has a device that they use for medical purposes and that device not only can enhance the speech of people but can also transcribe the words that they say so that they can then read the transcript later  if they need to. And we ask in our survey how people feel about that, whether or not they are concerned that their words might be taken out of context or transcribed incorrectly.

 

Sean:                  And of course you know this kind of, it’s similarly aligned to having a recording device, you know, a bug as we might have seen in the films isn’t it? Because usually when you are having a conversation you would sort of, it’s kind of felt that you are in the heat of the moment/it’s not being recorded for posterity, unlike this podcast, but what you are saying is kind of transient or is ephemeral. But if it’s actually being kind of taken down and possibly like you say, annotated or maybe incorrectly taken down then that is a problem isn’t it, or could be?

 

Jennifer:           Yeah absolutely and there is some evidence out there that people will speak a little differently when they know that they are being recorded or transcribed. This can of course cause some communication barriers and make the interaction less natural. But your point about having like for example a bug that can do this, so this is also an issue for some people because any time we enhance speech it has a potential to be misused. This type of technology could be misused if it’s in the wrong hands or used for nefarious purposes. For example to overhear a conversation that was never meant to be overheard.

 

Sean:                  Yeah because audience is key here isn’t it. I mean you know, the resounding kind of thing that the podcast keeps coming back to is often the fact that context is really important and that AI doesn’t always get the wider picture. So does that come with AI and audio as well?

 

Jennifer:           Yeah absolutely, yeah.

 

Sean:                  And I was thinking as well that actually I think pretty much every smart phone has the ability to turn what it’s hearing into text and do live captioning and live translating isn’t it now? So I mean these things are in most people’s pockets aren’t they?

 

Jennifer:           Yeah, yeah that’s very true. It seems like everyone has a camera and a microphone everywhere they go these days.

 

Sean:                  And I suppose that can be problematic but then everyone is just getting used to it. So perhaps generations down the line just won’t even think twice about, you know, everything being recorded.

 

Jennifer:           Well I think that’s interesting because we have sort of gotten used to having CCTV and some people are interested in equipping buildings with microphones in order to make smart buildings. For example a smart building in the workplace where you can unlock your office door just with your voice when your hands are full, so that is an example.

 

Or provide different kinds of services during meetings, for example when people say okay I will start recording the meeting now, that could then start taking notes during the meeting. Or analyse how people interact during the meeting and are people participating equally, is anyone left out.

 

So this idea of equipping buildings with microphones is controversial but we have gotten used to having CCTV. And then people have Alexa in their home and on their smart phone they have Siri and other voice services. So we think that this will be adopted more and more. We just want to make sure that the way that this technology is being developed takes into account different stakeholder opinions.

 

Sean:                  This kind of idea of asking the building to do things for you I mean is completely doable now, I think we have three smart speakers in this house and actually the problem is I will go onto the landing to ask one to do something and all of them respond. So you know, I am sure these are problems that can be, you know, tackled. And we have been seeing this in science fiction for decades haven’t we, you know, Star Trek is a really classic example, computer do this, computer do that, computer do the other. But it can be quite clunky can’t it? It’s handy in certain circumstances but sometimes it’s easier just to press a button isn’t it?

 

Jennifer:           It can be yeah and we have to take into consideration the controllability that the users have as well as passive bystanders. So just because I’m consenting to my voice being used in a smart building or smart home doesn’t mean that the visitors would also agree with that.

 

[00:10:10]

                            So if I have a friend over at my home and I have an Alexa, a really interesting question is, do I have an obligation to let that person know that I have an in home recording device? And this is one of the things we look at in the project.

 

Sean:                  Well it kind of ties in a little bit to my kind of other job which is that am I videographer. So as a videographer obviously I am very used to people not wanting to be on camera. And usually you are carrying quite a large device or a relatively obvious device of recording and what’s more you can sort of see which direction it’s pointing, okay. With a microphone that is not always the case is it? 

 

                            I mean, you know, you might be wearing an Apple airpod or whatever they are called, that is recording everything anyway or potentially recording everything that you walk past. I mean how many microphones must be sitting on a tube carriage when you are going through central London literally potentially recording everything? It’s quite a wider issue.

                            

                            And sorry my long-winded way of coming round to the point but often in venues that are doing recordings they have to have notices and signs don’t they?

 

Jennifer:           Yeah absolutely. But individuals don’t seem to have that same obligation. And it’s interesting in our survey some of our earliest results show that people are very concerned about their rights in these scenarios where they may be a passive bystander getting recorded with their audio when they pass someone who is using such a recording device. But the catch is that people tend to change their mind based on the context of why it’s being used.

                            

                            So we find that people are less apprehensive when they know that a person is using a medical device. Now that opens a whole other set of questions as to whether or not people using medical devices must declare that because that also violates their personal privacy about their medical condition as well.

 

Sean:                  Yeah and I have just had a vision then of people walking round with lanyards kind of, you know, with some big sign telling people what, you know what they are or are not doing with whatever equipment they happen to be carrying. And as you say it’s fraught with problems isn’t it?

                            

                            And just from a kind of, from a more kind of mainstream point of view, we’ve heard of lots of politicians getting caught out by still having their radio microphones on and saying things that they think are not being heard. Just as you say, opening that up to every smart device, every person sat next to someone on a train carriage, its massive isn’t it?

 

Jennifer:           It is massive. And there is so much audio data out there. For example I know there is videos of me on YouTube so there is enough of my voice on the internet to probably create a deepfake at this point. And anybody in the public domain, including politicians, but also radio hosts and news broadcasters, anyone who has public data out there are somewhat vulnerable now to having deepfakes. Which may violate their security or their privacy when we continue incorporating speech technology into areas like banking. 

 

Sean:                  So this, the deepfake side of it, so obviously this is, we haven’t quite touched on that yet, but that’s the idea of effectively simulating someone’s voice. And is that a viable technology right now?

 

Jennifer:           Yes that’s true it is simulating another person’s voice and impersonating another person’s voice. And the technology, whether or not it’s viable, I think that’s a question that’s disputed because it depends on how we evaluate that circumstance. But we have seen in the news cases where people have tried to make transactions over phone banking by impersonating the sound of someone’s voice and then by-passing the security measures.

 

Sean:                  And is that something, so that side of it I am not quite that aware of. So obviously I can imagine somebody, you know, we have been watching comedians do impressions of people, famous people for years, you know. I won’t try and do any impressions here because I am terrible at them. But imagining that that technology can do that, the flipside of that is is voice kind of recognition that smart to be able to, or supposed to be able to detect individual’s voices? Is that something that is already there as well?

 

Jennifer:            Yeah, so there is actually a technique that requires only three to five seconds of a person’s voice in order to create a clone of that voice or a deepfake.  The sophistication of that deepfake is minimal so it is enough to probably by-pass certain banking security protocols but probably not enough to for example, trick a person’s mother into believing that they are real.

                            

                            So we all have different mannerisms and how we speak and different disfluencies and all of those would go into a true deepfake and those are really difficult to mimic. But simple sentences or statements are absolutely possible.

 

Sean:                  But the recognition side of it is the bit that I am not sure about what does that use? Is that just literally it hears a lot of your voice and then thinks it knows what your voice should sound like? I am using very non-technical terms there.

 

Jennifer:            The kind of what we call speaker verification or speaker as a talker, the kind of technology that that is based on, for example used in UK banks, creates a model of a persons’ vocal tract. So every individual has a unique vocal tract from their throat all the way up through their nasal cavity and this uniqueness acts like a finger print. And the AI is modelling that uniqueness of an individual and that’s how we can detect whether a person is in fact a unique speaker.

 

Sean:                  That is fascinating because I hadn’t really thought that that was existing. I know that the smart speakers in our house have asked in the past for us to try and give it examples of how we speak so that it can identify us and therefore open say a specific calendar or something. But I thought it was ropey at best. I didn’t realise it was an actual technology.

 

Jennifer:           Yes.

 

Sean:                  So I was watching on television last night a documentary about a former rugby player, he has MND and is now able to use a voice generation, which to me sounded like what he sounded like. So is that something that other people could use and how does that kind of work going forwards?

 

Jennifer:           So right now this area of audio AI is not regulated so I don’t have a good answer for that. I think some people can use their moral compass and say maybe it’s not right to reuse a person’s voice without their consent, so that’s a privacy issue. But whether or not a reconstructed voice is genuinely unique is both a scientific question and a question for regulators. 

 

Sean:                  Because there is obviously the potential to, and again this comes back to what you were saying about if there is enough footage or whatever out there, there is a potential as we said to recreate anyone’s voice from enough recordings. And then potentially also using say large language models then you could potentially have some kind of simulated person in theory. It sounds quite science fictionary but yeah.

 

Jennifer:           It does sound like science fiction but Hollywood has already done this with Val Kilmer’s voice. So in Top Gun Val Kilmer had lost his ability to speak and they reconstructed his voice based on, and I am losing my voice, they reconstructed Val Kilmer’s voice to allow him to appear in the movie and have speaking parts. 

 

                            There is also an example in Hollywood from Paul Walker I believe in the Fast and Furious franchise. Now he died and they used bits of his voice from previous films as well as a family member, with consent from the family, to reconstruct his voice enough to finish one of the films. 

                            

                            Now it is a question of who is consenting to that, whether or not the artist or the actor would have given his consent if he knew that was possible, we don’t know and we will never know. But now that we have this technology it is question that a lot of artists and actors and everyday people should consider about their voice data existing after they have passed away. And I think this is a really deep and philosophical question but it also touches on issues of voice rights, voice intellectual property and the uniqueness of people’s voice and really what that means.

 

                            Now there may be the possibility that some people want to reconstruct a voice of a deceased family member and use a large language model to, for example, have conversations and feel connected to that person that they have lost. But again just because the technology is possible doesn’t mean that it is something that should be created.

 

                            And this is what we are exploring is what are the issues surrounding that and what can we do today to help prepare for that future?

 

Sean:                  Audio feels like it’s relatively new to the party with those kind of Hollywood techniques. But the visual side of it has been used even as far back as Gladiator, the 2000 Russell Crowe film where Ollie Reed died suddenly during the production before they had finished shooting his scenes and they did some digital face replacement even back in 1999-2000, twenty odd years ago. 

 

[00:20:10]

 

                            There have been quite a few films that have used images of Carrie Fisher, various other people that have appeared, Peter Cushing I think maybe appeared in one of the Star Wars movies. Anyway, these people have been used but almost as a kind of cameo, it didn’t feel like it was intruding. It was just kind of almost wallpaper, if that’s not too dismissive a way to put it.

 

                            But the reason for me saying this is I am going to connect it into the fact that when, in my other job making videos, the most important part of a lot of videos is the audio. And it’s the thing that people really have an emotional connection to and it is the thing that if it’s lacking people will notice, more so then picture quality in fact ironically. So the fact that we are able to do this it is a huge kind of change.

 

Jennifer:            It is a  big change. And we are coming to a very interesting section where we have talked about technologies that act as medical devices and voice reconstruction. As well as deepfakes which can be used as a form of misinformation but also a free expression and humour and sarcasm and irony right. As well as voice property, voice property is probably the wrong term, but voice related rights posthumously. And all of these different technologies are not only just related but some have the same algorithms. AI algorithms are used to create these technologies.

 

                            So I think it’s important that the public as well as researchers and government really come together to think about what are we going to do about this? Because the landscape is changing quickly and these technologies are not going away they are only growing.

 

Sean:                  It’s going to be really interesting to see what is going forward. And as you say I think education is very important there, people need to know what things are capable of and what technology is out there.

 

                            Jennifer I would just like to say thank you so much for joining us on the podcast today. It’s been really great to hear your kind of side of things and learn a bit more about AI and audio.

 

Jennifer:           Thank you so much.

 

Sean:                  If you have enjoyed this season of the podcast, one thing you might have noticed is it’s very difficult to avoid the subject of robots whenever you are talking about AI. And you might just enjoy a podcast that I really like which is called ‘Robot Talk’. And here to talk about it today we have got the producer and host of ‘Robot Talk’ it’s Claire. Welcome to ‘Living with AI’ Claire.

 

Claire:                Hi Sean, great to meet you.

 

Sean:                  What do you do on ‘Robot Talk’, so you know, just so people can get a sense of it?

 

Claire:                Sure. So ‘Robot Talk’ is a weekly podcast and we explore everything relating to robotics and intelligent machines. So I chat to robotics experts from research, from industry and beyond, find out about their work and try and discuss a bit about the future of robotics.

 

Sean:                  And it’s accessible is it? I am not going to need a PhD in kind of mechanical engineering or something to understand what’s going on?

 

Claire:                No absolutely not. I myself am not a roboticist by training and I try and make ‘Robot Talk’ jargon free. I also take questions from listeners so to try and make sure that we cover topics that, you know, people actually care about and want to hear about. So yeah ‘Robot Talk’ is for everyone.

 

Sean:                  That’s superb, fantastic, so I will put a link in the show notes. How can people find you if they are using podcast apps and things, is that easy enough?

 

Claire:                Yes, yes so you can find ‘Robot Talk’ on all major podcast providers, just try searching ‘Robot Talk’. You can also check out our website robottalk.org and we are on social media at ‘Robot Talk pod’.

 

Sean:                  Thank you Claire for joining us on ‘Living with AI’. Well this is the last episode in this season of ‘Living with AI’ so hopefully we will be back next year with a whole new season but I know what I will be doing, downloading plenty of episodes of ‘Robot Talk’ to binge on in the meantime.

 

Claire:                Fantastic. Thank you so much for having me and I look forward to a new season.

 

Sean:                  If you want to get in touch with us here at the ‘Living with AI’ podcast you can visit the TAS website at www.TAS.ac.uk or you can also find out more about the Trustworthy Autonomous Systems Hub. 

                            The ‘Living with AI’ podcast is a production of the Trustworthy Autonomous Systems Hub. Audio engineering was by Boardie Limited. Our theme music is Weekend in Tatooine by Unicorn Heads and it was presented by me, Sean Riley. 

 

[00:24:38]