Neural Network/deep learning for crowd noise removal

EXTRASUPER81 · January 4, 2018

This just occurred to me as a thing that could be done, so thought I'd post on here to see if anyone knows anything about this subject, as I know basically what I have read in articles in Wired. So possibly even less than nothing.

So any ideas? It would be really nice to clean all chatter etc off of bootlegs...autechre 2008, 2010, 2016 I am looking at you.

thawkins · January 4, 2018

I have worked with neural networks and machine learning stuff a bit, although I feel I probably know fuck all about deep learning (convolutional neural networks). I think I can grasp the general concept though.

Generally this stuff is done by training the neural network with samples of crowd noise and samples of music, so the thing would eventually be able to tell the difference between the two. However it seems really difficult to get the data for training, because you probably don't have a lot of noise recorded from the same event that you are trying to clean up and using sound from a different event may throw off the neural network (different crowd type, room acoustics, sound system etc.). Similarly you would ideally need "pure" samples of the same audio that was played at the event (although maybe album versions would do in a pinch if the neural network is good enough).

Assuming you get all the data you need, then what you basically want is to feed in the training data - samples of crowd noise and "pure" audio (maybe try to feed it through some reverb to approximate venue sound) - and build the model, then feed in the noisy audio. I think the model would then basically go through every sample (realistically that's more like 1-2 seconds of audio at a time) and add/substract what it thinks is the "noise" part of signal amplitude. On paper this looks simple enough, but I think there could probably be some audible artifacts: for instance when there's a more quiet part and some asshole hollers over it in the recording, then the neural network is going to have a hell of a time to recreate what audio actually should be there. Basically, what I want to say here is that it's quite possible that the neural network model is going to add unpredictable audio artifacts to the result no matter what, so even if you have successfully removed the noise, there still will be some extra sound that wasn't actually played by any artists.

However all this stuff is back of the envelope theory, would be nice to have the opportunity to actually try it out in real life. There's some amazing shit possible with the deep learning neural networks, so I would not be surprised if someone puts together a thing for noise removal. I have not checked any science about this really, I think there is definitely something being researched.

**Rotwang** · January 4, 2018

I'm guessing a recurrent neural network would be the way to go, though I have no idea how much training data would be necessary, nor how one would go about picking hyperparameters.

goDel · January 4, 2018

My guess/intuition: don't bother.

Unlikely to outperform already existing methods. But who knows? Whatever technique you're going to use, it'll probably do two things:

1. Extract sounds from waveform

2. Interpolate the changed waveform to restore assumed ideal waveform without extracted sounds.

Both steps use implicit assumptions and estimations, regardless of the technique you're going to use. Deep neural nets or something else (example of techniques use in photoshop to edit images, which is similar to sounds). My guess is that a complex/expensive technique like deep neural nets will hardly, if at all, outperform a simple technique that's already available.

Reasoning is that it's unlikely to find a solution which can model "crowd noise" better than whats already out there. Deep neural nets tend to depend on lots of data and needs problems with a static set of rules. Like games, or language. When it comes to crowd noise and especially autechre gigs I'm afraid you'll be limited by a lack of static rules defining crowd noise, or defining autechre to not be noise.

You could try though. But be prepared to spend a lot of time without a likely benefit, other than experience and lessons learned.

In the next tweet is a link to an interesting article on the possibilities/limitations of deep neural nets. It's non-technical/ readable .

What you should take away from it, imo: you're not learning to label images or playing a game. To an extent you're trying to label noise. But additionally, you're also predicting sounds as you need to restore sounds after extraction of noise. Or otherwise, you'll be predicting what the noise would sound like without the music. So you'll be predicting crowd noise. Which might be more problematic as it doesnt look that will follow strict rules. Less so than music, arguably. But who knows, right?

EXTRASUPER81 · January 4, 2018

Hmmmm. Yes, it's a rabbit hole.

I was more hoping that someone had already done the heavy lifting on this one.

I've got a mate who's a lecturer in music tech at a University, I might see if he can get his students on it. That's what they're for right?

Then, the clean autechre recordings can be mine. ALL MINE. Mwhahahaha etc

**mcbpete** · January 4, 2018

Izotope's RX repair tools (particularly De-noise & Dialogue Isolate) is probably the closest we have at the moment:

**Squee** · January 4, 2018

On 1/4/2018 at 10:30 AM, mcbpete said:

Izotope's RX repair tools (particularly De-noise & Dialogue Isolate) is probably the closest we have at the moment:

Yeah, that's what I would recommend. I recently used it to (partially) remove crowd noise from an interview

EXTRASUPER81 · January 4, 2018

Doing a little reading around, it seems the majority of the research going on at the moment is geared around isolating the human voice from background noise (usually other human voices). Some really impressive stuff, including some that can run faster than realtime on a Raspberry Pi.

However, I couldn't find anything regarding the separation of musical information from background noise specifically. I would think that it would be possible theoretically (especially given music like autechre that is more or less all synthetic), but not without great difficulty and low likelihood of the quality I am hankering after - artifacts are likely to be almost as irritating as that guy chuntering away in the background. Also I imagine preserving the sound of the space would be very difficult.

I guess in my simplistic pre-sleep mind I was hoping for a process like: input dataset A (autechre oeuvre) > input dataset B (crowd noise) > run iterative learning processes on datasets > remove B from A. Alas it turns out that cutting edge tech is more complicated than that! Who would have known?

Edited January 4, 2018 by EXTRASUPER81

KovalainenFanBoy · January 4, 2018

On 1/4/2018 at 12:07 PM, EXTRASUPER81 said:
including some that can run faster than realtime

it's really impressive that investigation of noise removal led to the discovery of time travel

EXTRASUPER81 · January 4, 2018

On 1/4/2018 at 12:24 PM, span said:

On 1/4/2018 at 12:07 PM, EXTRASUPER81 said:
including some that can run faster than realtime
it's really impressive that investigation of noise removal led to the discovery of time travel

Boom

The latest Raspberry Pi is really powerful.

Edited January 4, 2018 by EXTRASUPER81

fenton · January 5, 2018

Pretty impossible but go for it, I think you're more likely to get a direct recording off an artist if they see you trying to train neural networks, right
?

thawkins · January 5, 2018

On 1/5/2018 at 12:51 AM, fenton said:

Pretty impossible but go for it, I think you're more likely to get a direct recording off an artist if they see you trying to train neural networks, right

?

I am thinking that if you train a neural network on some artist's material, you can then generate brand new stuff that sounds like that artist.

Also - I think this has been discussed here already - you could in theory take recordings of some famous pianist or guitarist, annotate them with note data, train your network and feed it your own piano roll to automagically have your solos played in their style. If it works for text, it might work for other audio too. Might be worth trying out some day.

sweepstakes · January 5, 2018

I think most artists' music is too complex to just ram the audio into an NN and expect it to shit out something recognizably theirs.

That said, I reckon it stands a chance against Nickelback.

goDel · January 5, 2018

nah, that leadsingers voice is def going to be labelled as crowd noise :)

**Lane Visitor** · March 10, 2018

Been reading about neural networks and machine learning. So incredible.

Edited March 10, 2018 by Lane Visitor

Stay up to date

Neural Network/deep learning for crowd noise removal

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Recently Browsing 1 Member