Carlton Banks Carlton Banks - 2 months ago 12
Python Question

how do I ensure consistent frame size amongst seperate audio files?

I am currently trying to regression network for the purpose of extracting MFCC features. The input the for the network is sampled and framed audio files, which i seem to have some problems doing, or do it in a way such that i get a consistent output - meaning a consistent frame size, as it would not work as a input for a neural network.

I am currently sampling and framing each audio file as such:

def load_sound_files(file_paths , data_input):
raw_sounds = []
data_output = []
for fp in file_paths:
y,sr = librosa.load(fp)
X = librosa.util.frame(y)
return raw_sounds

Meaning that for each audio is appended to a list, and and within each list is a array with the framed audio file.

[array([[frame],[frame],...,[frame]],dtype=float32), ...]

I tried printing this

print raw_sounds[0].shape
print raw_sounds[1].shape

And got this result

(2048, 121)
(2048, 96)

But why I am getting this result?.. I am not changing anything regarding the framing options, so why are they different?...

And if there is no way to keep it consistent, how would anyone train a neural network capable of doing this, with an inconsistent input??


Your results

(2048, 121)
(2048, 96)

gives the frame length and the number of frames. So the frame sizes are actually the consistently 2048 samples long. The only difference between the two, is that there are 121 frames for the first sound file, and 96 for the second.