I am currently trying to regression network for the purpose of extracting MFCC features. The input the for the network is sampled and framed audio files, which i seem to have some problems doing, or do it in a way such that i get a consistent output - meaning a consistent frame size, as it would not work as a input for a neural network.
I am currently sampling and framing each audio file as such:
def load_sound_files(file_paths , data_input):
raw_sounds = 
data_output = 
for fp in file_paths:
y,sr = librosa.load(fp)
X = librosa.util.frame(y)
(2048, 121) (2048, 96)
gives the frame length and the number of frames. So the frame sizes are actually the consistently 2048 samples long. The only difference between the two, is that there are 121 frames for the first sound file, and 96 for the second.