I am working with the HTK
toolkit on a word spotting task and have a classic training and testing data mismatch. The training data consisted of only "clean" (recorded over a mic) data. The data was converted to MFCC_E_D_A
parameters which were then modelled by HMMs (phone-level). My test data has been recorded over landline and mobile phone channels (inviting distortions and the like). Using the MFCC_E_D_A
parameters with HVite
results in incorrect output. I want to make use of cepstral mean normalization
with MFCC_E_D_A_Z
parameters but it would not be of much use since the HMMs are not modelled with this data. My questions are as follows:
-
开发者_JAVA技巧
- Is there any way by which I can convert
MFCC_E_D_A_Z
intoMFCC_E_D_A
? That way I follow this way:input -> MFCC_E_D_A_Z -> MFCC_E_D_A -> HMM log likelihood computation
. - Is there any way to convert the existing HMMs which model
MFCC_E_D_A
parameters intoMFCC_E_D_A_Z
?
If there is a way to do (1) from above, what would the config file for HCopy
look like? I wrote the following HCopy
config file for conversion:
SOURCEFORMAT = MFCC_E_D_A_Z
TARGETKIND = MFCC_E_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = T
This does not work. How can I improve this?
You need to understand that telephone recordings have another range of frequencies because they are clipped in the channels. Usually range of frequencies from 200 to 3500 Hz is present. Wideband acoustic model is trained on the range from 100 to 6800. It will not decode telephone speech reliably because telephone speech misses the required frequencies from 3500 to 6800. It's not related to feature type or mean normalization or distortion, you just can't do that
You need to train your original model on audio converted to 8khz or at least to modify the filterbank parameters to match telephone range of frequencies.
精彩评论