What I plan on doing:

I want to develop the English accent (without professional training).

Set of axioms behind my reasoning with executive summary:

Following is knowingly over simplified, sorry for that. I tried to keep question short.

Part 1 : Understanding how learning works.

At the moment I assume, that Broca's area and Wernicke's area must be aware of the language, and muscle memory with existing phonetic alphabet will build the speech. Accents are just formed naturally over time by phonetic alphabet assimilation.

Audio mining for words boundaries

Using Google I found, that speech shadowing, can potentially be used for phonetic symbol assimilation. Muscle memory on the other hand can be easily trained by repetitive action. And this is most effective, if person is of 23-24 years of age and has lots of uninterpretable time on his/her hand as losing focus can dramatically decrease effective learning curve gradient. This kind of procedural memory can be probably optimized to flushed in memory with designed sleep pattern.

Part 2 : Designing behavioral pattern

Finding a fluent speaker whom accent I want to sound like.
Distinguishing target accent phonemes and phones.
Training muscle memory to produce target accent.

Part 3 : Finding a fluent speaker whom accent I want to sound like.

Youtube is a powerful free resource. Sample audio, that I tough about picking :

Audio mining for words boundaries

Someone Like You - Adele (Cover) in HD.

It does not bother me, that it is high pitched female voice.

Part 4 : Distinguishing target accent phonemes and phones.

It is not a trivial task - identifying and judging whether spoken phone is correct. And how correctly tangible text is spoken by human. It seems so complex in fact, that I wont bother automating it and just use IPA as baseline.

Here is the first psalm with word stress in american IPA of the sample audio above :

Audio mining for words boundaries

No copyright infringement intended. And image is created with upodn (alternative: photransedit).

Part 5 : Training muscle memory to produce target accent.

Although it is fun to just try to mimic and archive synchronization, then i would prefer building a tool, that extracts words as audio files. So I can use winamp or ipod to loop and shuffle the words I want.

I imagine, that I can use MS Expression Encoder for this.

Question

If given an audio file (ex. in wav format, size < 32mb) and it's text equivalent (finite nr of words, ex. 2000), then how to split it into multiple files, that each contai开发者_开发问答ns 1 word. Word can contain some excess whitespace, and boundary checks can be user approved. If it is not accurate, then what is the best way, to get good estimation for word boundaries.

Main intention is to reduce work, that I would be doing, if this would be done manually.

Detecting word boundaries is an intensely complex task! I don't know if you've looked into this more, but see Saffran et al., (1996). Word Segmentation: The role od Distributional Cues. There are also many many "corpuses" of language production out there for many languages, so rather than using a new person, I'd look into what's already been done in the Linguistics literature on detecting word boundaries.

First of all I would convert the signal from the time domain into the frequency domain by running a FFT over it. That might allow you to match certain consonant sounds in your text to broadband noise in the fft. The thing here is that you're not trying to do full speech recognition, just find the best match of signal to text. (I did something similar for document image highlighting back when I was at uni - didn't need to resort to OCR because I already had the text). My guess is that looking for dips in amplitude won't help you that much because some words run into each other.

Here's how I'd approach it for a first attempt:

Analyze the text/IPA for words that start with consonants that result in an easily-identifiable pattern in the frequency spectrum.
starting with a high threshold, detect instances of the pattern.
Lower the threshold until you get the right number of instances and the relative distances between them match your estimate of the distance from the text.
(if possible, get user verification of split points here)
This should give you a set of hopefully short phrases and blocks of spectrum.
Split these blocks into words by using another feature detection method.
Continue until you have only single words.

I'm sure it could be generalized, but that's how I'd attempt it.