In the text page, I would like to examine each word. What is the best way to read each word at the time? It is eas开发者_Python百科y to find words that are surrounded by space, but once you get into parsing out words in text it can get complicated.
Instead of defining my own way of parsing the words from text, is there something already built that parse out the words in regular expression or other methods?
Some example of words in text.
word word. word(word) word's word word' "word" .word. 'word' sub-word
You can use:
text = "word word. word(word) word's word word' \"word\" .word. 'word' sub-word";
words = text.match(/[-\w]+/g);
This will give you an array with all your words.
In regular expressions, \w
means any character that is either a-z
, A-Z
, 0-9
or _
. [-\w]
means any character that is a \w
or a -
. [-\w]+
means any of these characters that appear 1 ore more times.
If you would like to define a word as being something more than the above expression, add the other characters that compose your words inside the [-\w]
character class. For example, if you'd like words to also contain (
and )
, make the character class be [-\w()]
.
For an introduction in regular expressions, check out the great tutorial at regular-expressions.info.
What you're talking about is Tokenisation. It's non-trivial to say the least, and a subject of intense reasearch at the major search engines. There are a number of open source tokenisation libraries in various server-side languages (e.g see the Stanford NLP and Lucene projects) but as far as I am aware there's nothing that would even come close to these in javascript. You may have to roll your own :) or perhaps do the processing server-side, and load the results via AJAX?
I support Richard's answer here - but to add to it - one of the easiest ways of building a tokeniser (imho) is Antlr; and some maniac has built a Javascript target for it; thus allowing you to run and execute a grammar in the web browser (look under 'runtime libraries' section here)
I won't pretend that there's not a learning curve there though.
Take a look at regular expressions - you can define almost any parsing algorithm you want.
精彩评论