I need a regular expression that will extract sentences from text that contain a year in them.
Example text:
Next, in 1988 the Bradys were back again for a holiday celebration, "A Very Brady Christmas". Susan Olsen (Cindy) would be missing from this reunion, Jennifer Runyon took her place. This w开发者_如何学编程as a two hour movie in which the Bradys got together to celebrate Christmas, introducing the world to the spouses and children of the Brady kids. This movie was the highest rated TV-movie of 1988.
If the example text was variable $string, I need it to return:
- $sentenceWithYear[0] = Next, in 1988 the Bradys were back again for a holiday celebration, "A Very Brady Christmas".
- $sentenceWithYear[1] = This movie was the highest rated TV-movie of 1988.
If it's possible to retain the year via regex, I'd use the year within the sentence and eventually insert the sentences into a database like:
INSERT INTO table_name (year, sentence) VALUES ('$year', '$sentenceWithYear[x]')
(This is not an answer, but a suggestion)
I think you're trying to make this too complicated. You really have two problems:
- Break a paragraph into sentences
- Identify which sentences contain a 4-digit number, probably in the range of 1900-2100 or so.
Point #1 is quite difficult, because of the ambiguous use of the . character. For example, how would you process the sentences:
I was born in 1986. Mr. Smith was born in 1976.
You need to be able to recognize that the period after "Mr" is not a sentence terminating character, and that there are actually two sentences. Most answers you get (including @Tatu's) will do a naïve split based on the period.
edit another use case: money
I earned $42.00 yesterday that I don't have to report on my 2010 tax return.
Once you're able to adequately identify sentences, point #2 is pretty trivial.
Try this:
$string = ".".str_replace(".", "..", rtrim($string, '.')).".";
preg_match_all("~\.[^.]*?((19|20)\d{2})[^.]*?\.~", $string, $sentenceWithYear);
Note that you need to add additional dots to act as the break points for the regex. Every sentence must have it's own dots before and after itself, so that this:
'Sentence 1. Sentence 2.'
Becomes this:
'.Sentence 1.. Sentence 2.'
That regex would generate matches such as these:
Array (
0 => Array (
0 => '.Next, in 1988 the Bradys were back again for a holiday celebration, "A Very Brady Christmas".',
1 => '. This movie was the highest rated TV-movie of 1988.'
),
1 => Array (
0 => 1988,
1 => 1988
)
)
You can then easily loop through the results and insert them into the database. Note that the sentences still have the preceding dot present, you need to use ltrim
to get rid of that.
foreach($sentenceWithYear[0] as $key => $sentence) {
$q = "INSERT INTO
table_name (year, sentence)
VALUES ('".$sentenceWithYear[1][$key]."', '".ltrim($sentence, ". ")."')";
mysql_query($q);
}
That would generate queries like this:
INSERT INTO table_name (year, sentence) VALUES ('1988', 'Next, in 1988 the Bradys were back again for a holiday celebration, "A Very Brady Christmas".')
INSERT INTO table_name (year, sentence) VALUES ('1988', 'This movie was the highest rated TV-movie of 1988.')
Be sure to escape your queries, though.
精彩评论