Given an arbitrary string, for example ("I'm going to play croquet next Friday"
or "Gadzooks, is it 17th June already?"
), how would you go about extracting the dates from there?
If this is looking like a good candidate for the too-hard basket, perhaps you could suggest an alternative. I want to be able to par开发者_运维技巧se Twitter messages for dates. The tweets I'd be looking at would be ones which users are directing at this service, so they could be coached into using an easier format, however I'd like it to be as transparent as possible. Is there a good middle ground you could think of?
If you have the horsepower, you could try the following algorithm. I'm showing an example, and leaving the tedious work up to you :)
//Attempt to perform strtotime() on each contiguous subset of words...
//1st iteration
strtotime("Gadzooks, is it 17th June already")
strtotime("is it 17th June already")
strtotime("it 17th June already")
strtotime("17th June already")
strtotime("June already")
strtotime("already")
//2nd iteration
strtotime("Gadzooks, is it 17th June")
strtotime("is it 17th June")
strtotime("17th June") //date!
strtotime("June") //date!
//3rd iteration
strtotime("Gadzooks, is it 17th")
strtotime("is it 17th")
strtotime("it 17th")
strtotime("17th") //date!
//4th iteration
strtotime("Gadzooks, is it")
//etc
And we can assume that strtotime("17th June")
is more accurate than strtotime("17th")
simply because it contains more words... i.e. "next Friday" will always be more accurate than "Friday".
I would do it this way:
First check if the entire string is a valid date with strtotime(). If so, you're done.
If not, determine how many words are in your string (split on whitespace for example). Let this number be n.
Loop over every n-1 word combination and use strtotime() to see if the phrase is a valid date. If so you've found the longest valid date string within your original string.
If not, loop over every n-2 word combination and use strtotime() to see if the phrase is a valid date. If so you've found the longest valid date string within your original string.
...and so on until you've found a valid date string or searched every single/individual word. By finding the longest matches, you'll get the most informed dates (if that makes sense). Since you're dealing with tweets, your strings will never be huge.
Inspired by Juan Cortes's broken link based off Dolph's algorithm, I went ahead and wrote it up myself. Note that I decided to just return on first successful match.
<?php
function extractDatetime($string) {
if(strtotime($string)) return $string;
$string = str_replace(array(" at ", " on ", " the "), " ", $string);
if(strtotime($string)) return $string;
$list = explode(" ", $string);
$first_length = count($list);
for($j=0; $j < $first_length; $j++) {
$original_length = count($list);
for($i=0; $i < $original_length; $i++) {
$temp_list = $list;
for($k = 0; $k < $i; $k++) unset($temp_list[$k]);
//echo "<code>".implode(" ", $temp_list)."</code><br/>"; // for visualizing the tests, if you want to see it
if(strtotime(implode(" ", $temp_list))) return implode(" ", $temp_list);
}
array_pop($list);
}
return false;
}
Inputs
$array = array(
"Gadzooks, is it 17th June already",
"I’m going to play croquet next Friday",
"Where was the dog yesterday at 6 PM?",
"Where was Steve on Monday at 7am?"
);
foreach($array as $a) echo "$a => ".extractDatetime(str_replace("?", "", $a))."<hr/>";
Outputs
Gadzooks, is it 17th June already
is it 17th June already
it 17th June already
17th June already
June already
already
Gadzooks, is it 17th June
is it 17th June
it 17th June
17th June
Gadzooks, is it 17th June already => 17th June
-----
I’m going to play croquet next Friday
going to play croquet next Friday
to play croquet next Friday
play croquet next Friday
croquet next Friday
next Friday
I’m going to play croquet next Friday => next Friday
-----
Where was Rav Four yesterday 6 PM
was Rav Four yesterday 6 PM
Rav Four yesterday 6 PM
Four yesterday 6 PM
yesterday 6 PM
Where was the Rav Four yesterday at 6 PM? => yesterday 6 PM
-----
Where was Steve Monday 7am
was Steve Monday 7am
Steve Monday 7am
Monday 7am
Where was Steve on Monday at 7am? => Monday 7am
-----
Something like the following might do it:
$months = array(
"01" => "January",
"02" => "Feberuary",
"03" => "March",
"04" => "April",
"05" => "May",
"06" => "June",
"07" => "July",
"08" => "August",
"09" => "September",
"10" => "October",
"11" => "November",
"12" => "December"
);
$weekDays = array(
"01" => "Monday",
"02" => "Tuesday",
"03" => "Wednesday",
"04" => "Thursday",
"05" => "Friday",
"06" => "Saturday",
"07" => "Sunday"
);
foreach($months as $value){
if(strpos(strtolower($string),strtolower($value))){
\\ extract and assign as you like...
}
}
Probably do a nother loop to check for other weekDays or other formats, or just nest.
Use the strtotime
php function.
Of course you would need to set up some rules to parse them since you need to get rid of all the extra content on the string, but aside from that, it's a very flexible function that will more than likely help you out here.
For example, it can take strings like "next Friday" and "June 15th" and return the appropriate UNIX timestamp for the date in the string. I guess that if you consider some basic rules like looking for "next X" and week and month names you would be able to do this.
If you could locate the "next Friday" from the "I'm going to play croquet next Friday" you could extract the date. Looks like a fun project to do! But keep in mind that strtotime
only takes english phrases and will not work with any other language.
For example, a rule that will locate all the "Next weekday" cases would be as simple as:
$datestring = "I'm going to play croquet next Friday";
$weekdays = array('monday','tuesday','wednesday',
'thursday','friday','saturday','sunday');
foreach($weekdays as $weekday){
if(strpos(strtolower($datestring),"next ".$weekday) !== false){
echo date("F j, Y, g:i a",strtotime("next ".$weekday));
}
}
This will return the date of the next weekday mentioned on the string as long as it follows the rule! In this particular case, the output was June 18, 2010, 12:00 am
.
With a few (maybe more than a few!) of those rules you will more than likely extract the correct date in a high percentage of the cases, considering that the users use correct spelling though.
Like it's been pointed out, with regular expressions and a little patience you can do this. The hardest part of coding is deciding what way you are going to approach your problem, not coding it once you know what!
Following Dolph Mathews idea and basically ignoring my previous answer, I built a pretty nice function that does exactly that. It returns the string it thinks is the one that matches a date, the unix datestamp of it, and the date itself either with the user specified format or the predefined one (F j, Y
).I wrote a small post about it on Extracting a date from a string with PHP. As a teaser, here's the output of the two example strings:
Input: “I’m going to play croquet next Friday”
Output: Array (
[string] => "next friday",
[unix] => 1276844400,
[date] => "June 18, 2010"
)
Input: “Gadzooks, is it 17th June already?”
Output: Array (
[string] => "17th june",
[unix] => 1276758000,
[date] => "June 17, 2010"
)
I hope it helps someone.
Based on Dolph's suggestion, I wrote out a function that I think serves the purpose.
public function parse_date($text, $offset, $length){
$parseArray = preg_split( "/[\s,.]/", $text);
$dateTest = implode(" ", array_slice($parseArray, $offset, $length == 0 ? null : $length));
$date = strtotime($dateTest);
if ($date){
return $date;
}
//make the string one word shorter in the front
$offset++;
//have we reached the end of the array?
if($offset > count($parseArray)){
//reset the start of the string
$offset = 0;
//trim the end by one
$length--;
//reached the very bottom with no date found
if(abs($length) >= count($parseArray)){
return false;
}
}
//try to find the date with the new substring
return $this->parse_date($text, $offset, $length);
}
You would call it like this:
parse_date('Setting the due date january 5th 2017 now', 0 , 0)
What you're looking for a is a temporal expression parser. You might look at the Wikipedia article to get started. Keep in mind that the parsers can get pretty complicated, because this really a language recognition problem. That is commonly a problem tackled by the artificial intelligence/computational linguistics field.
Majority of suggested algorithms are in fact pretty lame. I suggest using some nice regex for dates and testing the sentence with it. Use this as an example:
(\d{1,2})?
((mon|tue|wed|thu|fri|sat|sun)|(monday|tuesday|wednesday|thursday|friday|saturday|sunday))?
(\d{1,2})? (\d{2,4})?
I skipped months, since I'm not sure I remember them in the right order.
This is the easiest solution, yet I will do the job better than other compute-power based solutions. (And yeah, it's hardly a fail-proof regex, but you get the point). Then apply the strtotime function on the matched string. This is the simplest and the fastest solution.
精彩评论