I'm working on a project that screen scrapes a list of departure times from a train schedule posted on the web. I realize this would be a lot easier if I wasn't using such a crude method to access the data but there's no API available, and it's more of a learning project than the kind of thing I expect to release publicly.
Anyhow, the schedule I'm reading from displays times in 12-hour format but without AM/PM (so for example, just 9:43). I'm storing the time in a database as an approximate unix timestamp, which means I need my script to be able to figure out if a time is AM or PM.
The data I'm scraping from lists times that are, potentially, between two hours ago and six hours in the future. So at 9am when the script runs, an upcoming 2pm train could be listed, and a 7am train could still be on the board if it didn't leave on time.
I wrote a function that takes two parameters -- the hour to be evaluated, and the current system hour to base the "guess" on (I realize I could have the function get the time itself, but I was trying to write a unit test that failed horribly, that's why I did that). I'd post it here but it doesn't really work, and I'd like to start fresh with some guidance or tips from you fine folks.
Can anyone help me out? What 开发者_如何学Goa good way to approach this?
If you know what time you scraped the page (you should), and you know the time listed (clearly you do), and you know that the times are -2 to +6 of the page access (eg, the time you scraped the page)... I'm failing to see where the trouble comes in. It seems like you have all the information you need.
I scrape a page at 11:30 (AM). There is a departure listed for 2:15. Well, when choosing between 2:15AM and 2:15PM, there's only one of the two that's less than 6 hours after 11:30(AM). If I saw an entry for 10:30, I'd know it had to be "an hour ago" because an arrival 11 hours in the future wouldn't be listed (per your explanation).
Or am I missing something?
Traditionally, train schedules distinguish a.m. and p.m. with lightface and boldface times. As best I can remember, p.m. is always bold. If that's the case for your source, just keep track if the text is inside <b> or <strong>.
OK, I forgot that this script runs to initialize trains as the appear on the board hours in advance, so the "2 hours ago" thing is not an issue. Here's what I came up with, it seems to be working:
function convertTime($input, $currentHour) {
if ($currentHour >= 8 && $currentHour < 12 && $input < 8) {
$input += 12;
}
if ($currentHour > 12 && $currentHour < 20 && $input < 12) {
$input += 12;
}
if ($currentHour > 20 && $currentHour < 24 && $input > 8) {
$input +=12;
}
$return $input;
}
精彩评论