I went to a PHP job interview, I was asked to implement a piece of code to detect visitors are bots to crawl thru the website and st开发者_如何学JAVAeal content.
So I implemented a few lines of code to detect if the site is being refreshed/visited too quickly/often by using a session variable to store last visit timestamp.
I got told that session varaibles can be manupilated by cookies etc, so I am wondering if there is a application variable that I can use to store the timestamp information against visitor IPs eg $_SERVER[REMOTE_ADDR]?
I know that I can write the data to a file but it's not very good for a high traffic website.
Regards
James
I got told that session varaibles can be manupilated by cookies etc,
Just to be clear, clients can't edit session variables to their liking. They can delete or change PHPSESSID, however, which grants another session. Global variables (ie. $_SERVER
) are not persistent, so any changes you make to them will not make it to the next page load.
The best way to go about detecting crawlers is to store the IP address, user-agent and timestamp of all page loads in a database. The overhead is miniscule.
In a word, no. Your options are cookies, session vars (aka server-side cookies) and storage (file/db).
Your best bet for this might be after-the-fact analysis of the logs. It won't stop content theft on-the-fly, but it'll be much easier to find abuse patterns and block those IPs from future accesses.
You would need store the IP and timestamps server-side. It's unlikely that a bot would send cookies, and even a URL based session is not reliable.
The overhead of a file should not be too much, unless you are just doing flat-file logging which will kill you. You can use SQLite or similar, perhaps stored on a memory based filesystem for a small speed boost. Or you could go with something like memcached. If you need to persist the data, use MySQL. The overhead of a full-blown database is practically nothing compared with the time it takes PHP to do pretty much anything.
If you really want to do something like this with sessions, display a user agreement page unless there is a defined "I Agree" variable in the session. That way, if a bot doesn't send a valid session back, all it gets is the user agreement. If it does, then you can track it with session variables.
Bear in mind that the session-based solution is not necessary since you don't need to remember client state between requests, and that sessions will incur as much, if not more, overhead than most custom alternatives.
Regarding the statement that session variables can be manipulated by cookies, it's not entirely true. However, if you're silly enough to leave register_globals
on and you ask for a global variable, I wouldn't like to hazard a guess as to whether it came from a session, a cookie, a query string, the environment, or was previously undefined. This is all moot if you explicitly access through $_SESSION of course.
Bots can ignore saving the cookie data (as in not passing the session variable back). The best option would be to use some sort of external DB or storage system. Like a C++ socket program that simply stores IP and compares it recent connections.
Don't expect to defeat them by refresh times alone. I did something very similar to combat contact form spam and some bots wait as long as people before taking the next action.
I'd look more at ip addresses who load just the html document, and ignore things like favicon, css stylesheets, etc. If you set css files to parse php you can put some logic in there to say that ip looks legit. Just be careful about caching.
Also, are you taking steps to make sure you don't lock out legitimate bots like the googlebot?
精彩评论