I need to put whois data in a table like
- registrant,
- created date,
- expire date etc.
I've the script which is extracting data from whois server开发者_运维百科s, but the output is different for each domain extensions.
For example, for .com
domains registrant details comes as a total address and for .org
domains it comes as registrant name,street1,street2,street3 etc.
so i'm not able extract registrant details as a unit to be put in db.
some where i heard if we get as xml data we can able to extract it, can somebody help to get around this? Thanks!.
Actually the problem is a big larger than that.
- there is no unified syntax for request
- nor defined set of capabilities
- there is no defined scheme for answers
- local legislations make contents different
- there is not sandardized error set
- there is weak quality of the recorded information
- you must deal with internalization
The WHOIS service is defined by RFC3912. It is a very basic request protocol that does not define the format of answered contents at all. So the answers often reflects the format of the database containing the data and you may get different syntax for each database. Since WHOIS can be use for whatever contents you want, you cannot make many assumptions about the format of answer you will get. Hopefully however, you can expect to receive parseable content, and similarly formatted answers for each request.
So you need to develop a parsing logic for each server which you will have to do in a very empirical manner.
However here a a few tips for your development that come from the RFC.
you need to send request using TCP port 43 with a single line ended by CR+LF ASCII characters
you must expect TCP end of connection as meaning the answer is finished, only.
About domain names specifically, you might be want to note that formerly restriction to ASCII encoding made some registrants to use Punycode to encode some (accentuated by example) strings in DNS systems, so you might want to be able to expect these in a Whois answer also if you meet in some replies. The existence of Internationalized Domain Names since 2003 will require you to support unicode encoding. Algorithms to converts names are complex, RFC 3490 should give you some useful details about this.
Good luck !
You need to detect the format ands use different regular expressions for them. alternatively as you mentioned you can use XML or even JSON APIs http://whoisxmlapi.com/ http://www.domaintools.com/api/docs/
You need to extend your database and processing to better deal with the problem.
The data provided by the remote service is in different format as you've already noted. So you need to separate the concerns of fetching the data and parsing it, because both things are independent to each other. For example, the format for one TLD can change over time.
So first of all you fetch the plain text data per domain and store it's meta-data:
- domain
- whois server
- timestamp of fetch operation
- response
- status code (if the protocol has this)
You can then later on within a second processing do the parsing. You can use the metadata that already exists to decide which parsing algorithm you need. That helps you to maintain your application over time as well.
After parsing went right, you've got the normalized format which is what you aim for.
Next to these technical processings, you should take care of the usage conditions offered by the whois service(s). Not everything that is technically possible, is legally or morally accepted. Take care and treat other persons personal records with the respect this deserves. Protect the data you collect, e.g. archive and scramble / lock-away data you don't need any longer for your on going processing.
See as well:
- RFC3912 WHOIS Protocol Specification
精彩评论