I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least:
- Name
- Address (physical)
- Email Address 开发者_StackOverflow
- Phone number
- website URL
I'm looking for a semantic parser that can attempt to extract these elements from the documents so that I can load that information into a relational database and work with these records as contacts.
Other services I've looked for, while valuable for other purposes, do not address this specific need.
- Alchemy API
- Open Calais
- Saplo
Any thoughts, suggestions or leads?
Have you found a lead to your question? I found some research articles:
www.cis.upenn.edu/~pereira/papers/crf.pdf
citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.9192&rep=rep1&type=pdf
www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta04extracting.pdf
But no specific examples of code on implementing any of these ideas.
Take a look at this too: stackoverflow.com/questions/953150/general-address-parser-for-freeform-text
(sorry I excluded the http, this system is not allowing me to post more than one url/link)
精彩评论