I posted the following question on another thread:
"Does anybody know of a good solution that can be used from php that will effectively remove contact information like phone numbers, email addresses and maybe even contact addresses from a document?"
I quickly got told what I suspected... I am asking too much :)
So now I am looking for alternative solutions. One I am considering is using Amazon's Mechanical Turk to do the cont开发者_开发技巧act information removal.
So two question?
- Would this be a good fit for mechanical turk?
- How effective is the service?
Check out http://www.microtask.com. (I'm not affiliated with this company.)
You might be able to cast a wide net with your regular expressions and then have the human workers sift out the real addresses, phone numbers, and e-mail addresses. Whether "such-and-such" is an address, phone number, or e-mail address is a fairly straightforward question for a human.
Since they chop the form up (or say they do -- I haven't used it) you don't have as much to worry about privacy concerns, or may be able to justify them. If MicroTask has hundreds of clients, what they are able to do is take all of the microtasks and throw them in a giant hopper that randomizes which ones each individual worker sees. Hence, they could virtually guarantee that the workers will have almost no means to correlate any of the sensitive information they work on. Each worker would see thousands of independent pieces of information each day. Under these conditions, who would be able to discern that Task 347 on day 1 had the e-mail address that corresponds to Task 1133 on day 3? Even if they could, it's hardly worth it to them. They'll probably make more money just doing what is asked of them.
精彩评论