how to develop a program to minimize errors in human transcription of hand written surveys_问答_开发者

I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases.

I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find h开发者_StackOverflow社区ard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that.

The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected.

This question has two parts:

The GUI.

The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas?

The error checking subsystem.

What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys?

My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way?

The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.

Call me old-school, but I still think the most pragmatic way to do this is to use double entry. Two data entry clerks enter their surveys, then swap stacks and enter the other clerk's surveys. Whenever your system detects a difference between the two, it throws up a flag - then the two clerks put their heads together and decide on the correct answer (or maybe it gets reviewed by a more senior research staff member, etc.). Combined with some of the other suggestions here (I like mdma's suggestions for the GUI a lot), this would make for a low-error system.

Yes, this will double your data entry time (maybe) - but it's dead simple and will cut your errors way, way down. The OMR idea is a great one, but it doesn't sound to me like this project (a national, 52-page survey) is the best case for a lone hacker to try to implement that for the first time. What software do you need? What hardware is available to do that? There will still be a lot of human work involved in identifying the goofy stuff where an interviewer marks all four possible answers and then writes a note off to the side - you'll likely want to randomly sample surveys to get a sense of what the machine-read error rate is. Even then you still just have an estimate of the error rate, not corrected data.

Try a simpler method to give your employer quality results this time - then use those results as a pre-validated data set for experimenting with the OMR stuff for next time.

OCR/OMR is probably the best choice, since you rule out unpredictable human error and replace it with fairly predicatable machine error. It may even be possible to filter out forms that the OCR may struggle with and have these amended to improve scan accuracy.

But, tackling the original question head on:

Error Checking

have questions correlated, so that essentially the same thing is asked more than once, or asked again in the negative. If the answers from correlated questions do not also correlate, then this could be an indication of input error.
deviations from the norm: if there are patterns in the typical responses then deviations from these typical reponses could be considered potential input errors. E.g. if questions 2 and 3 answer A, then question for is likely to be C or D. This is a generalization of correlation above. The correlations can be computed dynamically based on already inputted data.

GUI

have the GUI mimic the paper form, so that what entry clerks see on paper is reflected on the screen. Entering a paper question response into the wrong question in the GUI is less likely then.
provide visual assistance to data entry clerks, such as using a slider to maintain the current question location on paper.
A custom entry device for inputting the data may be easier to use than keyboard navigation and listboxes. For example, a touch display with all options spelled out A B C D. The clerk only has to hit an option, and it is selected and the next question shown - after a brief pause. In the event the clerk makes an error, they can use the prev/next buttons next to each question.
provide audio feedback of entered data, so when the clerk enters "A" they hear "A".

EDIT: If you consider performing dual-entry of data or implementing an improved GUI, it may be worth conducting a pilot scheme to assess the effectiveness of various approaches. Dual-entry can be expensive (doubling the cost of the data entry task) - which may or may not be justified by the improvement in accuracy. A pilot scheme will allow you to assess the effectiveness of dual-entry, quickly and relatively inexpensively. It will also give you an idea of the level of error from a single data entry clerk without any UI changes which can help help determine whether UI changes or other error-reducing strategies are needed and how much cost can be justified in implementing them.

Related links

A device that inputs data from multiple choice tests
Wikipedia: OMR - Optical Mark Recognition
ReadSoft - Automated Data Entry
Data capture hardware

My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances

I don't think this will actually produce a meaningful outcome. Presumably the errors are unintentional and random. Random checks would find systemic errors, but you'll only find 10% of random errors if you double-check 10% of the forms (and 20% of errors if you check 20% of forms, etc).

What do the paper surveys look like? If possible, I would guess that an OCR system which scans the hand-written tests and compares what the OCR detects the answer to be with what the data entry operator gave would be a better solution. You might still end up manually double-checking a fair number of surveys but you'll have some confidence that the surveys you double-check are more likely to contain an error than if you just picked them out at random.

If you also control what the paper surveys look like, then that's even better: you can design them specifically so that OCR can be made as accurate as possible.

Forgive me for totally side-stepping the question, but yesterday I went to eBay and paid US $99 for a 7 inch Android o/s slate PC. Not the world's paster processor, nor with heaps of RAM, but certainly enough to fill in user surveys in the field.

I can't believe that your organization can't afford $99 per interviewer to make this problem go away.

It's worth suggesting to your boss, at least, isn't it?

I would support Matt Parker's suggestion of using double entry to reduce errors. I have even seen triple entry used for very error-sensitive data entry tasks.

The good thing about double entry is it enables you to come up with a ballpark estimate of your overall error rate by making some assumptions (mainly that the error rate is consistent across entry items and clerks) and using the rate at which entry conflicts are encountered.

More sophisticated double entry systems can also measure the error rates of parts of the data entry task and individual clerks so that you can make improvements to reduce the error rate.

It sounds like there is need for a combined approach, the actual forms should be suitable for automated processing. You could scan the documents and just deal with the electronic version, if the multiple choice input can be automatically process you might get better error ratios by keeping the user out of the loop. Depending on the OCR package I would guess that you will get a value back that tells you how sure the system is about a selection it has made, dependent on that value you will want to have the form verified by a person. Note I am talking about using ocr on the marks on the multiple choice not the freeform entries, that is probably an issue by itself.

In parallel you will probably want to do random checks to find the error ratio of the ocr system. This value can then be used for determining the confidence value for the sum of the multiple choice question.

I think a similar approach would be helpful if you just go with human input, you will probably not get rid of all the errors because people will make errors and they will make errors correcting errors, but with a large enough sample size you will probably be able to determine the ratio of errors in the human input. This number can then be used for determining the results of the survey.

As for other UI ideas, you could use the scanned forms and overlay the UI in a way that the UI checkbox is close to the written checkbox. If you have a couple of known lines at angles, straightening and scaling the form should not be too hard. If the UI input element is close to the pencil marks chances are you are going to get higher rates for correct classification.

You can also probably use statistical analysis to pick forms that seem out of line, but you might then be skewing the result by non uniform selection which might be worse than a uniform random error. Depending on the design of the paper survey it might be helpful to copy that in the UI, it will be easier for everybody to find errors if the two should look similar, if you don't stick to that may some of the references on survey design (like this might be helpful.

This seems to be a rather large operation, I am sure there are some statisticians on staff, talk to them on what they need and what you could do to help them and should not do to skew results even more.

After you've implemented your best mix of software approaches to this problem, you could also consider running the output through Amazon's mechanical turk program and perform a human cross-check of the transcription to the originals. Other projects along those lines are reCaptcha (though it's only for printed text OCR as far as I can tell), and I just came across Beextra which seems to be doing things like cataloging Smithsonian media.

Regarding detection of errors in transcription of multiple-choice answers, my suggestion is to use multiple data entry people and statistical profiling.

A statistician could compare the results to see if any questions stand out as having a markedly different answer distribution for answers entered by one data entry user vs. those of others. If so, then those questions can be flagged to be reentered from the forms.

Assuming that the forms are randomly assigned to data entry personnel, the entered results should have fairly similar answer distributions for a sufficiently large number of forms per data entry user.

Human double checking is probably the most popular way to reach low errors number. . If you'd like to speed it up one person can only calculate total number of given answers and write this number at the bottom of survey(sort of 'control sum'). Person who enters data to your application should also fill that number in a special field and then system can calculate number of given answers and compare with expected value. This can solve problem of correct quantity but not correctness of data.

You can also use some methods from data-minig to detect errors in inserted data. Example: if you ask for age and salary range per you can create rule that says: if age < X it is most likely that person does not earn more than Y so give an alert and ask for revision. This is called association rules

GUI: it should be 1:1 to representation of paper form. some keyboard shortcuts might be helpful to speed up work.

As has been mentioned, key it twice. Yes it's "double the work", but that leads to point 2.

Make the surveys EASY TO KEY.

They should be simple to read for the keyers. With section regarding their attention well highlighted so it stands out from the noise of the form.

Your "GUI" shouldn't be. The GUIs primary benefit is "discoverability", these folks shouldn't be "discovering" anything. Keyboard navigation should be the "only" way once they start keying stuff in. One or two hands on the keyboard, one hand for changing survey page == no hands for a mouse. Attention to the screen (for a mouse, or anything really) is attention away from the survey for keying.

The keyers should be "heads down", and not having to look at the screen at all. If practical, you can used audio prompts to tell the keyers where they've switched pages, to help ensure that what they're keying and what the computer is keying are basically the same thing. If audio prompts aren't possible, then simply have the entry people key in the page of the survey that they are on. The computer will already "know" it's on page "2", and so when the keyers keys in the page number, it can validate that they're on the same spot.

DO use audible prompts for keying errors. Don't let them key in garbage, hit "save" and then correct errors. If you KNOW the data is wrong right away, STOP them and have them fix it immediately. Nothing catches their attention than 5 or 6 "ding ding dings", because they're already keying 3 fields later before they realize the computer stopped them. Auditing a long questionnaire for errors is a waste of time.

Do NOT "scroll" your data screens. Page back and forth. Scrolling sucks. When you scroll, fields on the screens move. When you don't they're always in the same spot so when the entry person DOES need to look at the screen, they can always look at the same place.

Because of this, drop down lists of any length -- suck. They shouldn't be using drop downs anyway, as they shouldn't be looking at the screen anyway. The form should TELL THEM EXACTLY what they need to key.

Be consistent with the data entry. Use the 10 key as much as possible. If you have more than 10 options, and 0-9 isn't practical for the entire questionnaire, then you should use 00-99. Don't use A-Z for options, as people don't think of keys that way. They don't memorize letters on the keyboard as much as they memorize word patterns on the keyboard. 01-26 is far faster to key than A-Z any day of the week.

Also, the SHIFT key is NOT your friend. But it'll be fine when they're in "typing english" mode.

Finally, organize the survey so all the "typing", "fill in the blank" stuff is in one section (ideally at the end). This lets them 10 key the rest in a blaze, get in to a zone, and not have to move their hands back and forth. Many folks will "top key" numbers when typing "english" (i.e. use the top row) and 10 key numbers when not.

For the multiple choice questions, it seems like an automated scan would be fairly reliable. If you have the option of scanning in all the documents before data entry starts, then incorporate the scans into the UI with computer guesses in place.

For a multiple choice question, have the data entry form on one side and the original scan on the other side. If the computer guess is above a certain threshold, fill in that choice in the data entry area. If the computer guess is below a certain threshold (multiple answers or no answer found) then do not mark an initial answer and highlight that question as needing attention. Even without the guesses, having the scanned paper visible on screen next to the data entry seems helpful.

For the handwritten answers I have no real suggestions beyond having the scanned input beside the the data entry area. Even if the image is not as legible as the original document, it will help ensure that the correct text is entered for each question. A fairly common input error is to be off by one, where the correct answer is entered for the wrong question. Having the image on screen could reduce that a little, and make it easier for another human to verify.

This assumes that all the forms are identical in layout so you can write some code to display a certain part of a certain page and expect it to be the right part of the form.

Design a closed loop system.

You have to inject, once in a while, doubly blind "reference forms" to be entered by your regular personnel to automatically rate their performance, and provide feedback based on the success rate.

This will control the human factor motivation and eliminate the major source of input errors.