How to test word count program if there is any uncovered bugs?_问答_开发者

I just revisited the classic C textbook K&R. And read t开发者_如何学Gohe exercise 1-11:

How would you test the word count program? What kinds of input are most likely to uncover bugs if there are any?

Actually, I only have a basic idea to manually count an existing paragraph to get the exact word numbers and compare it with the result word count program calculates.

Is there anything I've missed? And what is the trick of the test?

EDIT

Answers summary:

Semantic definition of word, some special cases:

link word: "cat-walk"
small word: a, b,c
biiiiiig words: "a fooooooooo<40MILLIONLETTERS>ooooooo a" has 3 words

boundary conditions:

Texts with multiple spaces between words.
Texts bigger than 2GB
Words which contain a dash but no whitespace.
Non-ascii words.
Files in some different encoding (if your program supports that)
Characters which are surrounded by whitespace but do not contain any word characters (e.g. "hello - world")
Texts without any words
Texts with all words on a single line

Well, it depends on what you semantically define as words. Since it is you who's writing the word count program, you are supposed to know what a word is.

So to test this program, you have to think where are the corner cases: does a "linked-word" count as one or two words? Do you consider "I'm" to be one or two? Etc..

As for the K&R exercise, I guess they voluntarily forgot some of these corner cases, and they suggest that you, analyzing their code, find these caveats.

Here are some examples of texts that could uncover bugs:

Texts with multiple spaces between words.
Texts bigger than 2GB
Words which contain a dash but no whitespace.
Non-ascii words.
Files in some different encoding (if your program supports that)
Characters which are surrounded by whitespace but do not contain any word characters (e.g. "hello - world")
Texts without any words
Texts with all words on a single line

I haven't re-read exercise 1-11 for this answer ... (my book is 60km away)

Things that might have been coded wrong

small words: "a b c d" has 4 words
biiiiiig words: "a fooooooooo<40MILLIONLETTERS>ooooooo a" has 3 words
use of several symbols: ",.!? ...

Definitions that may not have been understood

"cat-walk" 1 word? 2 words?
"under-\nstood" line break (with hyphen) in the middle of a word

To test an algorithm you should create a set of test cases with the well known result.

This test cases should cover:

Most possible combinations of the input;
"border" cases. In your case it could be: one word, 2 words with a lot of delimiters, short text started and ended with delimiters, an so on;
Some weird text. Just look at the algorithm and try to think of the strange input which can break it. Usually it is a quite small text (3-4) words but with some strange delimeters between them like "hello,word", "hello ,word", "hello word,,,,"

The other guys already gave some great practical suggestions. Let me add two things:

First of all, K&R don't want you to find all deficiencies of their code. The goal of the exercise is to make you aware of the fact that there often exists bogus input and that you may some day be expected to somehow deal with it in a similar situation. How you do that is completely up to you. Just remember that some seemingly easy problems sometimes require hard thinking.

Point in case: when my stupid iPhone receives a message which reads "foo is bad.it smells.", it recognizes "bad.it" as a URL. Seems funny, but as of yet, you can't fix this bug without requiring to change the message content itself.

And second, your title is misleading. There's no way that you find all bugs in a program by mere testing. Or as Edsger Dijkstra once put it:

Testing shows the presence, not the absence of bugs.

This is a fundamental result from theoretical computer science and can actually be proven. See Rice's theorem if you're interested.

EDIT: in writing this posting, I found a bug that is somehow related to our topic: the StackOverflow parser won't recognize " http://en.wikipedia.org/wiki/Rice's_theorem " as a URL. :-)

EDIT2: filed a bug report on meta here.