This was the main question posed by Greg Wilson's "bits of evidence" presentation. I'm paraphrasing in my question here so please read the presentation for all the details.
I'd also like to know if you disagree with the premise of the question i.e. you开发者_如何学Go think that current development practices actually do reflect evidence.
Most rigorous empirical studies of programming (by deliberate, designed experiment, not just observation of whatever happens to occur), accounting for all variables that may likely affect the results, would be scary-costly.
For example, just like in experimental psychology but even more so, many such empirical studies (e.g., Prechelt's, as quoted in the presentation) are based on volunteers (and any statistician can tell you that using a self-selected sample totally biases the results and make the whole study essentially useless) and/or students (and might 5, 10 or 20 years' worth of professional experience not make a huge difference to the results -- i.e., can experience be blindly assumed to be irrelevant, so that professionals learn nothing at all from it that might affect the results?).
Finding a representative, random sample would be fraught for most researchers -- e.g., even if you could offer participants $40 an hour, a scary-high amount for most studies of decent size (in terms of numbers of participants in the study and length thereof), you'd be biasing your sample towards unemployed or middle-low salary programmers, a systematic bias that might well affect your results.
You could do it (get a random sample) in a coercion-capable structure -- one where refusing to take part in the study when randomly selected as part of the sample could carry retribution (most firms would be in such a position, and definitely so would e.g. military programming outfits). You might get some grumbling, not-really-willing participants, but that's more or less inevitable. A firm with, say, 1000 programmers, might get a random sample of 100 of them to participate for two days -- that would be enough for some studies though definitely not for many of the most interesting ones among those that were quoted (e.g., about the effects of different phases of the development cycle), and a representative sample of the population of programmers currently employed at the firm.
The cost to the firm (considering fully loaded employee and infrastructure costs) might be something like $100,000. How would the firm's investment get repaid? Unless the study's results can be effectively kept secret (unlikely with so many persons involved, and wouldn't the researchers want to publish?-), "improving programmer productivity" (by possibly changing some practice based on the study) is no real answer, because all of the firm's competitors (those with similar programmer populations and practices, at least) could easily imitate any successful innovation. (I do hope and believe such results would not be patentable!-).
So, studies based on students and/or volunteers, very short studies, and purely observational (which is not the same as empirical!-) ones, are most of what's around. If you're not clear about the difference between observational and empirical: for most of humanity's history, people were convinced heavy objects fall faster, based on observational evidence; it took deliberate experiments (set up by Galileo, to compare falling-speeds while trying to reduce some effects that Galileo couldn't actually deal with rigorously), i.e., empirical evidence, to change opinions on the subject.
This is not totally worthless evidence, but it is somewhat weak -- one set of semi-convincing data points out of many, which decision-making managers must weigh, but only up to a point. Say there's this study based on students somewhere, or volunteers from the net, or even a proper sample of 100 people... from a company that does software completely different from mine and in my opinion hires mediocre programmers; how should I weigh those studies, compared with my own observational evidence based on accurate knowledge of the specific sectors, technologies, and people my firm is dealing with? "Somewhat" seems a reasonable adverb to use here;-)
Because...
- empirical "evidence" is hard to measure and expensive to produce
- studies which produce such evidence are often tainted with commercial concerns or other particular motives.
- the idea of systematic reproducibility in the context of software development is in part flawed
Disclosure: The above assertions are themselves the product of my own analysis based mostly on personal experience and little scientific data to boot. ;-) Never the less here are more details that somewhat support these assertions.
Pertinent metrics regarding any sophisticated system are difficult to find. That's because the numerous parts of a complex system provide an even more numerous number of possible parameters to measure, to assert, to compare. It is also, maybe mainly, because of the high level of correlation between these various metrics. SW design is no exception, with thousands of technology offerings, hundred of languages, dozen of methologies and with many factors that lay outside of the discipline proper, effective metrics are hard to find. Furthermore many of the factors in play are discrete/qualitative in nature, and hence less easily subjected to numeric integration. No wonder the "number of lines of codes" is still much talked about ;-)
It is easy to find many instances where particular studies (or indeed a particular write-ups by some "consulting entities") are sponsored in the context of a particular product or a particular industry. For example, folks selling debugging tools will tend to overestimate the percentage of time spend on this function as compared to the overall development time etc. Folks selling profilers...
The premise of having the software development processes adopt methodologies associated with mass production of identical products (it is no coincidence that at least two of the slides featured an automobile assembly line), is, in my opinion, greatly flawed. To be sure, individual steps in the process can and should be automated and produce predictable results, but as a whole, there are too many factors and too few instances of projects/product to seek the kind of rationalization found in mass production industries.
Commentary on Greg's presentation per se:
Generally I found the slides to be a pleasant read, humorous and all, but somewhat leaving me hungry for substance and relevance. It is nice to motivate folks to strive towards evidence-based processes but this should be followed by practical observations in the domain of software engineering to help outline the impediments and opportunities in this area.
I'm personally a long time advocate of the use of evidence-based anything, and I'm glad to live in a time where online technologies, computing power and general mathematical frameworks come together to deliver many opportunities in various domains, including but not limited to the domain of software engineering.
Because doing making decisions that go against conventional wisdom is risky.
A manager puts his job on the line when he goes against the accepted ways of doing things. Its a much safer decision to just stick with the wisdom of the crowd.
Interesting presentation!
It's really hard (and very costly) to run controlled experiments that are large enough and real enough to be compelling to practitioners. You tend to get small experiments involving 20 graduate students over a few hours when we really need to measure teams of experienced developers working for a few weeks or months on the same task under different conditions (see slide 12). And of course the latter studies are very expensive.
While smaller studies can be suggestive, real development organizations can't draw many real conclusions from them. I think instead that the more effective teams mainly learn from experience, a much less empirical process. If something works well enough, it will carry forward to the next project, and if something went wrong, a different approach will be tried next time. Small pilot projects will be undertaken, new technologies will be tried out, and notes are compared with colleagues in other organizations.
So I'd say that (non-dysfunctional) development groups behave more or less rationally, but that they could really use evidence from more ambitious experiments.
精彩评论