Wednesday, November 30, 2011

Incorrect bug fixes

Whenever developers fix a bug there is of course a chance of making a mistake – not fixing the bug correctly or completely, or introducing a regression, an unexpected side effect. This is a serious problem in system maintenance. In Geriatric Issues of Aging Software, Capers Jones says that
Roughly 7 percent of all defect repairs will contain a new defect that was not there before. For very complex and poorly structured applications, these bad fix injections have topped 20 percent.
An interesting new study How do Fixes Become Bugs? looks at mistakes made by developers trying to fix bugs. The study was done on bug fixes made to large operating system code bases. Their findings aren’t surprising, but they are interesting:
  • Somewhere between 15-25% of fixes for bugs are found to be incorrect in the field. Almost half of these mistakes are serious (can cause crashes, hangs, data corruptions or security problems).
  • Concurrency bugs are the most difficult to fix correctly: 39% of concurrency bug fixes are wrong, and fixes on data race bugs can easily introduce deadlocks or reveal other bugs that were previously hidden. Not surprising given that the analysis was of operating system code. But this still hilights the risks in trying to fix concurrent code.
  • The risk of making mistakes is magnified if the person making the fix is not familiar with the code. More than 25%of incorrect fixes are made by developers who had never previously touched this part of the code before.
The main reasons for incorrect bug fixes:
  • Bug fixes are usually done under tight timelines – bug fixers don’t have the chance to think about potential side-effects and the interaction with the rest of the system, and testers don’t have enough time to thoroughly regression test the fix.
  • Bug fixing has a narrow focus – the developer is focused on understanding and fixing the bug, and doesn’t bother to understand the wider context of the system, often doesn’t even check for other places where the same fix needs to be made. Testers are also narrowly focused on proving that the fix works, and don’t look outside of the specific problem.
  • Lack of understanding of the code base – fixes, especially high-risk fixes (like concurrency changes) should be made by whoever understands the code the best.

So: don't let people who don't know the code well try to fix high-risk problems like concurrency bugs. But you knew that already, didn't you.

Tuesday, November 29, 2011

Iterationless Development – the latest New New Thing

Thanks to the Lean Startup movement, Iterationless Development and Continuous Deployment have become the New New Thing in software development methods. Apparently this has gone so far that “there are venture firms in Silicon Valley that won’t even fund a company unless they employ Lean startup methodologies”.

Although most of us don’t work in a Web 2.0 social media startup, or anything like one, it’s important to cut through the hype and see what we can learn from these ideas. One of the most comprehensive descriptions I’ve seen so far of Iterationless Development is a (good, but buzzword-heavy) presentation by Erik Huddleston that explains how development is done at Dachis Group, which builds online social communities. The development team’s backlog is updated on a just-in-time basis, and includes customer business requirements (defined as minimum features), feedback from Operations (data from analytics and results of Devops retrospectives), and minimally required technical architecture.

Work is managed using Kanban WIP limits and queues. Developers create tests for each change or fix up front. Every check-in kicks off automated tests and static analysis checks for complexity and code duplication as part of Continuous Integration. If it passes these steps, the change is promoted to a test environment, and the code must then be reviewed for architectural oversight (they use Atlassian’s Crucible online code review tool to do this).

Once all of the associated change sets have been reviewed, the code changes are deployed to staging for acceptance testing and review by product management, before being promoted to production. All production changes (code change sets, environment changes and database migration sets) are packaged into Chef recipes and progressively rolled out online. It’s a disciplined and well-structured approach that depends a lot on automation and a good tool set.

Death to Time Boxing

What makes Iterationless Development different is obviously the lack of time boxing – instead of being structured in sprints or spikes, work is done in a continuous flow. According to Huddleston, iterationless Kanban is “here to stay” and is “much more productive than artificial time boxing”.

In a separate blog post, he talks about the death of iterations. While he agrees that iterations have benefits – providing a fixed and consistent routine for the team to follow, a forcing function to drive work to conclusion (nothing focuses the mind like a deadline), and logical points for the team to synch up with the rest of the business – Huddleston asserts that working in time boxes is unnatural and unnecessary. That the artificial and arbitrary boundaries defined by time boxes force people to compromise on solutions, and force them to cut corners in order to meet deadlines.

I agree that time boxes are arbitrary – but no more arbitrary than a work day or work week, or a month or a financial quarter; all cycles that businesses follow. In business we are always working towards a deadline, whether it is hard and real or soft and arbitrary. This is how work gets done. And this doesn’t change if we are working in time boxes or without them.

In iterationless Kanban, the pressure to meet periodic time box deadlines is replaced with the constant pressure to deliver work as fast as possible, to meet individual task deadlines. Rapid cycling in short time boxes is hard enough on teams over a long period of time. Continuous, interrupt-driven development with a tight focus on optimizing cycle time is even harder. The dials are set to on and they stay that way. Kanban makes this easy, giving the team, and the customer and management, the tools to continuously visualize work in progress, identify bottlenecks and delays and squeeze out waste – to maximize efficiency. This is a manufacturing process model remember. The emphasis on tactical optimization and fast-feedback loops, and the “myopic focus on eliminating waste” is just that – short-sighted and difficult to sustain.

With time boxes there are at least built-in synch points, chances for the team to review and reset, so that people can reflect on what they have done, look for ways to improve, look ahead at what they need to do next, and then build up again to an optimal pace. This isn’t waste. Cycling up and down is important and necessary to keep people from getting burnt out and to give them a chance to think and to get better at what they do.

Risk is managed in the same tactical, short-sighted way. Teams working on one issue at a time end up managing risk one issue at a time, relying heavily on automated testing and in-stream controls like code reviews. This is good, but not good enough for many environments: security and reliability risks need to be managed in a more comprehensive, systemic way. Even integrating feedback from Ops isn’t enough to find and prevent deep problems. Working in Agile time boxes is already trading technical risks for speed and efficiency. Iterationless Development and Continuous Deployment, focused on eliminating waste and on accelerating cycle time, pushes these tradeoffs even further, into the danger zone.

Huddleston is also critical of “boxcaring” – batching different pieces of work together in a time box – because it interferes with simple prioritization and introduces unnecessary delays. But batching work together that makes sense to do together can be a useful way to reduce risk and cost. Take a simple example. The team is working on feature 1a . Once it's done, they move on to feature 1b, then 1c. All of this work requires changing the same parts of code, the same or similar testing and reviews, and has a similar impact on operations. By batching this work together, you might deliver it slower, but you can reduce waste and minimize risk by delivering it once, rather than 3 times.

Iterationless Development Makes Sense…

Iterationless Development using Kanban as a control structure is an effective way to deal with excessive pressure and uncertainty – like in an early-stage startup, or a firefighting support team. It’s good for rapid innovation and experimental prototyping, building on continuous feedback from customers and from Operations – situations where speed and responsiveness to the business and customers is critical, more important than minimizing technical and operational risks. It formalizes the way that most successful web startups work – come up with a cool idea, build a prototype as quickly as possible, and then put it out and find out what customers actually want before you run out of cash. But it’s not a one-size-fits-all solution to software development problems.

All software development methods are compromises – imperfect attempts at managing risks and uncertainty. Sequential or serial development methods attempt to specify and fix the solution space upfront, and then manage to this fixed scope. Iterative, time-boxed development helps teams deal with uncertainty by breaking business needs down into small, concrete problems and delivering a working solution in regular steps. And iterationless, continuous-flow allows teams to rapidly test ideas and alternatives, when the problem isn’t clear and nobody is sure yet what direction to go in.

There’s no one right answer. What approach you follow depends on what your priorities and circumstances are, and what kind of problems and risks you need to solve today.

Tuesday, November 15, 2011

Diminishing Returns in software development and maintenance

Everyone knows from reading The Mythical Man Month that as you add more people to a software development project you will see diminishing marginal returns.

When you add a person to a team, there’s a short-term hit as the rest of the team slows down to bring the new team member up to speed and adjusts to working with another person, making sure that they fit in and can contribute. There’s also a long-term cost. More people means more people who need to talk to each other (n x n-1 / 2), which means more opportunities for misunderstandings and mistakes and misdirections and missed handoffs, more chances for disagreements and conflicts, more bottleneck points.

As you continue to add people, the team needs to spend more time getting each new person up to speed and more time keeping everyone on the team in synch. Adding more people means that the team speeds up less and less, while people costs and communications costs and overhead costs keep going up. At some point negative returns set in – if you add more people, the team’s performance will decline and you will get less work done, not more.

Diminishing Returns from any One Practice

But adding too many people to a project isn’t the only case of diminishing returns in software development. If you work on a big enough project, or if you work in maintenance for long enough, you will run into problems of diminishing returns everywhere that you look.

Pushing too hard in one direction, depending too much on any tool or practice, will eventually yield diminishing returns. This applies to:
- Manual functional and acceptance testing
- Test automation
- Any single testing technique
- Code reviews
- Static analysis bug finding tools
- Penetration tests and other security reviews

Aiming for 100% code coverage on unit tests is a good example. Building a good automated regression safety net is important – as you wire in tests for key areas of the system, programmers get more confidence and can make more changes faster.

How many tests are enough? In Continuous Delivery, Jez Humble and David Farley set 80% coverage as a target for each of automated unit testing, functional testing and acceptance testing. You could get by with lower coverage in many areas, higher coverage in core areas. You need enough tests to catch common and important mistakes. But beyond this point, more tests get more difficult to write, and find fewer problems.

Unit testing can only find so many problems in the first place. In Code Complete, Steve McConnell explains that unit testing can only find between 15% and 50% (on average 30%) of the defects in your code. Rather than writing more unit tests, people’s time would be better spent on other approaches like exploratory system testing and code reviews or stress testing or fuzzing to find different kinds of errors.
Too much of anything is bad, but too much whiskey is enough.
Mark Twain, as quoted in Code Complete
Refactoring is important for maintaining and improving the structure and readability of code over time. It is intended to be a supporting practice – to help make changes and fixes simpler and clearer and safer. When refactoring becomes an end in itself or turns into Obsessive Refactoring Disorder, it not only adds unnecessary costs as programmers waste time over trivial details and style issues, it can also add unnecessary risks and create conflict in a team.

Make sure that refactoring is done in a disciplined way, and focus refactoring on those areas that need it the most: on code that is frequently changed, routines that are too big, too hard to read, too complex and error-prone. Putting most of your attention refactoring (or if necessary rewriting) this code will get you the highest returns.

Less and Less over Time

Diminishing returns also set in over time. The longer that you spend working the same way and with the same tools, the less benefits you will see. Even core practices that you’ve grown to depend on don’t pay back over time, and at some point may cost more than they are worth.

It’s time again for New Year’s resolutions – time to sign up at a gym and start lifting weights. If you stick with the same routine for a couple of months, you will start to see good results. But after a while your body will get used to the work – if you keep doing the same things the same way your performance will plateau and you will stop seeing gains. You will get bored and stop going to the gym, which will leave more room for people like me. If you do keep going, trying to push harder for returns, you will overtrain and injure yourself.

The same thing happens to software teams following the same practices, using the same tools. Some of this is due to inertia. Teams, organizations reach an equilibrium point and they want to stay there. Because it is comfortable, and it works – or at least they understand it. And because the better the team is working, the harder it is to get better – all the low-hanging fruit has been picked. People keep doing what worked for them in the past. They stop looking beyond their established routines, stop looking for new ideas. Competence and control lead to complacency and acceptance. Instead of trying to be as good as possible, they settle for being good enough.

This is the point of inspect-and-adapt in Scrum and other time boxed methods – asking the team to regularly re-evaluate what they are doing and how they are doing it, what’s going well and what isn’t, what they should do more of or less of, challenging the status quo and finding new ways to move forward. But even the act of assessing and improving is subject to diminishing returns. If you are building software in 2-week time boxes, and you’ve been doing this for 3, 4 or 5 years, then how much meaningful feedback should you really expect from so many superficial reviews? After a while the team finds themselves going over the same issues and problems and coming up with the same results. Reviews become an unnecessary and empty ritual, another waste of time.

The same thing happens with tools. When you first start using a static analysis bug checking tool for example, there’s a good chance that you will find some interesting problems that you didn’t know were in the code – maybe even more problems than you can deal with. But once you triage this and fix up the code and use the tool for a while, the tool will find fewer and fewer problems until it gets to the point where you are paying for insurance – it isn’t finding problems any more, but it might someday.

In "Has secure software development reached its limits?” William Jackson argues that SDLCs – all of them – eventually reach a point of diminishing returns from a quality and security standpoint, and that Microsoft and Oracle and other big shops are already seeing diminishing returns from their SDLCs. Their software won’t get any better – all they can do is to keep spending time and money to stay where they are. The same thing happens with Agile methods like Scrum or XP – at some point you’ve squeezed everything that you can from this way or working, and the team’s performance will plateau.

What can you do about diminishing returns?

First, understand and expect returns to diminish over time. Watch for the signs, and factor this into your expectations – that even if you maintain discipline and keep spending on tools, you will get less and less return for your time and money. Watch for the team’s velocity to plateau or decline.

Expect this to happen and be prepared to make changes, even force fundamental changes on the team. If the tools that you are using aren’t giving returns any more, then find new ones, or stop using them and see what happens.

Keep reviewing how the team is working, but do these reviews differently: review less often, make the reviews more focused on specific problems, involve different people from inside and outside of the team. Use problems or mistakes as an opportunity to shake things up and challenge the status quo. Dig deep using Root Cause Analysis and challenge the team’s way of thinking and working, look for something better. Don’t settle for simple answers or incremental improvements.

Remember the 80/20 rule. Most of your problems will happen in the same small number of areas, from a small number of common causes. And most of your gains will come from a few initiatives.

Change the team’s driving focus and key metrics, set new bars. Use Lean methods and Lean Thinking to identify and eliminate bottlenecks, delays and inefficiencies. Look at the controls and tests and checks that you have added over time, question whether you still need them, or find steps and checks that can be combined or automated or simplified. Focus on reducing cycle time and eliminating waste until you have squeezed out what you can. Then change your focus to quality and eliminating bugs, or to simplifying the release and deployment pipeline, or some other new focus that will push the team to improve in a meaningful way. And keep doing this and pushing until you see the team slowing down and results declining. Then start again, and push the team to improve again along another dimension. Keep watching, keep changing, keep moving ahead.

Thursday, November 3, 2011

Real, useful security help for software developers

There's lots of advice on designing and building secure software. All you need to do is: Think like an attacker. Minimize the Attack Surface. Apply the principles of Least Privilege and Defense in Depth and Economy of Mechanism. Canonicalize and validate all input. Encode and escape output within the correct context. Use encryption properly. Manage sessions in a secure way....

But how are development teams actually supposed to do all of this? How do they know what's important, and what's not? What frameworks and libraries should they use? Where are code samples that they can review and follow? How can they test the software to see if they did everything correctly?

Read my latest post at the SANS Appsec Street Fighter blog for the best of the tools, cheat sheets and programming books that I've found to help development teams deal with the details of building secure software.
Site Meter