Wednesday, November 11, 2015
Thursday, August 20, 2015
Some Rules of failure in Complex Systems
4. Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.
3. Catastrophe requires multiple failures - single point failures are not enough. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.
14. Change introduces new forms of failure. The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.
The net of this: Complex systems are essentially and unavoidably fragile. We can try, but we can’t stop them from failing – there are too many moving pieces, too many variables and too many combinations to understand and to test. And even the smallest change or mistake can trigger a catastrophic failure.
A New Hope
But new research at the University of Toronto on catastrophic failures in complex distributed systems offers some hope – a potentially simple way to reduce the risk and impact of these failures.
The researchers looked at distributed online systems that had been extensively reviewed and tested, but still failed in spectacular ways.
They found that most catastrophic failures were initially triggered by minor, non-fatal errors: mistakes in configuration, small bugs, hardware failures that should have been tolerated. Then, following rule #3 above, a specific and unusual sequence of events had to occur for the catastrophe to unravel.
The bad news is that this sequence of events can’t be predicted – or tested for – in advance.
The good news is that catastrophic failures in complex, distributed systems may actually be easier to fix than anyone previously thought. Looking closer, the researchers found that almost all (92%) catastrophic failures are the result of incorrect handling on non-fatal errors. These mistakes in error handling caused the system to behave unpredictably, causing other errors, which weren’t always handled correctly or predictably, creating a domino effect.
More than half (58%) of catastrophic failures could be prevented by careful review and testing of error handling code. In 35% of the cases, the faults in error handling code were trivial: the error handler was empty or only logged a failure, or the logic was clearly incomplete. Easy mistakes to find and fix. So easy that the researchers built a freely available static analysis checker for Java byte code, Aspirator, to catch many of these problems.
In another 23% of the cases, the error handling logic of a non-fatal error was so wrong that basic statement coverage testing or careful code reviews would have caught the mistakes.
The next challenge that the researchers encountered was convincing developers to take these mistakes seriously. They had to walk developers through understanding why small bugs in error handling, bugs that “would never realistically happen” needed to be fixed – and why careful error handling is so important.
This is a challenge that we all need to take up – if we hope to prevent catastrophic failure in complex distributed systems.
Tuesday, July 7, 2015
There’s a lot of bad software out there. Unreliable, insecure, unsafe and unusable. It’s become so bad that some people are demanding regulation of software development and licensing software developers as “software engineers” so that they can be held to professional standards, and potentially sued for negligence or malpractice.
Licensing would ensure that everyone who develops software has at least a basic level of knowledge and an acceptable level of competence. But licensing developers won’t ensure good software. Even well-trained, experienced and committed developers can’t always build good software. Because most of the decisions that drive software quality aren’t made by developers – they’re made by somebody else in the organization.
Product managers and Product Owners. Project managers and program managers. Executive sponsors. CIOs and CTOs and VPs of Engineering. The people who decide what’s important to the organization, what gets done and what doesn’t, and who does it – what problems the best people work on, what work gets shipped offshore or outsourced to save costs. The people who do the hiring and firing, who decide how much money is spent on training and tools. The people who decide how people are organized and what processes they follow. And how much time they get to do their work.
Managers – not developers – decide what quality means for the organization. What is good, and what is “good enough”.
As a manager, I’ve made a lot of mistakes and bad decisions over my career. Short-changing quality to cut costs. Signing teams up for deadlines that couldn’t be met. Giving marketing control over schedules and priorities, trying to squeeze in more features to make the customer or a marketing executive happy. Overriding developers and testers who told me that the software wouldn’t be ready, that they didn’t have enough time to do things properly. Letting technical debt add up. Insisting that we had to deliver now or never, and that somehow we would make it all right later.
I’ve learned from these mistakes. I think I know what it takes to build good software now. And I try to hold to it. But I keep seeing other managers make the same mistakes. Even at the world’s biggest and most successful technology companies, at organizations like Microsoft and Apple.
These are organizations that control their own destinies. They get to decide what they will build and when they need to deliver it. They have some of the best engineering talent in the world. They have all the good tools that money can buy – and if they need better tools, they just write their own. They’ve been around long enough to know how to do things right, and they have the money and scale to accomplish it.
They should write beautiful software. Software that is a joy to use, and that the rest of us can follow as examples. But they don’t even come close. And it’s not the fault of the engineers.
Problems with software quality at Microsoft are so long-running that “Microsoft Quality” has become a recognized term, for software that is just barely “good enough” to be marginally accepted – and sometimes not even that good.
Even after Microsoft became a dominant, global enterprise vendor, quality has continued to be a problem. A 2014 Computer World article “At Microsoft, quality seems to be job none” complains about serious quality and reliability problems in early versions of Windows 10. But Windows 10 is supposed to represent a sea change for Microsoft under their new CEO, a chance to make up for past mistakes, to do things right. So what's going wrong?
The culture and legacy of “good enough” software has been in place for so long that Microsoft seems to be trapped, unable to improve even when they have recognized that good enough isn’t good enough anymore. This is a deep-seated organizational and cultural problem. A management problem. Not an engineering problem.
Apple’s Software Quality Problems
Apple sets themselves apart from Microsoft and the rest of the technology field, and charges a premium based on their reputation for design and engineering excellence. But when it comes to software, Apple is no better than anyone else.
From the epic public face plant of Apple Maps, to constant problems in iTunes and the App Store, problems with iOs updates that fail to install, data lost somewhere in the iCloud, serious security vulnerabilities, error messages that make no sense, and baffling inconsistencies and restrictions on usability, Apple’s software too often disappoints in fundamental and embarrassing ways.
And like Microsoft, Apple management seems have lost their way:
I fear that Apple’s leadership doesn’t realize quite how badly and deeply their software flaws have damaged their reputation, because if they realized it, they’d make serious changes that don’t appear to be happening. Instead, the opposite appears to be happening: the pace of rapid updates on multiple product lines seems to be expanding and accelerating.
I suspect the rapid decline of Apple’s software is a sign that marketing is too high a priority at Apple today: having major new releases every year is clearly impossible for the engineering teams to keep up with while maintaining quality. Maybe it’s an engineering problem, but I suspect not — I doubt that any cohesive engineering team could keep up with these demands and maintain significantly higher quality.
Marco Arment, Apple has lost the functional high ground, 2015-01-04
Recent announcements at this year’s WWDC indicate that Apple is taking some extra time to make sure that their software works. More finish, less flash. We’ll have to wait and see whether this is a temporary pause or a sign that management is starting to understand (or remember) how important quality and reliability actually is.
Managers: Stop Making the Same Mistakes
If companies like Microsoft and Apple, with all of their talent and money, can’t build quality software, how are the rest of us supposed to do it? Simple. By not making the same mistakes:
Putting speed-to-market and cost in front of everything else. Pushing people too hard to hit “drop dead dates”. Taking “sprints” literally: going as fast as possible, not giving the team time to do things right or a chance to pause and reflect and improve.
We all have to work within deadlines and budgets, but in most business situations there’s room to make intelligent decisions. Agile methods and incremental delivery provide a way out when you can’t negotiate deadlines or cost, and don’t understand or can’t control the scope. If you can’t say no, you can say “not yet”. Prioritize work ruthlessly and make sure that you deliver the important things as early as you can. And because these things are important, make sure that you do them right.
Leaving testing to the end. Which means leaving bug fixing to after the end. Which means delivering late and with too many bugs.
Disciplined Agile practices all depend on testing – and fixing – as you code. TDD even forces you to write the tests before the code. Continuous Integration makes sure that the code works every time someone checks in. Which means that there is no reason to let bugs build up.
Not talking to customers, not testing ideas out early. Not learning why they really need the software, how they actually use it, what they love about it, what they hate about it.
Deliver incrementally and get feedback. Act on this feedback, and improve the software. Rinse and repeat.
Ignoring fundamental good engineering practices. Pretending that your team doesn’t need to do these things, or can’t afford to do them or don’t have time to do them, even though we’ve known for years that doing things right will help to deliver better software faster.
As a Program Manager or Product Owner or a Business Owner you don’t need to be an expert in software engineering. But you can’t make intelligent trade-off decisions without understanding the fundamentals of how the software is built, and how software should be built. There’s a lot of good information out there on how to do software development right. There’s no excuse for not learning it.
Ignoring warning signs.
Listen to developers when they tell you that something can’t be done, or shouldn’t be done, or has to be done. Developers are generally too willing to sign on for too much, to reach too far. So when they tell you that they can’t do something, or shouldn’t do something, pay attention.
And when you make mistakes - which you will, learn from them, don’t waste them. When something goes wrong, get the team to review it in a retrospective or run a blameless post mortem to figure out what happened and why, and how you can get better. Learn from audits and pen tests. Take negative feedback from customers seriously. This is important, valuable information. Treat it accordingly.
As a manager, the most important thing you can do is to not set your team up for failure. That’s not asking for too much.
Wednesday, June 24, 2015
OWASP Top 10
The OWASP Top 10 is a community-built list of the 10 most common and most dangerous security problems in online (especially web) applications. Injection flaws, broken authentication and session management, XSS and other nasty security bugs.
These are problems that you need to be aware of and look for, and that you need to prevent in your design and coding. The Top 10 explains how to test for each kind of problem to see if your app is vulnerable (including common attack scenarios), and basic steps you can take to prevent each problem.
If you’re working on mobile apps, take time to understand the OWASP Top 10 Mobile list.
IEEE Top Design Flaws
The OWASP Top 10 is written more for security testers and auditors than for developers. It’s commonly used to classify vulnerabilities found in security testing and audits, and is referenced in regulations like PCI-DSS.
The IEEE Center for Secure Design, a group of application security experts from industry and university researchers, has taken a different approach. They have come up with a Top 10 list that focuses on identifying and preventing common security mistakes in architecture and design.
This list includes good design practices such as: earn or give, but never assume trust; identify sensitive data and how they should be handled; understand how integrating external components changes your attack surface. The IEEE’s list should be incorporated into design patterns and used in design reviews to try and deal with security issues early.
OWASP Proactive Controls
IEEE’s approach is principle-based – a list of things that you need to think about in design, in the same way that you think about things like simplicity and encapsulation and modularity.
The OWASP Proactive Controls, originally created by security expert Jim Manico, is written at the developer level. It is a list of practical, concrete things that you can do as a developer to prevent security problems in coding and design. How to parameterize queries, and encode or validate data safely and correctly. How to properly store passwords and to implement a forgot password feature. How to implement access control – and how not to do it.
It points you to Cheat Sheets and other resources for more information, and explains how to leverage the security features of common languages and frameworks, and how and when to use popular, proven security libraries like Apache Shiro and the OWASP Java Encoder.
Katy Anton and Jason Coleman have mapped all of these controls together (the OWASP Top 10, OWASP Proactive Controls and the IEEE Security Flaws), showing how the OWASP Proactive Controls implement safe design practices from the IEEE list and how they prevent or mitigate OWASP Top 10 risks.
You can use these maps to look for gaps in your application security practices, in your testing and coding, and in your knowledge, to identify areas where you can learn and improve.
Monday, June 22, 2015
Wednesday, June 17, 2015
DevOps can help reduce technical debt in some fundamental ways.
First, building a Continuous Delivery/Deployment pipeline, automating the work of migration and deployment, will force you to clean up inconsistencies and holes in configuration and code deployment, and inconsistencies between development, test and production environments.
And automated Continuous Delivery and Infrastructure as Code gets rid of dangerous one-of-a-kind snowflakes and configuration drift caused by making configuration changes and applying patches manually over time. Which makes systems easier to setup and manage, and reduces the risk of an un-patched system becoming the target of a security attack or the cause of an operational problem.
A CD pipeline also makes it easier, cheaper and faster to pay down other kinds of technical debt. With Continuous Delivery/Deployment, you can test and push out patches and refactoring changes and platform upgrades faster and with more confidence.
The Lean feedback cycle and Just-in-Time prioritization in DevOps ensures that you’re working on whatever is most important to the business. This means that bugs and usability issues and security vulnerabilities don’t have to wait until after the next feature release to get fixed. Instead, problems that impact operations or the users will get fixed immediately.
But there’s a negative side to DevOps that can add to technical debt costs.
Michael Feathers’ research has shown that constant, iterative change is erosive: the same code gets changed over and over, the same classes and methods become bloated (because it is naturally easier to add code to an existing method or a method to an existing class), structure breaks down and the design is eventually lost.
DevOps can make this even worse.
DevOps and Continuous Delivery/Deployment involves pushing out lots of small changes, running experiments and iteratively tuning features and the user experience based on continuous feedback from production use.
Many DevOps teams work directly on the code mainline, “branching in code” to “dark launch” code changes, while code is still being developed, using conditional logic and flags to skip over sections of code at run-time. This can make the code hard to understand, and potentially dangerous: if a feature toggle is turned on before the code is ready, bad things can happen.
Feature flags are also used to run A/B experiments and control risk on release, by rolling out a change incrementally to a few users to start. But the longer that feature flags are left in the code, the harder it is to understand and change.
There is a lot of housekeeping that needs to be done in DevOps: upgrading the CD pipeline and making sure that all of the tests are working; maintaining Puppet or Chef (or whatever configuration management tool you are using) recipes; disciplined, day-to-day refactoring; keeping track of features and options and cleaning them up when they are no longer needed, getting rid of dead code and trying to keep the code as simple as possible.
Microservices and Technology Choices
This is because loosely-coupled Microservices are easier for individual teams to independently deploy, change, refactor or even replace.
And a Microservices-based approach provides developers with more freedom when deciding on language or technology stack: teams don’t necessarily have to work the same way, they can choose the right tool for the job, as long as they support an API contract for the rest of the system.
In the short term there are obvious advantages to giving teams more freedom in making technology choices. They can deliver code faster, quickly try out prototypes, and teams get a chance to experiment and learn about different technologies and languages.
But Microservices “are not a free lunch”. As you add more services, system testing costs and complexity increase. Debugging and problem solving gets harder. And as more teams choose different languages and frameworks, it’s harder to track vulnerabilities, harder to operate, and harder for people to switch between teams. Code gets duplicated because teams want to minimize coupling and it is difficult or impossible to share libraries in a polyglot environment. Data is often duplicated between services for the same reason, and data inconsistencies creep in over time.
There is a potentially negative side to the Lean delivery feedback cycle too.
Constantly responding to production feedback, always working on what’s most immediately important to the organization, doesn’t leave much space or time to consider bigger, longer-term technical issues, and to work on paying off deeper architectural and technical design debt that result from poor early decisions or incorrect assumptions.
Smaller, more immediate problems get fixed fast in DevOps. Bugs that matter to operations and the users can get fixed right away instead of waiting until all the features are done, and patches and upgrades to the run-time can be pushed out more often. Which means that you can pay off a lot of debt before costs start to compound.
But behind-the-scenes, strategic debt will continue to add up. Nothing’s broke, so you don’t have to fix anything right away. And you can’t refactor your way out of it either, at least not easily. So you end up living with a poor design or an aging technology platform, slowly slowing down your ability to respond to changes, to come up with new solutions. Or forcing you to continue filling in security holes as they come up, or scrambling to scale as load increases.
DevOps can reduce technical debt. But only if you work in a highly disciplined way. And only if you raise your head up from tactical optimization to deal with bigger, more strategic issues before they become real problems.
Friday, June 5, 2015
A new book by Len Bass, Ingo Weber and Liming Zhu “DevOps: A Software Architect’s Perspective”, part of the SEI Series in Software Engineering, looks at how DevOps affects architectural decisions, and a software architect’s role in DevOps.
The authors focus on the goals of DevOps: to get working software into production as quickly as possible while minimizing risk, balancing time-to-market against quality.
“DevOps is a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while insuring high quality”These fundamental practices are:
- Engaging operations as a customer and partner, “a first-class stakeholder”, in development. Understanding and satisfying requirements for deployment, logging, monitoring and security in development of an application.
Engaging developers in incident handling. Developers taking responsibility for their code, making sure that it is working correctly, helping (often taking the role of first responders) to investigate and resolve production problems.
This includes the role of a “reliability engineer” on every development team, someone who is responsible for coordinating downstream changes with operations and for ensuring that changes are deployed successfully.
- Ensuring that all changes to code and configuration are done using automated, traceable and repeatable mechanisms – a deployment pipeline.
- Continuous Deployment of changes from check-in to production, to maximize the velocity of delivery, using these pipelines.
- Infrastructure as Code. Operations provisioning and configuration through software, following the same kinds of quality control practices (versioning, reviews, testing) as application software.
Cloud Architecture and Microservices
As a reference for architects, the book focuses on architectural considerations for DevOps. It walks through how Cloud-based systems work, virtualization concepts and especially microservices.
While DevOps does not necessarily require making major architectural changes, the authors argue that most organizations adopting DevOps will find that a microservices-based approach, as pioneered at organizations like Netflix and Amazon, by minimizing dependencies between different parts of the system and between different teams, will also minimize the time required to get changes into production – the first goal of DevOps.
Conway’s Law also comes into play here. DevOps work is usually done by small agile cross-functional teams solving end-to-end problems independently, which means that they will naturally end up building small, independent services:
“Having an architecture composed of small services is a response to having small teams.”
But there are downsides and costs to a microservice-based approach.
As Martin Fowler and James Lewis point out, microservices introduce many more points of failure. Which means that resilience has to be designed and built into each service. Services cannot trust their clients or the other services that they call out to. You need to add defensive checking on data and anticipate failures of other services, implement time-outs and retries, and fall back alternatives or safe default behaviors if another service is unavailable. You also need to design your service to minimize the impact of failure on other services, and to make it easier and faster to recover/restart.
Microservices also increase the cost and complexity of end-to-end system testing. Run-time performance and latency degrade due to the overhead of remote calls. And monitoring and troubleshooting in production can be much more complicated, since a single action often involves many microservices working together (an example at LinkedIn, where a single user request may chain to as many as 70 services).
DevOps in Architecture: Monitoring
In DevOps, monitoring becomes a much more important factor in architecture and design, in order to meet operations requirements.
The chapter on monitoring explains what you need to monitor and why, DevOps metrics, challenges in monitoring systems under continuous change, monitoring microservices and monitoring in the Cloud, and common log management and monitoring tools for online systems.
Monitoring also becomes an important part of live testing in DevOps (Monitoring as Testing), and plays a key role in Continuous Deployment. The authors look at common kinds of live testing, including canaries, A/B testing, and Netflix’s famous Simian Army in terms of passive checking (Security Monkey, Compliance Monkey) and active live testing (Chaos Monkey and Latency Monkey).
DevOps in Architecture: Security
Security is another important cross-cutting concern in software architecture addressed in this book. It looks at security fundamentals including how to identify threats (using Microsoft’s STRIDE model) and the resources that need to be protected, CIA, identity management, access controls. It provides an overview of the security controls in NIST 800-53, and common security issues with VMs and in Cloud architectures (specifically AWS).
In DevOps, security needs to be wired into Continuous Deployment:
- Enforcing that all changes to code and configuration are done through the Continuous Deployment pipeline
- Security testing should be included in different stages of the Continuous Deployment pipeline
- Securing the pipeline itself, including the logs and the artifacts
Continuous Deployment Pipeline and Gatekeepers
Developers – and architects – have to take responsibility for building their automated testing and deployment pipelines. The book explains how Continuous Deployment leverages Continuous Integration, and common approaches to code management and test automation. And it emphasizes the role of gatekeepers along the pipeline – manual decisions or automated checks at different points to determine if it is ok to go forward, from development to testing to staging to live production testing and then to production.
DevOps and Modern Software Architecture
“DevOps: A Software Architect’s Perspective” does a good job of explaining common DevOps practices, especially Continuous Deployment, in a development, instead of operations, context. It also looks at contemporary issues in software architecture, including virtualization and microservices.
It is less academic than Bass’s other book “Software Architecture in Practice”, and emphasizes the importance of real-world operations concerns like reliability, security and transparency (monitoring and live checks and testing) in architecture and deployment.
This is a book written mostly for enterprise software architects and managers who want to understand more about DevOps and Continuous Deployment and Cloud services.
If you’re already deep into DevOps and working with microservices in the Cloud, you probably won’t find much new here.
But if you are looking at how to apply DevOps at scale, or how to migrate legacy enterprise systems to microservices and the Cloud, or if you are a developer who wants to understand operations and what DevOps will mean to you and your job, this is worth reading.