Wednesday, December 23, 2009

New Communities for Project Managers and Security

The other day I happened across a new Q&A community forum for project managers called AskAboutProjects.com. This site is built using the Stack Overflow Knowledge Exchange Engine, the same platform that is used to host the popular software development Q&A site Stack Overflow and Server Fault, a similar resource for IT system administrators.

The Stack Overflow engine is an effective and low cost platform for quickly building communities. It has some quirks, many of them around its security model, which make it awkward to use at times: for example it is difficult to enter links in answers or profiles (sometimes it works well, sometimes it doesn’t). I haven’t taken the time to figure out why, and I shouldn’t need to: the UI should be more seamless. And Firefox’s NoScript plug-in (I never leave home without it) occasionally catches XSS problems on some of the sites.

But the community experience is addictive – I find myself spending way too much time scanning the boards and offering help where I can. Some of the programmers on my team have found StackOverflow handy when working with new technology or debugging obscure technical problems.

There is something of a gold rush going on, with people hurrying to setup new communities using this engine: there are communities being launched for gamers, amateur radio, technology support forums, sports betters, dating, industrial robots, the iPhone, travel, diving, professional stock traders, musicians, real estate, organizational psychology, aerospace engineering, startups, world cup soccer, natural living, electronics, climate change, mountain biking, money, moms, spirituality…. you name it.

Another one of these sites that I am following is SecurityCrunch, a new community focused on IT security issues.

Of course there is no guarantee which communities will catch on. AskAboutProjects is new and the community is still small. Many of the forum questions so far are either homework assignments (which plague Stack Overflow as well) and seed questions from the founders of the community. Although it appears to be intended as a general resource for project managers, it is clearly focused at the moment on IT, and more specifically software development, projects and related issues, reflecting the founders’ backgrounds.

It will be worth keeping an eye on these communities over the next few months, to see which, if any of them, can replicate the success of Stack Overflow.

Friday, December 11, 2009

Much ado about... nothing much (Agile Vancouver 2009)

In early November I attended Much Ado About Agile, the Agile Vancouver interest group’s annual conference. I was looking for a short break, and this conference offered a chance to get away from daily responsibilities, reflect, and learn more about the state of the art in software development.

I’ve decided to look back to see what stayed with me, what I learned that was worth taking forward.

First off, it was grand being back in Vancouver – I lived in Vancouver for a couple of years and always enjoy going back, the mountains and the water, the parks and markets and the sea shore, dining at some of the city’s excellent restaurants, and of course snacking at wonderful, quirky Japadog.

The conference agenda was a mixed bag: a handful of Agile community rock stars re-playing old hits or pushing their latest books, including Martin Fowler, Johanna Rothman, and Mary Poppendieck; some consultants from ThoughtWorks and wherever else presenting commercials in the guise of case studies; and some earnest hands-on real developers telling war stories, from which you could hope to learn something.

I was surprised by the number of (mostly young), well-intentioned enthusiastic people at the sessions. There was sincere interest in the rooms; you could feel the intensity, the heat from so many questing minds. We were looking for answers, for insight, for research and experience.

But what we got wasn’t much unfortunately.

The rock stars were polished and confident, but mostly kept to safe, introductory stuff. I remember attending Martin Fowler’s keynote. Martin is indisputably a smart guy and worth listening to: I had the pleasure of spending a few days with him on round tables at last year’s Construx Software Executive Summit where we explored some interesting problems in software development. To be honest, I had to go back to my notes to remember what he spoke about in Vancouver: a couple of short talks, one on agile fundamentals and something smart about technical debt and simple design. If you’ve read Martin’s books and follow his posts, there was nothing new here, nothing to take back. Maybe I expected too much.

I decided to avoid the professional entertainment for a while and see what I could learn from some less polished, real-life practitioners. I stuck to the “hard” track, avoiding the soft presentations on team work, building trust and such.

A talk on “Agile vs the Iron Triangle” about using lightweight methods to deliver large projects delivered on a fixed cost, fixed schedule basis. How to make commitments, freezing the schedule and then managing scope – following incremental, build-to-schedule methods. Most of the challenges here of course are in estimating the size of work that needs to be done, understanding the team’s capacity to deliver the work, and making trade-offs with the customer: accepting but managing change, trading changes in scope in order to adhere to the schedule. This lecture was interesting because it was real, representing the efforts of an organization trying to reconcile plan-driven and agile practices, working with customers with real demands, under real constraints.

Another session was on operations at a small Internet startup where the development team was also responsible for operations. The focus here was on lightweight, open source operations tooling: essential tools for availability checks, log monitoring, performance and capacity analysis, system configuration using technology like Puppet. Nothing new here, but it was fun to see a developer so excited and focused on technical operations issues, and committed to keep the developers and operations staff working closely together as the company continued to grow.

Some more talks about the basics of performance tuning, an advertisement for ThoughtWorks Cruise continuous integration platform, and some other sessions that weren’t worth remembering. I had the most fun at Philippe Kruchten’s lecture on backlog management: recognizing and managing not only work for business features, but architecture / plumbing, and technical debt, “making invisible work visible”. Dr. Kruchten is an entertaining speaker, he clearly enjoys performing in front of a crowd, and he enjoys his work, his enthusiasm was infectious.

And finally a technical session by Michael Feathers on Error Processing and Error Handling as First Class Considerations in Design, who bucked the trend, playing the cool professor who could not care less if half the class was left behind. His focus was on idioms and patterns for improving error handling in code, in particular, the idea of creating “safe zones”, where you only need to worry about construction problems if you are at, or outside the edge of the zone, making for cleaner and more robust code in the safe core. Definitely the hardest, geekiest of the talks that I attended. And like several of the sessions I attended, it had little to do directly with agile development methods – instead it challenged the audience to think about ways to write good code, which is what it all comes down to in the end.

Michael Feathers aside, most of the speakers underestimated their audiences – at least I hope that they did – and spoke down, spoon feeding the newbies in the audience. It made for dull stuff much of the time – as earnest, or entertaining as the speaker might be, there wasn’t much to chew on. There could have been much more to learn with so many smart people there, and I wasn’t the only one looking for more meat, less bun. The conference wasn’t expensive, it was well managed, but it didn’t offer an effective forum to dig deep, to find new ways to build better software, or software better. For me, at least, there wasn't much ado.

Tuesday, December 8, 2009

Reliability and the Risks of Using Enterprise Middleware

If you are building systems with high requirements for performance and reliability, it is important that you are careful, selective of course, but even more important, sparing in your use of general-purpose middleware solutions to solve your technical problems.

There are strong, obvious arguments in favor of using proven middleware solutions, whether commercial off the shelf software (COTS) or open source solutions - arguments that are based on time-to-market, risk mitigation, and cost leveraging:

Time-to-market
In most cases, it will take much less time to evaluate, acquire, install, configure and understand a commercial product or open source solution than to build your own plumbing. This is especially important early in the project when your focus should be on understanding and solving important business problems, delivering value early, getting something working in the customer’s hands as soon as possible for feedback and validation.

Risk mitigation
Somebody has already gone down this path, taken the time to understand a complex technical problem space, made some mistakes and learned from them. The results are in front of you. You can take advantage of what they have already learned, and focus on solving your customer’s business problems, rather than risking falling into a technical black hole.

Of course you take on a different set of risks: that the solution is of high quality, that you will get adequate support (from the vendor or the community), that you not are buying into a dead end.

Cost leverage
For open source solutions, the cost argument is obvious: you can take advantage of the time and knowledge invested by the community for close to nothing.

In the case of enterprise middleware, companies like Oracle and IBM have spent an awful lot of money hiring smart people, or buying companies that were created by smart people, invested millions of dollars into R&D and millions more into their support infrastructures. You get to take advantage of all of this through comparatively modest license and support fees.

The do-it-yourself, not-invented-here arguments for building instead of buying are essentially that your company is so different, your needs are unique: that most of the money and time invested by Oracle and IBM, or the code built up by an open source community, does not apply to your situation, that you need something that nobody else has anticipated, nobody else has built.

I can safely say that this is almost always bullshit: naïve arguments put forward by people who might be smart, but are too intellectually lazy or inexperienced to properly understand and frame the problem, to bother to look at the choice of solutions available, to appreciate the risks and costs involved in taking a proprietary path. But, when you are pushing the limits in performance and reliability, it may actually be true.

A fascinating study on software complexity by NASA’s Office of the Chief Engineer Technical Excellence Program examines a number of factors that contribute to complexity and risk in high reliability / safety critical software systems (in this case flight systems), and success factors in delivery of these systems. One of the factors that NASA examined was the risks and benefits of using commercial off the shelf software (COTS) solutions:
Finding:
Commercial off-the-shelf (COTS) software can provide valuable and well-tested functionality, but sometimes comes bundled with additional features that are not needed and cannot easily be separated. Since the unneeded features might interact with the needed features, they must be tested too, creating extra work.

Also, COTS software sometimes embodies assumptions about the operating environment that don’t apply well to [specific] applications. If the assumptions are not apparent or well documented, they will take time to discover. This creates extra work in testing; in some cases, a lot of extra work.

Recommendation:

Make-versus-buy decisions about COTS software should include an analysis of the COTS software to: (a) determine how well the desired components or features can be separated from everything else, and (b) quantify the effect on testing complexity. In that way, projects will have a better basis for make/buy and fewer surprises.
The costs and risks involved with using off the shelf solutions can be much greater than this, especially when working with enterprise middleware solutions. Enterprise solutions offer considerable promise: power and scale, configuration to handle different environments, extensive management capabilities, interface plug-and-play… all backed up by deep support capabilities. But you must factor in the costs and complexities of properly setting up and working with these products, and the costs and complexities in understanding the software and its limits: how much time and money must you invest in a technology before you know if it is a good fit, if it fulfills its promise?

Let’s use the example of an enterprise middleware database management system, Oracle’s Real Application Cluster (RAC) maximum availability database cluster solution.

Disclaimer: I am not an Oracle DBA, I am not going to argue fine technical details here. I chose RAC because of recent and extensive experience working with this product, because it is representative of the problems that teams can have working with enterprise middleware. I could have chosen other technologies from other projects, say Weblogic Suite or Websphere Application Server and so on, but I didn’t.

The promise of RAC is to solve many of the problems of managing data and ensuring reliability in high-volume, high-availability distributed systems. RAC shares and manages data across multiple servers, masks failures and provides instant failover in an active-active cluster, and allows you to scale the system horizontally, adding more servers to the cluster as needed to handle increasing demands. RAC is a powerful data management solution, involving many software layers, including clustering and storage management and data management and operations management, designed to solve a set of complex problems.

In particular, one of these technical problems is maintaining cache fusion across the cluster: fusing the in-memory data on each server together into a global, cluster-wide cache so that each server node in the cluster can access information locally as it changes on any other node.

As you would expect, there are limits to the speed and scaling of cluster-wide cache fusion, especially at high transaction rates. And this power and complexity comes with costs. You need to invest both in infrastructure, in a highly reliable and performant network interconnect fabric and shared storage subsystem, and in making fundamental application changes, to carefully and consistently partition data within the database and carefully design your indexes in order to minimize the overhead costs of maintaining global cache state consistency. As the number of server nodes in the cluster increases (for scaling purposes or for higher availability), the overhead costs and the costs involved in managing this overhead increase.

RAC is difficult to setup, tune and manage in production conditions: this is to be expected – the software does a lot for you. But it is especially difficult to setup, tune and manage effectively in high-volume environments with low tolerance for variability and latency, where predictable performance under sustained load, and predictable behavior in failure situations, is required. It requires a significant investment in time to understand the trade-offs in setup and operations of RAC, to balance reliability and integrity factors against performance; choosing between automated and manual management options, testing and measuring system behavior, setting up and testing failover scenarios, carefully managing and monitoring system operations. To do all of this will require you to invest in setting up and maintaining test and certification labs, in training for your operations staff and DBAs, in expert consulting and additional support from Oracle.

To effectively work with enterprise technology like this, at or close to the limits of its design capabilities, you need to understand it in depth: this understanding comes from months of testing and tuning your system, working through support issues and fixing problems in the software, modifying your application and re-testing. The result is like a race car engine: highly optimized and efficient, running hot and fast, highly sensitive to change. Upgrades to your application or to the Oracle software must be reviewed carefully and extensively tested, including planning and testing rollback scenarios: you must be prepared to manage the very real risk that a software upgrade can affect the behavior of the database engine or cluster manager or operations manager or other layers, impacting the reliability or performance of the system.

Clearly one of the major risks of working with enterprise software is that it is difficult, if not impossible, to learn enough about the costs and limits of this technology early enough in the project – especially if you are pushing these limits. Hiring experienced specialists, bringing in expert consultants, investing in training, testing in the lab: all of this might not be enough. While you can get up and running much faster and cheaper than you would trying to solve so many technical problems yourself from the start, you face the risk that you may not understand the technology well enough, the design points and real limits, how to make the necessary balances and trade-offs – and whether these trade-offs will be acceptable to you or your customers. The danger is that you become over-invested in the solution, that you run out of time or resources to explore alternatives, that you give yourself no choice.

You are making a big bet when working with enterprise products. The alternative is to avoid making big bets, avoid having to find big solutions to big problems. Break your problems down, and find narrow, specific answers to these smaller, well-bounded problems. Look for lightweight, single-purpose solutions, and design the simplest possible solution to the problem if you have to build it yourself. Spread the risks out, attack your problems iteratively and incrementally.

In order to do this you need to understand the problem well – but whether you break the problem down or try to solve it with an enterprise product, you can’t avoid the need to understand the problem. Look (carefully) at the options available, at open source and commercial products, look for the smallest, simplest approach that fits. Don’t over-specify, or design, yourself into a corner. Don’t force yourself to over-commit. And think twice, or three or four times, before looking at an enterprise solution as the answer.
Site Meter