Archive for the ‘reliability’ Category

How to write a bug

Tuesday, August 21st, 2007

In order to break my slump I thought I’d give a quick description of a project I recently worked on.  I was helping set up a demo of an application that was moved onto the fabric.  The app was written in C and we didn’t change any of its code.  We ended up running it by shelling out from Java to make it easy for the developers who didn’t know C.  The person asking for the demo wanted to see our reliability feature but didn’t want to see the usual hardware demo we give which involved pulling a power cord or network cable.  So I had to invent a software bug that would illustrate how an application can be made reliable by running it on top of Appistry’s software.

So to make a short story long, here’s what my first try looked like:

Random generator = new Random(System.currentTimeMillis());
if( (generator.nextInt() % 4 ) > 0 )

   doSomething();
}

The basic idea being that by generating a random integer in the code running on the fabric I could then use the modulo operator to inject a bug about 25% of the time.  Because it’s random, it would be unpredictable (the best kind of bug).  The modulo operator would give me values from 0-3 and therefore if I got a 0 I would run the bug:

else
{
   throw new Exception();
}

The error case just throws an Exception because the fabric will attempt to retry the job if an unchecked exception is thrown.  We tested out our new bug and it didn’t work.  It was running the error case way too much.  So basically I had a bug in my bug.  After reading the javadoc on the Random.nextInt() method I figured out that it will generate negative numbers!  So I added the following fix:

if( Math.abs(generator.nextInt() % 4 ) > 0 )

Now I get my 25% failure rate like we wanted.  We show the client running and submitting a handfull of requests to the fabric.  Then, on the log on, the fabric, we show in realtime that the failure randomly occurs.  But when it does, the job is automatically resubmitted on the fabric and the answer is sent back to the client.  The client doesn’t have to be concerned with retry logic and only gets its correct answer.

I thought a cool web demo would be to use Ajax and setup a button for people to cause a software bug to happen while the results are coming back to the browser in realtime.  Maybe I’ll write that in my "spare" time…

-jasen

That depends on what the definition of Grid is

Friday, May 25th, 2007

I have been doing some reading and a few weeks ago I came across an article by Corey Klaasmeyer on JavaWorld’s website.  He discusses the original definition of what a "grid" is with some of the history.  For example he quotes Ian Foster and Carl Kessellman from The Grid: Blueprint for a New Computing Infrastructure as saying, "A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities."

By this definition Appistry’s EAF qualifies.  But then Corey goes on to discuss the evolution of the definition and how Ian Foster has defined 3 requirements for being considerred a grid.  According to Ian a grid:

  1. Coordinates resources that are not subject to centralized control;
  2. Uses standard, open, general-purpose protocols and interfaces; and
  3. Delivers nontrivial qualities of service.

So Appistry again easily meets 1 & 3 above.  However 2. is debatable.  We have our own xml based meta-data syntax that we use to coordinate task execution.  We have discussed exposing this as WSDL and BPEL (which would be much more verbose) but then at least it would be "standard".  We are based on good old TCP/IP and UDP/IP so that sounds like general-purpose protocols to me.  So basically if I provided a set of XSLT transforms from our succint meta-data syntax to WSDL and/or BPEL would that then make us a true grid by Ian Foster’s definition?

Corey and I met for lunch today and discussed this idea and other things over awesome food at Illegal Pete’s in downtown Denver.  Ultimately I think we both decided that Appistry isn’t the purist form of a grid, but that we were compatible with it.  We also discussed the industies move toward using the term "grid" in a larger way in the market place.  Fuzzy’s blog discusses this concept and refers to others who attempt to define the classes of grids that exist in the market place today.

We don’t like to call our product an application server or a cluster because that term has so much baggage.  If we just call it a grid then its not clear how we are different than the efforts of the original father’s of grid computing.  So we’ve settled on the term "virutualized grid".  It would be cool to hold a vote to see how developers and architects describe software in this space.  Maybe Slashdot could do a poll.  I would vote for Cowboy Neal Grid.

-j

Reliability and how to upgrade a demo during a trade show

Friday, May 4th, 2007

I was in Chicago this week at a trade show.  We had our cool demo running on a table for people to see and play with.  It’s a stack of 5 AOpen computers hooked into a linksys router.  I run Tomcat on my laptop and then run a couple apps that I got from other tech reps.  One is what we call the "Dependability Demo" and its a Java Swing app that sends a bunch of transactions into the fabric that is running on the boxes.  I’m currently using Suse 10.2 linux for the OS on the boxes.  We always ask the people watching to come up and unplug an ethernet cable or a power cord from any of the boxes.  Then the Swing app display shows that some transactions are hung in various states.  As we are explaining what happened the fabric figures out what just happened and resubmits the the tasks that did not complete to different workers with the same state that existed before the cord was unplugged.  The transactions complete and the rest of the app just keeps working.  People really like that, they get the concept when they see the software do its job.  Then they also know we aren’t vaporware.

While I was at the show, we decided to change our ethernet cables.  On the first day we had 6-inch cables that were just too short to unplug easily because they were stretched so tight between the router and the machines.  So I had someone pick me up some 3-foot cables and bring them to the booth.  Then with the demo still running, I just started unplugging things and replacing cables.  Half way through I remarked that someone should be video taping it because I was taking down the entire network and taking machines offline without caring one bit about the application that was running.  The demo was happily completing all the tasks.  It was a good example of how IT management could change if someone employed an Appistry fabric on their mission critical apps.  How many times have you heard the story of someone unplugging the wrong ethernet cable in the server room and taking down the production app?  With Appistry it wouldn’t even be worth talking about anymore.  Imaging how much money is spent to avoid that scenario?  What if it just didn’t matter anymore?  I wonder how many middle ware solutions would survive a live network cable swap?

I’ll be at JavaOne next week, come over to our booth and I’ll let you unplug some cables.

-jasen