Wednesday, April 9, 2008

Off the [Bubble]Mark

James Ward has been looking into the BubbleMark application recently, and we were talking about some of the strange results we saw in different situations. For example, I was able to take the existing application, which gets something like 65 frames per second in IE7 on my Vista laptop, recast it as an AIR application, and get about double that frame rate, or 130 fps. James posted some variations of the application on his blog; check them out to see what I mean.

(Some background, for those that didn't avail themselves of the awesome opportunity of clicking on the recent link to the site: BubbleMark is an application that moves a number of 'balls' around in a window, and bounces them off one another while doing so, and then measures how fast this critical and awe-inspiring process happens per second).

What's going on? Is that gremlin in my keyboard back to haunt me? Or have I been staring at the screen too long and I'm mis-reading the numbers? Or maybe age is creeping up on me and I'm just getting things wrong more often, as a prelude to slipping on the ice and breaking my hip.

Circular Reasoning

The fundamental problem with the current test is that it is limited by the resolution (the time between events) of the timing mechanism used by the application. In this case, the timing mechanism is a Flash Timer object that is scheduled to call our code at frequent intervals (supposedly far more frequent than the frame rate we're seeing). So while the application purports to test the number of frames that it can render per second, it actually tests the number of times per second that the Timer sends out events. In the situation where Flex can render the balls much faster than the Timer frame rate, this means that the frames-per-second (fps) reported has very little to do with the speed at which the balls are being moved or rendered.

By producing an alternate test that runs on AIR (which takes Flex out of the browser and runs Flex applications in a standalone window on the desktop), we remove a constraint where the Timer resolution tends to be more limited in the browser than in standalone applications. Since the Timer can call our ball-moving routine more often in this AIR version, and since we were taking less than a frame's worth of time to move the balls before (although that wasn't reflected in the fps result), we get a higher effective frame rate and better results.

So the AIR version is closer to the ideal, where we're actually measuring more of what the test was presumably intended for; moving and rendering the balls.

But there's still a problem here, which has been implied by the wide variation in the results. The test is confusing three orthogonal issues and trying to combine them into one result (fps). The three things actually being tested are:

  • Calculation speed: How fast can the runtime position the balls, including moving them, calculating collisions, and offsetting the collided balls? This comes down to VM and math performance, since it's just raw computations.
  • Rendering performance: There is a certain amount of time spent each frame actually drawing the balls on the screen. That seems to be insignificant in our case, compared to the other factors here, but you would think that a graphics benchmark might actually care about this item them most. How fast can we draw the balls? I don't know, because we're spending most of our time just waiting around for Timer events and calculating the balls' positions. That's like timing how long it takes you to get dressed in the morning, but starting the timer the previous night when you go to sleep.
  • Timer resolution: This is the primary issue called about above. I don't believe it was ever the intention for the benchmark to measure the rate at which the various runtimes could schedule callbacks. Because, well, that'd be a pretty dull graphics benchmark. But that's exactly what's happening here; we're not measuring graphics performance at all in the default test.

To add to the problem of these three factors getting each others' chocolate stuck in the others' peanut butter, you've got the problem that all of the different platforms (Flex, Java, .net, etc.) probably have different reliance on these three factors (are they faster at rendering but slower than computation? Do they have more limited Timer resolution? How do browsers affect these factors on each platform? Or hardware? Or operating systems?)

I think that if you wanted real results that you could use and compare from this benchmark, you'd have to de-incestuate the factors and measure them all separately. I could envision three competely separate tests coming out of this:

  • Calculation speed: Let's have a non-graphical test that simply moves/collides the balls as fast as possible and reports how many times it was able to do this per second. No need to render here; we only care about how fast we can do the math and related setting of properties.
  • Rendering performance: Render a given number of balls as fast as you can and report how often you can do it per second. In order to not run up against the Timer resolution issue, you'd have to adjust the number of balls so that you could be assured that you would be spending very little time waiting around between callback events. In fact, you could even try to render far more than could be done in a single frame, but then report a number like balls/second instead of frames/second.
  • Timer resolution: Write a test that simply issues callbacks from the Timer mechanism, without doing any work in those callbacks, to see how many times this happened per second. (I wrote this test and posted a blog about it; see that post for more information on this no-op timer test).

With results from these three tests, it would be easier to understand where a platform was being bottlenecked: at the runtime, rendering, or resolution level.

At this point, I'd love to say, "...and here's that benchmark!" But I'm not going to, because I've already spent a pile of time just trying to understand the weird results I was seeing from Bubblemark. If someone else wants to take a whack at it, that'd be great. But at the very least, I hope this helps you understand and filter the results you see from the BubbleMark tests a little better.....

7 comments:

Anonymous said...

I would certainly be interested in a calculation test along the lines of this benchmark, for Flash vs a Flash-based AIR vs Silverlight versus WPF vs JavaScript vs Java.

I may get around to throwing together an example myself, when I get a chance. Unfortunately, with an incredibly busy and booked up schedule it might a month or two until I get a chance to put together such a test for all these various technologies.

David Moles said...

Don't forget JavaFX. :)

The competing BubbleMark numbers being thrown around at Javapolis last year were really pretty amusing. I don't have a horse in this race, but I'll be interested to see how RIA benchmarking shakes out.

coffeejolts said...

These are apples to oranges benchmarks, which require an additional runtime installed. How about saving the swf to the desktop, then launching it locally- without the AIR runtime installed?

Anonymous said...

I decided to finally post this in response to your blog post. http://www.craftymind.com/2008/04/11/why-bubblemark-is-a-poor-ui-benchmark/
Hopefully a better alternative is available to Bubblemark

Chet Haase said...

coffeejolts: Yep, playing the SWF file in the standalone player is another good way to show the non-browser results. In fact, that's how I was doing it while chasing down the issues. I just packaged it up as an AIR app to make it more of a one-click thing to run, rather than leaving instructions on saving the SWF and launching it.

Chet.

Chet Haase said...

Sean,

Nice post.

I hope there's a better alternative eventually, to, but in the meanwhile maybe it will help to at least explain to people why Bubblemark scroess are pretty meaningless for comparing the platforms...

Chet.

Chet Haase said...

coffeejolts,

By the way, comparing AIR to browser-Flash isn't apples-to-oranges (apart from the timing constraint issues that I mentioned). It's the same Flash runtime in both, it's just that AIR wraps it with a different container to allow it to play as a standalone application. This plusing a SWF in the standalone player should be quite similar to playing it in an AIR application.

Chet.