Parsing very small XML: beware of overheads

I recently needed to parse some very tiny chunks of XML, for a toy project related to OpenStreetMap data.

Data ranged from

  • Just one node with 3 or 4 small attributes
  • 10-20 nodes with a total of 50-80 attributes

First reaction is to just use the DOM API. Performance was just terrible. The usual suspects were checked: the DocumentBuilderFactory was correctly kept, the DocumentBuilder itself was correctly reused between XML documents.

So, it is common knowledge that “SAX parsing is faster than DOM”. That is true to some extent. DOM parsing definitely can’t handle very large XML as everything is instantiated at once, but the DOM instantiation itself is often quite quick compared to XML parsing.

So, switching to SAX … No real improvement. Let’s use the very first performance monitoring tool available ! Thread dump using kill -3 :

 java.lang.Thread.State: RUNNABLE
at java.lang.Throwable.fillInStackTrace(Native Method)
- locked <0x00000007ad82ba98> (a
at java.lang.Throwable.<init>(
at java.lang.Exception.<init>(
at java.lang.RuntimeException.<init>(
at SAXParsersTest.timeOne(

And basically, here is our answer. On very small XML documents, a huge part of the time is actually spent in startup time of the parser, not in the parsing itself. Each time you call parse, it does some entity management initialization, which, somewhere throws an exception (probably because a property is not present). Exceptions throwing is not really optimized by the JIT, so you pay a huge overhead each time.

Fortunately, JAXB allows you to easily change the implementation. We can for example try to go from the default one (Xerces) to the Piccolo implementation. Piccolo publishes some benchmarks that show up to 2 times gain when using “small” XML (500 bytes).

What happens on very tiny XML ?

  • Times are in milliseconds for 100K loops
  • All loops have been preloaded to eliminate any JIT effect
  • I’ve added the old Crimson parser to compare


                             picco       xerces   factor        crimson
"empty" (1 node, 1 attr)      80          1153    14.4           371
tiny    (a few nodes)         81          1161    14.3           368
lessTiny (a few dozens)      490          1654     3.4           958
big      (hundreds)         4015          4638     1.2          5590
veryBig (several K)        39200         36834     0.9         52240

Yup, to parse large number of extremely small XML fragments with only a few nodes, Picco is 14 times faster than the default implementation ! Which, all taken into account, lead to a 5-10 times increase in the speed of my OSM parsing job !
When we get to very big XML, both implementations are almost head-to-head.

So, I would probably not recommend switching your default implementation, as Xerces is probably more mature and might support some more exotic stuff, but if you have some very small XML to parse, then definitely go for Picco ! And beware of startup times and overheads in your code, not only of raw sustained throughput.

Because performance does matter …

Software performance is a very complex subject. There are a huge number of factors that will play a role in “how fast by software will be”.

Software developers must always keep the “performance” subject in mind, together with maintainability, extensibility and pace of development. On the other hand, it is very easy to become obsessed with performance. Trying to optimize performance too early is one of the most common mistakes. Even if you do need to optimize, another risk is to try to do some micro optimizations, like removing a few instructions in a tight loop, without really thinking about the big picture, or without knowing exactly where the problem lies.

“Performance” is also very vague by itself. Do we talk about speed of execution, memory consumption, disk space, … ? Depending on your program, the focus can be highly different.

I initially wanted to create a blog dedicated to this topic, but I decided that a category on this website would actually be a simpler addition. It will focus on how to analyze performance of your software, how to know where and when you should optimize, and some optimization techniques. The articles will not follow a fully logical path, but hopefully, they will be collated to create a structured how-to. The examples and data will be mostly about Java and C/C++.

We’ll start with some general insights on the optimization process, then with a series about object pooling and reuse, and memory allocation.

Tracking the edits on OpenStreetMap in real time

What is even better than analyzing and visualizing interesting data ? Doing that on open data … from OpenStreetMap … and in real time !

Somewhat inspired by this view of data uploads to OpenWeatherMap, Christian Quest and I created the service.

It displays in near-real-time (with data from 2 minutes ago) all edits made on OpenStreetMap, with live activity graphs and zooms on the edition zones.

It is very exciting for an existing contributor to see its work being displayed and broadcasted so quickly. But before all, this tool is intended as a communication tool, as a very visual demonstration of the activity and energy of the OSM project. Initial tests showed very positive reactions from sample audiences.

The tool also works as a very visual attractor for a booth at an exhibition, for example.

We now intend to leverage the backend side to get more insightful data, that, beyond the communication aspect, can render true services for OSM contributors and users. The main idea is to have an “interest feed”, where you could subscribe to a bounding box, or some tags on the OSM data, and get notified when some changes are made.

For example:

  • Discover who works in your area
  • If you are a transportation operator for example, it can be interesting to get notified when some changes are made to OSM about transportation data, within the context of an open data initiative.

Stay tuned for more announcements related to this service !

Data Web Services and a look at Pachube

I recently built an Arduino-based home electricity usage monitoring device, following the guides from the folks at OpenEnergyMon.

Being quite a data geek, I obviously wanted to be able to display various historical graphs and stats about the collected data. My device therefore has built-in Ethernet connectivity to be able to upload data on the Internet.

After doing a bit of code myself, I started to have a look to existing services that could help, which could be called Data Web Services. One really stands out, and it’s Pachube. Actually, it looks like it is currently the only service doing that. Simply put, it’s a web service which allows you to create data feeds, upload data through a REST-like API, and later retrieve this data. Everything is oriented around time series. The concept is quite neat and setting everything up was quite easy.

We want to analyse

Versatile storage and retrieving of stored data are obviously the basis of a Data Web Service. However,  I think it should provide more than just “store and retrieve” features.

Historical graphing first comes to mind. The user should be able to navigate in its data without limit and without excessive smoothing. The data from previous months

But everything becomes more interesting when you start doing more :

  • Computations : how much did I spend in electricity last week ? what was the average temperature last month ?
  • Comparison graphs

A platform or an end-user solution ?

Pachube is really a data platform. It does not know about what kind of data it handles, or what the user wants to do. This robust platform is obviously required, but I think the users want more. They want solutions, they want packaged applications, and this is where the real added value lies.

If you want to build a real home energy solution based on Pachube (and not just an history viewer), you have to build an application that uses the service as a data backend. To build this application, you have to do everything yourself (hosting, handling users)

An ideal Data Web Service should offer a platform for building applications based on its data. It should make it trivial to define user-centric dashboards.

The focus on the Internet Of Things and Open Data

Pachube’s main focus is on the Internet Of Things (IOT). It’s really designed from the ground up to handle data coming from embedded sensors and to allow some actions based on it. as put by their tagline “Manage real-time data from sensors, devices, and environments”.  Also, Pachube emphatizes the social and open data aspects.

I am a strong believer in the IOT. In a not-so-distant future, many of our devices will be Internet-enabled and having all of these devices interact and exchange data together is very exciting. It also joins the concept of Smart Grid, for energy usage monitoring and feedback.

However, not all data is open nor comes from devices. A Data Web Service should probably provide its users with robust and easy storage and analysis of private data. Wouldn’t you want to have a service that does detailed analysis of the evolution of your phone bills for example ?

A Data Web Service should also be able to retrieve large amounts of open data for public repositories and offer it to its users. Two major use cases come to mind :

  • Public open data repositories rarely come with sensible visualization services. Providing generic or specific (through the application building blocks) visualization of open data can help spread this idea and help citizens get a better insight into the data provided to them
  • Letting an user compare open data against its own private data can be of tremendous value. For example, Pachube became famous when a host of individuals started publishing radiation counters data following the Fukushima disaster. Providing comparisons between all of these and official figures is at least as interesting as the raw values from each counter.

What now ?

While I was thinking about this, several things happened. First, both Google and Microsoft shut down their services dedicated to home energy management (“Smart Grid”). Also, Pachube was acquired by LogMeIn, a provider of remote access and back-up services, both for personal and business users. While I am a bit puzzled by this acquisition, it definitely means that things are moving for Data Web Services.

There is space for new ideas, new concepts, new services. I am currently toying with the idea of creating a prototype for several of the ideas I have about this exciting service kind.

[FR] A la recherche d’un site d’hébergement de photos

Pendant longtemps, j’ai voulu maintenir ma propre installation de Gallery. Cette solution de tout faire soi-même offre un certain nombre d’avantages, le principal étant le contrôle que l’on peut avoir sur l’ensemble de son site.

Néanmoins, les inconvénients sont eux aussi nombreux :

  •  Gallery est très puissant, mais n’est pas forcément très pratique ni très agréable à utiliser, autant pour les visiteurs que pour l’administrateur.
  •  Il faut faire attention aux mises à jour de sécurité, assez fréquentes, et qui, comme pour la plupart des applications Web, ne passent pas forcément (ou alors mal) par le système normal de mise à jour de la distribution Linux du serveur
  •  Pas d’aide au référencement.

Ces contraintes, qui m’ont déjà poussé à externaliser chez WordPress mon blog sur la performance logicielle me font faire la même chose pour l’hébergement de mes photographies.

Pour l’instant, je suis un peu en recherche de l’idéal. Comme vous avez pu le voir, je me suis pour l’instant arrêté sur une Galerie Picasa Web.

J’ai pour l’instant un peu étudié les fournisseurs suivants :


Un service proposé par Google.

Avantages :

  •  Excellente intégration (évidemment) avec le client lourd Picasa. Un “simple clic” et un album est uploadé. C’est vraiment parfait, rien à redire sur ce point là.
  • Un espace de stockage gratuit relativement clément de 1 Go.
  • Un espace de stockage supplémentaire peu cher (à partir 5€ par an pour 20 Go)

Inconvénients :

  • Aucune personnalisation possible de la galerie.
  • Fonctionnalités assez limitées, diaporama assez faible.
  • Pas de personnalisation de l’URL de la galerie.
  • Assez peu représenté au final sur le Web


Le service de partage de photos, édité par Yahoo

Avantages :

  • Omniprésent, de nombreux systèmes, blogs et autres, le supportent nativement.
  • Fonctionnalités avancées pour le tagging des images
  •  Diaporamas très agréables

Inconvénients :

  • Fortes restrictions sur les transferts (100 Mo / mois), à moins de prendre la version “Pro” à 25€/an.

Et après ?

Je suis actuellement très tenté par SmugMug, un site payant, un peu plus cher que les précédents (à partir de 40$/an), mais qui semble offrir une richesse fonctionnelle assez impressionnante, des URL personnalisables, et aucunes limites de taille.

Etant donné qu’ils offrent un free trial de 14 jours, je pense l’essayer. Connaissez-vous ?

Software performance blog

Both in my work and in the various blog posts I read, I often find that people tend not to focus on the good questions when talking about software performance.

This can lead to wasting time, or even to actually worsen the performance of what you are trying to optimize.

Based on this, I started a new blog, called Software Performance. I intend to publish various tips and explanations about my vision of how to optimize your software, and how to do it the most efficiently possible.