I recently needed to parse some very tiny chunks of XML, for a toy project related to OpenStreetMap data.
Data ranged from
- Just one node with 3 or 4 small attributes
- 10-20 nodes with a total of 50-80 attributes
First reaction is to just use the DOM API. Performance was just terrible. The usual suspects were checked: the DocumentBuilderFactory was correctly kept, the DocumentBuilder itself was correctly reused between XML documents.
So, it is common knowledge that “SAX parsing is faster than DOM”. That is true to some extent. DOM parsing definitely can’t handle very large XML as everything is instantiated at once, but the DOM instantiation itself is often quite quick compared to XML parsing.
So, switching to SAX … No real improvement. Let’s use the very first performance monitoring tool available ! Thread dump using kill -3 :
java.lang.Thread.State: RUNNABLE at java.lang.Throwable.fillInStackTrace(Native Method) - locked <0x00000007ad82ba98> (a com.sun.org.apache.xerces.internal.xni.parser.XMLConfigurationException) at java.lang.Throwable.<init>(Throwable.java:213) at java.lang.Exception.<init>(Exception.java:58) at java.lang.RuntimeException.<init>(RuntimeException.java:60) at com.sun.org.apache.xerces.internal.xni.XNIException.<init>(XNIException.java:59) at com.sun.org.apache.xerces.internal.xni.parser.XMLConfigurationException.<init>(XMLConfigurationException.java:78) at com.sun.org.apache.xerces.internal.util.ParserConfigurationSettings.checkProperty(ParserConfigurationSettings.java:284) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.checkProperty(XML11Configuration.java:1398) at com.sun.org.apache.xerces.internal.util.ParserConfigurationSettings.getProperty(ParserConfigurationSettings.java:229) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.getProperty(XML11Configuration.java:924) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.reset(XMLEntityManager.java:1521) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.resetCommon(XML11Configuration.java:994) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:781) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:748) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1208) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:525) at SAXParsersTest.timeOne(SAXParsersTest.java:173)
And basically, here is our answer. On very small XML documents, a huge part of the time is actually spent in startup time of the parser, not in the parsing itself. Each time you call parse, it does some entity management initialization, which, somewhere throws an exception (probably because a property is not present). Exceptions throwing is not really optimized by the JIT, so you pay a huge overhead each time.
Fortunately, JAXB allows you to easily change the implementation. We can for example try to go from the default one (Xerces) to the Piccolo implementation. Piccolo publishes some benchmarks that show up to 2 times gain when using “small” XML (500 bytes).
What happens on very tiny XML ?
- Times are in milliseconds for 100K loops
- All loops have been preloaded to eliminate any JIT effect
- I’ve added the old Crimson parser to compare
picco xerces factor crimson "empty" (1 node, 1 attr) 80 1153 14.4 371 tiny (a few nodes) 81 1161 14.3 368 lessTiny (a few dozens) 490 1654 3.4 958 big (hundreds) 4015 4638 1.2 5590 veryBig (several K) 39200 36834 0.9 52240
Yup, to parse large number of extremely small XML fragments with only a few nodes, Picco is 14 times faster than the default implementation ! Which, all taken into account, lead to a 5-10 times increase in the speed of my OSM parsing job !
When we get to very big XML, both implementations are almost head-to-head.
So, I would probably not recommend switching your default implementation, as Xerces is probably more mature and might support some more exotic stuff, but if you have some very small XML to parse, then definitely go for Picco ! And beware of startup times and overheads in your code, not only of raw sustained throughput.