Contents

Introduction

More and more JSON is becoming the data interchange format of the web and even starts to leak outside of this world, replacing XML wherever it can, and there is really good reasons for that.
But often people are driven towards JSON for other reasons, not necessarily bad reasons, but based on more anecdotal facts, like the so-called verbosity of XML.
Indeed this is the argument you’ll hear the most often, e.g. just have a look at this nice comparison of the two formats: the first cons is of course “verbosity“.

And it’s a factual argument: the size gains can be important if your values are small, typically for representing business objects like customers because the markup overhead (all the closing tags) will become important relatively to the carried information (e.g. the names and zip codes of your customers).

But you rarely send big chunk of data in a raw text format as XML or JSON, because nowadays servers and clients (e.g. web browsers) supports live gzipping of the workloads, and use it transparently.
So the size advantage of JSON over XML should reduce because GZIP knows how to factorize redundant information like markups.

At least this seems a reasonable speculation, but while intuition is good hard numbers are better to be definitely convinced and to have a numerical idea of the impact.
So I’ve written a small Java benchmark that I’ll present, along with its results, in this article.

The source code is available in this archive:

The model

The benchmark reproduces a very common scenario in web development: serializing a big bunch of business data, a set of two millions users.

Here are the different representations of the user entity:
Java:

public class User
{
	private int id;
	private String name;

	public int getId()
	{
		return id;
	}

	public String getName()
	{
		return name;
	}

	public User(int id, String name)
	{
		this.id = id;
		this.name = name;
	}
}

Xml:

<user><id>%d</id><name>%s</name></user>

Note that I’ve used a verbose format to clearly illustrate the point; of course “id” and “name” should have been implemented as attributes, but sometimes you have no choice, e.g. when you have to conform with an ill-conceived XML schema.

And JSON:

{id:%d,name:"%s"}

We already see that the XML template is quite verbose compared to the JSON one.

Data generation

All the data, i.e. the users ids and names, are randomly generated using some helpers methods, to avoid any bias that could appear when choosing fixed values:

private static final Random random = new Random();

private static final char[] letters = new char[26];

static
{
	for (int i = 0; i < 26; ++i)
	{
		letters[i] = (char) ('a' + i);
	}
  }

private static int getId()
{
	return random.nextInt(99999);
}

private static String getName(int length)
{
	char[] chars = new char[length];

	for (int i = 0; i < length; ++i)
	{
		chars[i] = letters[random.nextInt(letters.length)];
	}

	return new String(chars);
}

private static User[] getUsers(int count)
{
	User[] users = new User[count];

	for (int i = 0; i < count; ++i)
	{
		users[i] = new User(getId(), getName(6));
	}

	return users;
}

As the benchmark tries to compare the costs of the formatting overheads for each of the document formats, the size of the values are limited so they don’t become too prevalent: ids and names lengths have been limited to 6 characters.

Data compression

The zipping process is based on the standard Java GZip implementation and is as simple as that:

private static byte[] zip(String string) throws Exception
{
	ByteArrayOutputStream memory = new ByteArrayOutputStream();

	GZIPOutputStream zip = new GZIPOutputStream(memory);
	zip.write(string.getBytes());
	zip.close();

	return memory.toByteArray();
}

The inputs are the text versions of the XML and JSON documents, the output the raw binary representation of the zipped content.

The benchmark

And here is the benchmark:

generate a set of random users
generate the XML and JSON representations of this set
compare the sizes of the text documents
generate the zipped versions of the XML and JSON documents
compare the sizes of the zipped documents and the time it took to compress

Note that the benchmark takes into account the time necessary to zip the documents because as you’d guessed zipping duration depends on the size of the content and CPU time is an important factor that can’t be ignored.

And the implementation:

public static void main(String[] args) throws Exception
{
	User[] users = getUsers(2000000);

	String xml = getXML(users);
	String json = getJSON(users);

	System.out.println(String.format("xml(%d)/json(%d): %f", xml.length(), json.length(), 1.0 * xml.length()/json.length()));

	long t1 = System.currentTimeMillis();
	byte[] xmlZip = zip(xml);
	long t2 = System.currentTimeMillis();
	byte[] jsonZip = zip(json);
	long t3 = System.currentTimeMillis();

	System.out.println(String.format("xmlDuration(%d)/jsonDuration(%d): %f", t2 - t1, t3 - t2, 1.0 * (t2 - t1) / (t3 - t2)));

	System.out.println(String.format("xmlZip(%d)/jsonZip(%d): %f", xmlZip.length, jsonZip.length, 1.0 * xmlZip.length/jsonZip.length));
}

Not rocket science but it should do the job.

And the winner is…

Enough suspense, here are the results:

	Text	Gzip	Zip duration
XML	91.78M	18.74M	3.38s
JSON	49.78M	17.09M	2.78s
XML overhead	84.38%	9.62%	21.3%

As expected for both the text and zipped versions XML has a size overhead but while this overhead is really important with the text version: 84%, almost twice as big, it becomes less significant, less than 10%, when gzipped.

But to obtain this gain in size we had to consume some additional CPU time: it takes more than 20% more time to gzip the XML document than the JSON document.

So depending on your use case it could be completely acceptable or not at all: if your server does not handle many requests and is never overloaded the 20% additional time is not an issue because it allows a dramatic reduction of the size, but if your server is already overloaded 20% more CPU loads could cause the latency to end in the red.

Conclusion

As you’ve seen while the “angle bracket tax” of XML is real it can be dramatically reduced to an acceptable level, but at the cost of some additional processing time.
Keep in mind that as any benchmark it’s worth what it’s worth and if the data format is critical in your situation you should carry out your own study, inspired by this one but using your own data and technologies because your mileage may vary.

In my humble opinion what makes JSON the natural choice for a lot of applications is not its inherent qualities, though real as demonstrated in the above article, but its strong integration into the web ecosystem because JSON is the native way of representing Javascript objects trees.
And as Javascript is no more limited to the client side, with the rise of Javascript on the server with Node.js, it becomes the logical candidate to ensure communication between the client and the server: Javascript talking to Javascript using some … Javascript.

Moreover JSON has been chosen as their data format by some NoSQL databases like MongoDB or CouchDB, one more good reason to use it in order to build a uniform stack.

For a deeper perspective on JSON and XML you can check this great article: A Deep Look at JSON vs. XML

To follow the blog please subscribe to the RSS feed

5 thoughts on “JSON vs XML: some hard numbers about verbosity”

Ashish on October 28, 2015 at 5:10 am said:

Nice Article. No Abstract theory, purely on data points. Very well benchmarked.
Thanks!

Reply ↓
Kenneth E. on January 31, 2016 at 10:38 pm said:

It would be more fair to use attributes instead of elements for the XML representation. Besides, XML has the advantage of having more semantic meaning. It starts with a root element that can actually describe the content as a User, where the JSON requires the receiver to understand it is parsing a User.

When used right, XML is a far superior format in my opinion, and if you ran a new test with the use of attributes, I think the difference would be hard to notice. Well, the difference would be that the XML would include the “User” semantics descibing the entity, and this cost slightly more in terms of storage.

Also keep in mind that XML is more readable to most people, otherwise we should maybe start writing HTML pages in JSON 🙂

Reply ↓
- pragmateek on February 6, 2016 at 5:34 pm said:
  
  Hi Kenneth,
  indeed there is a bias which may be significant here because of the small size of entities, it’s why I’ve mentioned it first.
  So yes with this limited benchmark the difference with an optimized XML would be small but my point was to illustrate the overhead for real schema like:
  XML:
  <company name="ACME"> <services> <service name="HR"> <employees> ... </employees> </service> <services> </company>
  JSON:
  { "$type": "company", name: "ACME", services: [{ name: "HR", employees: [{ ... }] }] }
  I agree for semantics but JSON serializers often support some metadata that bring all the necessary typing information.
  As for readability I think it highly depends on habits: when I first discovered JSON after years of XML it was harder for me to find the right information but now I’m equally “fluent” with both, but prefer YAML which has even less distracting apparent plumbing than JSON. 🙂
  
  Reply ↓
  - ben kloosterman on December 14, 2016 at 4:55 am said:
    
    After compression those tags become a few bits as they are are repeated and common thats what compression does , in fact i have seen it go less than json quite often.. For human readability for large amounts of data with some structure XML is better.
    
    Reply ↓
    - pragmateek on December 18, 2016 at 10:19 pm said:
      
      Indeed, it’s why I’ve compared the compressed documents sizes too to demonstrate that the difference is less obvious and decisive to make a choice.
      As for human readability “beauty is in the eye of the beholder”. 😉
      
      Reply ↓

Pragmateek

A geek on his way to IT pragmatism

JSON vs XML: some hard numbers about verbosity