Friday, February 26, 2016

JEP 254: Compact Strings

JEP 254 proposes changing the internal representation of strings inside the JVM. As most readers surely know, strings are stored using UTF-16, which uses two bytes per character. This proposal suggests using a more compact, one-byte-per-character representation internally: “Data gathered from many different applications indicates that strings are a major component of heap usage and, moreover, that most String objects contain only Latin-1 characters. Such characters require only one byte of storage, hence half of the space in the internal char arrays of such String objects is going unused,” says the JEP proposal.


Changing to the more compact form would not affect existing code or any APIs; it would be a purely internal change inside the JVM and not visible to programmers. Interestingly, the information on the JEP’s web page reveals that a string compression feature was tested in Java 6. It converted String.value to an Object that pointed either to an array of 7-bit characters or an array of regular Java characters. That feature, though, was removed subsequently.