The Force.com developer blog recently posted a notice to everyone with a developer edition organization. Starting last Friday, all Force.com Developer Edition organizations will use a new Apex runtime that compiles directly to Java bytecode, which will provide significant improvements in performance.
Of course, I was thrilled to hear about a performance boost, but this rose some questions for me. What was Apex code compiling to before this upgrade? Was it not using Java bytecode before?
To answer these questions, I did a bit of internet searching. According to an old Salesforce blog post that introduces the language when it was first created, Apex is parsed into an Abstract Syntax Tree (AST), and this AST is then executed at runtime.
This information was a good start, but there are some missing details. At what point is this AST created? How is an AST ‘executed’? Is Apex ever compiled directly to Java bytecode? To answer questions like these, I had to dig deep and ask around. I found two resources that helped immensely to decode this puzzle, Salesforce’s patent application for this technology, and a Twitter exchange with one of the developers on the Apex dev team, Rich Unger. After some critical thinking, here’s my Unofficial Grand Theory of Apex Compilation and Execution, both old and new.
The Old Apex Runtime:
When an Apex class is created, if its syntax and references are valid, the code is saved in a record in the ApexClasses table, along with various metadata about the class, stored as fields on the record.
At this point, we have a valid Apex class stored in our org’s database. Looking at it in the database, it looks just as it did when the developer wrote it; it is in an un-compiled, un-interpreted state. So how can a plain string retrieved from the database be executed?
According to page 3 of the patent application, “In one embodiment, Apex is implemented as an Abstract Syntax Tree (AST)-based interpreter.” Apex is not a bytecode interpreter, like Java, but an AST interpreter! So, to run Apex, it must be compiled into the AST that was mentioned earlier. The Apex runtime does this when a request arrives to use that class or its members. For every request that arrives that wants to use the class, the Salesforce system pulls the uncompiled class out of the database, compiles it into an AST node graph, and executes the AST. Each of the node objects in this graph has a method called ‘execute’ which the Apex interpreter can call execute the functionality defined by that node. By traversing the AST tree in this way, an Apex AST can be ‘executed’ by the Apex interpreter as if it were a program. Very cool solution!
I mentioned that an Apex class is compiled into an AST for *each request* to that class. Doesn’t this seem a bit inefficient? Indeed it is. To alleviate this inefficiency and the number of times this AST must be created, the AST object is serialized and cached in memcached, a popular RAM-based key-value store. This reduces the load on the database and also makes the request for the class faster since RAM is faster than HDD-backed database access, by a great order of magnitude. The concept of serializing the AST into a cache is the key reason for the creation of a new Apex runtime. This is the performance bottleneck of the runtime, and we’ll see why below.
The New Apex Runtime
Serializing an object structure in not an easy operation in most programming languages, and deserializing it is just as difficult. The Apex AST can be potentially huge, roughly 10x the size of the source code on the heap. Serializing and deserializing this becomes much more costly as the AST becomes larger and more complex. According to page 3 of the patent application, “Deserializing the AST from memcached is the dominant cost in many Apex requests.” The new Apex runtime eliminates this complex AST structure, to remove this bottleneck. The new Apex is interpreted not in an object state, but in a bytecode state. With the new system, when a request arrives to use a class, it is pulled out of the database, compiled into a Java-like bytecode, and then executed by the new Apex bytecode interpreter. Like before, the memcached system is utilized to store this bytecode for rapid retrieval and execution in the near future. What has improved over the old system, however, is that serialization and deserialization of bytecode is quite trivial and not a costly operation. It can be scraped right off the disk and onto the wire to be stored elsewhere, no complex algorithms required.
As you can now see, the speed boost that Salesforce advertises comes from time saved by not needing to serialize and deserialize the relatively complex AST to and from cache. By eliminating the AST completely, Apex executable caching becomes a much simpler process, and simplicity is king in the land of software engineering.
There is one more piece to the puzzle that hasn’t been discussed yet, and that is the replacement for the actual Apex interpreter. Is it just an off-the-shelf Java interpreter? As Rich Unger tweeted to me, they can’t use the standard JVM used by Java since it wasn’t designed for a multi-tenant environment. Instead, a custom Apex bytecode interpreter had to be written; one that was optimized to operate on specific records in the multi-tenant database that Salesforce uses. The bytecode interpreter may be a clean room implementation, but Apex bytecode should be the same as Java bytecode.
Questions give rise to but more questions. If Apex bytecode is now the same as Java bytecode, will Apex be receiving more Java-like features soon, such as deeper nested inner-classes and generics? Will we soon be able to download an Apex runtime to run Apex code locally in a Salesforce emulator? These are questions that can only be answered with time.
Maybe the reader knows the answers to these questions or can otherwise contribute to the knowledge here. You can leave a comment below or find me on Twitter or Google+ by using the links in the sidebar.
(Thanks to Rich Unger for sharing some info about the inner-workings of the Apex runtime!)