nebelwelt.net logo
navigation logo

Adaptive Optimisation using Hardware Performance Monitors

This project is about optimizing a virtual machines bytecode using collected performance samples. These samples are not collected via additional instructions in the bytecode but via a special hardware interface.

In my work I use a special kernel module that activates the pentium 4 precise event based sampling. This way the cpu issues special microcode to collect the samples on the fly and copies them into memory.

The paper Online optimizations driven by hardware performance monitoring was presented at PLDI'07 in San Diego.

 

Objectives
The main objective for this thesis is to provide an interface for the Pentium 4's Hardware Performance Monitors and the Precise Event Based Sampling facilities. PEBS is a hardware based method to sample different events like L2 and L1 cache misses. Special microcode programs on the processor take care of sampling important information like registers and IP whenever a HPM counter overflows.
These hardware information is then forwarded into the Jikes Research Virtual Machine. Inside the VM the IP is resolved to the corresponding VM_Method and bytecode instruction. More data can then be gathered depending on the bytecode instruction.
All these collected information can then be used to instruct the optimization system and to increase the overall performance.
A new garbage collector was implemented as a second objective. This garbage collector differs between often used and sampled types that are hot and cold ones. The GC is a generational garbage collector with a nursery and two mature spaces. If the objects are cold then they are copied into the mark and sweep mature space. But if they are hot then they are copied into a copy space mature space. Upon collection time the objects are then ordered and copied depending on the field heat. This should increase cache locality of the objects and should reduce cache misses.

 

Results
A new user-space library named libpebsi uses the Perfmon2 kernel interface to sample all PEBS events. These samples are then copied into Jikes. The JNI bindings use some tricks to keep the overhead low. As many samples are copied from kernel-space to user-space as possible, to limit the number of slow transitions between these two spaces. The samples are then transferred into the virtual machine. This is done by a direct memory copy into the VM's address-space.
The lookup from IP to method is then don by a special three-levelled lookup table. The sample's address is separated into three parts. The first two parts (with 8 and 12 bit length) are used to lookup the specific page. At last a linear search is done in the page itself. There are normally only 2-4 methods inside a page, so the linear search is sufficient.
Using these solutions the overhead is normally between 0.6% and 1%, which is very low compared to the gathered information.
The hot/cold copy garbage collector then offers an overall speedup over all benchmarks of 1% using second level cache misses as sample input.
Further work can be done to improve the optimization. The information gathering is very stable, fast and simple to use. The low overhead should make it possible to implement new optimizations that offer a good speedup. One of the most promising fields would be to optimize method references.

 

The documentation and presentation are available as PDF.