21 May 2018

S10 Overview

Superoptimization

Superoptimization was introduced by H. Massalin in his paper Superoptimizer: a look at the smallest program^[1] back in 1987. The idea is that for every sequence of instructions for a given computer architecture and a metric, there is an infinite number of sequences which perform the same task (are semantically equivalent to the original sequence), but one of them is the best one with respect to the metric.

Take for example the instruction sequence in RV32I:

addi x1, zero, 1
mv   x2, zero
add  x2, x2, x1
add  x2, x2, x1

This sets the register x2 to 2. If we choose a metric that orders programs by increasing order of size, then the best possible program is:

addi x2, zero, 2

Now take another example in RV32IM (which includes the multiplication extension):

mul x1, x1, 2

This doubles the value in register x1. Given the same metric, it is not possible to find a better program. However, if we change our metric to the number of clock cycles taken by the whole sequence of instructions and our micro-architecture takes 8 cycles for multiplications and 2 cycles for shifting then there's a better program:

slli x1, x1, 1

Which shifts register x1 by 1 bit to the left, in essence multiplying its value by two and taking only 2 cycles whereas the original example took 8 (under our micro-architectural assumptions).

For long sequence of instructions the state space that requires searching for the best sequence is enormous, which makes superoptimization a CPU intensive program. Research on superoptimization slowed down after the initial paper but has seen a resurgence in the last decade.

The compiler

Will superoptimizers replace your compiler? No!

Production compilers are highly complex pieces of software that generate very good code, under very tight time constraints. On the other hand, superoptimizers are very slow since they need to search very large state spaces and therefore can take minutes, hours or even days to find the best possible instruction sequence.

The placement of the superoptimizer in the development toolchain is up to the user but, in general, it will be placed after the compilation step picking up where the compiler left off and trying to optimize the code it generated.

S10

S10 was created out of the resurgence in superoptimization research mentioned earlier. Mangpo Phothilimthana from UC Berkeley published in 2016 a paper^[2] and technical report on Greenthumb. The initial prototype of S10 was built upon Greenthumb and once we verified the potential we decided to turn it into a commercial product -- where at this point about 15% of code comes from Mangpo's work.

Since most of the code was rewritten, including the interface between the backend architectures and the main modules, the architectures supported by Greenthumb are not supported by S10, namely arm, llvm ir, and greenarrays. We have therefore chosen to implement support initially for RISC-V, with other architectures on the roadmap.

S10 is written completely in Racket and implements three distinct algorithms that cooperate in a distributed setting. A symbolic^[2:1], stochastic^[3] and lens^[4] algorithms are distributed over several cores (or different machines) and communicate with each other through message passing finding incrementally better instruction sequences until it's proved that the best sequence has already been found.

S10 includes a statistical profiler and debugging aids to ensure that any error is found quickly and any hotspots are first to be optimized. Due to the statistical nature of some methods it is also possible to automatically tune these parameters on an architecture independent way to ensure the best possible performance from the framework.

Another area we focused on, given the long times taken to find solutions to the optimization problem was partial caching. S10 is supported by a database which uses previous found solutions to the superoptimization problem to simplify future searches.

Superoptimization in practice

In practice waiting for the perfect solution is not always possible or realistic, that is why S10 will incrementally provide incrementally better instruction sequences until it proves there is none better.

An interaction with S10 for the first example I mentioned in this post might look like this:

$ riscv-s10 -j 8 --timeout 60 --verbose --progress add2.s
S10 Superoptimization Framework, by Linki Tools UG
https://linki.tools/s10

Booting up 8 cores (stochastic 3, lens 4, symbolic 1)
Input score: 16

* stochastic found optimization (score: 12)

  mv x2, zero
  addi x2, x2, 1
  addi x2, x2, 1

* lens found optimization (score: 4)

  addi x2, zero, 2
  
* symbolic found best

  addi x2, zero, 2
  
Shutting down cores

Caching solution:
  addi x2, zero, 2

The superoptimizer will incrementally give information (due to the --progress flag) about optimizations found until we can prove there's no better solution, or a timeout is reached (which in this case was set to 60 seconds with --timeout).

The metric for the previous example was instruction sequence size, but had our metric been performance the situation would have been similar with one exception. The metric has to be fast to execute because it is executed potentially millions of times. If a cycle accurate simulation takes a long time, then it is better to implement a fast heuristic and request the best few solutions found and run an accurate metric on those.

$ riscv-s10 -j 8 --timeout 20 --verbose --progress \
  --metric speed --accurate-metric-script sim.sh --run-metric-on 2 add2.s
S10 Superoptimization Framework, by Linki Tools UG
https://linki.tools/s10

Booting up 8 cores (stochastic 3, lens 4, symbolic 1)
Input score: 16

* stochastic found optimization (score: 12)

  mv x2, zero
  addi x2, x2, 1
  addi x2, x2, 1

* lens found optimization (score: 4)

  addi x2, zero, 2

Timeout reached.
Shutting down cores

Best two solutions:

[1] mv x2, zero
    addi x2, x2, 1
    addi x2, x2, 1

[2] addi x2, zero, 2

Running sim.sh on [1]... Done
Running sim.sh on [2]... Done

Best solution (sim.sh):

[2] addi x2, zero, 2

Caching solution:
  addi x2, zero, 2

S10 in this case ran against a tight timeout and at that point since it was given a script to calculate an accurate metric (with --accurate-metric-script) it called it on the best two optimizations found at that point (due to --run-metric-on). The metric scored solution [2] as better and therefore that was the one which was returned and cached.

This is a simple example of what S10 can do. I have not mentioned how to extend the algorithms and internal structures, how to extend S10 to a new architecture or autotune it for better performance. I hope to touch those points on future posts.

Linki Tools is the first company to offer a commercial product for Superoptimization. Join us in superoptimizing your software! Contact me for more information.

Henry Massalin, Superoptimizer: a look at the smallest program, 1987 ↩︎
Mangpo Phothilimthana, et al., GreenThumb: superoptimizer construction framework, 2016 ↩︎ ↩︎
Alex Aiken, et al., Stochastic superoptimization, 2013 ↩︎
Mangpo Phothilimthana, et al., Scaling up Superoptimization, 2016 ↩︎