Make It Visible A01: Java’s Math.max

etki
Ayte Labs
Published in
27 min readOct 7, 2020

--

This is an article in Make It Visible series. In Make It Visible we are drilling holes in black boxes by dissecting some frequently used code and/or common assumptions to find out what happens inside and what is the real impact behind them. This time subject is completely insignificant (has anyone seen a project where Math.max would account for at least 10% of calls?), but it is just an expedition into uncharted territory.

Disclaimer: i’m not CPU inside-outs expert, but i’m striving to be one. I may be wrong at some points, but the tools themselves don’t lie: if input is correct, you can trust them. Since all i do is simply measure tiny operations and check their disassembly, there’s not much space to make a wrong step, so hopefully i won’t lie below.

The catch

While writing some utility libraries every developer writes, i looked into Java’s Math.max source code:

public static int max(int a, int b) {
return (a >= b) ? a : b;
}

Seasoned developer¹ will instantly notice the thing: it’s the condition. While conditional branching is something we’re using every day, usually every text on the web discourages to use it in tight spots, because of how it is processed. As you might know, CPUs consists of several different units which may work in parallel, forming a pipeline, so (in layman’s terms) when CPU actually executes instruction N, instruction N+1 is being decoded and instruction N + 2 is being fetched. Branching is a thing that can kill this performance optimization: when CPU decodes an instruction that may or may not result in a jump, it doesn’t know in which way pipeline will go — it would be either instruction following conditional jump or instruction jump is targeting at. Because of that it can’t fetch instructions in advance: to know which instruction it will be it first have to resolve so-called control dependency, the jump instruction itself. CPU engineers addressed that with a branch prediction: CPU tracks down results of such conditions and makes a prediction which branch will be taken, starting to fetch and process corresponding instructions for predicted branch². And if branch was predicted incorrectly, pipeline has to start from scratch, meaning every next unit must await result from previous one, causing stall. When input values constantly change their relationship, which i expect to be often for Math.max calls³, branch predictions should also be wrong quite often, which would result in performance drops, and looking for the first whitepaper on the theme it may easily take 10+ CPU cycles (which may be bigger than hot spot itself):

Also it is worth to check corresponding JMH sample: ~3.75ns

But if algorithm is straightforward and has no conditions, it may perform better than more compact one which is subject for branch misprediction penalty. It is common to see bitwise hacks on the web that perform standard operations on different number types in a non-standard way for performance reasons (branching elimination included), as well as compiler optimizations of the same kind. Let’s confirm once more that Math.max(int, int) indeed contains branching:

$ javap -c java.lang.Math...
public static int max(int, int);
Code:
0: iload_0
1: iload_1
2: if_icmplt 9 # Here's the bad boye
5: iload_0
6: goto 10 # And here's another
9: iload_1
10: ireturn
...

(now you know one more reason why goto is evil)

As any developer, as soon as i’ve seen it i couldn’t resist the urge to try and outsmart standard library.

We already know how it ends

Implementation

So, how does one calculate maximum of two integers using bitwise-only operations? The simplest approach uses argument difference (left — right): maximum argument is defined by difference sign and overflow presence.

  • Negative difference, no overflow: right
  • Positive difference, no overflow: left
  • Negative difference, overflow: left
  • Positive difference, overflow: right

It boils down to computing mask based on two conditions, applying that mask to one argument and it’s inversion to another, merging results as a final step:

public static int maximum(int left, int right) {
// 0b11... if negative, 0b0 if positive
int difference = (left - right) >> 31;
// 0b11... if left is negative, 0b0 if it is positive
int alpha = left >> 31;
// 0b11... if right is negative, 0b0 if it is positive
int beta = right >> 31;
// (0b11... if arguments have different signs)
// &
// (0b11... if difference and left have different signs)
int overflow = (alpha ^ beta) & (alpha ^ difference);
// (0b11... if difference is negative and no overflow
// or
// difference is positive and overflow is present)
int mask = difference ^ overflow;
return (left & ~mask) | (right & mask);
}

Welp, that’s much more code than i hoped. But according to JMH sample, branch prediction in simplest case can introduce as much as 3.75ns averaged latency (~15 CPU cycles if that sample was taken at something like 4GHz clock speed), and the execution time itself doesn’t have linear correlation to CPU cycles, maybe bitwise algorithm (which is likely to perform very good because of trivial operations) will outrun classic comparison? Let’s destroy my fairy castle with the mentioned JMH.

Benchmark

First, i’ve set up test value provider:

@UtilityClass
public class Corpus {
private final Random RANDOM = new SecureRandom();

public final int SIZE = Integer.parseInt(
System.getProperty(
"pls no constant folding",
"16384" // actually this is in Configuration class
)
);

public int[][] random() {
val values = new int[SIZE][2];
for (var i = 0; i < SIZE; i++) {
values[i][0] = RANDOM.nextInt();
values[i][1] = RANDOM.nextInt();
}
return values;
}

public int[][] constant() {
val values = new int[SIZE][2];
val left = RANDOM.nextInt();
val right = RANDOM.nextInt();
for (var i = 0; i < SIZE; i++) {
values[i][0] = left;
values[i][1] = right;
}
return values;
}
/**
* Returns a two-dimensional array of {@link #SIZE} size, which
* is initialized with repeating pattern of {@code interval} two
* element arrays, first {@code length} of which contain
* greater number as zero element, and element with index one
* otherwise.
*/
public int[][] tripping(int interval, int length) {
val values = new int[SIZE][2];
for (var i = 0; i < SIZE; i++) {
val number = RANDOM.nextInt(Integer.MAX_VALUE - 1);
val inverted = i % interval < length;
values[i][0] = number;
values[i][1] = inverted ? number - 1 : number + 1;
}
return values;
}
}

Benchmark would be similar for all algorithms:

@Threads(Configuration.THREADS)
@Fork(Configuration.FORKS)
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Measurement(
time = Configuration.ITERATION_DURATION,
iterations = Configuration.ITERATIONS
)
@Warmup(
time = Configuration.WARMUP_DURATION,
iterations = Configuration.WARMUPS
)
public class BenchmarkX {
@Param({"true", "false"})
public boolean constant;

private int[][] corpus;

@Setup
public void setUp() {
corpus = constant ? Corpus.constant() : Corpus.random();
}

@Benchmark
public void run(Blackhole blackhole) {
for (var i = 0; i < Corpus.SIZE; i++) {
blackhole.consume(implementation(corpus[i][0], corpus[i][1]));
}
}
private static implementation(int left, int right) { ... }
}

I’ve created following benchmarks:

  • LoopBaseline: Same as all other benchmarks, but doing virtually nothing in the loop body.
  • MaximumBaseline: Math.max is substituted with a method call that simply returns first argument. it is expected that nothing should be faster than direct return.
  • CoreMaximum: direct usage of Math.max.
  • BitwiseMaximum: my algo.
  • ExtractedMaximum: Math.max source code copied as a separate method. I won’t spoil the surprise why it is created, but seasoned developers probably already know the importance of such benchmark.
  • EntangledExtractedMaximum: to make compiler’s life even harder, condition check is extracted as a separate method. The only reason for it is just to have fun and see how it compiles relative to other benchmarks.
  • InlinedExtractedMaximum: same as ExtractedMaximum, but instead of disabling inlining it is forced by JMH annotation.
  • InlinedBitwiseMaximum: again, inlined twin of BitwiseMaximum.
  • NaiveMaximum: do not do any calls to non-inlined methods, do comparison right in place.
  • PrematurelyOptimizedNaiveMaximum: the same, but instead of
    i[0] > i[1] ? i[0] : i[1] i[0] and i[1] are put in separate variables: var a = i[0]; var b = i[1]; return a > b ? a : b
    This is for fun only as well — just to see that compiler is smart enough to do that optimization by itself.
  • You may also see couple of inspection benchmarks that exist purely to be able attach to them while they are running, so they don’t actually matter and are not run in batch with others. In fact, forget completely about them.

All benchmarks are the same except for calls to different implementations. Inlining⁴ for all implementations and condition method in entangled extracted benchmark is forbidden by JMH annotations and CompileCommandFile⁵ JVM option with corresponding content:

dontinline java/lang/Math.max
print java/lang/Math.max
print io/ayte/performance/breakdown/number/maximum/*.run
print io/ayte/performance/breakdown/number/maximum/*.implementation
print io/ayte/performance/breakdown/number/maximum/inspection/*.run

print instructions simply tell to dump disassembled JIT code in the output⁶, inline tells to put method body directly into caller method body, dontinline forbids such optimization (which is done by compiler quite often, because it removes additional jumps and stack managing).

Since i have 4-core CPU, i set thread amount to the very same number (fuck the hyperthreading, i have only four physical cores, who knows how HT will affect), extended warmup and benchmark time, and didn’t tweak anything else since there’s not much sense in it. Java runtime is OpenJDK 14, and i don’t have enough time to try out every JDK version out there: we’re not testing JDK’s here, we’re looking whether Math.max is affected by branch prediction on HotSpot and whether straightforward bitwise algorithm may compete with it. Last thing to do is to enable perfnorm profiler and log compilation in a separate file.

$ ./mvnw package$ OPTIONS="src/main/resources/number/maximum/compiler.options"$ java -XX:+UnlockDiagnosticVMOptions \
-XX:CompileCommandFile=$OPTIONS \
-XX:+LogCompilation \
-XX:LogFile=target/number.maximum.compilation.log \
-jar target/io.ayte.performance.breakdown.jar \
io.ayte.performance.breakdown.number.maximum \
-prof perfnorm 2>&1 | tee target/number.maximum.log

Lift off!

Benchmark                         Mode       Score    Error  Units
------------------------------------------------------------------
LoopBaseline - 65.674 ± 0.023 us/op
CoreMaximum constant 88.596 ± 0.205 us/op
CoreMaximum random 88.568 ± 0.096 us/op
MaximumBaseline constant 104.173 ± 0.087 us/op
MaximumBaseline random 104.414 ± 0.350 us/op
BitwiseMaximum constant 120.357 ± 0.079 us/op
BitwiseMaximum random 120.667 ± 0.389 us/op
InlinedBitwiseMaximum constant 104.652 ± 0.087 us/op
InlinedBitwiseMaximum random 104.952 ± 0.332 us/op
ExtractedMaximum constant 111.125 ± 0.149 us/op
ExtractedMaximum random 110.896 ± 0.018 us/op
InlinedExtractedMaximum constant 88.456 ± 0.091 us/op
InlinedExtractedMaximum random 88.817 ± 0.189 us/op
NaiveMaximum constant 88.466 ± 0.112 us/op
NaiveMaximum random 88.366 ± 0.042 us/op
PrematurelyOptimizedNaiveMaximum constant 88.372 us/op
PrematurelyOptimizedNaiveMaximum random 88.394 us/op
EntangledExtractedMaximum constant 140.545 us/op
EntangledExtractedMaximum random 140.398 us/op
SO CLOSE

Ok, i admit it. I’ve lost.

Investigations

But all that thing above is just numbers, they don’t tell why things happen as they do. Before going forward i want to notice couple of things:

  • For every benchmark, including CoreMaximum, results are practically the same for both constant and random input.
  • CoreMaximum is actually faster than MaximumBaseline. WTF?
  • InlinedExtractedMaximum repeats CoreMaximum performance and is faster than MaximumBaseline, which is somewhat expected due to no additional stack management and jumps.
  • NaiveMaximum repeats InlinedExtractedMaximum performance. That’s pretty much expected as well, since in compiled form it should be exactly the same as InlinedExtractedMaximum.
  • PrematurelyOptimizedNaiveMaximum repeats NaiveMaximum performance because, obviously, compiler knows how to do such primitive optimizations.
  • EntangledExtractedMaximum results are worse (that’s expected, there’s an extra call), but still there is no difference for constant and random input.

The first one blames me as roughly as it possible. If constant input gives same performance as random input, there is no branch misprediction at all or i just made it up myself and am deceiving you. I must be wrong from the very start. Let’s confirm that looking at perf metrics:

Metric                      Mode       Value         Error    Units
-------------------------------------------------------------------
CoreMaximum:branches constant 196637.483 ± 127.379 #/op
CoreMaximum:branch-misses constant 1.897 ± 0.109 #/op

CoreMaximum:branches random 196642.713 ± 60.221 #/op
CoreMaximum:branch-misses random 1.939 ± 0.252 #/op

Yes, there is no significant difference in branch count and misses.

So, what may have happened here? Isn’t that a classic comparison which should be implemented as cmp and jX⁷? Am i wrong about branch prediction stuff? Let’s just inspect assembly that was dumped during benchmarking. If you’ve never seen such thing, don’t worry: writing in assembly language is really tough, but reading it is simple enough thanks to your favorite search engine. Basically, assembly is just a long tape of instructions that can be treated as executed sequentially (reality is quite more complex, but that’s another story). Execution goes from one instruction to next one, unless it hits some kind of jump instruction, which may force cursor to continue at another point in tape.

First, Math.max was compiled using C1⁸:

# {method} {0x0000000800322798} 'max' '(II)I' in 'java/lang/Math'
# parm0: rsi = int
# parm1: rdx = int
# [sp+0x40] (sp of caller)
; some management code i'm not good enough with0x00007f3960cbf420: mov %eax,-0x14000(%rsp)
0x00007f3960cbf427: push %rbp
0x00007f3960cbf428: sub $0x30,%rsp
0x00007f3960cbf42c: movabs $0x7f394d2a7870,%rax ; {metadata(method data for {method} {0x0000000800322798} 'max' '(II)I' in 'java/lang/Math')}
0x00007f3960cbf436: mov 0x12c(%rax),%edi
0x00007f3960cbf43c: add $0x8,%edi
0x00007f3960cbf43f: mov %edi,0x12c(%rax)
0x00007f3960cbf445: and $0x1ff8,%edi
0x00007f3960cbf44b: cmp $0x0,%edi
; actual method start
; [1] First it jumps for unknown reason down there
0x00007f3960cbf44e: je 0x00007f3960cbf4b7 ; *iload_0 ; [3] Actual method body0x00007f3960cbf454: cmp %edx,%esi
0x00007f3960cbf456: movabs $0x7f394d2a7870,%rax ; {metadata}
0x00007f3960cbf460: movabs $0x170,%rdi
; [4] Some strange code, if %esi (parameter #1) is lesser
; than %edx (parameter #2), skip movabs instruction. Let's move on
; though.
0x00007f3960cbf46a: jl 0x00007f3960cbf47a
0x00007f3960cbf470: movabs $0x180,%rdi
0x00007f3960cbf47a: mov (%rax,%rdi,1),%rbx
0x00007f3960cbf47e: lea 0x1(%rbx),%rbx
0x00007f3960cbf482: mov %rbx,(%rax,%rdi,1)
; [5] Since flags register shouldn't have changed it state from
; previous cmp call, it's still reads as "if %esi (parameter #1)
; is lesser than %edx (parameter #2)"
0x00007f3960cbf486: jl 0x00007f3960cbf4a1 ; *if_icmplt
0x00007f3960cbf48c: movabs $0x7f394d2a7870,%rax ; {metadata}
0x00007f3960cbf496: incl 0x190(%rax)
; [7] Else jump straight to return sequence where paramater #1 is
; put in %rax, the return register
0x00007f3960cbf49c: jmpq 0x00007f3960cbf4a4 ; *goto ; [6] Then put parameter #2 in parameter #1 register0x00007f3960cbf4a1: mov %rdx,%rsi ; *ireturn
0x00007f3960cbf4a4: mov %rsi,%rax
0x00007f3960cbf4a7: add $0x30,%rsp
0x00007f3960cbf4ab: pop %rbp
0x00007f3960cbf4ac: mov 0x110(%r15),%r10
0x00007f3960cbf4b3: test %eax,(%r10) ; safepoint
0x00007f3960cbf4b6: retq
; [2] Some management again, then jump back0x00007f3960cbf4b7: movabs $0x800322798,%r10 ; {metadata}
0x00007f3960cbf4c1: mov %r10,0x8(%rsp)
0x00007f3960cbf4c6: movq $0xffffffffffffffff,(%rsp)
0x00007f3960cbf4ce: callq 0x00007f39607d7080 ; *synchr. entry
0x00007f3960cbf4d3: jmpq 0x00007f3960cbf454

Here we would jump to C2 output, but let’s check compilation log first

<klass id='1164' name='java.lang.Math' flags='17'/>
<method id='1165' holder='1164' name='max' return='1023' arguments='1023 1023' flags='9' bytes='11' compile_id='234' compiler='c1' level='3' iicount='2081'/>
<call method='1165' count='1025' prof_factor='0,880634' inline='1'/>
<intrinsic id='_max' nodes='7'/>

Now wait a second. What’s an intrinsic? That’s the surprise i was mentioning above and the reason why i’ve benchmarked not only Math.max, but also it’s identical copies. JVM intrinsics are built-in implementations of standard library methods in C++, which is a subject for common rant that C/C++ are always faster than other languages. While JIT and AOT outperform each other at different scenarios, thing that should be admitted is that Java compiler may behave differently from release to release, may have not enough information about domain and not optimizing where it could, and other zouzands of things, so why not optimize simple method like Math.max just once and be sure it doesn’t compile differently ever? May be there’s even same bitwise implementation, and my own is performing worse just because JIT inserts some JVM things like safepoints? Let’s find out.

OpenJDK source code is available via it’s mercurial repo and github; i’ll be using last one just because of irrational feeling of being more comfortable.
Yeap, there’s intrinsic for Math.max. Next sign of it is located in C2 compiler source code, here’s our candidate, here’s call to function that generates the code, which is fucking huge. I will be honest with you: i’m not that expert in C++/JDK source code to understand what exactly it does and what branches it takes on which platform. I’ll again just skip right to the juicy part,the disassembly from C2 output.

; Yes, that's whole method with all required management code# {method} {0x0000000800322798} 'max' '(II)I' in 'java/lang/Math'
# parm0: rsi = int
# parm1: rdx = int
# [sp+0x20] (sp of caller)
0x00007f92541d5080: sub $0x18,%rsp
0x00007f92541d5087: mov %rbp,0x10(%rsp) ; *synchr. entry
0x00007f92541d508c: cmp %edx,%esi
0x00007f92541d508e: mov %esi,%eax
0x00007f92541d5090: cmovl %edx,%eax ; *ireturn
0x00007f92541d5093: add $0x10,%rsp
0x00007f92541d5097: pop %rbp
0x00007f92541d5098: mov 0x110(%r15),%r10
0x00007f92541d509f: test %eax,(%r10) ; safepoint
0x00007f92541d50a2: retq

Strictly speaking, one can’t be sure that’s what exactly is called, since it is expected that C2 would emit JITted Java code, and intrinsics aren’t written in Java at all. However, later we will see that it is exactly the code that is executing.

Now, there’s something special there:

; Compare two parameters0x00007f92541d508c:   cmp    %edx,%esi; Move parameter #1 into return register0x00007f92541d508e:   mov    %esi,%eax; cmovl, something new! Move parameter #1 into return register only 
; if parameter #2 was less than parameter #1
0x00007f92541d5090: cmovl %edx,%eax

And that’s pure bliss. The whole method body takes only three CPU instructions — even using straightforward approach would require four:

    mov    %edx,%eax
cmp %edx,%esi
jl RESULT
mov %esi,%eax
RESULT:
<restore stack and base pointers and exit method>

But i still didn’t prove that this code is not affected by branching. cmp and mov instructions surely don’t do any branching, but what’s cmovl ? CMOVcc family stands for “conditional move”, and while i couldn’t find official information for that (ok-ok: didn’t have time for that, probably it’s in Intel’s Software Developer Manual), i believe they were created exactly to prevent branching and convert control dependency into data dependency. While branch prediction tries to guess ahead which path would be taken and which dependencies are required for predicted branch (so they may be processed ahead of time), CMOVcc instructions do not any extra work: they just swap register value if condition is met, and that’s all, there’s simply no jumps that could reset pipeline processing⁹. I can only envy experience of those who wrote this.

Math.max’s twins

There was another benchmark for testing exact copy of Math.max source code. It’s there to test just one thing: performance for method that doesn’t have backing intrinsic. But in terms of branching it turned out to perform as good as Math.max itself. So, let’s open Pandora’s box:

# {method} {0x00007f25850b4920} 'implementation' '(II)I' in 
# 'io/ayte/performance/breakdown/number/maximum/ExtractedMaximum'
# parm0: rsi = int
# parm1: rdx = int
# [sp+0x20] (sp of caller)
0x00007f25a01fcb00: sub $0x18,%rsp
0x00007f25a01fcb07: mov %rbp,0x10(%rsp) ; *synchr. entry
0x00007f25a01fcb0c: cmp %edx,%esi
0x00007f25a01fcb0e: mov %esi,%eax
0x00007f25a01fcb10: cmovl %edx,%eax ; *ireturn 0x00007f25a01fcb13: add $0x10,%rsp
0x00007f25a01fcb17: pop %rbp
0x00007f25a01fcb18: mov 0x110(%r15),%r10
0x00007f25a01fcb1f: test %eax,(%r10) ; safepoint
0x00007f25a01fcb22: retq

It’s exactly the same¹⁰! My envy just tripled.

Math.max’s melancholic twin

Everything seen above relies on ability to do cmp followed by CMOVcc. So i’ve introduced that another benchmark (EntangledExtractedMaximum), which forbids such combination by forcing cmp placement outside of the actual implementation, which in turn would result in branching in case of straightforward approach. Checking the numbers, this implementation still does not suffer from branch prediction.

Metric                                    Mode      Value      Units
--------------------------------------------------------------------
EntangledExtractedMaximum:branches constant 327722.910 #/op
EntangledExtractedMaximum:branch-misses constant 3.356 #/op
EntangledExtractedMaximum:branches random 327721.812 #/op
EntangledExtractedMaximum:branch-misses random 2.448 #/op

Let’s look at compiled code:

; *invokestatic condition
0x00007fc84c1d5193: callq 0x00007fc844d45d20
; condition() returns boolean, so test %eax,%eax makes sense
0x00007fc84c1d5198: test %eax,%eax
; same pattern again: load one value and substitute with another
; should it be bigger
0x00007fc84c1d519a: mov (%rsp),%eax
0x00007fc84c1d519d: cmovne %ebp,%eax ; *ireturn

Of course there’s CMOVcc again. While comparing values is placed outside of the method, the same result may be restored using returned value.

Still, why the timings?

Let’s recap the results:

Benchmark                         Mode       Score    Error  Units
------------------------------------------------------------------
CoreMaximum constant 88.596 ± 0.205 us/op
CoreMaximum random 88.568 ± 0.096 us/op
MaximumBaseline constant 104.173 ± 0.087 us/op
MaximumBaseline random 104.414 ± 0.350 us/op
ExtractedMaximum constant 111.125 ± 0.149 us/op
ExtractedMaximum random 110.896 ± 0.018 us/op
InlinedExtractedMaximum constant 88.456 ± 0.091 us/op
InlinedExtractedMaximum random 88.817 ± 0.189 us/op
NaiveMaximum constant 88.466 ± 0.112 us/op
NaiveMaximum random 88.366 ± 0.042 us/op

MaximumBaseline and ExtractedMaximum are performing worse than everything else, while every other benchmark shows nearly identical results. We’ve seen that max() code is in fact identical in all benchmarks, and still Math.max performs as well as inlined / in-place implementations. How so? I’ve forbid inliningMath.max using CompileCommand, no? Let’s dive into assembly once again:

# {method} {0x00007f9223c947c8} 'run' 
; '(Lorg/openjdk/jmh/infra/Blackhole;)V' in
; 'io/ayte/performance/breakdown/number/maximum/CoreMaximum'
; irrelevant code0x00007f92541d5785: xchg %ax,%ax ; alignment
0x00007f92541d5787: callq 0x00007f924c723d80 ; *iload_2
; {external_word}
0x00007f92541d578c: movabs $0x7f926b7c8198,%rdi
0x00007f92541d5796: and $0xfffffffffffffff0,%rsp
; {runtime_call MacroAssembler::debug64(char*, long, long*)}
0x00007f92541d579a: callq 0x00007f926b3cc420
0x00007f92541d579f: hlt
0x00007f92541d57a0: mov %r13,0x8(%rsp)
0x00007f92541d57a5: mov %r11,(%rsp)
0x00007f92541d57a9: mov %ebx,%ebp ;*iload_2
0x00007f92541d57ab: mov 0x10(%r12,%r10,8),%edx ;*iaload
0x00007f92541d57b0: mov 0x14(%r12,%r10,8),%r11d ;*iaload
; Now would you just look at that?; *invokestatic max
0x00007f92541d57b5: cmp %r11d,%edx
0x00007f92541d57b8: cmovl %r11d,%edx
0x00007f92541d57bc: mov %r13,%rsi
; *invokevirtual consume
0x00007f92541d57bf: callq 0x00007f92541d53b0
; Loop management, blah-blah-blah
; Due to jumps in irrelevant code section,
; execution starts somewhere close to here
0x00007f92541d57c4: mov %ebp,%ebx
0x00007f92541d57c6: inc %ebx ; *iinc

It got inlined, even though i explicitly forbid it. I guess this is another property of intrinsics — ignore what everybody tells and do what they want, i mean, get inlined no matter what. Here you also can see that C2 output was correct, since this is what gets inlined in other methods.

Destroying what’s left of castle

There was bitwise implementation that lost to three-instruction smart move, remember? Let’s take a look at what it has compiled to:

# method handling code omitted
0x00007f181056ae0c: mov %esi,%r11d
0x00007f181056ae0f: sub %edx,%r11d
0x00007f181056ae12: mov %edx,%r8d
0x00007f181056ae15: sar $0x1f,%r8d
0x00007f181056ae19: sar $0x1f,%r11d
0x00007f181056ae1d: mov %esi,%r10d
0x00007f181056ae20: sar $0x1f,%r10d
0x00007f181056ae24: mov %r11d,%r9d
0x00007f181056ae27: xor %r10d,%r9d
0x00007f181056ae2a: xor %r10d,%r8d
0x00007f181056ae2d: and %r9d,%r8d
0x00007f181056ae30: xor %r11d,%r8d
0x00007f181056ae33: and %r8d,%edx
0x00007f181056ae36: andn %esi,%r8d,%eax
0x00007f181056ae3b: or %edx,%eax

(yet another round of applause for JVM compiler and CPU engineers who introduced this — x & ~y is compiled into single andn)

And here’s the thing. Fifteen instructions. Imagining that one instruction corresponds to one CPU cycle (which is totally false and will be mythbusted just lines below), it should perform five times worse than standard implementation.

Benchmark                         Mode       Score    Error  Units
------------------------------------------------------------------
LoopBaseline - 65.674 ± 0.023 us/op
CoreMaximum constant 88.596 ± 0.205 us/op
CoreMaximum random 88.568 ± 0.096 us/op
InlinedBitwiseMaximum constant 104.652 ± 0.087 us/op
InlinedBitwiseMaximum random 104.952 ± 0.332 us/op

(I took inlined benchmark scores to pair up with inlined Math.max )

Well, it’s somewhat like 23 us vs 40 us¹¹, closer to 5:8 rather than 1:5 (6:7, if we count the loop). How does this happen? Let’s check perf output for metric called CPI:

CoreMaximum.run:CPI                0.328 ± 0.001   #/op
InlinedBitwiseMaximum.run:CPI 0.281 ± 0.002 #/op
# inverseCoreMaximum.run:IPC ~3.05 #/op
InlinedBitwiseMaximum.run:IPC ~3.56 #/op
# Again, don't forget that those numbers include loop code as well

CPI stands for Cycles Per Instruction, or, if one inverts the fraction, how many instructions are executed per cycle. First of all, now you have to live with knowledge that CPU is capable executing more than single instruction per cycle. Second, as you can see, bitwise implementation is indeed faster in number of instructions per cycle, and don’t forget that presented numbers account for loop — which takes most of the time — as well, so actual numbers differ even more. That takes us to the next section:

Drawbacks of CMOVcc

Of course, everything comes at specific cost. CMOVcc instruction introduce data dependency, which mean it forbids some reordering opportunities just because it has to wait both for preceding cmp result and valid value in source register, and dependent instructions, which are likely to be the very next ones, will do the same. Conditional jumping requires source value only for one of the branches, and in terms of replacing CMOVcc in this particular case even if wrong one is predicted, the correct one either follows predicted one or is one instruction before it, which shouldn’t be that big deal, everything is probably already in L1 cache.

Also, there is critique from Linus Torvalds, which, even being written more than thirteen years ago, gives us a hint that CMOVcc does not guarantee faster execution than branching. How about actually giving branch prediction a try?

@Threads(Configuration.THREADS)
@Fork(Configuration.CURIOSITY_FORKS)
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Measurement(
time = Configuration.ITERATION_DURATION,
iterations = Configuration.ITERATIONS
)
@Warmup(
time = Configuration.WARMUP_DURATION,
iterations = Configuration.WARMUPS
)
public class IntentionalBranching {
@Param({
"constant",
"random",
"1:0",
"2:1",
"4:1", "4:2",
"8:1", "8:2", "8:4",
"32:1", "32:2", "32:4", "32:8",
"128:1", "128:4", "128:16", "128:64",
"512:1", "512:16", "512:32", "512:128"
})
public String configuration;
private int[][] corpus; @Setup
public void setUp() {
if (configuration.equals("constant")) {
corpus = Corpus.constant();
return;
}
if (configuration.equals("random")) {
corpus = Corpus.random();
return;
}
String[] split = configuration.split(":");
corpus = Corpus.tripping(
Integer.parseInt(split[0]),
Integer.parseInt(split[1])
);
}

@Benchmark
public void run(Blackhole blackhole) {
for (var i = 0; i < Corpus.SIZE; i++) {
if (corpus[i][0] > corpus[i][1]) {
blackhole.consume(corpus[i][0]);
} else {
blackhole.consume(corpus[i][1]);
}
}
}
}

Don’t be frightened by amount of code, it is in fact very simple benchmark. A code that should have branching and does pretty much the same job as Math.max is executed against three type of data sets: constant, random, and a thing i called tripping. Tripping is a data set which changes it’s behavior once in a while for specified length, and, for example, 32:8 means that for first 8 in every 32 pairs first element is bigger than the second, while for the remainder of interval it is quite the opposite. Let me demonstrate what output could produce configuration 8:2:

          i  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | ...
corpus[i][0] | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | ...
corpus[i][1] | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | ...

Actual branch prediction heuristics are complete mystery to me. Because of that i just took several data sets that, how i thought, could be unexpectable for CPU and added constant input as a baseline. I can’t and i won’t actually treat the numbers, i can’t — at least for now — say and explain why they behave the way they do, but i’m desperate to see the patterns. There is one thing that must be done before going forward: verifying that benchmark in fact does branching. Of course, this leads us to disassembly once again. Method body is strange as heck (i don’t know why JIT loves unnecessary jumps), but there’s enough information to confirm that branching takes place:

; This is just an excerpt, method start and end are excluded for
; erm
; brevity, if it can be called so
0x00007febd41d4840: mov %rbx,0x8(%rsp)
0x00007febd41d4845: mov %r8,(%rsp)
0x00007febd41d4849: mov %rbx,%rsi
0x00007febd41d484c: data16 xchg %ax,%ax
; *invokevirtual Blackhole.consume
; As you'll see below, the necessary value is already
; in %edx register, which carries parameter for .consume()
0x00007febd41d484f: callq 0x00007febcc725a00
0x00007febd41d4854: mov 0x110(%r15),%r10
0x00007febd41d485b: inc %ebp
0x00007febd41d485d: test %eax,(%r10)
0x00007febd41d4860: mov (%rsp),%r8
; *iload_2
0x00007febd41d4864: mov 0x8(%rsp),%rbx
; This is loop handling code, and note how compiler inlined
; Corpus.SIZE field over here - because it's static final,
; there's no reason not to do so.
0x00007febd41d4869: cmp $0x8000,%ebp
; *if_icmpge
0x00007febd41d486f: jge 0x00007febd41d48ba
; *getfield corpus
0x00007febd41d4871: mov 0x10(%r8),%r11d
; implicit exception: dispatches to 0x00007febd41d496e
0x00007febd41d4875: mov 0xc(%r12,%r11,8),%r10d
0x00007febd41d487a: cmp %r10d,%ebp
0x00007febd41d487d: jae 0x00007febd41d48ca
0x00007febd41d487f: lea (%r12,%r11,8),%r10
; *aaload
0x00007febd41d4883: mov 0x10(%r10,%rbp,4),%r10d
; implicit exception: dispatches to 0x00007febd41d498c
0x00007febd41d4888: mov 0xc(%r12,%r10,8),%r9d
0x00007febd41d488d: cmp $0x1,%r9d
0x00007febd41d4891: jbe 0x00007febd41d48f8
; *iaload: load one value into %r11d
0x00007febd41d4893: mov 0x10(%r12,%r10,8),%r11d
; *iaload: load other value into %edx
0x00007febd41d4898: mov 0x14(%r12,%r10,8),%edx
0x00007febd41d489d: cmp %edx,%r11d
; if %r11d is lesser or equal to %edx, jump back to the start
0x00007febd41d48a0: jle 0x00007febd41d4840
0x00007febd41d48a2: mov %rbx,0x8(%rsp)
; *iload_2
0x00007febd41d48a7: mov %r8,(%rsp)
0x00007febd41d48ab: mov %rbx,%rsi
; since %r11d is bigger than %edx, set %edx to %r11d value
; because %edx is intended to carry method parameter
0x00007febd41d48ae: mov %r11d,%edx
0x00007febd41d48b1: xchg %ax,%ax
; *invokevirtual Blackhole.consume
0x00007febd41d48b3: callq 0x00007febcc725a00

So ok, there’s definitely branching, and this is the first time i’ve seen something compiled by C2 that may be optimized — it could use the same cmovl into %edx and exclude all the complex logic, but it didn’t, so it is a perfect candidate that should do pretty much the same work as Math.max.

Ready for surprise?

benchmark              constant          84.038 ± 0.053        us/op
branch-misses constant 1.916 #/op
branches constant 237155.399 #/op
instructions constant 1269993.736 #/op
benchmark random 239.572 ± 0.085 us/op
branch-misses random 16046.829 #/op
branches random 245910.039 #/op
# just to show that there's no hidden extra work:
# number of instructions difference is under 1%
# and not in constant configuration favor
instructions random 1262072.377 #/op
benchmark 1:0 81.891 ± 0.042 us/op
branch-misses 1:0 1.843 #/op
branches 1:0 229403.104 #/op
benchmark 2:1 85.184 ± 0.045 us/op
branch-misses 2:1 1.925 #/op
branches 2:1 245776.276 #/op
benchmark 4:1 84.121 ± 0.027 us/op
branch-misses 4:1 2.130 #/op
branches 4:1 237554.534 #/op
benchmark 4:2 86.319 ± 0.043 us/op
branch-misses 4:2 1.922 #/op
branches 4:2 245733.109 #/op
benchmark 8:1 82.505 ± 0.027 us/op
branch-misses 8:1 2.121 #/op
branches 8:1 233473.128 #/op
benchmark 8:2 83.969 ± 0.019 us/op
branch-misses 8:2 1.986 #/op
branches 8:2 237592.787 #/op
benchmark 8:4 85.302 ± 0.025 us/op
branch-misses 8:4 1.848 #/op
branches 8:4 245794.427 #/op
benchmark 32:1 82.212 ± 0.038 us/op
branch-misses 32:1 2.016 #/op
branches 32:1 230430.378 #/op
benchmark 32:2 82.969 ± 0.042 us/op
branch-misses 32:2 1.836 #/op
branches 32:2 231451.211 #/op
benchmark 32:4 83.339 ± 0.027 us/op
branch-misses 32:4 2.161 #/op
branches 32:4 233532.520 #/op
benchmark 32:8 84.248 ± 0.045 us/op
branch-misses 32:8 1.932 #/op
branches 32:8 237603.894 #/op
benchmark 128:1 86.064 ± 0.029 us/op
branch-misses 128:1 256.895 #/op
branches 128:1 229514.753 #/op
benchmark 128:4 84.914 ± 0.023 us/op
branch-misses 128:4 256.904 #/op
branches 128:4 230412.359 #/op
benchmark 128:16 86.397 ± 0.083 us/op
branch-misses 128:16 263.730 #/op
branches 128:16 233504.259 #/op
benchmark 128:64 90.789 ± 0.050 us/op
branch-misses 128:64 513.002 #/op
branches 128:64 245791.021 #/op
benchmark 512:1 83.103 ± 0.043 us/op
branch-misses 512:1 65.115 #/op
branches 512:1 229461.628 #/op
benchmark 512:16 82.834 ± 0.049 us/op
branch-misses 512:16 64.885 #/op
branches 512:16 230401.189 #/op
benchmark 512:32 83.874 ± 0.029 us/op
branch-misses 512:32 128.843 #/op
branches 512:32 231439.314 #/op
benchmark 512:128 85.070 ± 0.033 us/op
branch-misses 512:128 128.910 #/op
branches 512:128 237580.527 #/op

In majority of cases, except for two cases where branch miss count was ~512 or higher (in bold), predicting variant showed slightly better results. One of them is underperforming in something like just 2%, but random input gives CPU a really hard time.

Again, i can’t say anything about what really goes under the hood, but it looks like CPU may detect small repeatable patterns (branch miss count is close to zero up to 32:* configuration), and small deviations in input values don’t confuse it too much. There are 32 768 entries in corpus, so for interval 512 there are 128 pattern changes at max (there are two changes per interval and 64 intervals of 512 entries in 32 768 entries), which may or may not trigger branch misprediction. In 512:32 and higher one can see that all those 128 changes produce misprediction, however 512:16 triggers only half of that. Random, on the opposite side, seems to fail nearly on every other sample, and i think this number is so high just because CPU tries to guess and apply a pattern that simply doesn’t exist.

Again: looks like / i think. I’m not only no expert in this, but even haven’t read “Branch Predictions for Dummies” or something like that.

All in all, code with branching is not inherently evil as you may think. It just may execute differently, and, as with any other microoptimizations, you should not try to write non-branching code just for the sake of speed / performance / etc. While it may cause some performance loss, it quite often would be only negligible and you will have more high-level things to optimize. And, as Linus wrote in mails above, most input is predictable, you’ll be totally OK until you’ll start processing huge amounts of data which is totally random, and when you’ll do, you’ll surely know where to seek for performance leaks and that article would be of no use for you.

So you’re telling that variant with prediction is better than with CMOVcc? Why it is used then? You’re confuseful.

No, i only told that in my benchmark (which certainly doesn’t cover a lot of cases and haven’t been verified by other people) majority of configurations outperformed CMOVcc version by no more than 10%, and, well, those configurations are very machine-friendly (everything is a power of two). However, truly random input resulted in nearly 200% degradation (and again, don’t forget that this number includes loop management, without it number would be way worse). Just look at this wonderful difference again:

benchmark              constant          84.038 ± 0.053        us/op
branch-misses constant 1.916 #/op
benchmark random 239.572 ± 0.085 us/op
branch-misses random 16046.829 #/op

The point of choosing code without branch predictions is that it is much more stable, you don’t expect it to deviate based on input pattern, and when you’re writing highly generic code that is reused across the globe, which JVM is, you prefer to sacrifice those possible 10% (at max and on very rarely changing input) for guaranteeing that end user would not have to dig it down to assembly to find out why same Spark job works twice as slow if input lines are shuffled. I think that would be really confusing.

Le finale

Well, that’s pretty much it. I’ve demonstrated some facts that probably are completely useless for your career, but still fun, and, hopefully, you’ve started to feel that all that magic happening inside black JVM box is not that far away, and everyone may dive as deep as they want regardless if they have or haven’t graduated from MIT.

Whole code, launching instructions and results are stored in performance breakdown repository, and you may try do the very same thing on your machine.

But be aware that whole suite takes over twelve hours (intentional branching is not launched with regular suite and takes another four hours IIRC), and you’re not allowed to browse youtube at that time — in fact, you have to turn off everything you can, i even shut down everything but systemd and some services of kernel importance. It is also a bad idea to run benchmarks on laptop, since things like power management may affect CPU behavior. And if running on virtual server, you have to watch CPU steal time and be aware that you have no control which instructions are really executed on underlying CPU (but chances are they’ll be the same, there’s not much sense in doing otherwise unless you’re consciously testing every architecture through QEMU). And there is that special note at the end of every JMH run i love:

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell.

There are a lot of things in play. It is not just “welp how many CPU instructions are executed in time slice”. There is RAM, access to which is very heavy relative to small algorithms, there are even costlier page faults, there is data cache, there is instruction cache, there are several layers of cache of different size, there is size of cache line, there is false sharing, there are different CPU vendors, architectures and models, there is floating CPU frequency, there is microcode, there are micro-operations (uops), which don’t map one-to-one to instructions, there are operations that were processed but never executed, executed but never committed, operations with warmup that speed up as they get executed, there is OS, there are context switches, there are syscalls, there are non-syscall interrupts, there is clock skew, there may be side effects, there may be starvation, there may be contention, there may be other hardware in play.

Footnotes

[1]: Seasoned developer. Performance engineer would be more careful with conclusions.

[2]: In fact, everything is quite more complex under the hood. Under many conditions it is possible that both branches would be preprocessed, but let’s put it aside for a while.

[3]: That was naively written by me nearly a year ago.

[4]: Inlining is a process of inserting code of one method directly into another, thus eliminating expenses of performing a call. Everything that is compiled to machine code executes in this way:

  • Prepare arguments for function
  • Perform jump to instruction where function starts
  • Save current base pointer
  • Set base pointer to stack pointer value
  • Set stack pointer according so local variables would fit
  • Perform calculation
  • Put result in EAX register (this may differ for different calling conventions). Quite often this is zero-cost operation because computation is done in the same EAX register, as seen in cmovl EAX, %address% above.
  • Restores base and stack pointers
  • Pops returning address (instruction that called current function + size of that instruction with operands, i.e. next instruction) and performs yet another jump to it.

If «perform calculation» step is relatively small, management may take a decent amount of time of whole time call has taken. So it makes sense just to take function body and insert it into parent function, like if it would be written as a single function from the very start. This removes all management work and saves execution time in exchange of code size (if one function is inlined in different places, it would contribute to overall size proportionally to number of inlinings).

Since described benchmarks have to execute identically except for max implementations, it’s required to ensure all of them are executing under same conditions, also inlining condition method renders it useless: if it is inlined, compiler has full right to compile it just as regular Math.max.

[5]: CompileCommand and CompileCommandFile JVM options allows one to alter compiler behavior. There’s more modern way of doing same things called CompilerControl, but i’ll evade quite the regular way — i don’t know if this is propagated well with JMH and didn’t want to lose some time there.

[6]: To print out compiled code as ASM, you will need hsdis library. Here’s instruction about getting it on Linux.

[7]: cmp stands for compare, which compares two values and sets different bits flags register according to result, while jX instructions (e.g. jle, jg) perform a jump to another section of code if condition is met.

[8]: C1 and C2 stand for Level 1 Compiler and Level 2 Compiler (you may also find them under client and server names). C1 is a basic compiler that targets to emit non-optimized code as fast as possible, while C2 takes it’s time to produce most optimized version. I couldn’t find proper official help page, but this article probably will do. Things are going to change, though.

[9]: In fact, there is a thing called trace cache that would render at least half of pipeline unnecessary at such spot (if not all — depends on implementation that may take in opcodes or retired uops), but i’ve learned about it just couple of days ago (just in time, pun intended). Still, that wouldn’t work with bigger code that doesn’t constantly cycle within a single kilobyte.

[10]: Of course that relates only to specific JVM on my machine — but i’m pretty sure it will be the same for all modern JVMs on every x86 machine.

[11]: In fact, you should never count composition as a sum of it’s elements, but or the sake of thought experiment it will do.

--

--

etki
Ayte Labs

I still bear a russian citizenship with no plans to prolong the contract, if that matters