There are some flaws in the current official standard Go compiler implementation (v1.24.n). One of them is some nil array pointer checks are not moved out of loops. Here is an example to show this flaw.
// unnecessary-checks.go
package pointers
import "testing"
const N = 1000
var a [N]int
//go:noinline
func g0(a *[N]int) {
for i := range a {
a[i] = i // line 12
}
}
//go:noinline
func g1(a *[N]int) {
_ = *a // line 18
for i := range a {
a[i] = i // line 20
}
}
func Benchmark_g0(b *testing.B) {
for i := 0; i < b.N; i++ { g0(&a) }
}
func Benchmark_g1(b *testing.B) {
for i := 0; i < b.N; i++ { g1(&a) }
}
Let's run the benchmarks with the -S
compiler option, the following outputs are got (uninterested texts are omitted):
$ go test -bench=. -gcflags=-S unnecessary-checks.go
...
0x0004 00004 (unnecessary-checks.go:12) TESTB AL, (AX)
0x0006 00006 (unnecessary-checks.go:12) MOVQ CX, (AX)(CX*8)
...
0x0000 00000 (unnecessary-checks.go:18) TESTB AL, (AX)
0x0002 00002 (unnecessary-checks.go:18) XORL CX, CX
0x0004 00004 (unnecessary-checks.go:19) JMP 13
0x0006 00006 (unnecessary-checks.go:20) MOVQ CX, (AX)(CX*8)
...
Benchmark_g0-4 494.8 ns/op
Benchmark_g1-4 399.3 ns/op
From the outputs, we could find that the g1
implementation is more performant than the g0
implementation,
even if the g1
implementation contains one more code line (line 18).
Why? The question is answered by the outputted assembly instructions.
In the g0
implementation, the TESTB
instruction is generated within the loop, whereas in the g1
implementation, the TESTB
instruction is generated out of the loop.
The TESTB
instruction is used to check whether or not the argument a
is a nil pointer.
For this specified case, checking once is enough.
The one more code line avoids the flaw in the compiler implementation.
There is a third implementation which is as performant as the g1
implementation.
The third implementation uses a slice derived from the array pointer argument.
//go:noinline
func g2(x *[N]int) {
a := x[:]
for i := range a {
a[i] = i
}
}
Please note that the flaw might be fixed in future compiler versions.
And please note that, if the three implementation functions are inline-able, the benchmark results will change much.
That is the reason why the //go:noinline
compiler directives are used here.
(Before Go toolchain v1.18, the //go:noinline
compiler directives are actually unnecessary here.
Because Go toolchain v1.18- never inlines a function containing a for-range
loop.)
For the cases in which an array pointer is a struct field, things are a little complex.
The _ = *t.a
line in the following code is useless to avoid the compiler flaw.
For example, in the following code, the performance difference between the f1
function and the f0
function is small.
(In fact, the f1
function might be even slower, if a NOP
instruction is generated within its loop.)
type T struct {
a *[N]int
}
//go:noinline
func f0(t *T) {
for i := range t.a {
t.a[i] = i
}
}
//go:noinline
func f1(t *T) {
_ = *t.a
for i := range t.a {
t.a[i] = i
}
}
To move the nil array pointer checks out of the loop, we should copy the t.a
field to a local variable, then adopt the trick introduced above:
//go:noinline
func f3(t *T) {
a := t.a
_ = *a
for i := range a {
a[i] = i
}
}
Or simply derive a slice from the array pointer field:
//go:noinline
func f4(t *T) {
a := t.a[:]
for i := range a {
a[i] = i
}
}
The benchmark results:
Benchmark_f0-4 622.9 ns/op
Benchmark_f1-4 637.4 ns/op
Benchmark_f2-4 511.3 ns/op
Benchmark_f3-4 390.1 ns/op
Benchmark_f4-4 387.6 ns/op
The results verify our previous conclusions.
Note, the f2
function mentioned in the benchmark results is declared as
//go:noinline
func f2(t *T) {
a := t.a
for i := range a {
a[i] = i
}
}
The f2
implementation is not fast as the f3
and f4
implementations, but it is faster than the f0
and f1
implementations. However, that is another story.
If the elements of an array pointer field are not modified (only read) in the loop, then the f1
way is as performant as the f3
and f4
way.
Personally, for most cases, I think we should try to use the slice way (the f4
way) to get the best performance,
because generally slices are optimized better than arrays by the official standard Go compiler.
Sometimes, the current official standard Go compiler (v1.24.n) is not smart enough to generate assembly instructions in the most optimized way. We have to write the code in another way to get the best performance. For example, in the following code, the f
function is much less performant than the g
function.
// avoid-indirects_test.go
package pointers
import "testing"
//go:noinline
func f(sum *int, s []int) {
for _, v := range s { // line 8
*sum += v // line 9
}
}
//go:noinline
func g(sum *int, s []int) {
var n = *sum
for _, v := range s { // line 16
n += v // line 17
}
*sum = n
}
var s = make([]int, 1024)
var r int
func Benchmark_f(b *testing.B) {
for i := 0; i < b.N; i++ {
f(&r, s)
}
}
func Benchmark_g(b *testing.B) {
for i := 0; i < b.N; i++ {
g(&r, s)
}
}
The benchmark results (uninterested texts are omitted):
$ go test -bench=. -gcflags=-S avoid-indirects_test.go
...
0x0009 00009 (avoid-indirects_test.go:9) MOVQ (AX), SI
0x000c 00012 (avoid-indirects_test.go:9) ADDQ (BX)(DX*8), SI
0x0010 00016 (avoid-indirects_test.go:9) MOVQ SI, (AX)
0x0013 00019 (avoid-indirects_test.go:8) INCQ DX
0x0016 00022 (avoid-indirects_test.go:8) CMPQ CX, DX
0x0019 00025 (avoid-indirects_test.go:8) JGT 9
...
0x000b 00011 (avoid-indirects_test.go:16) MOVQ (BX)(DX*8), DI
0x000f 00015 (avoid-indirects_test.go:16) INCQ DX
0x0012 00018 (avoid-indirects_test.go:17) ADDQ DI, SI
0x0015 00021 (avoid-indirects_test.go:16) CMPQ CX, DX
0x0018 00024 (avoid-indirects_test.go:16) JGT 11
...
Benchmark_f-4 3024 ns/op
Benchmark_g-4 566.6 ns/op
The outputted assembly instructions show the pointer sum
is dereferenced within the loop in the f
function.
A dereference operation is a memory operation.
For the g
function, the dereference operations happen out of the loop,
and the instructions generated for the loop only process registers.
It is much faster to let CPU instructions process registers than process memory,
which is why the g
function is much more performant than the f
function.
This is not a compiler flaw. In fact, the f
and g
functions are not equivalent
(though for most use cases in practice, their results are the same).
For example, if they are called like the following code shows, then they return different results
(thanks to skeeto@reddit for making this correction).
{
var s = []int{1, 1, 1}
var sum = &s[2]
f(sum, s)
println(*sum) // 6
}
{
var s = []int{1, 1, 1}
var sum = &s[2]
g(sum, s)
println(*sum) // 4
}
Another performant implementation for this specified case is to move the pointer parameter out of the function body
(again, it is not totally equivalent to either f
or g
function):
//go:noinline
func h(s []int) int {
var n = 0
for _, v := range s {
n += v
}
return n
}
func use_h(s []int) {
var sum = new(int)
*sum += h(s)
...
}
The Go 101 project is hosted on Github. Welcome to improve Go 101 articles by submitting corrections for all kinds of mistakes, such as typos, grammar errors, wording inaccuracies, description flaws, code bugs and broken links.
If you would like to learn some Go details and facts every serveral days, please follow Go 101's official Twitter account @zigo_101.