Three new books, Go Optimizations 101, Go Details & Tips 101 and Go Generics 101 are published now.

Pointers

Avoid unnecessary nil array pointer checks in a loop

There are some flaws in the current official standard Go compiler implementation (v1.17). One of them is some nil array pointer checks are not moved out of loops. Here is an example to show this flaw.

// unnecessary-checks.go
package pointers

import "testing"

const N = 1000
var a [N]int

//go:noinline
func g0(a *[N]int) {
	for i := range a {
		a[i] = i // line 12
	}
}

//go:noinline
func g1(a *[N]int) {
	_ = *a // line 18
	for i := range a {
		a[i] = i // line 20
	}
}

func Benchmark_g0(b *testing.B) {
	for i := 0; i < b.N; i++ { g0(&a) }
}

func Benchmark_g1(b *testing.B) {
	for i := 0; i < b.N; i++ { g1(&a) }
}

Let's run the benchmarks with the -S compiler option, the following outputs are got (uninterested texts are omitted):

$ go test -bench=. -gcflags=-S unnecessary-checks.go
...
0x0004 00004 (unnecessary-checks.go:12)	TESTB	AL, (AX)
0x0006 00006 (unnecessary-checks.go:12)	MOVQ	CX, (AX)(CX*8)
...
0x0000 00000 (unnecessary-checks.go:18)	TESTB	AL, (AX)
0x0002 00002 (unnecessary-checks.go:18)	XORL	CX, CX
0x0004 00004 (unnecessary-checks.go:19)	JMP	13
0x0006 00006 (unnecessary-checks.go:20)	MOVQ	CX, (AX)(CX*8)
...
Benchmark_g0-4  517.6 ns/op
Benchmark_g1-4  398.1 ns/op

From the outputs, we could find that the g1 implementation is more performant than the g0 implementation, even if the g1 implementation contains one more code line (line 18). Why? The question is answered by the outputted assembly instructions. In the g0 implementation, the TESTB instruction is generated within the loop, whereas in the g1 implementation, the TESTB instruction is generated out of the loop. The TESTB instruction is used to check whether or not the argument a is a nil pointer. For this specified case, checking once is enough. The one more code line avoids the flaw in the compiler implementation.

There is a third implementation which is as performant as the g1 implementation. The third implementation uses a slice derived from the array pointer argument.

//go:noinline
func g2(x *[N]int) {
	a := x[:]
	for i := range a {
		a[i] = i
	}
}

Please note that the flaw might be fixed in future compiler versions.

And please note that, if the three implementation functions are inline-able, the benchmark results will change much. That is the reason why the //go:noinline compiler directives are used here. However, we should know that, up to Go toolchain v1.17, the //go:noinline compiler directives are actually unnecessary here, because a function containing a for-range loop is not inline-able.

The case in which an array pointer is a struct field

For the cases in which an array pointer is a struct field, things are a little complex. The _ = *t.a line in the following code is useless to avoid the compiler flaw. For example, in the following code, the performance difference between the f1 function and the f0 function is small. (In fact, the f1 function might be even slower, if a NOP instruction is generated within its loop.)

type T struct {
	a *[N]int
}

//go:noinline
func f0(t *T) {
	for i := range t.a {
		t.a[i] = i
	}
}

//go:noinline
func f1(t *T) {
	_ = *t.a
	for i := range t.a {
		t.a[i] = i
	}
}

To move the nil array pointer checks out of the loop, we should copy the t.a field to a local variable, then adopt the trick introduced above:

//go:noinline
func f3(t *T) {
	a := t.a
	_ = *a
	for i := range a {
		a[i] = i
	}
}

Or simply derive a slice from the array pointer field:

//go:noinline
func f4(t *T) {
	a := t.a[:]
	for i := range a {
		a[i] = i
	}
}

The benchmark results:

Benchmark_f0-4  622.9 ns/op
Benchmark_f1-4  637.4 ns/op
Benchmark_f2-4  511.3 ns/op
Benchmark_f3-4  390.1 ns/op
Benchmark_f4-4  387.6 ns/op

The results verify our previous conclusions.

Note, the f2 function mentioned in the benchmark results is declared as

//go:noinline
func f2(t *T) {
	a := t.a
	for i := range a {
		a[i] = i
	}
}

The f2 implementation is not fast as the f3 and f4 implementations, but it is faster than the f0 and f1 implementations. However, that is another story.

If the elements of an array pointer field are not modified (only read) in the loop, then the f1 way is as performant as the f3 and f4 way.

Personally, for most cases, I think should try to use the slice way (the f4 way) to get the best performance, because generally slices are optimized better than arrays by the official standard Go compiler.

Avoid unnecessary point dereferences in a loop

Sometimes, the current official standard Go compiler (v1.17) is not smart enough to generate assembly instructions in the most optimized way. We have to write the code in another way to get the best performance. For example, in the following code, the f function is must less performant than the g function.

// avoid-indirects_test.go
package pointers

import "testing"

func f(sum *int, s []int) {
	for _, v := range s { // line 7
		*sum += v // line 8
	}
}

func g(sum *int, s []int) {
	var n = 0
	for _, v := range s { // line 14
		n += v // line 15
	}
	*sum = n
}

var s = make([]int, 1024)
var r int

func Benchmark_f(b *testing.B) {
	for i := 0; i < b.N; i++ {
		f(&r, s)
	}
}

func Benchmark_g(b *testing.B) {
	for i := 0; i < b.N; i++ {
		g(&r, s)
	}
}

The benchmark results (uninterested texts are omitted):

$ go test -bench=. -gcflags=-S avoid-indirects_test.go
...
0x0009 00009 (avoid-indirects_test.go:8)	MOVQ	(AX), SI
0x000c 00012 (avoid-indirects_test.go:8)	ADDQ	(BX)(DX*8), SI
0x0010 00016 (avoid-indirects_test.go:8)	MOVQ	SI, (AX)
0x0013 00019 (avoid-indirects_test.go:7)	INCQ	DX
0x0016 00022 (avoid-indirects_test.go:7)	CMPQ	CX, DX
0x0019 00025 (avoid-indirects_test.go:7)	JGT	9
...
0x000b 00011 (avoid-indirects_test.go:14)	MOVQ	(BX)(DX*8), DI
0x000f 00015 (avoid-indirects_test.go:14)	INCQ	DX
0x0012 00018 (avoid-indirects_test.go:15)	ADDQ	DI, SI
0x0015 00021 (avoid-indirects_test.go:14)	CMPQ	CX, DX
0x0018 00024 (avoid-indirects_test.go:14)	JGT	11
...
Benchmark_f-4  3024 ns/op
Benchmark_g-4   566.6 ns/op

The outputted assembly instructions show the pointer sum is dereferenced within the loop in the f function. A dereference operation is a memory operation. For the g function, the dereference operations happen out of the loop, and the instructions generated for the loop only process registers. It is much faster to let CPU instructions process registers than process memory, which is why the g function is much more performant than the f function.

Another performant implementation for this specified case is to move the pointer parameter out of the function body:

func h(s []int) int {
	var n = 0
	for _, v := range s {
		n += v
	}
	return n
}

func use_h(s []int) {
	var sum = new(int)
	*sum = h(s)
	...
}

Index↡

The Go 101 project is hosted on Github. Welcome to improve Go 101 articles by submitting corrections for all kinds of mistakes, such as typos, grammar errors, wording inaccuracies, description flaws, code bugs and broken links.

If you would like to learn some Go details and facts every serveral days, please follow Go 101's official Twitter account @go100and1 or join Go 101 slack channels.

The digital versions of this book are available at the following places:
Tapir, the author of Go 101, has been on writing the Go 101 series books and maintaining the go101.org website since 2016 July. New contents will be continually added to the book and the website from time to time. Tapir is also an indie game developer. You can also support Go 101 by playing Tapir's games (made for both Android and iPhone/iPad):
Individual donations via PayPal are also welcome.

Index: