The Go Optimizations 101 and Go Details & Tips 101 books have been updated to Go 1.25.

If you would like to learn some Go details and facts every serveral days, please follow @zigo_101.

Pointers

Avoid unnecessary nil array pointer checks in a loop

There are some flaws in the current official standard Go compiler implementation (v1.24.n). One of them is some nil array pointer checks are not moved out of loops. Here is an example to show this flaw.

// unnecessary-checks.go
package pointers

import "testing"

const N = 1000
var a [N]int

//go:noinline
func g0(a *[N]int) {
	for i := range a {
		a[i] = i // line 12
	}
}

//go:noinline
func g1(a *[N]int) {
	_ = *a // line 18
	for i := range a {
		a[i] = i // line 20
	}
}

func Benchmark_g0(b *testing.B) {
	for i := 0; i < b.N; i++ { g0(&a) }
}

func Benchmark_g1(b *testing.B) {
	for i := 0; i < b.N; i++ { g1(&a) }
}

Let's run the benchmarks with the -S compiler option, the following outputs are got (uninterested texts are omitted):

$ go test -bench=. -gcflags=-S unnecessary-checks.go
...
0x0004 00004 (unnecessary-checks.go:12)	TESTB	AL, (AX)
0x0006 00006 (unnecessary-checks.go:12)	MOVQ	CX, (AX)(CX*8)
...
0x0000 00000 (unnecessary-checks.go:18)	TESTB	AL, (AX)
0x0002 00002 (unnecessary-checks.go:18)	XORL	CX, CX
0x0004 00004 (unnecessary-checks.go:19)	JMP	13
0x0006 00006 (unnecessary-checks.go:20)	MOVQ	CX, (AX)(CX*8)
...
Benchmark_g0-4  494.8 ns/op
Benchmark_g1-4  399.3 ns/op

From the outputs, we could find that the g1 implementation is more performant than the g0 implementation, even if the g1 implementation contains one more code line (line 18). Why? The question is answered by the outputted assembly instructions.

In the g0 implementation, the TESTB instruction is generated within the loop, whereas in the g1 implementation, the TESTB instruction is generated out of the loop. The TESTB instruction is used to check whether or not the argument a is a nil pointer. For this specified case, checking once is enough. The one more code line avoids the flaw in the compiler implementation.

There is a third implementation which is as performant as the g1 implementation. The third implementation uses a slice derived from the array pointer argument.

//go:noinline
func g2(x *[N]int) {
	a := x[:]
	for i := range a {
		a[i] = i
	}
}

Please note that the flaw might be fixed in future compiler versions.

And please note that, if the three implementation functions are inline-able, the benchmark results will change much. That is the reason why the //go:noinline compiler directives are used here. (Before Go toolchain v1.18, the //go:noinline compiler directives are actually unnecessary here. Because Go toolchain v1.18- never inlines a function containing a for-range loop.)

The case in which an array pointer is a struct field

For the cases in which an array pointer is a struct field, things are a little complex. The _ = *t.a line in the following code is useless to avoid the compiler flaw. For example, in the following code, the performance difference between the f1 function and the f0 function is small. (In fact, the f1 function might be even slower, if a NOP instruction is generated within its loop.)

type T struct {
	a *[N]int
}

//go:noinline
func f0(t *T) {
	for i := range t.a {
		t.a[i] = i
	}
}

//go:noinline
func f1(t *T) {
	_ = *t.a
	for i := range t.a {
		t.a[i] = i
	}
}

To move the nil array pointer checks out of the loop, we should copy the t.a field to a local variable, then adopt the trick introduced above:

//go:noinline
func f3(t *T) {
	a := t.a
	_ = *a
	for i := range a {
		a[i] = i
	}
}

Or simply derive a slice from the array pointer field:

//go:noinline
func f4(t *T) {
	a := t.a[:]
	for i := range a {
		a[i] = i
	}
}

The benchmark results:

Benchmark_f0-4  622.9 ns/op
Benchmark_f1-4  637.4 ns/op
Benchmark_f2-4  511.3 ns/op
Benchmark_f3-4  390.1 ns/op
Benchmark_f4-4  387.6 ns/op

The results verify our previous conclusions.

Note, the f2 function mentioned in the benchmark results is declared as

//go:noinline
func f2(t *T) {
	a := t.a
	for i := range a {
		a[i] = i
	}
}

The f2 implementation is not fast as the f3 and f4 implementations, but it is faster than the f0 and f1 implementations. However, that is another story.

If the elements of an array pointer field are not modified (only read) in the loop, then the f1 way is as performant as the f3 and f4 way.

Personally, for most cases, I think we should try to use the slice way (the f4 way) to get the best performance, because generally slices are optimized better than arrays by the official standard Go compiler.

Avoid unnecessary pointer dereferences in a loop

Sometimes, the current official standard Go compiler (v1.24.n) is not smart enough to generate assembly instructions in the most optimized way. We have to write the code in another way to get the best performance. For example, in the following code, the f function is much less performant than the g function.

// avoid-indirects_test.go
package pointers

import "testing"

//go:noinline
func f(sum *int, s []int) {
	for _, v := range s { // line 8
		*sum += v // line 9
	}
}

//go:noinline
func g(sum *int, s []int) {
	var n = *sum
	for _, v := range s { // line 16
		n += v // line 17
	}
	*sum = n
}

var s = make([]int, 1024)
var r int

func Benchmark_f(b *testing.B) {
	for i := 0; i < b.N; i++ {
		f(&r, s)
	}
}

func Benchmark_g(b *testing.B) {
	for i := 0; i < b.N; i++ {
		g(&r, s)
	}
}

The benchmark results (uninterested texts are omitted):

$ go test -bench=. -gcflags=-S avoid-indirects_test.go
...
0x0009 00009 (avoid-indirects_test.go:9)	MOVQ	(AX), SI
0x000c 00012 (avoid-indirects_test.go:9)	ADDQ	(BX)(DX*8), SI
0x0010 00016 (avoid-indirects_test.go:9)	MOVQ	SI, (AX)
0x0013 00019 (avoid-indirects_test.go:8)	INCQ	DX
0x0016 00022 (avoid-indirects_test.go:8)	CMPQ	CX, DX
0x0019 00025 (avoid-indirects_test.go:8)	JGT	9
...
0x000b 00011 (avoid-indirects_test.go:16)	MOVQ	(BX)(DX*8), DI
0x000f 00015 (avoid-indirects_test.go:16)	INCQ	DX
0x0012 00018 (avoid-indirects_test.go:17)	ADDQ	DI, SI
0x0015 00021 (avoid-indirects_test.go:16)	CMPQ	CX, DX
0x0018 00024 (avoid-indirects_test.go:16)	JGT	11
...
Benchmark_f-4  3024 ns/op
Benchmark_g-4   566.6 ns/op

The outputted assembly instructions show the pointer sum is dereferenced within the loop in the f function. A dereference operation is a memory operation. For the g function, the dereference operations happen out of the loop, and the instructions generated for the loop only process registers. It is much faster to let CPU instructions process registers than process memory, which is why the g function is much more performant than the f function.

This is not a compiler flaw. In fact, the f and g functions are not equivalent (though for most use cases in practice, their results are the same). For example, if they are called like the following code shows, then they return different results (thanks to skeeto@reddit for making this correction).

{
	var s = []int{1, 1, 1}
	var sum = &s[2]
	f(sum, s)
	println(*sum) // 6
}
{
	var s = []int{1, 1, 1}
	var sum = &s[2]
	g(sum, s)
	println(*sum) // 4
}

Another performant implementation for this specified case is to move the pointer parameter out of the function body (again, it is not totally equivalent to either f or g function):

//go:noinline
func h(s []int) int {
	var n = 0
	for _, v := range s {
		n += v
	}
	return n
}

func use_h(s []int) {
	var sum = new(int)
	*sum += h(s)
	...
}

The digital versions of this book are available at the following places:

Leanpub store, $9.99+.
Apple Books store, $9.99.
Amazon Kindle store, $9.99.
Google Play store, $9.99.

(more articles ↡)

The Go 101 project is hosted on GitHub. Welcome to improve Go 101 articles by submitting corrections for all kinds of mistakes, such as typos, grammar errors, wording inaccuracies, description flaws, code bugs and broken links.

Tapir, the author of Go 101, has been on writing the Go 101 series books since 2016 July. New contents will be continually added to the books (and the go101.org website) from time to time. Tapir is also an indie game developer. You can also support Go 101 by playing Tapir's games:

Color Infection (★★★★★), a physics based original casual puzzle game. 140+ levels.
Rectangle Pushers (★★★★★), an original casual puzzle game. Two modes, 104+ levels.
Let's Play With Particles, a casual action original game. Three mini games are included.

Individual donations via PayPal are also welcome.

Articles in this book:

Acknowledgments
About Go Optimizations 101
Value Parts and Value Sizes ^{(available in the paid ebooks)}

value/type sizes
memory alignments
struct padding
avoid larger copy costs

Memory Allocations
Stack and Escape Analysis ^{(available in the paid ebooks)}

escape analysis
how to control value allocation places
stacks growth and shrinkage
how to reduce stack grow times

Garbage Collection ^{(available in the paid ebooks)}

GC pacer
how to reduce GC pressure
control GC frequency

Pointers
Structs ^{(available in the paid ebooks)}

3 facts/suggestions

Arrays and Slices ^{(available in the paid ebooks)}

10+ facts/suggestions

String and Byte Slices ^{(available in the paid ebooks)}

10+ facts/suggestions

BCE (Bound Check Elimination)

the cases BCE works for
the cases BCE doesn't work for
the cases BCE works for when given hints

Maps
Channels ^{(available in the paid ebooks)}

3 facts/suggestions

Functions ^{(available in the paid ebooks)}

how to make a function inline-able
how to make a function not inline-able
pointer parameters/results vs. non-pointer ones
named results vs. anonymous ones
10+ facts/suggestions

Interfaces ^{(available in the paid ebooks)}

value boxing costs
3+ facts/suggestions