Three new books, Go Optimizations 101, Go Details & Tips 101 and Go Generics 101 are published now. It is most cost-effective to buy all of them through this book bundle in the Leanpub book store.

Please follow Go 101 Twitter account @Go100and1 to get latest news of Go 101 books/articles/apps/libraries.

Memory Allocations

Memory blocks

The basic memory allocation units are called memory blocks. A memory block is a continuous memory segment. As aforementioned, at run time, a value part is carried on a single memory block.

A single memory block might carry multiple value parts. The size of a memory block must not be smaller than the size of any value part it carries.

When a memory block is carrying a value part, we may say the value part is referencing the memory bock.

Memory allocation operations will consume some CPU resources to find the suitable memory blocks. So the more memory blocks are created (for memory allocations), the more CPU resources are consumed. In programming, we should try to avoid unnecessary memory allocations to get better code execution performances.

Memory allocation places

Go runtime might find a memory block on the stack (one kind of memory zone) of a goroutine or the heap (the other kind of memory zone) of the whole program to carry some value parts. The finding-out process is called a (memory) allocation.

The memory management manners of stack and heap are quite different. For most cases, finding a memory block on stack is much cheaper than on heap.

Collecting stack memory blocks is also much cheaper than collecting heap memory blocks. In fact, stack memory blocks don't need to be collected. The stack of a goroutine could be actually viewed as a single memory block, and it will be collected as a whole when the goroutine exits.

On the other hand, when all the value parts being carried on/by a heap memory block are not used any more (in other words, no alive value part is still referencing the memory block), the memory block will be viewed as garbage and automatically collected eventually, during runtime garbage collection cycles, which might consume certain CPU resources (garbage collection will be talked in detail in a later chapter). Generally, the more memory blocks are allocated on heap, the larger pressure is made for garbage collection.

As heap allocations are much more expensive, only heap memory allocations contribute to the allocation metrics in Go code benchmark results. But please note that allocating on stack still has a cost, though it is often comparatively much smaller.

The escape analysis module of a Go compiler could detect some value parts will be only used by one goroutine and try to let those value parts allocated on stack at run time if certain extra conditions are satisfied. Stack memory allocations and escape analysis will be explained with more details in the next chapter.

Memory allocation scenarios

Generally, each of the following operations will make at least one allocation.

However, the official standard Go compiler makes some special code optimizations so that at certain cases, some of the above listed operations will not make allocations. These optimizations will be introduced later in this book.

Memory wasting caused by allocated memory blocks larger than needed

The sizes of different memory blocks might be different. But they are not arbitrary. In the official standard Go runtime implementation, for memory blocks allocated on heap,

So,

In other words, memory blocks are often larger than needed. The strategies are made to manage memory easily and efficiently, but might cause a bit memory wasting sometimes (yes, a trade-off).

These could be proved by the following program:

package main

import "testing"
import "unsafe"

var t *[5]int64
var s []byte

func f(b *testing.B) {
	for i := 0; i < b.N; i++ {
		t = &[5]int64{}
	}
}

func g(b *testing.B) {
	for i := 0; i < b.N; i++ {
		s = make([]byte, 32769)
	}
}

func main() {
	println(unsafe.Sizeof(*t))      // 40
	rf := testing.Benchmark(f)
	println(rf.AllocedBytesPerOp()) // 48
	rg := testing.Benchmark(g)
	println(rg.AllocedBytesPerOp()) // 40960
}

Another example:

package main

import "testing"

var s = []byte{32: 'b'} // len(s) == 33
var r string

func Concat(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r = string(s) + string(s)
	}
}

func main() {
	br := testing.Benchmark(Concat)
	println(br.AllocsPerOp())       // 3
	println(br.AllocedBytesPerOp()) // 176
}

There are 3 allocations made within the Concat function. Two of them are caused by the byte slice to string conversions string(s), and the sizes of the two memory blocks carrying the underlying bytes of the two result strings are both 48 (which is the smallest size class which is not smaller than 33). The third allocation is caused by the string concatenation, and the size of the result memory block is 80 (the smallest size class which is not smaller than 66). The three allocations allocate 176 (48+48+80) bytes totally. In the end, 14 bytes are wasted. And 44 (15 + 15 + 14) bytes are wasted during executing the Concat function.

In the above example, the results of the string(s) conversions are used temporarily in the string concatenation operation. By the current official standard Go compiler/runtime implementation (v1.19), the string bytes are allocated on heap (see below sections for details). After the concatenation is done, the memory blocks carrying the string bytes become into memory garbage and will be collected eventually later.

Reduce memory allocations and save memory

The less memory (block) allocations are made, the less CPU resources are consumed, and the smaller pressure is made for garbage collection.

Memory is cheap nowadays, but this is not true for the memory sold by cloud computing providers. So if we run programs on cloud servers, the more memory is saved by the Go programs, the more money is saved.

The following are some suggestons to reduce memory allocations and save memory in programming.

Avoid unnecessary allocations by allocating enough in advance

We often use the built-in append function to push some slice elements. In the statement r = append(s, elements...), if the free capacity of s is not large enough to hold all appended elements, Go runtime will allocate a new memory block to hold all the elements of the result slice r.

If the append function needs to be called multiple times to push some elements, it is best to ensure that the base slice has a large enough capacity, to avoid several unnecessary allocations in the whole pushing process.

For example, to merge some slices into one, the following shown MergeWithTwoLoops implementation is more efficient than the MergeWithOneLoop implementation, because the former one makes less allocations and copies less values.

package allocations

import "testing"

func getData() [][]int {
	return [][]int{
		{1, 2},
		{9, 10, 11},
		{6, 2, 3, 7},
		{11, 5, 7, 12, 16},
		{8, 5, 6},
	}
}

func MergeWithOneLoop(data ...[]int) []int {
	var r []int
	for _, s := range data {
		r = append(r, s...)
	}
	return r
}

func MergeWithTwoLoops(data ...[]int) []int {
	n := 0
	for _, s := range data {
		n += len(s)
	}
	r := make([]int, 0, n)
	for _, s := range data {
		r = append(r, s...)
	}
	return r
}

func Benchmark_MergeWithOneLoop(b *testing.B) {
	data := getData()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = MergeWithOneLoop(data...)
	}
}

func Benchmark_MergeWithTwoLoops(b *testing.B) {
	data := getData()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = MergeWithTwoLoops(data...)
	}
}

The benchmark results:

Benchmark_MergeWithOneLoop-4   636.6 ns/op  352 B/op  4 allocs/op
Benchmark_MergeWithTwoLoops-4  268.4 ns/op  144 B/op  1 allocs/op

The benchmark results show that allocations affect code execution performance much.

Let's print some logs to see when each of the 4 allocations happens in a MergeWithOneLoop function call.

package main

import "fmt"

func getData() [][]int {
	return [][]int{
		{1, 2},
		{9, 10, 11},
		{6, 2, 3, 7},
		{11, 5, 7, 12, 16},
		{8, 5, 6},
	}
}

const format = "Allocate from %v to %v (when append slice#%v).\n"

func MergeWithOneLoop(data [][]int) []int {
	var oldCap int
	var r []int
	for i, s := range data {
		r = append(r, s...)
		if oldCap == cap(r) {
			continue
		}
		fmt.Printf(format, oldCap, cap(r), i)
		oldCap = cap(r)
	}
	return r
}

func main() {
	MergeWithOneLoop(getData())
}

The outputs (for the official standard Go compiler v1.19):

Allocate from 0 to 2 (when append slice#0).
Allocate from 2 to 6 (when append slice#1).
Allocate from 6 to 12 (when append slice#2).
Allocate from 12 to 24 (when append slice#3).

From the outputs, we could get that only the last append call doesn't allocate.

In fact, the Merge_TwoLoops function could be faster in theory. As of the official standard Go compiler v1.19, the make call in the Merge_TwoLoop function will zero all just created elements, which is actually unnecessary. Compiler optimizations in future versions might avoid the zero operation.

BTW, the above implementation of the Merge_TwoLoops function has an imperfection. It doesn't handle the integer overflowing case. The following is a better implementation.

func Merge_TwoLoops(ss ...[][]byte) []byte {
	n := 0
	for _, s := range data {
		if k := n + len(s); k < n {
			panic("slice length overflows")
		} else {
			n = k
		}
	}
	r := make([]int, 0, n)
	...
}

Avoid allocations if possible

Allocating less is better, but allocating none is the best.

The following is another example to show the performance differences between two implementations. One of the implementations makes no allocations, the other one makes one allocation.

package allocations

import "testing"

func buildOrginalData() []int {
	s := make([]int, 1024)
	for i := range s {
		s[i] = i
	}
	return s
}

func check(v int) bool {
	return v%2 == 0
}

func FilterOneAllocation(data []int) []int {
	var r = make([]int, 0, len(data))
	for _, v := range data {
		if check(v) {
			r = append(r, v)
		}
	}
	return r
}

func FilterNoAllocations(data []int) []int {
	var k = 0
	for i, v := range data {
		if check(v) {
			data[i] = data[k]
			data[k] = v
			k++
		}
	}
	return data[:k]
}


func Benchmark_FilterOneAllocation(b *testing.B) {
	data := buildOrginalData()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = FilterOneAllocation(data)
	}
}

func Benchmark_FilterNoAllocations(b *testing.B) {
	data := buildOrginalData()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = FilterNoAllocations(data)
	}
}

The benchmark results:

Benchmark_FilterOneAllocation-4  7263 ns/op   8192 B/op  1 allocs/op
Benchmark_FilterNoAllocations-4   903.3 ns/op    0 B/op  0 allocs/op

From the benchmark results, we could get that the FilterNoAllocations implementation is more performant. (Surely, if the input data is not allowed to be modified, then we have to choose an implementation which makes allocations.)

Save memory and reduce allocations by combining memory blocks

Sometimes, we could allocate one large memory block to carry many value parts instead of allocating a small memory block for each value part. Doing this will reduce many memory allocations, so less CPU resources are consumed and GC pressure is relieved to some extend. Sometimes, doing this could decrease memory wasting, but this is not always true.

Let's view an example:

package allocations

import "testing"

const N = 100

type Book struct {
	Title  string
	Author string
	Pages  int
}

//go:noinline
func CreateBooksOnOneLargeBlock(n int) []*Book {
	books := make([]Book, n)
	pbooks := make([]*Book, n)
	for i := range pbooks {
		pbooks[i] = &books[i]
	}
	return pbooks
}

//go:noinline
func CreateBooksOnManySmallBlocks(n int) []*Book {
	books := make([]*Book, n)
	for i := range books {
		books[i] = new(Book)
	}
	return books
}

func Benchmark_CreateBooksOnOneLargeBlock(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_ = CreateBooksOnOneLargeBlock(N)
	}
}

func Benchmark_CreateBooksOnManySmallBlocks(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_ = CreateBooksOnManySmallBlocks(N)
	}
}

Run the benchmarks, we get:

Benchmark_CreateOnOneLargeBlock-4     4372 ns/op  4992 B/op    2 allocs/op
Benchmark_CreateOnManySmallBlocks-4  18017 ns/op  5696 B/op  101 allocs/op

From the results, we could get that allocating many small value parts on one large memory block

  1. spends much less CPU time.
  2. consumes a bit less memory.

The first conclusion is easy to understand. Two allocation operations spend much less time than 101 allocation operations.

The second conclusion has actually already been explained before. As aforementioned, when the size of a small value (part) doesn't exactly match any memory block classes supported by the official standard Go runtime, then a bit larger memory block than needed will be allocated for the small value (part) if the small value (part) is created individually. The size of the Book type is 40 bytes (on 64-bit architectures), whereas the size of the smallest memory block size class larger than 40 is 48. So allocating a Book value individually will waste 8 bytes.

In fact, the second conclusion is only right under certain conditions. Specifically, the conclusion is not right when the value N is in the range from 820 to 852. In particular, when N == 820, the benchmark results show allocating many small value parts on one large memory block consumes 3.5% more memory.

Benchmark_CreateOnOneLargeBlock-4     30491 ns/op  47744 B/op    2 allocs/op
Benchmark_CreateOnManySmallBlocks-4  145204 ns/op  46144 B/op  821 allocs/op

Why does the CreateBooksOnOneLargeBlock function consume more memory when N == 820? Because it needs to allocate a memory block with the minimum size as 32800 (820 * 40), which is larger than the largest small memory block class 32768. So the memory block needs 5 memory pages, which total size is 40960 (8192 * 5). In other words, 8160 (40960 - 32800) bytes are wasted.

Despite it sometimes wastes more memory, generally speaking, allocating many small value parts on one large memory block is comparatively better than allocating each of them on a separated memory block. This is especially true when the life times of the small value parts are almost the same, in which case allocating many small value parts on one large memory block could often effectively avoid memory fragmentation.

Use value cache pool to avoid some allocations

Sometimes, we need to frequently allocate and discard values of a specified type from time to time. It is a good idea to reuse allocated values to avoid a large quantity of allocations.

For example, there are many non-player characters (NPC) in RTS games. A large quantity of NPCs will be spawned and destroyed from time to time in a game session. The related code is like

type NPC struct {
	name       [64]byte
	nameLen    uint16
	blood      uint16
	properties uint32
	x, y       float64
}

func SpawnNPC(name string, x, y float64) *NPC {
	var npc = newNPC()
	npc.nameLen = uint16(copy(npc.name[:], name))
	npc.x = x
	npc.y = y
	return npc
}

func newNPC() *NPC {
	return &NPC{}
}

func releaseNPC(npc *NPC) {
}

As Go supports automatic GC, the releaseNPC function may do nothing. However, such implementation will lead to a large quantity of allocations in game playing and cause large pressure for garbage collection, so that it is hard to guarentee a good game FPS (frames per second).

We could instead use a cache pool to reduce allocations, like the code shown below.

import "container/list"

var npcPool = struct {
	sync.Mutex
	*list.List
}{
	List: list.New(),
}

func newNPC() *NPC {
	npcPool.Lock()
	defer npcPool.Unlock()
	if npcPool.Len() == 0 {
		return &NPC{}
	}
	return npcPool.Remove(npcPool.Front()).(*NPC)
}

func releaseNPC(npc *NPC) {
	npcPool.Lock()
	defer npcPool.Unlock()
	*npc = NPC{} // zero the released NPC
	npcPool.PushBack(npc)
}

By using the pool (also called free list), allocations of NPC values will be reduced much, which is very helpful to keep a smooth FPS (frames per second) in game playing.

If the cache pool is used in only one goroutine, then concurrency synchronizations are not necessary in the implementation.

We could also set a max size for the pool to avoid the pool occupies too much memory.

The standard sync package provides a Pool type to provide similar functionalities but with several design differences:

Personally, I find the design of sync.Pool seldom satisfies the needs in practice. So I often use custom value cache pool implementations in my Go projects.


Index↡

The Go 101 project is hosted on Github. Welcome to improve Go 101 articles by submitting corrections for all kinds of mistakes, such as typos, grammar errors, wording inaccuracies, description flaws, code bugs and broken links.

If you would like to learn some Go details and facts every serveral days, please follow Go 101's official Twitter account @go100and1 or join Go 101 slack channels.

The digital versions of this book are available at the following places:
Tapir, the author of Go 101, has been on writing the Go 101 series books and maintaining the go101.org website since 2016 July. New contents will be continually added to the book and the website from time to time. Tapir is also an indie game developer. You can also support Go 101 by playing Tapir's games (made for both Android and iPhone/iPad):
Individual donations via PayPal are also welcome.

Index: