The basic memory allocation units are called memory blocks. A memory block is a continuous memory segment. As aforementioned, at run time, a value part is carried on a single memory block.
A single memory block might carry multiple value parts. The size of a memory block must not be smaller than the size of any value part it carries.
When a memory block is carrying a value part, we may say the value part is referencing the memory bock.
Memory allocation operations will consume some CPU resources to find the suitable memory blocks. So the more memory blocks are created (for memory allocations), the more CPU resources are consumed. In programming, we should try to avoid unnecessary memory allocations to get better code execution performances.
Go runtime might find a memory block on the stack (one kind of memory zone) of a goroutine or the heap (the other kind of memory zone) of the whole program to carry some value parts. The finding-out process is called a (memory) allocation.
The memory management manners of stack and heap are quite different. For most cases, finding a memory block on stack is much cheaper than on heap.
Collecting stack memory blocks is also much cheaper than collecting heap memory blocks. In fact, stack memory blocks don't need to be collected. The stack of a goroutine could be actually viewed as a single memory block, and it will be collected as a whole when the goroutine exits.
On the other hand, when all the value parts being carried on/by a heap memory block are not used any more (in other words, no alive value part is still referencing the memory block), the memory block will be viewed as garbage and automatically collected eventually, during runtime garbage collection cycles, which might consume certain CPU resources (garbage collection will be talked in detail in a later chapter). Generally, the more memory blocks are allocated on heap, the larger pressure is made for garbage collection.
As heap allocations are much more expensive, only heap memory allocations contribute to the allocation metrics in Go code benchmark results. But please note that allocating on stack still has a cost, though it is often comparatively much smaller.
The escape analysis module of a Go compiler could detect some value parts will be only used by one goroutine and try to let those value parts allocated on stack at run time if certain extra conditions are satisfied. Stack memory allocations and escape analysis will be explained with more details in the next chapter.
Generally, each of the following operations will make at least one allocation.
new
function.make
function.+
.However, the official standard Go compiler makes some special code optimizations so that at certain cases, some of the above listed operations will not make allocations. These optimizations will be introduced later in this book.
The sizes of different memory blocks might be different. But they are not arbitrary. In the official standard Go runtime implementation, for memory blocks allocated on heap,
So,
[33, 48]
, the size of the memory block is general (must be at least) 48. In other words, there might be up to 15 bytes wasted (if the value size is 33).In other words, memory blocks are often larger than needed. The strategies are made to manage memory easily and efficiently, but might cause a bit memory wasting sometimes (yes, a trade-off).
These could be proved by the following program:
package main
import "testing"
import "unsafe"
var t *[5]int64
var s []byte
func f(b *testing.B) {
for i := 0; i < b.N; i++ {
t = &[5]int64{}
}
}
func g(b *testing.B) {
for i := 0; i < b.N; i++ {
s = make([]byte, 32769)
}
}
func main() {
println(unsafe.Sizeof(*t)) // 40
rf := testing.Benchmark(f)
println(rf.AllocedBytesPerOp()) // 48
rg := testing.Benchmark(g)
println(rg.AllocedBytesPerOp()) // 40960
}
Another example:
package main
import "testing"
var s = []byte{32: 'b'} // len(s) == 33
var r string
func Concat(b *testing.B) {
for i := 0; i < b.N; i++ {
r = string(s) + string(s)
}
}
func main() {
br := testing.Benchmark(Concat)
println(br.AllocsPerOp()) // 3
println(br.AllocedBytesPerOp()) // 176
}
There are 3 allocations made within the Concat
function.
Two of them are caused by the byte slice to string conversions string(s)
,
and the sizes of the two memory blocks carrying the underlying bytes of the two result strings are both 48
(which is the smallest size class which is not smaller than 33).
The third allocation is caused by the string concatenation,
and the size of the result memory block is 80
(the smallest size class which is not smaller than 66).
The three allocations allocate 176 (48+48+80) bytes totally.
In the end, 14 bytes are wasted.
And 44 (15 + 15 + 14) bytes are wasted during executing the Concat
function.
In the above example, the results of the string(s)
conversions are used temporarily in the string concatenation operation.
By the current official standard Go compiler/runtime implementation (1.24 versions), the string bytes are allocated on heap (see below sections for details).
After the concatenation is done, the memory blocks carrying the string bytes become into memory garbage and will be collected eventually later.
The less memory (block) allocations are made, the less CPU resources are consumed, and the smaller pressure is made for garbage collection.
Memory is cheap nowadays, but this is not true for the memory sold by cloud computing providers. So if we run programs on cloud servers, the more memory is saved by the Go programs, the more money is saved.
The following are some suggestions to reduce memory allocations and save memory in programming.
We often use the built-in append
function to push some slice elements.
In the statement r = append(s, elements...)
,
if the free capacity of s
is not large enough to hold all appended elements,
Go runtime will allocate a new memory block to hold all the elements of the result slice r
.
If the append
function needs to be called multiple times to push some elements,
it is best to ensure that the base slice has a large enough capacity,
to avoid several unnecessary allocations in the whole pushing process.
For example, to merge some slices into one, the following shown
MergeWithTwoLoops
implementation is more efficient than the MergeWithOneLoop
implementation,
because the former one makes less allocations and copies less values.
package allocations
import "testing"
func getData() [][]int {
return [][]int{
{1, 2},
{9, 10, 11},
{6, 2, 3, 7},
{11, 5, 7, 12, 16},
{8, 5, 6},
}
}
func MergeWithOneLoop(data ...[]int) []int {
var r []int
for _, s := range data {
r = append(r, s...)
}
return r
}
func MergeWithTwoLoops(data ...[]int) []int {
n := 0
for _, s := range data {
n += len(s)
}
r := make([]int, 0, n)
for _, s := range data {
r = append(r, s...)
}
return r
}
func Benchmark_MergeWithOneLoop(b *testing.B) {
data := getData()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = MergeWithOneLoop(data...)
}
}
func Benchmark_MergeWithTwoLoops(b *testing.B) {
data := getData()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = MergeWithTwoLoops(data...)
}
}
The benchmark results:
Benchmark_MergeWithOneLoop-4 636.6 ns/op 352 B/op 4 allocs/op
Benchmark_MergeWithTwoLoops-4 268.4 ns/op 144 B/op 1 allocs/op
The benchmark results show that allocations affect code execution performance much.
Let's print some logs to see when each of the 4 allocations happens in a MergeWithOneLoop
function call.
package main
import "fmt"
func getData() [][]int {
return [][]int{
{1, 2},
{9, 10, 11},
{6, 2, 3, 7},
{11, 5, 7, 12, 16},
{8, 5, 6},
}
}
const format = "Allocate from %v to %v (when append slice#%v).\n"
func MergeWithOneLoop(data [][]int) []int {
var oldCap int
var r []int
for i, s := range data {
r = append(r, s...)
if oldCap == cap(r) {
continue
}
fmt.Printf(format, oldCap, cap(r), i)
oldCap = cap(r)
}
return r
}
func main() {
MergeWithOneLoop(getData())
}
The outputs (for the official standard Go compiler v1.24.n):
Allocate from 0 to 2 (when append slice#0).
Allocate from 2 to 6 (when append slice#1).
Allocate from 6 to 12 (when append slice#2).
Allocate from 12 to 24 (when append slice#3).
From the outputs, we could get that only the last append
call doesn't allocate.
In fact, the Merge_TwoLoops
function could be even faster in theory.
As of the official standard Go compiler version 1.24, the make
call in the Merge_TwoLoop
function will zero all just created elements, which is actually unnecessary. Compiler optimizations in future versions might avoid the zero operation.
BTW, the above implementation of the Merge_TwoLoops
function has an imperfection.
It doesn't handle the integer overflowing case.
The following is a better implementation.
func Merge_TwoLoops(data ...[][]byte) []byte {
n := 0
for _, s := range data {
if k := n + len(s); k < n {
panic("slice length overflows")
} else {
n = k
}
}
r := make([]int, 0, n)
...
}
Allocating less is better, but allocating none is the best.
The following is another example to show the performance differences between two implementations. One of the implementations makes no allocations, the other one makes one allocation.
package allocations
import "testing"
func buildOrginalData() []int {
s := make([]int, 1024)
for i := range s {
s[i] = i
}
return s
}
func check(v int) bool {
return v%2 == 0
}
func FilterOneAllocation(data []int) []int {
var r = make([]int, 0, len(data))
for _, v := range data {
if check(v) {
r = append(r, v)
}
}
return r
}
func FilterNoAllocations(data []int) []int {
var k = 0
for i, v := range data {
if check(v) {
data[i] = data[k]
data[k] = v
k++
}
}
return data[:k]
}
func Benchmark_FilterOneAllocation(b *testing.B) {
data := buildOrginalData()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = FilterOneAllocation(data)
}
}
func Benchmark_FilterNoAllocations(b *testing.B) {
data := buildOrginalData()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = FilterNoAllocations(data)
}
}
The benchmark results:
Benchmark_FilterOneAllocation-4 3711 ns/op 8192 B/op 1 allocs/op
Benchmark_FilterNoAllocations-4 903.3 ns/op 0 B/op 0 allocs/op
From the benchmark results, we could get that the FilterNoAllocations
implementation is more performant.
(Surely, if the input data is not allowed to be modified, then we have to choose an implementation which makes allocations.)
Sometimes, we could allocate one large memory block to carry many value parts instead of allocating a small memory block for each value part. Doing this will reduce many memory allocations, so less CPU resources are consumed and GC pressure is relieved to some extend. Sometimes, doing this could decrease memory wasting, but this is not always true.
Let's view an example:
package allocations
import "testing"
const N = 100
type Book struct {
Title string
Author string
Pages int
}
//go:noinline
func CreateBooksOnOneLargeBlock(n int) []*Book {
books := make([]Book, n)
pbooks := make([]*Book, n)
for i := range pbooks {
pbooks[i] = &books[i]
}
return pbooks
}
//go:noinline
func CreateBooksOnManySmallBlocks(n int) []*Book {
books := make([]*Book, n)
for i := range books {
books[i] = new(Book)
}
return books
}
func Benchmark_CreateBooksOnOneLargeBlock(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = CreateBooksOnOneLargeBlock(N)
}
}
func Benchmark_CreateBooksOnManySmallBlocks(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = CreateBooksOnManySmallBlocks(N)
}
}
Run the benchmarks, we get:
Benchmark_CreateOnOneLargeBlock-4 4372 ns/op 4992 B/op 2 allocs/op
Benchmark_CreateOnManySmallBlocks-4 18017 ns/op 5696 B/op 101 allocs/op
From the results, we could get that allocating many small value parts on one large memory block
The first conclusion is easy to understand. Two allocation operations spend much less time than 101 allocation operations.
The second conclusion has actually already been explained before.
As aforementioned, when the size of a small value (part) doesn't exactly match any memory block classes supported by the official standard Go runtime, then a bit larger memory block than needed will be allocated for the small value (part) if the small value (part) is created individually.
The size of the Book
type is 40 bytes (on 64-bit architectures), whereas the size of the smallest memory block size class larger than 40 is 48. So allocating a Book
value individually will waste 8 bytes.
In fact, the second conclusion is only right under certain conditions. Specifically, the conclusion is not right when the value N is in the range from 820 to 852. In particular, when N == 820, the benchmark results show allocating many small value parts on one large memory block consumes 3.5% more memory.
Benchmark_CreateOnOneLargeBlock-4 30491 ns/op 47744 B/op 2 allocs/op
Benchmark_CreateOnManySmallBlocks-4 145204 ns/op 46144 B/op 821 allocs/op
Why does the CreateBooksOnOneLargeBlock
function consume more memory when N == 820?
Because it needs to allocate a memory block with the minimum size as 32800 (820 * 40), which is larger than the largest small memory block class 32768. So the memory block needs 5 memory pages, which total size is 40960 (8192 * 5). In other words, 8160 (40960 - 32800) bytes are wasted.
Despite it sometimes wastes more memory, generally speaking, allocating many small value parts on one large memory block is comparatively better than allocating each of them on a separated memory block. This is especially true when the life times of the small value parts are almost the same, in which case allocating many small value parts on one large memory block could often effectively avoid memory fragmentation.
Sometimes, we need to frequently allocate and discard values of a specified type from time to time. It is a good idea to reuse allocated values to avoid a large quantity of allocations.
For example, there are many non-player characters (NPC) in RTS games. A large quantity of NPCs will be spawned and destroyed from time to time in a game session. The related code is like
type NPC struct {
name [64]byte
nameLen uint16
blood uint16
properties uint32
x, y float64
}
func SpawnNPC(name string, x, y float64) *NPC {
var npc = newNPC()
npc.nameLen = uint16(copy(npc.name[:], name))
npc.x = x
npc.y = y
return npc
}
func newNPC() *NPC {
return &NPC{}
}
func releaseNPC(npc *NPC) {
}
As Go supports automatic GC, the releaseNPC
function may do nothing.
However, such implementation will lead to a large quantity of allocations in game playing and cause large pressure for
garbage collection, so that it is hard to guarantee a good game FPS (frames per second).
We could instead use a cache pool to reduce allocations, like the code shown below.
import "container/list"
var npcPool = struct {
sync.Mutex
*list.List
}{
List: list.New(),
}
func newNPC() *NPC {
npcPool.Lock()
defer npcPool.Unlock()
if npcPool.Len() == 0 {
return &NPC{}
}
return npcPool.Remove(npcPool.Front()).(*NPC)
}
func releaseNPC(npc *NPC) {
npcPool.Lock()
defer npcPool.Unlock()
*npc = NPC{} // zero the released NPC
npcPool.PushBack(npc)
}
By using the pool (also called free list), allocations of NPC
values will be reduced much,
which is very helpful to keep a smooth FPS (frames per second) in game playing.
If the cache pool is used in only one goroutine, then concurrency synchronizations are not necessary in the implementation.
We could also set a max size for the pool to avoid the pool occupies too much memory.
The standard sync
package provides a Pool
type to provide similar functionalities but with several design differences:
sync.Pool
will be automatically garbage collected if it is found to be unused in two successive garbage collection cycles. This means the max size of a sync.Pool
is dynamic and depends on the demand at run time.sync.Pool
could be different. But the best practice is to make sure the objects put in the same sync.Pool
are of the some type and have the same size.Personally, I find the design of sync.Pool
seldom satisfies the needs in practice.
So I often use custom value cache pool implementations in my Go projects.
The Go 101 project is hosted on Github. Welcome to improve Go 101 articles by submitting corrections for all kinds of mistakes, such as typos, grammar errors, wording inaccuracies, description flaws, code bugs and broken links.
If you would like to learn some Go details and facts every serveral days, please follow Go 101's official Twitter account @zigo_101.