Strings In Go

Like many programming languages, string is also one important kind of types in Go. This article will list all the facts of strings.

The Internal Structure Of String Types

For the standard Go compiler, the internal structure of any string type is declared like:
type _string struct {
	elements *byte // underlying bytes
	len      int   // number of bytes
}

From the declaration, we know that the a string is actually a byte sequence wrapper (or a container which elements are bytes).

Note, in Go, byte is a built-in alias of type uint8.

Some Simple Facts About Strings

We have learned the following facts about strings from previous articles. Example:
package main

import "fmt"

func main() {
	const World = "world"
	var hello = "hello"
	
	// Concat strings.
	var helloWorld = hello + " " + World
	helloWorld += "!"
	fmt.Println(helloWorld) // hello world!
	
	// Compare strings.
	fmt.Println(hello == "hello")   // true
	fmt.Println(hello > helloWorld) // false
}

More facts about string types and values in Go. Example:
package main

import (
	"fmt"
	"strings"
)

func main() {
	var helloWorld = "hello world!"
	
	var hello = helloWorld[:5] // substring
	// 104 is the ascii code (and Unicode) of char 'h'
	fmt.Println(hello[0])         // 104
	fmt.Printf("%T \n", hello[0]) // uint8
	// hello[0] = 'H'             // error: hello[0] is immutable
	// fmt.Println(&hello[0])     // error: hello[0] is not addressable
	
	fmt.Println(len(hello), len(helloWorld))          // 5 12
	fmt.Println(strings.HasPrefix(helloWorld, hello)) // true
}

String Encoding And Runes

In Go, all strings are viewed as UTF-8 encoded. The basic units in UTF-8 encoded strings are called code points. Most code points can be viewed as the charators we speak of in daily life. But for a few charators, each of them is composed of several code points.

Code points are represented as rune values in Go. The built-in rune type is an alias of type int32.

At compile time, illegal UTF-8 runes in string literals (and rune literals) will make compilation fail. At run time, Go runtime can't prevent some bytes stored in a string from being UTF-8 illegal. In other words, a string at run time may be ill UTF-8 encoded.

As the above has mentioned, each string is actually a byte sequence wrapper. So each rune in a string will be stored as one or more bytes (up to four bytes). For example, each English code point stores as one byte in Go strings, however each Chinese code point stores as three bytes in Go strings.

String Related Conversions

In the article constants and variables, we have learned that intergers can be explicitly converted to strings (but not vice versa). Here introduces two more string related conversions rules in Go:
  1. a string value can be explicitly converted to a byte slice, and vice versa. A byte slice is a slice whose underlying type is []byte (a.k.a, []uint8).
  2. a string value can be explicitly converted to a rune slice, and vice versa. A rune slice is a slice whose underlying type is []rune (a.k.a, []int32).

In the conversions from rune slices to strings, each slice element (a rune value) will be converted to its respective UTF-8 encoding byte sequence representation. If a slice element values is outside the range of valid Unicode code points, the it will be viewed as 0xFFFD, the code point for the Unicode replacement character. 0xFFFD is represented with three bytes in UTF-8 encoding ("\uFFFD" == "\xef\xbf\xbd").

When a string is converted to a rune slice, the bytes stored in the string will be viewed as successive UTF-8 encoding byte sequence representations of many Unicode code points. Bad UTF-8 encoding representations will be converted to a rune value 0xFFFD.

When a string is converted to a byte slice, the result byte slice is just a deep copy of the underlying byte sequence of the string. When a byte slice is converted to a string, the underlying byte sequence of the result string is also just a deep copy of the byte slice. A memory allocation is needed to store the deep copy in each of such conversions. The reason why a deep copy is essential is slice elements are mutable but the bytes stored in strings are immutable, so a byte slice and a string can't share byte elements.

Please note, for conversions between strings and byte slices, Conversions between byte slices and rune slices are not supported directly in Go, We can use the following ways to achieve this purpose: Example:
package main

import (
	"bytes"
	"unicode/utf8"
)

func Runes2Bytes(rs []rune) []byte {
	n := 0
	for _, r := range rs {
		n += utf8.RuneLen(r)
	}
	n, bs := 0, make([]byte, n)
	for _, r := range rs {
		n += utf8.EncodeRune(bs[n:], r)
	}
	return bs
}

func main() {
	s := "Color Infection is a good game."
	bs := []byte(s) // string -> []byte
	s = string(bs)  // []byte -> string
	rs := []rune(s) // string -> []rune
	s = string(rs)  // []rune -> string
	rs = bytes.Runes(bs) // []byte -> []rune
	bs = Runes2Bytes(rs) // []rune -> []byte
}

Maybe it is not intentional, for the standard Go compiler, it looks a string can also be converted to a slice type whose element type's underlying type is byte, but not versa.

package main

func main() {
	type MyByte byte
	str := "hello"
	bs := []MyByte(str) // compile okay
	str = string(bs)    // error: cannot use []MyByte as []byte
}

Compiler Optimizations For Conversions Between Strings And Byte Slices

Above has mentioned that the underlying bytes in the conversions between strings and byte slices will be copied. The current standard Go compiler (v1.10) will make some optimizations for some special scenarios to avoid the duplicate copies. These scenarios include: Example:
package main

import "fmt"

func main() {
	var str = "world"
	// Here, the []byte(str) conversion will not copy
	// the underlying bytes of str.
	for i, b := range []byte(str) {
		fmt.Println(i, ":", b)
	}
	
	key := []byte{'k', 'e', 'y'}
	m := map[string]string{}
	// Here, the string(key) conversion will not copy the bytes
	// in key. The optimization will be still made even if key
	// is a package-level variable.
	m[string(key)] = "value"
	fmt.Println(m[string(key)]) // value
}
Another example:
package main

import "fmt"
import "testing"

var s string
var x = []byte{1024: 'x'}
var y = []byte{1024: 'y'}

func fc() {
	// None of the below 4 conversions will
	// copy the underlying bytes of x and y.
	if string(x) != string(y) {
		s = (" " + string(x) + string(y))[1:]
	}
}

func fd() {
	// Only the two conversions in the comparison
	// will not copy the underlying bytes of x and y.
	if string(x) != string(y) {
		s = string(x) + string(y)
	}
}

func main() {
	fmt.Println(testing.AllocsPerRun(1, fc)) // 1
	fmt.Println(testing.AllocsPerRun(1, fd)) // 3
}

for-range On Strings

The for-range loop control flow applies to strings. But please note, for-range will iterate the Unicode code points (as rune values), instead of bytes, in a string. Bad UTF-8 encoding representations in the string will be interpreted as rune value 0xFFFD.

Example:
package main

import "fmt"

func main() {
	s := "éक्षिaπ汉字"
	for i, rn := range s {
		fmt.Printf("%v: 0x%x %v \n", i, rn, string(rn))
	}
}
The output of the above program:
0: 0x65 e
1: 0x301 ́
3: 0x915 क
6: 0x94d ्
9: 0x937 ष
12: 0x93f ि
15: 0x61 a
16: 0x3c0 π
18: 0x6c49 汉
21: 0x5b57 字
Please note:
  1. the index value may be not continuous, for one code point may need more than one bytes to represent.
  2. the first character, , is composed of two runes (3 bytes total)
  3. the second character, क्षि, is composed of four runes (12 bytes total).
  4. the English character, a, is composed of one rune (1 byte).
  5. the character, π, is composed of one rune (2 bytes).
  6. each of the two Chinese characters, 汉字, is composed of one rune (3 bytes each).
Then how to iterate bytes in a string? Do this:
package main

import "fmt"

func main() {
	s := "éक्षिaπ汉字"
	for i := 0; i < len(s); i++ {
		fmt.Printf("The byte at index %v: 0x%x \n", i, s[i])
	}
}

As you have seen, len(s) will return the number of bytes in strings. The time complexity of len(s) is O(1). How to get the number of runes in a string? Using for-range to iterate and count all runes is a way, and using the RuneCountInString function in the unicode/utf8 standard package is another way. The efficiencies of the two ways are almost the same. Please note that the time complexities of both ways are O(n).

Surely, we can also make use of the compiler optimization mentioned above to iterate bytes in a string. For the standard Go compiler, this way is a little more efficient than the above one.
package main

import "fmt"

func main() {
	s := "éक्षिaπ汉字"
	for i, b := range []byte(s) { // here, the underlying bytes are not copied.
		fmt.Printf("The byte at index %v: 0x%x \n", i, b)
	}
}

More String Concatenation Methods

Besides using the + operator to concatenate strings, you can also use following methods to concatenation strings.

The Go standard compiler makes optimizations for string concatenations by using the + operator. So generally, using + operator to concatenate strings is convenient and efficient if the number of the concatenated strings is knwon.

Sugar: Use Strings As Byte Slices

From the article arrays, slices and maps, we have leanred that we can use the built-in copy and append functions to copy and append slice elements. In fact, as a special case, if the type of the first argument of a call to either of the two functions is a byte slice, then the second argument can be a string (if the call is an append call, then the string argument must be followed by three dots ...). In other words, a string can be used as a byte slice for the special case.

Example:
package main

import "fmt"

func main() {
	hello := []byte("Hello ")
	world := "world!"
	
	// helloWorld := append(hello, []byte(world)...) // the normal way
	helloWorld := append(hello, world...)            // the sugar way
	fmt.Println(string(helloWorld))
	
	helloWorld2 := make([]byte, len(hello) + len(world))
	copy(helloWorld2, hello)
	// copy(helloWorld2[len(hello):], []byte(world)) // the normal way
	copy(helloWorld2[len(hello):], world)            // the sugar way
	fmt.Println(string(helloWorld2))
}

More About String Comparisons

Above has mentioned that comparing two strings is comparing their underlying bytes in fact. Most compilers will made some optimizations for string comparisons.

So for two equal strings, the time complexity of comparing them depends on whether or not their underlying byte sequence pointers are equal. If the two equal string values don't share the same underlying bytes, then the time complexity of comparing the two values is O(n), where n is the length of the two strings, otherwise, the time complexity is O(1).

As above mentioned, for the standard Go compiler, in a string value assignment, the destination string value and the source string value will share the same underlying byte sequence in memory. So after the assignment, the cost of comparing the two strings becomes very small.

Example:
package main

import (
	"fmt"
	"time"
)

func main() {
	bs := make([]byte, 1<<26)
	s0 := string(bs)
	s1 := string(bs)
	s2 := s1
	
	// s0, s1 and s2 are three equal strings.
	// The underlying bytes of s0 is a copy of bs.
	// The underlying bytes of s1 is also a copy of bs.
	// The underlying bytes of s0 and s1 are two different copies of bs.
	// s2 shares the same underlying bytes with s1.

	startTime := time.Now()
	_ = s0 == s1
	duration := time.Now().Sub(startTime)
	fmt.Println("duration for (s0 == s1):", duration)
	
	startTime = time.Now()
	_ = s1 == s2
	duration = time.Now().Sub(startTime)
	fmt.Println("duration for (s1 == s2):", duration)
}
Output:
duration for (s0 == s1): 10.462075ms
duration for (s1 == s2): 136ns

1ms is 1000000ns! So please try to avoid comparing two long strings if they don't share the same underlying byte sequence.

The Go 101 project is hosted on both github and gitlab. Welcome to improve Go 101 articles by submitting corrections for all kinds of mistakes, such as typos, grammar errors, wording inaccuracies, description flaws and code bugs.