Adventures in making small Go binaries
In this post, we'll write a small Go program to talk with a websocket server while trying to make the generated binary as small as possible. This was performed as part of one of Dyte's hiring challenges, but the methods discussed here can be applied to any Go program in general. Do note that this is just for fun and not something you should try in production!
Problem statement
So, we have a basic websocket server that accepts connections from a client and checks if it sent a hello
message, whereas the client has to print out the server's response. The server-side code is written using the gorilla/websocket
package and can be found here, but we won't really go through it here as our focus is on making the client-side binary small.
We'll be covering various methods throughout this post, ranging from swapping out the Go compiler, using an ELF packer, and tweaking linker flags to using raw syscalls instead of the standard library.
Humble Beginnings
Let's start out by writing an obvious Go program using the x/net/websocket
package:
package main
import (
"fmt"
"log"
"golang.org/x/net/websocket"
)
func main() {
url := "ws://localhost:8080/"
ws, err := websocket.Dial(url, "", url)
if err != nil {
log.Fatal(err)
}
defer ws.Close()
// Write the `hello` message
if _, err := ws.Write([]byte("hello")); err != nil {
log.Fatal(err)
}
// 512 byte buffer for storing the response
var response = make([]byte, 512)
// No. of bytes received
var received int
if received, err = ws.Read(response); err != nil {
log.Fatal(err)
}
fmt.Printf("Received: %s\n", response[:received])
}
Building and running it, we get a ~5.8 MiB binary (6084899 bytes), which is far from our goal:
$ go build -o main && ./main
Received: dyte
$ wc -c main
6084899 main
The go build
command allows us to tweak the flags passed to various components like the assembler ( go tool asm
, -asmflags
), the linker ( go tool link
, -ldflags
) and the compiler itself ( go tool compile
, -gcflags
). But only the linker flags are relevant to us for reducing the binary size, and this is quite widely known. In ldflags
, -s
disables the symbol table and -w
omits debug information, while the -trimpath
flag converts absolute file paths to relative ones, further reducing the size to ~3.9 MiB:
$ go build -trimpath -ldflags '-s -w' -o main && wc -c main
4128768 main
Reinventing the wheel
Now, we'll start moving into the more esoteric side of things while still sticking with our trusty Go compiler. For starters, let's abandon the net/websocket
package and talk over the TCP socket directly, crafting the HTTP and websocket payload by hand.
Refer to this MDN document about writing websocket servers, as we won't be covering the payload in-depth here, though it is extensively commented on in the code below:
package main
import (
"fmt"
"log"
"net"
)
func main() {
httpInitMsg := []byte("GET / HTTP/1.1\r\nHost:dyte.io\r\nUpgrade:websocket\r\nConnection:Upgrade\r\nSec-WebSocket-Key:dGhlIHNhbXBsZSBub25jZQ==\r\nSec-WebSocket-Version:13\r\nConnection:Upgrade\r\n\r\n")
wsPayload := []byte{
// FIN Bit (Final fragment), OpCode (1 for text payload)
0b10000001,
// Mask Bit (Required), followed by 7 bits for length (0b0000101 == 5)
0b10000101,
// We don't set the extended payload bits as our payload is only 5 bytes
// Mask (can be any arbritary 32 bit integer)
0b00000001,
0b00000010,
0b00000011,
0b00000100,
// Payload, the string "hello" with each character XOR'd with the
// corresponding mask bits
0b01101001, // 'h' ^ 0b00000001
0b01100111, // 'e' ^ 0b00000010
0b01101111, // 'l' ^ 0b00000011
0b01101000, // 'l' ^ 0b00000100
0b01101110, // 'o' ^ 0b00000001
}
// Establish a TCP connection to the server
conn, err := net.Dial("tcp", "localhost:8080")
if err != nil {
log.Fatal(err)
}
defer conn.Close()
// Send the initial HTTP message to start talking over the WebSocket protocol
_, err = conn.Write(httpInitMsg)
if err != nil {
log.Fatal(err)
}
response := make([]byte, 512)
// Receive the initial HTTP response
received, err := conn.Read(response)
if err != nil {
log.Fatal(err)
}
// Write the websocket frame
_, err = conn.Write(wsPayload)
if err != nil {
log.Fatal(err)
}
// Read the reply into the existing buffer
_, err = conn.Read(response[received:])
fmt.Println(string(response))
}
We've made quite some progress, down to ~1.7 MiB!
$ go build -trimpath -ldflags '-s -w' -o main && ./main
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
dyte
$ wc -c main
1814528 main
Now, we'll use UPX, an executable packer that compresses the binary and strips unneeded ELF sections. Do note that this impacts cold start times a bit due to the decompression overhead. This takes us down to ~710 KiB!
$ upx -9 main # Max compression level
Ultimate Packer for eXecutables
Copyright (C) 1996 - 2024
UPX 4.2.2 Markus Oberhumer, Laszlo Molnar & John Reiser Jan 3rd 2024
File size Ratio Format Name
-------------------- ------ ----------- -----------
1814528 -> 727684 40.10% linux/amd64 main
Packed 1 file.
One step closer to insanity
Till now, we've just switched to the standard library for talking to the server. We can go one step further and use raw syscalls to handle all the socket interactions, becoming our own standard library in a sense :p
Note that syscalls are a lower level of abstraction than libc, as the libc functions, such as recv
internally wrap the corresponding system calls. This might not make much sense if you've never done socket programming in C, but the comments should give you enough of an idea of what's going on.
Essentially, a socket is a file descriptor that we create via the socket()
syscall (which is identified by SYS_SOCKET
here, referring to syscall no. 41
), and we further use this in subsequent syscalls to connect to the server and exchange data. The sockaddr_in
structure is used to describe the address & port we want to connect to, which we encode by hand here:
func main() {
httpInitMsg := []byte(...)
wsPayload := []byte{...}
// Connects to an IPv4 server at 127.0.0.1 on port 8080
sockaddr := []byte{
// family - AF_INET (0x2), padded to 16 bits
0b00000010,
0b00000000,
// port - 8080, padded to 16 bits
0b00011111,
0b10010000,
// addr - 127.0.0.1, 32 bits
// 127 << 0 | 0 << 8 | 0 << 16 | 1 << 24
0b01111111,
0b00000000,
0b00000000,
0b00000001,
// 64 bits of padding
0b00000000, 0b00000000, 0b00000000, 0b00000000,
0b00000000, 0b00000000, 0b00000000, 0b00000000,
}
// The response buffer for receiving server responses
var response [135]byte
// Create a IPv4 (AF_INET), TCP (SOCK_STREAM) socket FD
// __NR_socket, AF_INET, SOCK_STREAM
var sock, _, _ = syscall.Syscall(syscall.SYS_SOCKET, 0x2, 0x1, 0)
// Connect to the server using the `sockaddr_in` structure
// __NR_connect, fd, sockaddr_in, len(sockaddr_in)
syscall.Syscall6(syscall.SYS_CONNECT, sock, uintptr(unsafe.Pointer(&sockaddr[0])), uintptr(len(sockaddr)), 0, 0, 0)
// Send the HTTP message over the socket
// __NR_sendto, fd, buf, len(buf), flags, addr, addr_len
syscall.Syscall6(syscall.SYS_SENDTO, sock, uintptr(unsafe.Pointer(&httpInitMsg[0])), uintptr(len(httpInitMsg)), 0, 0, 0)
// Receive the response
// __NR_recvfrom, fd, buf, len(buf), flags, addr, addr_len
var n, _, _ = syscall.Syscall6(syscall.SYS_RECVFROM, sock, uintptr(unsafe.Pointer(&response[0])), uintptr(len(response)), 0, 0, 0)
// Send the WebSocket frame
// __NR_sendto
syscall.Syscall6(syscall.SYS_SENDTO, sock, uintptr(unsafe.Pointer(&wsPayload[0])), uintptr(len(wsPayload)), 0, 0, 0)
// Receive the response
// __NR_recvfrom
syscall.Syscall6(syscall.SYS_RECVFROM, sock, uintptr(unsafe.Pointer(&response[n])), uintptr(len(response))-n, 0, 0, 0)
// Close the socket FD
// __NR_close
syscall.Syscall(syscall.SYS_CLOSE, sock, 0, 0)
// Write the response string to standard output
// __NR_write, STDOUT_FILENO
syscall.Syscall(syscall.SYS_WRITE, 1, uintptr(unsafe.Pointer(&response[0])), uintptr(len(response)))
}
Now, the stock binary is ~828 KiB, and with UPX, it goes down to a not-so measly ~352 KiB:
$ go build -trimpath -ldflags '-s -w' -o main && upx -9 main
Ultimate Packer for eXecutables
Copyright (C) 1996 - 2024
UPX 4.2.2 Markus Oberhumer, Laszlo Molnar & John Reiser Jan 3rd 2024
File size Ratio Format Name
-------------------- ------ ----------- -----------
847872 -> 360692 42.54% linux/amd64 main
Packed 1 file.
Swapping the Go compiler
Unfortunately, that's a dead-end for how far the vanilla Go compiler can take us. We can now start experimenting with TinyGo, an alternative LLVM-based Go compiler that produces significantly smaller binaries. Spoiler alert: Our binaries will fall below the minimum size accepted by UPX for compression!
One nifty feature TinyGo provides is the -size
flag, which shows the various packages that make up our final binary. This is what it shows in the 2nd example that uses the standard library's net
package:
$ tinygo build -o main -size full
code rodata data bss | flash ram | package
------------------------------- | --------------- | -------
0 45 4 18 | 49 22 | (padding)
45 20330 18 78 | 20393 96 | (unknown)
1647 3627 0 72 | 5274 72 | /usr/lib/go/src/syscall
3870 58 12 536 | 3940 548 | C musl
401 0 0 0 | 401 0 | Go interface assert
477 0 0 0 | 477 0 | Go interface method
0 5816 0 0 | 5816 0 | Go types
50 0 0 0 | 50 0 | errors
7780 161 40 0 | 7981 40 | fmt
26 0 0 0 | 26 0 | internal/bytealg
1690 21 0 0 | 1711 0 | internal/fmtsort
443 51 0 48 | 494 48 | internal/godebug
157 369 1280 0 | 1806 1280 | internal/godebugs
31 12 48 88 | 91 136 | internal/intern
155 2 0 0 | 157 0 | internal/itoa
0 57 48 0 | 105 48 | internal/oserror
486 24 0 16 | 510 16 | internal/task
336 22 0 0 | 358 0 | io/fs
2767 3 40 64 | 2810 104 | log
127 0 0 0 | 127 0 | main
27 0 0 0 | 27 0 | math
122 0 0 0 | 122 0 | math/bits
0 25 16 160 | 41 176 | net
298 16 56 24 | 370 80 | os
6272 715 96 0 | 7083 96 | reflect
8993 258 12 95 | 9263 107 | runtime
822 0 0 0 | 822 0 | sort
7280 16705 1338 0 | 25323 1338 | strconv
1822 200 0 0 | 2022 0 | sync
141 75 0 1 | 216 1 | sync/atomic
193 1455 0 8 | 1648 8 | syscall
20700 1029 184 128 | 21913 312 | time
1132 288 0 0 | 1420 0 | unicode/utf8
------------------------------- | --------------- | -------
68290 51364 3192 1336 | 122846 4528 | total
Meanwhile, for the syscalls-only example:
$ tinygo build -o main -size full
code rodata data bss | flash ram | package
------------------------------- | --------------- | -------
0 1 4 21 | 5 25 | (padding)
25 2494 8 31 | 2527 39 | (unknown)
92 0 0 40 | 92 40 | /usr/lib/go/src/syscall
2894 27 4 536 | 2925 540 | C musl
0 208 0 0 | 208 0 | Go types
365 24 0 16 | 389 16 | internal/task
268 162 0 0 | 430 0 | main
3020 135 8 91 | 3163 99 | runtime
80 75 0 1 | 155 1 | sync/atomic
------------------------------- | --------------- | -------
6744 3126 24 736 | 9894 760 | total
This makes sense as we skip over a ton of abstractions by using raw syscalls, but we can reduce this even further with the control that TinyGo gives us! We can disable goroutines & channels, swap out the GC, pass arbritary linker flags, etc. as can be seen in the documentation.
First off, let's get a baseline for how much TinyGo can help us:
$ tinygo build -o main -no-debug && wc -c main
18160 main
17.7 KiB, that's already 1/20th the size of our previous attempt! Let's go ahead and disable goroutines ( with -scheduler none
), switch to a smaller GC implementation that just leaks memory ( -gc leaking
), and just execute a trap instruction instead of printing the panic message in-case of panics ( -panic trap
) - 12.75 KiB:
$ tinygo build -o main -no-debug -scheduler none -gc leaking -panic trap && wc -c main
13056 main
Ripping out the GC
The leaking GC, while quite small, still includes code to request memory via syscalls, so we can just provide our own allocator that gives out addresses from a fixed-size buffer on the stack, which is initialized at program startup. We can use a small buffer for this purpose as only a few allocations are made in our program, such as initializing the variables we declared (due to Go's escape analysis, as we take pointers to these variables) and the runtime startup code:
--- a/main.go
+++ b/main.go
@@ -5,6 +5,26 @@ import (
"unsafe"
)
+var buffer [1024]byte
+var used uintptr = 0
++// We disable the go GC entirely and provide this stub for handling
+// allocations, giving out addresses from a static buffer on the stack
+// This saves many bytes over using the "leaking" GC, it is more or less
+// used exclusively by the runtime's startup code for tasks like setting up
+// the processe's environment variables
+// If it crashes, run it with a clean environment (env -i ./main)
++//go:linkname alloc runtime.alloc
+func alloc(size uintptr, layoutPtr unsafe.Pointer) unsafe.Pointer {
+ var ptr = unsafe.Pointer(&buffer[used])
++ // Align for x64
+ used += ((size + 15) &^ 15)
++ return ptr
+}
+
func main() {
Now, building with -gc none
- 12.42 KiB:
$ tinygo build -o main -no-debug -scheduler none -gc none -panic trap && wc -c main
12720 main
Linker flags
As mentioned before, TinyGo allows us to pass arbitrary flags to the linker at compile time. This can be done via spec files, which tell TinyGo some information about the target architecture; some examples can be seen here. The format is not documented as such in the documentation, but all the possible keys with their defaults can be found in target.go, which we use as a reference for creating our own.
This is what our spec.json
looks like all the values are at their defaults except for ldflags
, which we will now go through:
{
"llvm-target": "x86_64-unknown-linux-musl",
"cpu": "x86-64",
"goos": "linux",
"goarch": "amd64",
"build-tags": [
"amd64",
"linux"
],
"linker": "ld.lld",
"rtlib": "compiler-rt",
"libc": "musl",
"defaultstacksize": 65536,
"ldflags": [
"--gc-sections",
"--discard-all",
"--strip-all",
"--no-rosegment",
"-znorelro",
"-znognustack"
]
}
Let's refer to the lld
linker's man-page for these flags:
-gc-sections
: Enables garbage collection of unused sections, explained more in detail in this blog-discard-all
: Deletes all local symbols-strip-all
: Removes the symbol table and debug information-no-rosegment
: Allows the linker to combine read-only and read-execute segments of the binaryznorelro
: Disables emitting thePT_GNU_RELRO
segment, used to specify certain regions of the binary that should be marked as read-only after performing relocations. Good security measure, but we just care about trimming bytes in this post :pznognustack
: Disables emitting thePT_GNU_STACK
segment, used to determine whether the stack should be executable or not, again, security
On top of this, we can further strip more sections from the compiled binary with the strip
command strip --strip-section-headers -R .comment -R .note -R .eh_frame main
. This removes the section headers (used by tools like objdump
to locate sections), along with the .comment
section (which contains toolchain-related info) and the .eh_frame
section (used for stack unwinding, which we don't need here)
Finally, our binary is down to 6.44 KiB:
$ tinygo build -o main -no-debug -scheduler none -gc none -panic trap -target spec.json
$ strip --strip-section-headers -R .comment -R .note -R .eh_frame main
$ wc -c main
6600 main
Ripping out the standard library
6.44 KiB is still too big for a program that basically just makes a few syscalls (technically, every program fits this definition, but you get the intent), and this part gets its own section as it is basically cheating in the context of this challenge :p
So, we're still pulling in quite a bit of code from the standard library, mainly around the startup code that sets up the program's execution environment before our main
function is actually called, look at runtime_unix.go and scheduler_none.go for more clarity.
All we have to do is export our main
function with a different name (eg. smol_main
), and tell the linker to treat that as the actual entry point, which would prevent the standard library startup code from making its way into our binary.
- In
spec.json
, we pass theentry
flag to the linker, and drop libc completely, as it is only needed by TinyGo's standard library for certain functions
--- a/spec.json
+++ b/spec.json
@@ -9,7 +9,6 @@
],
"linker": "ld.lld",
"rtlib": "compiler-rt",
- "libc": "musl",
"defaultstacksize": 65536,
"ldflags": [
"--gc-sections",
@@ -17,6 +16,7 @@
"--strip-all",
"--no-rosegment",
"-znorelro",
- "-znognustack"
+ "-znognustack",
+ "-entry=smol_main"
]
}
- In
main.go
, we make our local variables global, allowing them to be placed on the stack rather than the heap (remember the escape analysis mentioned earlier?), which further allows us to get rid of our dummy GC implementation. We annotate themain
function with directives to export it assmol_main
, and disable bounds checking, as the panic handler for it indirectly pulls in some libc symbols.
--- a/main.go
+++ b/main.go
@@ -5,29 +5,9 @@ import (
"unsafe"
)
-var buffer [1024]byte
-var used uintptr = 0
--// We disable the go GC entirely and provide this stub for handling
-// allocations, giving out addresses from a static buffer on the stack
-// This saves many bytes over using the "leaking" GC, it is more or less
-// used exclusively by the runtime's startup code for tasks like setting up
-// the processe's environment variables
-// If it crashes, run it with a clean environment (env -i ./main)
--//go:linkname alloc runtime.alloc
-func alloc(size uintptr, layoutPtr unsafe.Pointer) unsafe.Pointer {
- var ptr = unsafe.Pointer(&buffer[used])
-- // Align for x64
- used += ((size + 15) &^ 15)
-- return ptr
-}
--func main() {
- httpInitMsg := []byte("GET / HTTP/1.1\r\nHost:dyte.io\r\nUpgrade:websocket\r\nConnection:Upgrade\r\nSec-WebSocket-Key:dGhlIHNhbXBsZSBub25jZQ==\r\nSec-WebSocket-Version:13\r\nConnection:Upgrade\r\n\r\n")
- wsPayload := []byte{
+var (
+ httpInitMsg = []byte("GET / HTTP/1.1\r\nHost:dyte.io\r\nUpgrade:websocket\r\nConnection:Upgrade\r\nSec-WebSocket-Key:dGhlIHNhbXBsZSBub25jZQ==\r\nSec-WebSocket-Version:13\r\nConnection:Upgrade\r\n\r\n")
+ wsPayload = []byte{
// FIN Bit (Final fragment), OpCode (1 for text payload)
0b10000001,
// Mask Bit (Required), followed by 7 bits for length (0b0000101 == 5)
@@ -47,7 +27,7 @@ func main() {
0b01101110, // 'o' ^ 0b00000001
}
// Connects to an IPv4 server at 127.0.0.1 on port 8080
- sockaddr := []byte{
+ sockaddr = []byte{
// family - AF_INET (0x2), padded to 16 bits
0b00000010,
0b00000000,
@@ -65,8 +45,12 @@ func main() {
0b00000000, 0b00000000, 0b00000000, 0b00000000,
}
// The response buffer for receiving server responses
- var response [135]byte
+ response [135]byte
+)
+//export smol_main
+//go:nobounds
+func main() {
// Create a IPv4 (AF_INET), TCP (SOCK_STREAM) socket FD
// __NR_socket, AF_INET, SOCK_STREAM
var sock, _, _ = syscall.Syscall(syscall.SYS_SOCKET, 0x2, 0x1, 0)
@@ -98,4 +82,11 @@ func main() {
// Write the response string to standard output
// __NR_write, STDOUT_FILENO
syscall.Syscall(syscall.SYS_WRITE, 1, uintptr(unsafe.Pointer(&response[0])), uintptr(len(response)))
++ // Cleanly exit the program with status code 0
+ // The libc does this for us in the usual flow, that goes like so:
+ // __libc_start_main (libc) -> main (runtime_unix.go) -> main (main.go)
+ // But here, the entrypoint is in main.go itself
+ // __NR_exit, EXIT_SUCCESS
+ syscall.Syscall(syscall.SYS_EXIT, 0, 0, 0)
}
Now, we're down to just 810 bytes:
$ tinygo build -o main -scheduler none -gc none -panic trap -target spec.json \
&& strip --strip-section-headers -R .comment -R .note -R .eh_frame main \
&& wc -c main
810 main
Compiling for 32-bits
One last trick up our sleeves is to compile the binary for 32-bits (i386
) rather than amd64
, as 32-bit binaries are significantly smaller in comparison. However, we'll still be able to run this binary on most 64-bit Linux systems (given that CONFIG_IA32_EMULATION
is enabled in the kernel)
To do this, all we need to do is flip the target-related switches in spec.json
. Note that we don't need to update syscalls to reflect i386
as we're using constants like syscall.SYS_SOCKET
rather than hardcoding the syscall numbers:
spec.json
--- a/spec.json
+++ b/spec.json
@@ -1,10 +1,10 @@
{
- "llvm-target": "x86_64-unknown-linux-musl",
- "cpu": "x86-64",
+ "llvm-target": "i386-unknown-linux-musl",
+ "cpu": "i386",
"goos": "linux",
- "goarch": "amd64",
+ "goarch": "386",
"build-tags": [
- "amd64",
+ "386",
"linux"
],
"linker": "ld.lld",
Now, our binary is just 538 bytes, and it still works!
$ tinygo build -o main -scheduler none -gc none -panic trap -target spec.json \
&& strip --strip-section-headers -R .comment -R .note -R .eh_frame main \
&& wc -c main
538 main
$ file main
main: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
$ ./main
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
dyte
Conclusion
Attempt | Size (in bytes) | Compiler |
Using x/net/websocket | 4128768 (stripped) | Go |
Pure standard library | 727684 (1814528 without UPX) | Go |
Syscalls only | 360692 (847872 without UPX) | Go |
Syscalls only | 13056 | TinyGo |
Syscalls with dummy GC | 12720 | TinyGo |
Syscalls with dummy GC, custom ldflags | 6600 | TinyGo |
Syscalls with no GC, custom ldflags, custom entrypoint | 810 | TinyGo |
Syscalls with no GC, custom ldflags, custom entrypoint, 32-bit | 538 | TinyGo |
As we did not cover each topic in a lot of depth in this post, here are some handy resources:
- TinyGo docs
- ld.lld man-page
- Using raw syscalls in C
- Misc. linker-related blogs
- Linker garbage collection
- Explain GNU style linker options
Checkout the full solution here https://github.com/git-bruh/wscodegolf/ and if you want to look at some of our other challenges, checkout https://hacktofinale.dyte.io/