Hiring Challenge: Smallest Golang Websocket Client

Adventures in making small Go binaries

In this post, we'll write a small Go program to talk with a websocket server while trying to make the generated binary as small as possible. This was performed as part of one of Dyte's hiring challenges, but the methods discussed here can be applied to any Go program in general. Do note that this is just for fun and not something you should try in production!

Problem statement

So, we have a basic websocket server that accepts connections from a client and checks if it sent a hello message, whereas the client has to print out the server's response. The server-side code is written using the gorilla/websocket package and can be found here, but we won't really go through it here as our focus is on making the client-side binary small.

We'll be covering various methods throughout this post, ranging from swapping out the Go compiler, using an ELF packer, and tweaking linker flags to using raw syscalls instead of the standard library.

Humble Beginnings

Let's start out by writing an obvious Go program using the x/net/websocket package:

package main

import (
	"fmt"
	"log"

	"golang.org/x/net/websocket"
)

func main() {
	url := "ws://localhost:8080/"
	ws, err := websocket.Dial(url, "", url)

	if err != nil {
		log.Fatal(err)
	}

	defer ws.Close()

	// Write the `hello` message
	if _, err := ws.Write([]byte("hello")); err != nil {
		log.Fatal(err)
	}

	// 512 byte buffer for storing the response
	var response = make([]byte, 512)

	// No. of bytes received
	var received int

	if received, err = ws.Read(response); err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Received: %s\n", response[:received])
}

Building and running it, we get a ~5.8 MiB binary (6084899 bytes), which is far from our goal:

$ go build -o main && ./main
Received: dyte
$ wc -c main
6084899 main

The go build command allows us to tweak the flags passed to various components like the assembler ( go tool asm, -asmflags), the linker ( go tool link, -ldflags) and the compiler itself ( go tool compile, -gcflags). But only the linker flags are relevant to us for reducing the binary size, and this is quite widely known. In ldflags, -s disables the symbol table and -w omits debug information, while the -trimpath flag converts absolute file paths to relative ones, further reducing the size to ~3.9 MiB:

$ go build -trimpath -ldflags '-s -w' -o main && wc -c main
4128768 main

Reinventing the wheel

Now, we'll start moving into the more esoteric side of things while still sticking with our trusty Go compiler. For starters, let's abandon the net/websocket package and talk over the TCP socket directly, crafting the HTTP and websocket payload by hand.

Refer to this MDN document about writing websocket servers, as we won't be covering the payload in-depth here, though it is extensively commented on in the code below:

package main

import (
	"fmt"
	"log"
	"net"
)

func main() {
	httpInitMsg := []byte("GET / HTTP/1.1\r\nHost:dyte.io\r\nUpgrade:websocket\r\nConnection:Upgrade\r\nSec-WebSocket-Key:dGhlIHNhbXBsZSBub25jZQ==\r\nSec-WebSocket-Version:13\r\nConnection:Upgrade\r\n\r\n")
	wsPayload := []byte{
		// FIN Bit (Final fragment), OpCode (1 for text payload)
		0b10000001,
		// Mask Bit (Required), followed by 7 bits for length (0b0000101 == 5)
		0b10000101,
		// We don't set the extended payload bits as our payload is only 5 bytes
		// Mask (can be any arbritary 32 bit integer)
		0b00000001,
		0b00000010,
		0b00000011,
		0b00000100,
		// Payload, the string "hello" with each character XOR'd with the
		// corresponding mask bits
		0b01101001, // 'h' ^ 0b00000001
		0b01100111, // 'e' ^ 0b00000010
		0b01101111, // 'l' ^ 0b00000011
		0b01101000, // 'l' ^ 0b00000100
		0b01101110, // 'o' ^ 0b00000001
	}

	// Establish a TCP connection to the server
	conn, err := net.Dial("tcp", "localhost:8080")

	if err != nil {
		log.Fatal(err)
	}

	defer conn.Close()

	// Send the initial HTTP message to start talking over the WebSocket protocol
	_, err = conn.Write(httpInitMsg)

	if err != nil {
		log.Fatal(err)
	}

	response := make([]byte, 512)

	// Receive the initial HTTP response
	received, err := conn.Read(response)

	if err != nil {
		log.Fatal(err)
	}

	// Write the websocket frame
	_, err = conn.Write(wsPayload)

	if err != nil {
		log.Fatal(err)
	}

	// Read the reply into the existing buffer
	_, err = conn.Read(response[received:])

	fmt.Println(string(response))
}

We've made quite some progress, down to ~1.7 MiB!

$ go build -trimpath -ldflags '-s -w' -o main && ./main
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

dyte
$ wc -c main
1814528 main

Now, we'll use UPX, an executable packer that compresses the binary and strips unneeded ELF sections. Do note that this impacts cold start times a bit due to the decompression overhead. This takes us down to ~710 KiB!

$ upx -9 main # Max compression level
                       Ultimate Packer for eXecutables
                          Copyright (C) 1996 - 2024
UPX 4.2.2       Markus Oberhumer, Laszlo Molnar & John Reiser    Jan 3rd 2024

        File size         Ratio      Format      Name
   --------------------   ------   -----------   -----------
   1814528 ->    727684   40.10%   linux/amd64   main

Packed 1 file.

One step closer to insanity

Till now, we've just switched to the standard library for talking to the server. We can go one step further and use raw syscalls to handle all the socket interactions, becoming our own standard library in a sense :p

Note that syscalls are a lower level of abstraction than libc, as the libc functions, such as recv internally wrap the corresponding system calls. This might not make much sense if you've never done socket programming in C, but the comments should give you enough of an idea of what's going on.

Essentially, a socket is a file descriptor that we create via the socket() syscall (which is identified by SYS_SOCKET here, referring to syscall no. 41), and we further use this in subsequent syscalls to connect to the server and exchange data. The sockaddr_in structure is used to describe the address & port we want to connect to, which we encode by hand here:

func main() {
	httpInitMsg := []byte(...)
	wsPayload := []byte{...}
	// Connects to an IPv4 server at 127.0.0.1 on port 8080
	sockaddr := []byte{
		// family - AF_INET (0x2), padded to 16 bits
		0b00000010,
		0b00000000,
		// port - 8080, padded to 16 bits
		0b00011111,
		0b10010000,
		// addr - 127.0.0.1, 32 bits
		// 127 << 0 | 0 << 8 | 0 << 16 | 1 << 24
		0b01111111,
		0b00000000,
		0b00000000,
		0b00000001,
		// 64 bits of padding
		0b00000000, 0b00000000, 0b00000000, 0b00000000,
		0b00000000, 0b00000000, 0b00000000, 0b00000000,
	}
	// The response buffer for receiving server responses
	var response [135]byte

	// Create a IPv4 (AF_INET), TCP (SOCK_STREAM) socket FD
	// __NR_socket, AF_INET, SOCK_STREAM
	var sock, _, _ = syscall.Syscall(syscall.SYS_SOCKET, 0x2, 0x1, 0)

	// Connect to the server using the `sockaddr_in` structure
	// __NR_connect, fd, sockaddr_in, len(sockaddr_in)
	syscall.Syscall6(syscall.SYS_CONNECT, sock, uintptr(unsafe.Pointer(&sockaddr[0])), uintptr(len(sockaddr)), 0, 0, 0)

	// Send the HTTP message over the socket
	// __NR_sendto, fd, buf, len(buf), flags, addr, addr_len
	syscall.Syscall6(syscall.SYS_SENDTO, sock, uintptr(unsafe.Pointer(&httpInitMsg[0])), uintptr(len(httpInitMsg)), 0, 0, 0)

	// Receive the response
	// __NR_recvfrom, fd, buf, len(buf), flags, addr, addr_len
	var n, _, _ = syscall.Syscall6(syscall.SYS_RECVFROM, sock, uintptr(unsafe.Pointer(&response[0])), uintptr(len(response)), 0, 0, 0)

	// Send the WebSocket frame
	// __NR_sendto
	syscall.Syscall6(syscall.SYS_SENDTO, sock, uintptr(unsafe.Pointer(&wsPayload[0])), uintptr(len(wsPayload)), 0, 0, 0)

	// Receive the response
	// __NR_recvfrom
	syscall.Syscall6(syscall.SYS_RECVFROM, sock, uintptr(unsafe.Pointer(&response[n])), uintptr(len(response))-n, 0, 0, 0)

	// Close the socket FD
	// __NR_close
	syscall.Syscall(syscall.SYS_CLOSE, sock, 0, 0)

	// Write the response string to standard output
	// __NR_write, STDOUT_FILENO
	syscall.Syscall(syscall.SYS_WRITE, 1, uintptr(unsafe.Pointer(&response[0])), uintptr(len(response)))
}

Now, the stock binary is ~828 KiB, and with UPX, it goes down to a not-so measly ~352 KiB:

$ go build -trimpath -ldflags '-s -w' -o main && upx -9 main
                       Ultimate Packer for eXecutables
                          Copyright (C) 1996 - 2024
UPX 4.2.2       Markus Oberhumer, Laszlo Molnar & John Reiser    Jan 3rd 2024

        File size         Ratio      Format      Name
   --------------------   ------   -----------   -----------
    847872 ->    360692   42.54%   linux/amd64   main

Packed 1 file.

Swapping the Go compiler

Unfortunately, that's a dead-end for how far the vanilla Go compiler can take us. We can now start experimenting with TinyGo, an alternative LLVM-based Go compiler that produces significantly smaller binaries. Spoiler alert: Our binaries will fall below the minimum size accepted by UPX for compression!

One nifty feature TinyGo provides is the -size flag, which shows the various packages that make up our final binary. This is what it shows in the 2nd example that uses the standard library's net package:

$ tinygo build -o main -size full
   code  rodata    data     bss |   flash     ram | package
------------------------------- | --------------- | -------
      0      45       4      18 |      49      22 | (padding)
     45   20330      18      78 |   20393      96 | (unknown)
   1647    3627       0      72 |    5274      72 | /usr/lib/go/src/syscall
   3870      58      12     536 |    3940     548 | C musl
    401       0       0       0 |     401       0 | Go interface assert
    477       0       0       0 |     477       0 | Go interface method
      0    5816       0       0 |    5816       0 | Go types
     50       0       0       0 |      50       0 | errors
   7780     161      40       0 |    7981      40 | fmt
     26       0       0       0 |      26       0 | internal/bytealg
   1690      21       0       0 |    1711       0 | internal/fmtsort
    443      51       0      48 |     494      48 | internal/godebug
    157     369    1280       0 |    1806    1280 | internal/godebugs
     31      12      48      88 |      91     136 | internal/intern
    155       2       0       0 |     157       0 | internal/itoa
      0      57      48       0 |     105      48 | internal/oserror
    486      24       0      16 |     510      16 | internal/task
    336      22       0       0 |     358       0 | io/fs
   2767       3      40      64 |    2810     104 | log
    127       0       0       0 |     127       0 | main
     27       0       0       0 |      27       0 | math
    122       0       0       0 |     122       0 | math/bits
      0      25      16     160 |      41     176 | net
    298      16      56      24 |     370      80 | os
   6272     715      96       0 |    7083      96 | reflect
   8993     258      12      95 |    9263     107 | runtime
    822       0       0       0 |     822       0 | sort
   7280   16705    1338       0 |   25323    1338 | strconv
   1822     200       0       0 |    2022       0 | sync
    141      75       0       1 |     216       1 | sync/atomic
    193    1455       0       8 |    1648       8 | syscall
  20700    1029     184     128 |   21913     312 | time
   1132     288       0       0 |    1420       0 | unicode/utf8
------------------------------- | --------------- | -------
  68290   51364    3192    1336 |  122846    4528 | total

Meanwhile, for the syscalls-only example:

$ tinygo build -o main -size full
   code  rodata    data     bss |   flash     ram | package
------------------------------- | --------------- | -------
      0       1       4      21 |       5      25 | (padding)
     25    2494       8      31 |    2527      39 | (unknown)
     92       0       0      40 |      92      40 | /usr/lib/go/src/syscall
   2894      27       4     536 |    2925     540 | C musl
      0     208       0       0 |     208       0 | Go types
    365      24       0      16 |     389      16 | internal/task
    268     162       0       0 |     430       0 | main
   3020     135       8      91 |    3163      99 | runtime
     80      75       0       1 |     155       1 | sync/atomic
------------------------------- | --------------- | -------
   6744    3126      24     736 |    9894     760 | total

This makes sense as we skip over a ton of abstractions by using raw syscalls, but we can reduce this even further with the control that TinyGo gives us! We can disable goroutines & channels, swap out the GC, pass arbritary linker flags, etc. as can be seen in the documentation.

First off, let's get a baseline for how much TinyGo can help us:

$ tinygo build -o main -no-debug && wc -c main
18160 main

17.7 KiB, that's already 1/20th the size of our previous attempt! Let's go ahead and disable goroutines ( with -scheduler none), switch to a smaller GC implementation that just leaks memory ( -gc leaking), and just execute a trap instruction instead of printing the panic message in-case of panics ( -panic trap) - 12.75 KiB:

$ tinygo build -o main -no-debug -scheduler none -gc leaking -panic trap && wc -c main
13056 main

Ripping out the GC

The leaking GC, while quite small, still includes code to request memory via syscalls, so we can just provide our own allocator that gives out addresses from a fixed-size buffer on the stack, which is initialized at program startup. We can use a small buffer for this purpose as only a few allocations are made in our program, such as initializing the variables we declared (due to Go's escape analysis, as we take pointers to these variables) and the runtime startup code:

--- a/main.go
+++ b/main.go
@@ -5,6 +5,26 @@ import (
        "unsafe"
 )

+var buffer [1024]byte
+var used uintptr = 0
++// We disable the go GC entirely and provide this stub for handling
+// allocations, giving out addresses from a static buffer on the stack
+// This saves many bytes over using the "leaking" GC, it is more or less
+// used exclusively by the runtime's startup code for tasks like setting up
+// the processe's environment variables
+// If it crashes, run it with a clean environment (env -i ./main)
++//go:linkname alloc runtime.alloc
+func alloc(size uintptr, layoutPtr unsafe.Pointer) unsafe.Pointer {
+       var ptr = unsafe.Pointer(&buffer[used])
++       // Align for x64
+       used += ((size + 15) &^ 15)
++       return ptr
+}
+
 func main() {

Now, building with -gc none - 12.42 KiB:

$ tinygo build -o main -no-debug -scheduler none -gc none -panic trap && wc -c main
12720 main

Linker flags

As mentioned before, TinyGo allows us to pass arbitrary flags to the linker at compile time. This can be done via spec files, which tell TinyGo some information about the target architecture; some examples can be seen here. The format is not documented as such in the documentation, but all the possible keys with their defaults can be found in target.go, which we use as a reference for creating our own.

This is what our spec.json looks like all the values are at their defaults except for ldflags, which we will now go through:

{
  "llvm-target": "x86_64-unknown-linux-musl",
  "cpu": "x86-64",
  "goos": "linux",
  "goarch": "amd64",
  "build-tags": [
    "amd64",
    "linux"
  ],
  "linker": "ld.lld",
  "rtlib": "compiler-rt",
  "libc": "musl",
  "defaultstacksize": 65536,
  "ldflags": [
    "--gc-sections",
    "--discard-all",
    "--strip-all",
    "--no-rosegment",
    "-znorelro",
    "-znognustack"
  ]
}

Let's refer to the lld linker's man-page for these flags:

  • -gc-sections: Enables garbage collection of unused sections, explained more in detail in this blog
  • -discard-all: Deletes all local symbols
  • -strip-all: Removes the symbol table and debug information
  • -no-rosegment: Allows the linker to combine read-only and read-execute segments of the binary
  • znorelro: Disables emitting the PT_GNU_RELRO segment, used to specify certain regions of the binary that should be marked as read-only after performing relocations. Good security measure, but we just care about trimming bytes in this post :p
  • znognustack: Disables emitting the PT_GNU_STACK segment, used to determine whether the stack should be executable or not, again, security

On top of this, we can further strip more sections from the compiled binary with the strip command strip --strip-section-headers -R .comment -R .note -R .eh_frame main. This removes the section headers (used by tools like objdump to locate sections), along with the .comment section (which contains toolchain-related info) and the .eh_frame section (used for stack unwinding, which we don't need here)

Finally, our binary is down to 6.44 KiB:

$ tinygo build -o main -no-debug -scheduler none -gc none -panic trap -target spec.json
$ strip --strip-section-headers -R .comment -R .note -R .eh_frame main
$ wc -c main
6600 main

Ripping out the standard library

6.44 KiB is still too big for a program that basically just makes a few syscalls (technically, every program fits this definition, but you get the intent), and this part gets its own section as it is basically cheating in the context of this challenge :p

So, we're still pulling in quite a bit of code from the standard library, mainly around the startup code that sets up the program's execution environment before our main function is actually called, look at runtime_unix.go and scheduler_none.go for more clarity.

All we have to do is export our main function with a different name (eg. smol_main), and tell the linker to treat that as the actual entry point, which would prevent the standard library startup code from making its way into our binary.

  • In spec.json, we pass the entry flag to the linker, and drop libc completely, as it is only needed by TinyGo's standard library for certain functions
--- a/spec.json
+++ b/spec.json
@@ -9,7 +9,6 @@
   ],
   "linker": "ld.lld",
   "rtlib": "compiler-rt",
-  "libc": "musl",
   "defaultstacksize": 65536,
   "ldflags": [
     "--gc-sections",
@@ -17,6 +16,7 @@
     "--strip-all",
     "--no-rosegment",
     "-znorelro",
-    "-znognustack"
+    "-znognustack",
+    "-entry=smol_main"
   ]
 }
  • In main.go, we make our local variables global, allowing them to be placed on the stack rather than the heap (remember the escape analysis mentioned earlier?), which further allows us to get rid of our dummy GC implementation. We annotate the main function with directives to export it as smol_main, and disable bounds checking, as the panic handler for it indirectly pulls in some libc symbols.
--- a/main.go
+++ b/main.go
@@ -5,29 +5,9 @@ import (
        "unsafe"
 )

-var buffer [1024]byte
-var used uintptr = 0
--// We disable the go GC entirely and provide this stub for handling
-// allocations, giving out addresses from a static buffer on the stack
-// This saves many bytes over using the "leaking" GC, it is more or less
-// used exclusively by the runtime's startup code for tasks like setting up
-// the processe's environment variables
-// If it crashes, run it with a clean environment (env -i ./main)
--//go:linkname alloc runtime.alloc
-func alloc(size uintptr, layoutPtr unsafe.Pointer) unsafe.Pointer {
-       var ptr = unsafe.Pointer(&buffer[used])
--       // Align for x64
-       used += ((size + 15) &^ 15)
--       return ptr
-}
--func main() {
-       httpInitMsg := []byte("GET / HTTP/1.1\r\nHost:dyte.io\r\nUpgrade:websocket\r\nConnection:Upgrade\r\nSec-WebSocket-Key:dGhlIHNhbXBsZSBub25jZQ==\r\nSec-WebSocket-Version:13\r\nConnection:Upgrade\r\n\r\n")
-       wsPayload := []byte{
+var (
+       httpInitMsg = []byte("GET / HTTP/1.1\r\nHost:dyte.io\r\nUpgrade:websocket\r\nConnection:Upgrade\r\nSec-WebSocket-Key:dGhlIHNhbXBsZSBub25jZQ==\r\nSec-WebSocket-Version:13\r\nConnection:Upgrade\r\n\r\n")
+       wsPayload   = []byte{
                // FIN Bit (Final fragment), OpCode (1 for text payload)
                0b10000001,
                // Mask Bit (Required), followed by 7 bits for length (0b0000101 == 5)
@@ -47,7 +27,7 @@ func main() {
                0b01101110, // 'o' ^ 0b00000001
        }
        // Connects to an IPv4 server at 127.0.0.1 on port 8080
-       sockaddr := []byte{
+       sockaddr = []byte{
                // family - AF_INET (0x2), padded to 16 bits
                0b00000010,
                0b00000000,
@@ -65,8 +45,12 @@ func main() {
                0b00000000, 0b00000000, 0b00000000, 0b00000000,
        }
        // The response buffer for receiving server responses
-       var response [135]byte
+       response [135]byte
+)

+//export smol_main
+//go:nobounds
+func main() {
        // Create a IPv4 (AF_INET), TCP (SOCK_STREAM) socket FD
        // __NR_socket, AF_INET, SOCK_STREAM
        var sock, _, _ = syscall.Syscall(syscall.SYS_SOCKET, 0x2, 0x1, 0)
@@ -98,4 +82,11 @@ func main() {
        // Write the response string to standard output
        // __NR_write, STDOUT_FILENO
        syscall.Syscall(syscall.SYS_WRITE, 1, uintptr(unsafe.Pointer(&response[0])), uintptr(len(response)))
++       // Cleanly exit the program with status code 0
+       // The libc does this for us in the usual flow, that goes like so:
+       //   __libc_start_main (libc) -> main (runtime_unix.go) -> main (main.go)
+       // But here, the entrypoint is in main.go itself
+       // __NR_exit, EXIT_SUCCESS
+       syscall.Syscall(syscall.SYS_EXIT, 0, 0, 0)
 }

Now, we're down to just 810 bytes:

$ tinygo build -o main -scheduler none -gc none -panic trap -target spec.json \
    && strip --strip-section-headers -R .comment -R .note -R .eh_frame main \
    && wc -c main
810 main

Compiling for 32-bits

One last trick up our sleeves is to compile the binary for 32-bits (i386) rather than amd64, as 32-bit binaries are significantly smaller in comparison. However, we'll still be able to run this binary on most 64-bit Linux systems (given that CONFIG_IA32_EMULATION is enabled in the kernel)

To do this, all we need to do is flip the target-related switches in spec.json. Note that we don't need to update syscalls to reflect i386 as we're using constants like syscall.SYS_SOCKET rather than hardcoding the syscall numbers:

spec.json

--- a/spec.json
+++ b/spec.json
@@ -1,10 +1,10 @@
 {
-  "llvm-target": "x86_64-unknown-linux-musl",
-  "cpu": "x86-64",
+  "llvm-target": "i386-unknown-linux-musl",
+  "cpu": "i386",
   "goos": "linux",
-  "goarch": "amd64",
+  "goarch": "386",
   "build-tags": [
-    "amd64",
+    "386",
     "linux"
   ],
   "linker": "ld.lld",

Now, our binary is just 538 bytes, and it still works!

$ tinygo build -o main -scheduler none -gc none -panic trap -target spec.json \
    && strip --strip-section-headers -R .comment -R .note -R .eh_frame main \
    && wc -c main
538 main
$ file main
main: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, no section header
$ ./main
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

dyte

Conclusion

AttemptSize (in bytes)Compiler
Using x/net/websocket4128768 (stripped)Go
Pure standard library727684 (1814528 without UPX)Go
Syscalls only360692 (847872 without UPX)Go
Syscalls only13056TinyGo
Syscalls with dummy GC12720TinyGo
Syscalls with dummy GC, custom ldflags6600TinyGo
Syscalls with no GC, custom ldflags, custom entrypoint810TinyGo
Syscalls with no GC, custom ldflags, custom entrypoint, 32-bit538TinyGo

As we did not cover each topic in a lot of depth in this post, here are some handy resources:

Checkout the full solution here https://github.com/git-bruh/wscodegolf/ and if you want to look at some of our other challenges, checkout https://hacktofinale.dyte.io/