err := what() 🤷‍♂️

From ‘An Error Occurred’ to actionable information.

The Backstory 📖

It’s no secret that Go’s error handling is a contentious topic, whether you love the explicitness of the syntax, or hate it’s verbosity, almost all developers hit that brief hurdle when learning Go where they try to find a way to make error handling “nice”. This never ending battle against the syntax has been so long and drawn out that the go development team have seen enough proposals to be a blog post in and of itself, and have recently decided to call quits on any further changes and closed all existing and future proposals on the topic.

But for me, the main problem has never been about syntax…

Error Types 🗺️

The first language I learned beyond basic syntax was C++, so naturally I learned exceptions as the go-to method of error handling, and with a background in games I was encouraged to avoid using them, due to the the performance issues they had on consoles. Over time my journey took me to JavaScript and Python, both exception heavy - and on to CICD and web development, now utilising numerical codes to facilitate cross-process error handling, and finally on to some TypeScript development took me to the outcome pattern.

By the time I learned Go, I had experienced three different ways of handling errors; codes, exceptions, and outcomes. So encountering a new way was no big deal, especially as the Go error pattern is conceptually a trivialised outcome.

For me, this was never the problem, for me, it was one of observability

“An unexpected error occurred” 😵‍💫

Within minutes of starting to learn Go I encountered obscure error messages spat out at the entrypoint to my application, timeouts, EOFs, whatever the message, the Go pattern is that errors are chained upwards until something, somewhere, handles it, typically the root handler of a web server as Go is predominantly used for server application code;

err := handleHTTP()
if err != nil {
  fmt.Printf("An unexpected error occured! %v\n", err)

  // No return or exit - we're just gunna keep going
}

You don’t want your entire server being ripped down due to a handler failing, so you catch the error and log it, ideally with metrics so you can see the increase in failures on your dashboards.

But here’s the crux, Go errors don’t include any information as to why this error occurred, no callstack, no dump, no variables. 🕵️‍♀️

And adding this information into your error strings is an antipattern, both because the error is no longer comparable with == nor errors.Is(), and because messages should be static for observability tools. Searching for Failed on attempt %d becomes exponentially more annoying with each parameter.

Further obscurity comes from the fact that the errors are identified from the value themselves, so if I ever want to identify what actually went wrong I have to add a conditional check on err.Error(), or resolve the type and do more detailed comparisons. This matters for situations like queue processors which might want to identify between a transient error, which can be retried on a backoff, or a functional one which should be sent to a dead letter queue. N.B. this is alleviated with errors.New() and errors.Is() if the errors are not parameterised strings, nor redefined - see “Registries” later.

Improving on Strings ✨

So one of the first things I did was build a custom error type, error is an interface after all, so I was simply adding context to my own internal errors;

package simple_error

type SimpleError struct {
  cause          error
  callstack      []string
}

func (instance *SimpleError) Error() (message string) {
  return fmt.Sprintf("%s\n\n%s\n", instance.cause.Error(), strings.Join(instance.callstack, "\n"))
}

func (instance *SimpleError) Is(other error) (is bool) {
  return errors.Is(instance.cause, other)
}

func (instance *SimpleError) As(other any) (ok bool) {
  return errors.As(instance.cause, other)
}

func New(cause error) (err *SimpleError) {
  callstack := make([]string, 0)
  stackDepth := 1
  for stackDepth < 1000 {
    _, file, line, success := runtime.Caller(stackDepth)
    if !success {
      break
    }
    stackDepth++
    callstack = append(callstack, fmt.Sprintf("%s: %d", file, line))
  }

  return &SimpleError{
    cause:     cause,
    callstack: callstack,
  }
}

Wonderful, now when we create an error we get a callstack, nothing too fancy but we no longer have to spend ages trying to figure out what code could possibly create the error string we saw in our output;

func TestError(t *testing.T) {
  var e error
  e = New(fmt.Errorf("this is a simple error"))
  fmt.Println(e.Error())
}

Outputs;

this is a simple error

.../simple_error/simple_error_test.go: 18
.../go/src/testing/testing.go: 1792
.../go/src/runtime/asm_amd64.s: 1700

In reality there was slightly more complexity for detecting if I passed a SimpleError into the New function and preventing duplicate callstack generation - but blogposts yo.

Now some of you will rightly point out that grabbing the callstack is significantly most costly than creating an error with fmt.Errorf() or errors.New(), and this is absolutely correct. But errors should not be part of the standard path, exceptions are for the exceptional path, and although errors come closer to the outcome pattern than exceptions, I do feel they should represent system failure not a logical outcome.

This in turn means that yes, if I am generating a callstack for errors which should result in 4xx codes, a standard path in any web server, then a malicious actor could spam the endpoint to increase CPU and memory utilisation to form the basis of a denial of service attack - but I would counter with the question, is a 4xx error an error or an outcome - I would argue that your 401 has successfully executed and determined the user is not authorised, your input validations successfully executed and determined the parameters were invalid.

So where am I arguing these errors are created? Well, probably due to my games industry background I work on the principle that errors protect the boundaries - wherever execution leaves my code and enters third party code, the standard library, or is communicating with another process, via files, pipes, or networking - and code within my system should not throw errors, but rather return outcomes, e.g. success booleans, number of items processed, parameter failed validation rule #1 etc.

Improving on Improvements 🔍

Whilst this absolutely is an improvement, telling us where the error came from, the callstack, along with what the error is, the error itself, we still can’t really see why.

My first iteration for this was to add a simple “additional info” system;

package additional_info

type AdditionalInfo interface {
  GetKey() (key string)
  GetValue() (value any)
}

// N.B. does not support nested JSON values
type jsonValue interface {
  ~bool | ~int64 | ~uint64 | ~float64 | ~string
}

type typedInfo[T jsonValue] struct {
  key   string
  value T
}

func (instance typedInfo[T]) GetKey() (key string) {
  return instance.key
}

func (instance typedInfo[T]) GetValue() (value any) {
  return instance.value
}

func WithBoolInfo[T ~bool](key string, value T) (info typedInfo[T]) {
  return typedInfo[T]{key: key, value: value}
}

type typedInfoSlice[T jsonValue] struct {
  key   string
  value []T
}

func (instance typedInfoSlice[T]) GetKey() (key string) {
  return instance.key
}

func (instance typedInfoSlice[T]) GetValue() (value any) {
  return instance.value
}

func WithBoolSliceInfo[T ~bool](key string, value []T) (info typedInfoSlice[T]) {
  return typedInfoSlice[T]{key: key, value: value}
}

// Other With*Info helper functions follow

Then add these to our error type, now called StructuredError

package structured_error

type StructuredError struct {
  cause          error
  callstack      []string
  additionalInfo []additional_info.AdditionalInfo
}

func New(cause error, additionalInfo ...additional_info.AdditionalInfo) (err *StructuredError) {
  // As simple_error.New, but now also storing the additionalInfo
}

That’s cool and all, we now have errors defined with any additional information we want to add to give context to the error;

func DoSomething(user string) (err error) {
  err = structured_error.New(fmt.Errorf("this is a structured error"),
    additional_info.WithStringInfo("user", user),
  )

  return err
}

But this proliferates the verbosity of Go error handling, adding an additional line per info we want to add - or one huge unreadable monster line, your choice.

The next obvious improvement ties into tracing and contexts

Adding Context 🔗

Unsurprisingly, given that I am audacious enough to create StructuredError, I have also created a structured Context, in fact I have a few depending on the use case, but the base is ServiceContext,

package service_context

type Context interface {
  context.Context

  NewServiceContextChild(name string, kind tracing.SpanKind, additionalInfo ...additional_info.AdditionalInfo) (ctx Context)
  Close(err *error)

  GetAdditionalInfo() (additionalInfo additional_info.AdditionalInfos)
}

package structured_error

type StructuredError struct {
  context        service_context.Context
  cause          error
  callstack      []string
  additionalInfo []additional_info.AdditionalInfo
}

func New(ctx service_context.Context, cause error, additionalInfo ...additional_info.AdditionalInfo) (err *StructuredError) {
  // As before, but now also storing the context
}

Again, it’s actually more complicated, but for blog reasons I’ve trivialised it. The result is that my contexts are now both my spans for tracing, and providers of AdditionalInfo to my StructuredError, so for a trivialised example;

func HandleRegisterUser(ctx service_context.Context, user *User) (err error) {
  ctx = ctx.NewServiceContextChild("HandleRegisterUser", tracing.SpanKindInternal,
    additional_info.WithStringInfo("user_id", user.ID),
  )
  defer ctx.Close(&err)

  // ... more work

  err = AddUserToSearchEngine(ctx, user)
  if err != nil {
    // N.B. for simplicity, we always create a new StructuredError
    return nil, structured_error.New(ctx, err)
  }

  return nil
}

func AddUserToSearchEngine(ctx service_context.Context, user *User) (err error) {
  ctx = ctx.NewServiceContextChild("AddUserToSearchEngine", tracing.SpanKindInternal,
    additional_info.WithStringInfo("user_name", user.Name),
  )
  defer ctx.Close(&err)

  err = searchEngine.Write(user.ID, user.Name)
  if err != nil {
    return structured_error.New(ctx, err)
  }

  return nil
}

Now, the error created from a failure to searchEngine.Write() contains additional info for the user_id and user_name from the spans, without having to duplicate each parameter in each error case.

For simplicity we always create a new StructuredError, there’s code in the New function to not create multiple callstacks, and similarly code to handle nesting of ServiceContext information. This then means developers don’t have to know “privileged” information of what type of error a function returns - just always wrap, then we will always get a callstack and additional info.

Registries 🗄️

The final step on this journey was about consistency. For sure you can create custom error messages using fmt.Errorf or errors.New, and custom additional info by using the helper functions, but if you’ve ever tried debugging production systems with inconsistent naming in the observability tooling you’ll know it can be a pain having to write searches for userid, user_id, and user.id where each developer wrote a different key. So for my errors I have an error registry;

package error_registry

var (
  ErrTableNameEmpty                             = errors.New("table name cannot be empty")
  ErrTableNameRegex                             = errors.New("table name can only contain lowercase letters, numbers, and underscores, and must be less than 63 characters")
  // ...
)

for my additional info, an info registry;

package info_registry

func WithInstanceID(instanceID string) (info additional_info.AdditionalInfo) {
  return additional_info.WithStringInfo("instance_id", instanceID)
}

func WithUserID(user users.User) (info additional_info.AdditionalInfo) {
  return additional_info.WithStringInfo("user_id", user.ID)
}

// ...

and even my logs in a log registry;

package log_registry

type LogString string

var (
  ConfigParsingFailed                 LogString = "Failed to parse config"
  StartupComplete                     LogString = "Startup complete"
)

overall making my code simpler, observablity tools more consistent, and making it easier to find all instances of each type for any refactoring, debugging, and auditing.

Sample Code 💾

I’ve prepared some sample code to demonstrate how my StructuredError works along side a simple Open Telemetry span context.

In it you can find an example of how to use this system, which demonstrates both how nesting of the StructuredError is expected, reducing cognitive load, and how the context system works for root and child spans, as well as automatically handling error states;

// F1 is the start of our chain of execution, it creates a root context
func F1() (err error) {
  // If we were continuing a distributed trace, we would pass the traceID and parentID from the incoming
  // request, instead of nils
  ctx := NewRootContext("F1", nil, nil, WithStringInfo("key1", "value1"))
  defer ctx.Close(&err)

  // Emulate work
  time.Sleep(100 * time.Millisecond)

  err = F2(ctx)
  if err != nil {
    return New(ctx, err, WithStringInfo("error1", "value1"))
  }

  return nil
}

// F2 is still within our system, so it creates a child context to denote the nested span
func F2(ctx *OtelContext) (err error) {
  ctx = ctx.NewChildContext("F2", WithStringInfo("key2", "value2"))
  defer ctx.Close(&err)

  // Emulate work
  time.Sleep(100 * time.Millisecond)

  err = F3(ctx)
  if err != nil {
    return New(ctx, err, WithStringInfo("error2", "value2"))
  }

  return nil
}

// F3 is an external call, it doesn't understand OtelContext, and will return a normal error
func F3(ctx context.Context) (err error) {
  return &BadError{}
}

Which automatically generates the observability information discussed in this post, whilst still supporting errors.As, errors.Is, and the error interface;

  err := F1()

  var badError *BadError
  assert.ErrorAs(t, err, &badError)
  assert.ErrorIs(t, err, ErrSomethingBad)
  assert.Equal(t, err.Error(), ErrSomethingBad.Error())

  cause, callstack, additionalInfo := GetLoggingInfo(err)
  assert.Equal(t, cause, ErrSomethingBad.Error())
  assert.Contains(t, callstack, "example_test.go:")
  assert.Equal(t, additionalInfo, AdditionalInfos{
    WithStringInfo("key1", "value1"),
    WithStringInfo("error1", "value1"),
    WithStringInfo("key2", "value2"),
    WithStringInfo("error2", "value2"),
  })

Summing Up 🎁

For me the issues with Go’s error handling have never been about syntax, they’ve been about the loss of observability information - I understand this was done with the aim of performance, but being unable to fix a bug for hours because you can’t find what actually went wrong is, to me, an unacceptable trade off. However, as error and context.Context are both interfaces, the Go dev team facilitate our own implementations, and this in turn allows us to add the information we need as we need it.