Teju's Blog

Full stack engineer and AI architect. Notes from the work.


Building an AutoGPT-style agent in Go

AutoGPT was the thing that woke a lot of people up to what agents could actually do. In March 2023 you could point a GPT-4 prompt at a goal like “research the best CRM for a five-person team” and let it bumble around the web for an hour. It would come back with a half-decent answer and a list of nine browser tabs it would have opened if it had hands. That demo did more for agent adoption than every research paper that year combined.

Two years on, the pattern is clear enough to build cleanly. The shape is two loops, not one: an outer loop that decomposes a goal into tasks and tracks them in a queue, and an inner loop per task that calls tools to do the work. Between iterations the agent writes to a memory store and reflects on what it just did. A single run can last hundreds of model calls. The inner-loop executor here is the ReAct pattern; the rest of this post is the machinery around it.

The pattern

A canonical Plan-Execute-Reflect loop, in sequence:

loop [until done or budget hit] goal goal + memory initial tasks next task task task + memory read context result store result task + result + memory add / retry / done final answer User Outer Loop Planner Task Queue Executor Memory Reflector

Five moving parts, all of them an LLM call wrapped in a small Go struct:

  • The Planner reads the goal plus what the agent has already done and proposes new tasks.
  • The Task Queue holds pending work and tracks status.
  • The Executor runs one task. It calls tools, takes notes, returns a result. In practice this is a single-loop agent.
  • The Memory stores retrievable bits of context: past task results, scraped pages, intermediate notes.
  • The Reflector reads what just happened and decides whether to mark the task done, retry it, split it, or abandon the whole goal.

Each piece is small. The interesting code is in how they hand off to each other.

Architecture

Goal Planner Task Queue Executor: inner loop Tools: web, files, sql Memory: SQLite + sqlite-vec Reflector Final answer

The memory store is the part that gets the most attention in production. Get it wrong and the agent restarts every task from zero, reinventing the same tool sequences for ten model calls in a row.

Core types

Three small Go types carry the state.

go
type Goal struct {
    ID         string
    Text       string
    StartedAt  time.Time
    StepBudget int
    WallBudget time.Duration
}

type Task struct {
    ID        string
    GoalID    string
    Text      string
    ParentID  string // tasks created during reflection point back at their parent
    Status    string // pending | running | done | failed | abandoned
    Result    string
    CreatedAt time.Time
    UpdatedAt time.Time
}

type MemoryEntry struct {
    ID        string
    GoalID    string
    TaskID    string    // empty for global notes
    Kind      string    // task_result | observation | retrieval
    Text      string
    Embedding []float32 // sqlite-vec / pgvector stores as a blob
    CreatedAt time.Time
}

There are no agents-as-classes here. The structs are data. The behaviour lives in functions that take state plus the LLM client and return new state. This is partly a Go thing and partly a debugging thing: when an agent does something stupid at 2am, you want to be able to replay any single step from any past state by hand.

The planner

The planner takes the goal, the current task list, and the last few memory entries, and asks the model for the next batch of tasks. One structured LLM call.

go
type PlannerOutput struct {
    Tasks       []string `json:"tasks"`
    Done        bool     `json:"done"`
    FinalAnswer string   `json:"final_answer,omitempty"`
}

func (p *Planner) Plan(ctx context.Context, g Goal, completed []Task, recent []MemoryEntry) (PlannerOutput, error) {
    sys := `You are the planner for an autonomous agent.

Read the goal, the tasks already completed, and the recent memory.
Decide whether the goal is satisfied. If yes, set done=true and write a final answer.
If no, return a small batch of concrete next tasks (3-7 of them).

Rules:
- Each task must be doable by a single-turn agent in under 10 tool calls.
- Do not repeat work that is already in the completed list.
- Prefer narrow, verifiable tasks over big aspirational ones.
`
    user := buildPlanPrompt(g, completed, recent)
    return callStructured[PlannerOutput](ctx, p.llm, sys, user)
}

Two things worth pointing out.

The system prompt enumerates the rules in plain English. Models will follow rules in a short system prompt much more reliably than rules buried in a long user message. Keep the system prompt under a screen and rewrite it whenever the agent gets worse.

The structured output is the bit that makes this maintainable. Hand-rolled JSON parsing breaks; the provider’s tool-use API does not. Every modern provider has structured outputs now. Use them.

The task queue

Boring code on purpose. The queue is just a table:

sql
CREATE TABLE tasks (
  id         TEXT PRIMARY KEY,
  goal_id    TEXT NOT NULL,
  text       TEXT NOT NULL,
  parent_id  TEXT,
  status     TEXT NOT NULL,    -- pending | running | done | failed | abandoned
  result     TEXT,
  created_at DATETIME NOT NULL,
  updated_at DATETIME NOT NULL
);
CREATE INDEX idx_tasks_goal_status ON tasks(goal_id, status);

The queue methods are Pull, Push, MarkRunning, MarkDone, MarkFailed. Five functions, each three lines of SQL.

The one design choice that matters: do you let the planner add tasks in front of pending ones (priority), or only at the back (FIFO)? I default to FIFO, with the option for the reflector to insert “retry” tasks at the front. That keeps planning roughly chronological and makes traces easier to read at 2am.

The executor

The executor takes one task and runs it. It returns a result plus a list of memory entries to store. The inner loop slots in unchanged: the task text becomes the user message, the agent has the full tool set available, the answer becomes the result.

go
type Executor struct {
    inner *innerloop.Agent
    mem   *Memory
}

func (e *Executor) Run(ctx context.Context, t Task) (string, []MemoryEntry, error) {
    // Pull a few relevant memory entries to seed the agent's context.
    seed, err := e.mem.SearchSimilar(ctx, t.Text, 5)
    if err != nil {
        return "", nil, fmt.Errorf("memory search: %w", err)
    }

    prompt := buildExecutorPrompt(t, seed)

    events := make(chan innerloop.Event, 32)
    var notes []MemoryEntry
    go func() {
        for ev := range events {
            if ev.Type == "tool_result" {
                notes = append(notes, MemoryEntry{
                    Kind: "observation",
                    Text: oneLine(ev.Output),
                })
            }
        }
    }()

    answer, err := e.inner.RunOnce(ctx, prompt, events)
    if err != nil {
        return "", notes, err
    }

    notes = append(notes, MemoryEntry{Kind: "task_result", Text: answer})
    return answer, notes, nil
}

A few notes on what is important here.

The memory search at the top is what turns a forgetful agent into one that learns. Skip it and every task starts from a blank conversation. In practice the recall step roughly halves the tool-call count per task because the agent has the five most relevant past notes in front of it.

The notes you store should be small. Tool result summaries, key sentences from scraped pages, one-line conclusions. If you store the whole HTML of every page the agent fetched, your memory will dilute fast and similarity search will return noise.

Every observation flowing through the events channel gets a chance to be persisted, which gives you a free run log at the granularity of individual tool calls.

Memory: SQLite plus sqlite-vec

The memory store is RAG over the agent’s own observations. The general retrieval techniques (chunking, embedding choice, hybrid search) all apply; the only thing that is agent-specific is what gets embedded and when.

I use sqlite-vec for memory because it lets the agent run on a single SQLite file with no external services. For a single-user agent this is plenty. For a multi-tenant production agent, swap in pgvector and keep the rest.

sql
CREATE VIRTUAL TABLE memory_vec USING vec0(
  embedding float[1536]
);

CREATE TABLE memory (
  id      TEXT PRIMARY KEY,
  goal_id TEXT NOT NULL,
  task_id TEXT,
  kind    TEXT NOT NULL,
  text    TEXT NOT NULL,
  rowid   INTEGER       -- foreign key into memory_vec
);

The Go side:

go
func (m *Memory) Write(ctx context.Context, e MemoryEntry) error {
    if e.Embedding == nil {
        v, err := m.embed.Embed(ctx, e.Text)
        if err != nil {
            return err
        }
        e.Embedding = v
    }
    return m.db.WithTx(ctx, func(tx *sql.Tx) error {
        res, err := tx.ExecContext(ctx,
            `INSERT INTO memory_vec(embedding) VALUES (?)`,
            float32Blob(e.Embedding))
        if err != nil {
            return err
        }
        rowid, _ := res.LastInsertId()
        _, err = tx.ExecContext(ctx,
            `INSERT INTO memory(id, goal_id, task_id, kind, text, rowid)
             VALUES (?, ?, ?, ?, ?, ?)`,
            e.ID, e.GoalID, e.TaskID, e.Kind, e.Text, rowid)
        return err
    })
}

func (m *Memory) SearchSimilar(ctx context.Context, query string, k int) ([]MemoryEntry, error) {
    v, err := m.embed.Embed(ctx, query)
    if err != nil {
        return nil, err
    }
    rows, err := m.db.QueryContext(ctx, `
        SELECT m.id, m.text, m.kind, m.task_id, m.goal_id
        FROM memory_vec v
        JOIN memory m ON m.rowid = v.rowid
        WHERE v.embedding MATCH ? AND k = ?
        ORDER BY distance ASC
    `, float32Blob(v), k)
    if err != nil {
        return nil, err
    }
    defer rows.Close()

    var out []MemoryEntry
    for rows.Next() {
        var e MemoryEntry
        if err := rows.Scan(&e.ID, &e.Text, &e.Kind, &e.TaskID, &e.GoalID); err != nil {
            return nil, err
        }
        out = append(out, e)
    }
    return out, nil
}

Embedding model: any of the smaller hosted models will do. text-embedding-3-small is 1536-dim, cheap, and works well enough for short observations. Embed at write time, store the vector inline, never re-embed unless you change the model.

The reflector

The reflector reads the task, the result, and a slice of recent memory, and produces one of four verdicts: done, retry, split, or abandon.

go
type ReflectionVerdict struct {
    Status   string   `json:"status"`             // done | retry | split | abandon
    Reason   string   `json:"reason"`
    NewTasks []string `json:"new_tasks,omitempty"` // only used for split
}

func (r *Reflector) Reflect(ctx context.Context, t Task, result string, recent []MemoryEntry) (ReflectionVerdict, error) {
    sys := `You are reviewing the result of a single task in an autonomous agent.

Decide whether the task is genuinely complete (done), needs a retry with a hint (retry),
should be split into smaller sub-tasks (split), or is impossible and should be dropped (abandon).

Be strict. A confident-sounding answer is not the same as a correct answer.
If the result claims to have done something without observable evidence in memory, retry.
`
    return callStructured[ReflectionVerdict](ctx, r.llm, sys, buildReflectionPrompt(t, result, recent))
}

Both lines in that prompt earn their place. The “be strict” phrasing keeps the reflector from waving tasks through with confident hallucinations. The “without observable evidence in memory, retry” line catches the most common failure: the executor says “I have written the file” when no file write tool call appears in the trace.

The outer loop

Everything ties together in about thirty lines:

go
func (a *Agent) Run(ctx context.Context, goal Goal) (string, error) {
    a.queue.SeedFromPlan(ctx, goal, a.planner)
    start := time.Now()

    for step := 0; step < goal.StepBudget; step++ {
        if time.Since(start) > goal.WallBudget {
            return "", errors.New("wall budget exceeded")
        }

        task, ok, err := a.queue.PullNext(ctx, goal.ID)
        if err != nil { return "", err }
        if !ok { break }

        result, notes, err := a.executor.Run(ctx, task)
        if err != nil {
            a.queue.MarkFailed(ctx, task.ID, err.Error())
            continue
        }
        for _, n := range notes { a.mem.Write(ctx, n) }

        recent, _ := a.mem.SearchSimilar(ctx, task.Text, 8)
        verdict, err := a.reflector.Reflect(ctx, task, result, recent)
        if err != nil { return "", err }

        switch verdict.Status {
        case "done":    a.queue.MarkDone(ctx, task.ID, result)
        case "retry":   a.queue.Push(ctx, restateWithHint(task, verdict.Reason))
        case "split":   for _, t := range verdict.NewTasks { a.queue.Push(ctx, newTaskUnder(task, t)) }
        case "abandon": a.queue.MarkAbandoned(ctx, task.ID, verdict.Reason)
        }

        completed, _ := a.queue.Completed(ctx, goal.ID)
        plan, err := a.planner.Plan(ctx, goal, completed, recent)
        if err != nil { return "", err }
        if plan.Done {
            return plan.FinalAnswer, nil
        }
        for _, t := range plan.Tasks { a.queue.Push(ctx, newRootTask(goal, t)) }
    }
    return "", errors.New("step budget exceeded")
}

The loop is dense but every line is doing a thing. The planner runs once per step, which is the expensive design choice: it doubles the LLM cost per step but keeps the plan responsive to new information. A cheaper variant calls the planner only every Nth step, or only when the reflector returns split or abandon. I usually start with every-step planning and back off when the bill hurts.

Control flow, end to end

no yes done retry split abandon yes no Start Initial plan Tasks pending? Execute one task Write memory Reflect Verdict Mark done Push retry Push subtasks Mark abandoned Replan Plan says done? Return final answer Step or wall budget hit

A real run

Goal: "Research the top three SQLite vector extensions for under 100K embeddings, and recommend one for a single-binary Go service."

Edited trace:

plan:    [search sqlite vector extensions, list candidates, evaluate sqlite-vec,
          evaluate sqlite-vss, evaluate libsql vector, compare and recommend]

run:     search sqlite vector extensions
inner:   -> web_search(q="sqlite vector extensions 2026")
inner:   <- [sqlite-vec, sqlite-vss, libsql, chroma...]
inner:   answer: "Three primary candidates: sqlite-vec, sqlite-vss, libsql vector"
reflect: done

run:     list candidates
reflect: done (already covered by previous task)

run:     evaluate sqlite-vec
inner:   -> web_fetch(url="https://github.com/asg017/sqlite-vec")
inner:   -> web_fetch(url="https://alexgarcia.xyz/sqlite-vec/")
inner:   answer: "sqlite-vec: pure C, no deps, k-NN, MIT, used in production by..."
reflect: done

run:     evaluate sqlite-vss
inner:   answer: "sqlite-vss: faiss-backed, larger dep, requires runtime extension..."
reflect: retry, reason: "no version or benchmark numbers"
inner:   -> web_search(q="sqlite-vss benchmark p99")
inner:   answer: "sqlite-vss: faiss-backed, ~3MB shared lib, P99 250ms at 100K..."
reflect: done

run:     evaluate libsql vector
reflect: split into [check libsql vector status, check libsql Go binding]

...

plan:    done. final answer: "Use sqlite-vec. Pure C, MIT, no separate shared
         library, k-NN sufficient for 100K vectors at 1536 dims, benchmarks
         around 40ms P99. Both alternatives have heavier deployment stories
         that defeat the single-binary goal."

The reflector caught the sqlite-vss task answering with no numbers and pushed it back with a hint. The libsql task got split because the planner had wanted one task but the work was actually two.

When the pattern breaks

Five failure modes I keep running into:

Drift. After a hundred steps on a complex goal, the planner is producing tasks that are tangentially related to the original goal. The fix is to re-anchor on the goal text in every planner prompt and keep the completed-tasks list visible.

Confident hallucinated success. The model says “I have done X” with no tool calls that could have done X. The reflector’s job is to catch this, but the reflector is also a model and can be fooled. Defence in depth: require certain task types to have specific tool calls in their trace before being marked done.

Infinite split. A task gets split into subtasks, each subtask gets split, the tree never bottoms out. Cap the parent chain depth. After three levels of splitting, force the next reflector verdict to be done or abandon.

Plan thrash. The planner adds and removes the same tasks each round. Usually a memory problem: the planner is not seeing the right slice of memory. Make the recent-memory window deterministic by combining most-recent-N with most-similar-K.

Goal completion ambiguity. The planner thinks the goal is done; the user disagrees. There is no clean fix for this. The practical workaround is to keep the agent’s deliverable concrete enough that “done” is visible from outside: a written file, a sent email, a closed PR.

Tools you actually need

Beyond the basics (web fetch, web search, file read/write, SQL), an agent running this long wants a few specific things:

  • append_to_file: results accumulate across many tasks. Do not let each task overwrite the previous.
  • write_note: a tool that explicitly writes to memory. The agent learns faster when it can decide what to remember.
  • summarize_page: prevents the agent from dumping 50KB of HTML into a tool result. Run the page through the model with a short prompt and return the summary.
  • list_completed: the agent should be able to query its own task history. Otherwise the planner has to do all the bookkeeping in the prompt.

← all posts