blog.info()

What if there was an expert, pre-trained on my blog content, who could give comprehensive and insightful answers to user questions?

Below is how I added a non-invasive, personalized AI assistant to my blog.

The Tech Stack

For my purposes, I decided to use Google's model Gemini 2.5 Flash. This is for a couple reasons:

The 2.5 Flash model is very cheap and efficient.
For my use case, Retrieval Augmented Generation (RAG), we don't need a more advanced model.

The Implementation

To actually handle this, we need to expose yet another API endpoint for the blog via the router.

1appMux.HandleFunc("POST /api/questions", h.HandleQuestion)

With this route exposed, the frontend can make requests to the /questions endpoint.

Handler

The Handler for the /questions route looks like this:

 1type Question struct {
 2	Body string `json:"question"`
 3}
 4
 5type Answer struct {
 6	Answer string `json:"answer"`
 7}
 8
 9func (h *Handler) HandleQuestion(w http.ResponseWriter, r *http.Request) {
10	var req Question
11	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
12		HttpErrorResponse(w, "bad request", http.StatusBadRequest)
13		return
14	}
15
16	if len(req.Body) < 5 || len(req.Body) > 300 {
17		HttpErrorResponse(w, "message too short or too long", http.StatusBadRequest)
18		return
19	}
20
21	answer, err := h.queryProcessor.Query(r.Context(), req.Body)
22	if err != nil {
23		h.log.Error("unable to query model", "error", err)
24		HttpErrorResponse(w, "unable to query model", http.StatusInternalServerError)
25		return
26	}
27
28	res := Answer{Answer: answer}
29
30	json.NewEncoder(w).Encode(res)
31}

The most important function in that block is the following:

1answer, err := h.queryProcessor.Query(r.Context(), req.Body)

Now is a great time to go over the QueryProcessor implementation.

QueryProcessor

The QueryProcessor looks like this:

 1package ai
 2
 3import (
 4	"context"
 5	"log"
 6)
 7
 8const bufferSize = 10
 9
10type QueryProcessor struct {
11	client *Client
12	sem    chan struct{}
13}
14
15func NewQueryProcessor(client *Client) *QueryProcessor {
16	sem := make(chan struct{}, bufferSize)
17	q := &QueryProcessor{
18		sem:    sem,
19		client: client,
20	}
21	return q
22}
23
24func (q *QueryProcessor) Query(requestCtx context.Context, question string) (string, error) {
25	select {
26	case q.sem <- struct{}{}:
27		defer func() { <-q.sem }()
28		return q.client.Ask(requestCtx, question)
29	case <-time.After(time.Second * 8):
30		return "", fmt.Errorf("query timed out")
31	case <-requestCtx.Done():
32		return "", nil
33	}
34}

The most important detail is the Query function:

In it, we are using the sem channel as a semaphore. What does this mean? A Semaphore is a concurrency model that limits the number of resources that can be consumed at once.

For example, say 12 users are using my website at the same time, and they all attempt to ask the ai a question. Because we set bufferSize to 10, users 11 and 12 will have to wait for a response until other users have finished their queries. Below is a commented Query to make this more clear:

 1func (q *QueryProcessor) Query(requestCtx context.Context, question string) (string, error) {
 2	select {
 3    // q.sem <- struct{}{} will block if q.sem is full, giving us natural rate limiting.
 4	case q.sem <- struct{}{}:
 5        // we must release the resource after our question is answered so other users can participate.
 6		defer func() { <-q.sem }()
 7		return q.client.Ask(requestCtx, question)
 8    // spam protection: if we've been waiting for longer than 8 seconds, get rid of this user's request
 9    case <-time.After(time.Second * 8):
10		return "", fmt.Errorf("query timed out")
11    // if the user ends up leaving or closing the connection before we are able to acquire the semaphore, we just return.
12	case <-requestCtx.Done():
13		return "", nil
14	}
15}

I decided to implement it this way for a couple reasons:

By limiting it to 10 queries at a time, we block ourselves from getting spammed by a script (and thus spamming Google). The first 10 queries will succeed, and the remaining ones will hang. After 8 seconds, the remaining hanging requests will be booted.
I didn't want to use an explicit queue like I did for anonymous messages, because in this case the process is synchronous. What that means is the user expects to wait for the AI model's response. In the anonymous message case, we were able to "fire and forget." We can't do that when we need the AI response to respond to the user.

Now is a perfect time to discuss the client implementation, which is how we actually communicate with the Google API.

Client

This is what client.go looks like:

  1package ai
  2
  3import (
  4	"context"
  5	"errors"
  6	"fmt"
  7	"log"
  8	"strings"
  9	"time"
 10
 11	"github.com/google/generative-ai-go/genai"
 12	"github.com/thornhall/blog/internal/posts"
 13	"google.golang.org/api/option"
 14)
 15
 16const (
 17	modelName       = "gemini-2.5-flash"
 18	maxOutputTokens = 8192 // legacy limit for output tokens
 19)
 20
 21type Client struct {
 22	genClient    *genai.Client
 23	model        *genai.GenerativeModel
 24	systemPrompt *genai.Content
 25}
 26
 27func NewClient(ctx context.Context, apiKey string) (*Client, error) {
 28	if apiKey == "" {
 29		log.Println("ERROR: gemini api key empty")
 30		return nil, fmt.Errorf("gemini api key empty")
 31	}
 32	client, err := genai.NewClient(ctx, option.WithAPIKey(apiKey))
 33	if err != nil {
 34		return nil, fmt.Errorf("failed to create genai client: %w", err)
 35	}
 36
 37	gm := client.GenerativeModel(modelName)
 38
 39	gm.SetTemperature(1)
 40	gm.SetMaxOutputTokens(maxOutputTokens)
 41
 42	return &Client{
 43		genClient: client,
 44		model:     gm,
 45	}, nil
 46}
 47
 48// LoadContent takes all of the blog posts and loads them into the model's memory.
 49// Critically, it also injects context into the queries so the AI knows how to respond to users.
 50func (c *Client) LoadContent(posts []posts.TextPost) {
 51	if len(posts) == 0 {
 52		return
 53	}
 54
 55	var sb strings.Builder
 56	sb.WriteString("CRITICAL INSTRUCTIONS:\n")
 57	sb.WriteString("1. You are a helpful assistant for a personal blog. Use the following blog posts as your knowledge base to answer user questions.\n")
 58	sb.WriteString("2. You must answer the user's question directly and completely in a single response.\n")
 59	sb.WriteString("3. Do NOT ask follow-up questions (e.g., 'Does that help?', 'Would you like to know more?').\n")
 60	sb.WriteString("4. Assume this is a one-off interaction and the user cannot reply to you.\n")
 61
 62	sb.WriteString("--- START BLOG CONTENT ---\n")
 63
 64	for i, post := range posts {
 65		sb.WriteString(fmt.Sprintf("\n--- Post %d ---\n%s\n", i+1, post.Body))
 66	}
 67
 68	sb.WriteString("\n--- END BLOG CONTENT ---\n")
 69
 70	c.systemPrompt = genai.NewUserContent(genai.Text(sb.String()))
 71}
 72
 73func (c *Client) Ask(ctx context.Context, prompt string) (string, error) {
 74	if c.systemPrompt == nil {
 75		return "", errors.New("no content loaded; call LoadContent first")
 76	}
 77
 78	var lastErr error
 79
 80	numAttempts := 5
 81    // Retries with backoff: The API doesn't always work, so we retry it 5 times
 82	for attempt := range numAttempts {
 83		if attempt > 0 {
 84			sleepDuration := time.Duration(attempt) * time.Second
 85			select {
 86			case <-time.After(sleepDuration):
 87				// Continue
 88			case <-ctx.Done():
 89				return "", ctx.Err()
 90			}
 91		}
 92
 93		cs := c.model.StartChat()
 94		cs.History = []*genai.Content{
 95			c.systemPrompt,
 96			{
 97				Role: "model",
 98				Parts: []genai.Part{
 99					genai.Text("Understood. I have read the blog posts and am ready to answer questions about them."),
100				},
101			},
102		}
103
104		resp, err := cs.SendMessage(ctx, genai.Text(prompt))
105		if err == nil {
106			return printResponse(resp)
107		}
108
109		lastErr = err
110
111		errStr := err.Error()
112		if !strings.Contains(errStr, "503") && !strings.Contains(errStr, "429") {
113			return "", fmt.Errorf("non-retriable error: %w", err)
114		}
115	}
116
117	return "", fmt.Errorf("failed after 3 attempts: %w", lastErr)
118}
119
120func (c *Client) Close() error {
121	return c.genClient.Close()
122}
123
124func printResponse(resp *genai.GenerateContentResponse) (string, error) {
125	var result string
126	for _, cand := range resp.Candidates {
127		if cand.Content != nil {
128			for _, part := range cand.Content.Parts {
129				if txt, ok := part.(genai.Text); ok {
130					result += string(txt)
131				}
132			}
133		}
134	}
135	if result == "" {
136		return "", errors.New("model returned empty response")
137	}
138	return result, nil
139}

I've annotated the code with comments to communicate its intention.

We call LoadContent only one time on server startup and then we initialize QueryProcessor with the Client. Then, the Handler is given the QueryProcessor.

Conclusion

With this system in place, users can now ask an expert specific questions about my engineering journey, my tutorials, or my opinions on Go, all powered by a lightweight, context-aware AI.