Retrieval Augmented Generation (RAG) AI for the Blog
What if there was an expert, pre-trained on my blog content, who could give comprehensive and insightful answers to user questions?
Below is how I added a non-invasive, personalized AI assistant to my blog.
The Tech Stack
For my purposes, I decided to use Google's model Gemini 2.5 Flash. This is for a couple reasons:
- The 2.5 Flash model is very cheap and efficient.
- For my use case, Retrieval Augmented Generation (RAG), we don't need a more advanced model.
The Implementation
To actually handle this, we need to expose yet another API endpoint for the blog via the router.
1appMux.HandleFunc("POST /api/questions", h.HandleQuestion)
With this route exposed, the frontend can make requests to the /questions endpoint.
Handler
The Handler for the /questions route looks like this:
1type Question struct {
2 Body string `json:"question"`
3}
4
5type Answer struct {
6 Answer string `json:"answer"`
7}
8
9func (h *Handler) HandleQuestion(w http.ResponseWriter, r *http.Request) {
10 var req Question
11 if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
12 HttpErrorResponse(w, "bad request", http.StatusBadRequest)
13 return
14 }
15
16 if len(req.Body) < 5 || len(req.Body) > 300 {
17 HttpErrorResponse(w, "message too short or too long", http.StatusBadRequest)
18 return
19 }
20
21 answer, err := h.queryProcessor.Query(r.Context(), req.Body)
22 if err != nil {
23 h.log.Error("unable to query model", "error", err)
24 HttpErrorResponse(w, "unable to query model", http.StatusInternalServerError)
25 return
26 }
27
28 res := Answer{Answer: answer}
29
30 json.NewEncoder(w).Encode(res)
31}
The most important function in that block is the following:
1answer, err := h.queryProcessor.Query(r.Context(), req.Body)
Now is a great time to go over the QueryProcessor implementation.
QueryProcessor
The QueryProcessor looks like this:
1package ai
2
3import (
4 "context"
5 "log"
6)
7
8const bufferSize = 10
9
10type QueryProcessor struct {
11 client *Client
12 sem chan struct{}
13}
14
15func NewQueryProcessor(client *Client) *QueryProcessor {
16 sem := make(chan struct{}, bufferSize)
17 q := &QueryProcessor{
18 sem: sem,
19 client: client,
20 }
21 return q
22}
23
24func (q *QueryProcessor) Query(requestCtx context.Context, question string) (string, error) {
25 select {
26 case q.sem <- struct{}{}:
27 defer func() { <-q.sem }()
28 return q.client.Ask(requestCtx, question)
29 case <-time.After(time.Second * 8):
30 return "", fmt.Errorf("query timed out")
31 case <-requestCtx.Done():
32 return "", nil
33 }
34}
The most important detail is the Query function:
In it, we are using the sem channel as a semaphore. What does this mean? A Semaphore is a concurrency model that limits the number of resources that can be consumed at once.
For example, say 12 users are using my website at the same time, and they all attempt to ask the ai a question. Because we set bufferSize to 10, users 11 and 12 will have to wait for a response until other users have finished their queries. Below is a commented Query to make this more clear:
1func (q *QueryProcessor) Query(requestCtx context.Context, question string) (string, error) {
2 select {
3 // q.sem <- struct{}{} will block if q.sem is full, giving us natural rate limiting.
4 case q.sem <- struct{}{}:
5 // we must release the resource after our question is answered so other users can participate.
6 defer func() { <-q.sem }()
7 return q.client.Ask(requestCtx, question)
8 // spam protection: if we've been waiting for longer than 8 seconds, get rid of this user's request
9 case <-time.After(time.Second * 8):
10 return "", fmt.Errorf("query timed out")
11 // if the user ends up leaving or closing the connection before we are able to acquire the semaphore, we just return.
12 case <-requestCtx.Done():
13 return "", nil
14 }
15}
I decided to implement it this way for a couple reasons:
- By limiting it to 10 queries at a time, we block ourselves from getting spammed by a script (and thus spamming Google). The first 10 queries will succeed, and the remaining ones will hang. After 8 seconds, the remaining hanging requests will be booted.
- I didn't want to use an explicit queue like I did for anonymous messages, because in this case the process is
synchronous. What that means is the user expects to wait for the AI model's response. In the anonymous message case, we were able to "fire and forget." We can't do that when we need the AI response to respond to the user.
Now is a perfect time to discuss the client implementation, which is how we actually communicate with the Google API.
Client
This is what client.go looks like:
1package ai
2
3import (
4 "context"
5 "errors"
6 "fmt"
7 "log"
8 "strings"
9 "time"
10
11 "github.com/google/generative-ai-go/genai"
12 "github.com/thornhall/blog/internal/posts"
13 "google.golang.org/api/option"
14)
15
16const (
17 modelName = "gemini-2.5-flash"
18 maxOutputTokens = 8192 // legacy limit for output tokens
19)
20
21type Client struct {
22 genClient *genai.Client
23 model *genai.GenerativeModel
24 systemPrompt *genai.Content
25}
26
27func NewClient(ctx context.Context, apiKey string) (*Client, error) {
28 if apiKey == "" {
29 log.Println("ERROR: gemini api key empty")
30 return nil, fmt.Errorf("gemini api key empty")
31 }
32 client, err := genai.NewClient(ctx, option.WithAPIKey(apiKey))
33 if err != nil {
34 return nil, fmt.Errorf("failed to create genai client: %w", err)
35 }
36
37 gm := client.GenerativeModel(modelName)
38
39 gm.SetTemperature(1)
40 gm.SetMaxOutputTokens(maxOutputTokens)
41
42 return &Client{
43 genClient: client,
44 model: gm,
45 }, nil
46}
47
48// LoadContent takes all of the blog posts and loads them into the model's memory.
49// Critically, it also injects context into the queries so the AI knows how to respond to users.
50func (c *Client) LoadContent(posts []posts.TextPost) {
51 if len(posts) == 0 {
52 return
53 }
54
55 var sb strings.Builder
56 sb.WriteString("CRITICAL INSTRUCTIONS:\n")
57 sb.WriteString("1. You are a helpful assistant for a personal blog. Use the following blog posts as your knowledge base to answer user questions.\n")
58 sb.WriteString("2. You must answer the user's question directly and completely in a single response.\n")
59 sb.WriteString("3. Do NOT ask follow-up questions (e.g., 'Does that help?', 'Would you like to know more?').\n")
60 sb.WriteString("4. Assume this is a one-off interaction and the user cannot reply to you.\n")
61
62 sb.WriteString("--- START BLOG CONTENT ---\n")
63
64 for i, post := range posts {
65 sb.WriteString(fmt.Sprintf("\n--- Post %d ---\n%s\n", i+1, post.Body))
66 }
67
68 sb.WriteString("\n--- END BLOG CONTENT ---\n")
69
70 c.systemPrompt = genai.NewUserContent(genai.Text(sb.String()))
71}
72
73func (c *Client) Ask(ctx context.Context, prompt string) (string, error) {
74 if c.systemPrompt == nil {
75 return "", errors.New("no content loaded; call LoadContent first")
76 }
77
78 var lastErr error
79
80 numAttempts := 5
81 // Retries with backoff: The API doesn't always work, so we retry it 5 times
82 for attempt := range numAttempts {
83 if attempt > 0 {
84 sleepDuration := time.Duration(attempt) * time.Second
85 select {
86 case <-time.After(sleepDuration):
87 // Continue
88 case <-ctx.Done():
89 return "", ctx.Err()
90 }
91 }
92
93 cs := c.model.StartChat()
94 cs.History = []*genai.Content{
95 c.systemPrompt,
96 {
97 Role: "model",
98 Parts: []genai.Part{
99 genai.Text("Understood. I have read the blog posts and am ready to answer questions about them."),
100 },
101 },
102 }
103
104 resp, err := cs.SendMessage(ctx, genai.Text(prompt))
105 if err == nil {
106 return printResponse(resp)
107 }
108
109 lastErr = err
110
111 errStr := err.Error()
112 if !strings.Contains(errStr, "503") && !strings.Contains(errStr, "429") {
113 return "", fmt.Errorf("non-retriable error: %w", err)
114 }
115 }
116
117 return "", fmt.Errorf("failed after 3 attempts: %w", lastErr)
118}
119
120func (c *Client) Close() error {
121 return c.genClient.Close()
122}
123
124func printResponse(resp *genai.GenerateContentResponse) (string, error) {
125 var result string
126 for _, cand := range resp.Candidates {
127 if cand.Content != nil {
128 for _, part := range cand.Content.Parts {
129 if txt, ok := part.(genai.Text); ok {
130 result += string(txt)
131 }
132 }
133 }
134 }
135 if result == "" {
136 return "", errors.New("model returned empty response")
137 }
138 return result, nil
139}
I've annotated the code with comments to communicate its intention.
We call LoadContent only one time on server startup and then we initialize QueryProcessor with the Client. Then, the Handler is given the QueryProcessor.
Conclusion
With this system in place, users can now ask an expert specific questions about my engineering journey, my tutorials, or my opinions on Go, all powered by a lightweight, context-aware AI.