00%
blog.info()
← Back to Home
SEQUENCE // Building The Blog

Retrieval Augmented Generation (RAG) AI for the Blog

Author Thorn Hall
0

What if there was an expert, pre-trained on my blog content, who could give comprehensive and insightful answers to user questions?

Below is how I added a non-invasive, personalized AI assistant to my blog.

The Tech Stack

For my purposes, I decided to use Google's model Gemini 2.5 Flash. This is for a couple reasons:

  • The 2.5 Flash model is very cheap and efficient.
  • For my use case, Retrieval Augmented Generation (RAG), we don't need a more advanced model.

The Implementation

To actually handle this, we need to expose yet another API endpoint for the blog via the router.

1appMux.HandleFunc("POST /api/questions", h.HandleQuestion)

With this route exposed, the frontend can make requests to the /questions endpoint.

Handler

The Handler for the /questions route looks like this:

 1type Question struct {
 2	Body string `json:"question"`
 3}
 4
 5type Answer struct {
 6	Answer string `json:"answer"`
 7}
 8
 9func (h *Handler) HandleQuestion(w http.ResponseWriter, r *http.Request) {
10	var req Question
11	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
12		HttpErrorResponse(w, "bad request", http.StatusBadRequest)
13		return
14	}
15
16	if len(req.Body) < 5 || len(req.Body) > 300 {
17		HttpErrorResponse(w, "message too short or too long", http.StatusBadRequest)
18		return
19	}
20
21	answer, err := h.queryProcessor.Query(r.Context(), req.Body)
22	if err != nil {
23		h.log.Error("unable to query model", "error", err)
24		HttpErrorResponse(w, "unable to query model", http.StatusInternalServerError)
25		return
26	}
27
28	res := Answer{Answer: answer}
29
30	json.NewEncoder(w).Encode(res)
31}

The most important function in that block is the following:

1answer, err := h.queryProcessor.Query(r.Context(), req.Body)

Now is a great time to go over the QueryProcessor implementation.

QueryProcessor

The QueryProcessor looks like this:

 1package ai
 2
 3import (
 4	"context"
 5	"log"
 6)
 7
 8const bufferSize = 10
 9
10type QueryProcessor struct {
11	client *Client
12	sem    chan struct{}
13}
14
15func NewQueryProcessor(client *Client) *QueryProcessor {
16	sem := make(chan struct{}, bufferSize)
17	q := &QueryProcessor{
18		sem:    sem,
19		client: client,
20	}
21	return q
22}
23
24func (q *QueryProcessor) Query(requestCtx context.Context, question string) (string, error) {
25	select {
26	case q.sem <- struct{}{}:
27		defer func() { <-q.sem }()
28		return q.client.Ask(requestCtx, question)
29	case <-time.After(time.Second * 8):
30		return "", fmt.Errorf("query timed out")
31	case <-requestCtx.Done():
32		return "", nil
33	}
34}

The most important detail is the Query function:

In it, we are using the sem channel as a semaphore. What does this mean? A Semaphore is a concurrency model that limits the number of resources that can be consumed at once.

For example, say 12 users are using my website at the same time, and they all attempt to ask the ai a question. Because we set bufferSize to 10, users 11 and 12 will have to wait for a response until other users have finished their queries. Below is a commented Query to make this more clear:

 1func (q *QueryProcessor) Query(requestCtx context.Context, question string) (string, error) {
 2	select {
 3    // q.sem <- struct{}{} will block if q.sem is full, giving us natural rate limiting.
 4	case q.sem <- struct{}{}:
 5        // we must release the resource after our question is answered so other users can participate.
 6		defer func() { <-q.sem }()
 7		return q.client.Ask(requestCtx, question)
 8    // spam protection: if we've been waiting for longer than 8 seconds, get rid of this user's request
 9    case <-time.After(time.Second * 8):
10		return "", fmt.Errorf("query timed out")
11    // if the user ends up leaving or closing the connection before we are able to acquire the semaphore, we just return.
12	case <-requestCtx.Done():
13		return "", nil
14	}
15}

I decided to implement it this way for a couple reasons:

  1. By limiting it to 10 queries at a time, we block ourselves from getting spammed by a script (and thus spamming Google). The first 10 queries will succeed, and the remaining ones will hang. After 8 seconds, the remaining hanging requests will be booted.
  2. I didn't want to use an explicit queue like I did for anonymous messages, because in this case the process is synchronous. What that means is the user expects to wait for the AI model's response. In the anonymous message case, we were able to "fire and forget." We can't do that when we need the AI response to respond to the user.

Now is a perfect time to discuss the client implementation, which is how we actually communicate with the Google API.

Client

This is what client.go looks like:

  1package ai
  2
  3import (
  4	"context"
  5	"errors"
  6	"fmt"
  7	"log"
  8	"strings"
  9	"time"
 10
 11	"github.com/google/generative-ai-go/genai"
 12	"github.com/thornhall/blog/internal/posts"
 13	"google.golang.org/api/option"
 14)
 15
 16const (
 17	modelName       = "gemini-2.5-flash"
 18	maxOutputTokens = 8192 // legacy limit for output tokens
 19)
 20
 21type Client struct {
 22	genClient    *genai.Client
 23	model        *genai.GenerativeModel
 24	systemPrompt *genai.Content
 25}
 26
 27func NewClient(ctx context.Context, apiKey string) (*Client, error) {
 28	if apiKey == "" {
 29		log.Println("ERROR: gemini api key empty")
 30		return nil, fmt.Errorf("gemini api key empty")
 31	}
 32	client, err := genai.NewClient(ctx, option.WithAPIKey(apiKey))
 33	if err != nil {
 34		return nil, fmt.Errorf("failed to create genai client: %w", err)
 35	}
 36
 37	gm := client.GenerativeModel(modelName)
 38
 39	gm.SetTemperature(1)
 40	gm.SetMaxOutputTokens(maxOutputTokens)
 41
 42	return &Client{
 43		genClient: client,
 44		model:     gm,
 45	}, nil
 46}
 47
 48// LoadContent takes all of the blog posts and loads them into the model's memory.
 49// Critically, it also injects context into the queries so the AI knows how to respond to users.
 50func (c *Client) LoadContent(posts []posts.TextPost) {
 51	if len(posts) == 0 {
 52		return
 53	}
 54
 55	var sb strings.Builder
 56	sb.WriteString("CRITICAL INSTRUCTIONS:\n")
 57	sb.WriteString("1. You are a helpful assistant for a personal blog. Use the following blog posts as your knowledge base to answer user questions.\n")
 58	sb.WriteString("2. You must answer the user's question directly and completely in a single response.\n")
 59	sb.WriteString("3. Do NOT ask follow-up questions (e.g., 'Does that help?', 'Would you like to know more?').\n")
 60	sb.WriteString("4. Assume this is a one-off interaction and the user cannot reply to you.\n")
 61
 62	sb.WriteString("--- START BLOG CONTENT ---\n")
 63
 64	for i, post := range posts {
 65		sb.WriteString(fmt.Sprintf("\n--- Post %d ---\n%s\n", i+1, post.Body))
 66	}
 67
 68	sb.WriteString("\n--- END BLOG CONTENT ---\n")
 69
 70	c.systemPrompt = genai.NewUserContent(genai.Text(sb.String()))
 71}
 72
 73func (c *Client) Ask(ctx context.Context, prompt string) (string, error) {
 74	if c.systemPrompt == nil {
 75		return "", errors.New("no content loaded; call LoadContent first")
 76	}
 77
 78	var lastErr error
 79
 80	numAttempts := 5
 81    // Retries with backoff: The API doesn't always work, so we retry it 5 times
 82	for attempt := range numAttempts {
 83		if attempt > 0 {
 84			sleepDuration := time.Duration(attempt) * time.Second
 85			select {
 86			case <-time.After(sleepDuration):
 87				// Continue
 88			case <-ctx.Done():
 89				return "", ctx.Err()
 90			}
 91		}
 92
 93		cs := c.model.StartChat()
 94		cs.History = []*genai.Content{
 95			c.systemPrompt,
 96			{
 97				Role: "model",
 98				Parts: []genai.Part{
 99					genai.Text("Understood. I have read the blog posts and am ready to answer questions about them."),
100				},
101			},
102		}
103
104		resp, err := cs.SendMessage(ctx, genai.Text(prompt))
105		if err == nil {
106			return printResponse(resp)
107		}
108
109		lastErr = err
110
111		errStr := err.Error()
112		if !strings.Contains(errStr, "503") && !strings.Contains(errStr, "429") {
113			return "", fmt.Errorf("non-retriable error: %w", err)
114		}
115	}
116
117	return "", fmt.Errorf("failed after 3 attempts: %w", lastErr)
118}
119
120func (c *Client) Close() error {
121	return c.genClient.Close()
122}
123
124func printResponse(resp *genai.GenerateContentResponse) (string, error) {
125	var result string
126	for _, cand := range resp.Candidates {
127		if cand.Content != nil {
128			for _, part := range cand.Content.Parts {
129				if txt, ok := part.(genai.Text); ok {
130					result += string(txt)
131				}
132			}
133		}
134	}
135	if result == "" {
136		return "", errors.New("model returned empty response")
137	}
138	return result, nil
139}

I've annotated the code with comments to communicate its intention.

We call LoadContent only one time on server startup and then we initialize QueryProcessor with the Client. Then, the Handler is given the QueryProcessor.

Conclusion

With this system in place, users can now ask an expert specific questions about my engineering journey, my tutorials, or my opinions on Go, all powered by a lightweight, context-aware AI.

View Abstract Syntax Tree (Build-Time Generated)
Document
Paragraph
Text "What if there was an expert..."
Text " questions?"
Paragraph
Text "Below is how I added a non-..."
Text " blog."
Heading
Text "The Tech"
Text " Stack"
Paragraph
Text "For my purposes, I decided ..."
CodeSpan
Text "Gemini 2.5 Flash"
Text ". This is for a couple"
Text " reasons:"
List
ListItem
TextBlock
Text "The 2.5 Flash model is very..."
Text " efficient."
ListItem
TextBlock
Text "For my use case, Retrieval ..."
Text " model."
Heading
Text "The"
Text " Implementation"
Paragraph
Text "To actually handle this, we..."
CodeSpan
Text "router"
Text "."
FencedCodeBlock code: "appMux.HandleFunc..."
Paragraph
Text "With this route exposed, th..."
CodeSpan
Text "/questions"
Text " endpoint."
Heading
Text "Handler"
Paragraph
Text "The Handler for the "
CodeSpan
Text "/questions"
Text " route looks like"
Text " this:"
FencedCodeBlock code: "type Question str..."
Paragraph
Text "The most important function..."
Text " following:"
FencedCodeBlock code: "answer, err := h...."
Paragraph
Text "Now is a great time to go o..."
CodeSpan
Text "QueryProcessor"
Text " implementation."
Heading
Text "QueryProcessor"
Paragraph
Text "The "
CodeSpan
Text "QueryProcessor"
Text " looks like"
Text " this:"
FencedCodeBlock code: "package ai "
Paragraph
Text "The most important detail i..."
CodeSpan
Text "Query"
Text " function:"
Paragraph
Text "In it, we are using the "
CodeSpan
Text "sem"
Text " channel as a "
CodeSpan
Text "semaphore"
Text ". What does this mean? A "
Emphasis
Text "Semaphore"
Text " is a concurrency model tha..."
Text " once."
Paragraph
Text "For example, say 12 users a..."
CodeSpan
Text "bufferSize"
Text " to "
CodeSpan
Text "10"
Text ", users 11 and 12 will have..."
CodeSpan
Text "Query"
Text " to make this more"
Text " clear:"
FencedCodeBlock code: "func (q *QueryPro..."
Paragraph
Text "I decided to implement it t..."
Text " reasons:"
List
ListItem
TextBlock
Text "By limiting it to 10 querie..."
Text " booted."
ListItem
TextBlock
Text "I didn't want to use an exp..."
CodeSpan
Text "synchronous"
Text ". What that means is the us..."
Text " user."
Paragraph
Text "Now is a perfect time to di..."
CodeSpan
Text "client"
Text " implementation, which is h..."
Text " API."
Heading
Text "Client"
Paragraph
Text "This is what "
CodeSpan
Text "client.go"
Text " looks"
Text " like:"
FencedCodeBlock code: "package ai "
Paragraph
Text "I've annotated the code wit..."
Text " intention."
Paragraph
Text "We call "
CodeSpan
Text "LoadContent"
Text " only one time on server st..."
CodeSpan
Text "QueryProcessor"
Text " with the "
CodeSpan
Text "Client"
Text ". Then, the "
CodeSpan
Text "Handler"
Text " is given the "
CodeSpan
Text "QueryProcessor"
Text "."
Heading
Text "Conclusion"
Paragraph
Text "With this system in place, ..."
Text " AI."