00%
blog.info()
← Back to Home
SEQUENCE // DevOps

Search Engine Optimization in the Build

Author Thorn Hall
0

How does Google find websites to show users when they search a term?

The answer is that Google has web crawlers that scour the web. Below is how I gave Google's web crawlers a tour guide.

robots.txt and sitemap.xml

First, we need to understand the two main components that web crawlers use when accessing our site. One is the robots.txt data, and the other is the sitemap.xml file.

robots.txt

robots.txt is sort of like the "bouncer" for the site. It tells web crawlers which paths are allowed for them to crawl and which ones aren't. Here is what my robots.txt looks like:

1const robotsTxt = `User-agent: *
2Allow: /
3Disallow: /api/
4Sitemap: https://thorn.sh/sitemap.xml`

Note that I'm serving it directly from Go instead of an actual robots.txt file.

The robots.txt data tells the web crawler 3 important pieces of information:

  • Which URLs it is allowed to crawl.
  • Which URLs it should avoid.
  • The path of our sitemap.xml.

Now is a perfect time to discuss the sitemap.

sitemap.xml

sitemap.xml exists to tell the web crawler about relevant content on the site. In my example, my sitemap.xml looks like this:

 1<?xml version="1.0" encoding="UTF-8"?>
 2<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 3  <url>
 4    <loc>https://thorn.sh/about/</loc>
 5    <lastmod>2025-12-07T19:32:14-06:00</lastmod>
 6  </url>
 7  <url>
 8    <loc>https://thorn.sh/adding-metrics-sse/</loc>
 9    <lastmod>2025-12-07T19:29:21-06:00</lastmod>
10  </url>
11  <url>
12    <loc>https://thorn.sh/architecture/</loc>
13    <lastmod>2025-12-07T19:32:14-06:00</lastmod>
14  </url>
15  ... 
16  ...

Together, they function to augment Google's ability to discover content on our site, which optimizes search engine results.

How sitemap.xml is Implemented

We build the sitemap.xml file during the build phase. Here's what the process looks like:

  • Our build process generates all of our public files in /public.
  • Then our build process walks the entire /public directory, looking for any files with .html as a suffix.
  • Using that information, it forms a valid URL.

Here is what the code looks like for generating the site map:

  1package builder
  2
  3import (
  4	"encoding/xml"
  5	"fmt"
  6	"log"
  7	"os"
  8	"path/filepath"
  9	"strings"
 10	"time"
 11)
 12
 13const (
 14	baseURL   = "https://thorn.sh"
 15	publicDir = "./public"
 16)
 17
 18type URLSet struct {
 19	XMLName xml.Name `xml:"http://www.sitemaps.org/schemas/sitemap/0.9 urlset"`
 20	URLs    []URL    `xml:"url"`
 21}
 22
 23type URL struct {
 24	Loc     string `xml:"loc"`
 25	LastMod string `xml:"lastmod"`
 26}
 27
 28func GenerateSiteMap() {
 29	log.Println("Generating site map...")
 30
 31	urls, err := generateURLs()
 32	if err != nil {
 33		panic(err)
 34	}
 35
 36	f, err := os.Create(filepath.Join(publicDir, "sitemap.xml"))
 37	if err != nil {
 38		panic(err)
 39	}
 40	defer f.Close()
 41
 42	f.Write([]byte(xml.Header))
 43	enc := xml.NewEncoder(f)
 44	enc.Indent("", "  ")
 45	if err := enc.Encode(URLSet{URLs: urls}); err != nil {
 46		panic(err)
 47	}
 48
 49	log.Printf("Sitemap generated with %d URLs\n", len(urls))
 50}
 51
 52func generateURLs() ([]URL, error) {
 53	var urls []URL
 54
 55	err := filepath.Walk(publicDir, func(path string, info os.FileInfo, err error) error {
 56		if err != nil {
 57			return err
 58		}
 59
 60		// Skip directories and non-html files
 61		if info.IsDir() || filepath.Ext(path) != ".html" {
 62			return nil
 63		}
 64
 65		// Convert File Path to URL
 66		// Rel path: "public/blog/post.html" -> "blog/post.html"
 67		relPath, _ := filepath.Rel(publicDir, path)
 68
 69		// Fix Windows slashes to Web slashes
 70		slug := filepath.ToSlash(relPath)
 71
 72		// Handle "Clean URLs" (remove index.html and .html extensions)
 73		if filepath.Base(slug) == "index.html" {
 74			slug = filepath.Dir(slug) // "blog/index.html" -> "blog"
 75			if slug == "." {
 76				slug = "" // root
 77			}
 78		} else {
 79			slug = strings.TrimSuffix(slug, ".html") // "about.html" -> "about"
 80		}
 81
 82		finalURL := ""
 83		if slug == "" {
 84			finalURL = strings.TrimSuffix(fmt.Sprintf("%s/%s", baseURL, slug), "/")
 85		} else {
 86			finalURL = fmt.Sprintf("%s/%s/", baseURL, slug)
 87		}
 88
 89		urls = append(urls, URL{
 90			Loc:     finalURL,
 91            // Note: This relies on self-hosted runners + incremental builds.
 92            // On ephemeral runners (like standard GitHub Actions), file ModTime resets 
 93            // on every clone, causing every page to look "new" and confusing Google.
 94			LastMod: info.ModTime().Format(time.RFC3339),
 95		})
 96
 97		return nil
 98	})
 99
100	return urls, err
101}

Final Optimization

Previously, I talked about custom compression middleware that is implemented for this site. One thing I missed in that implementation is this:

1return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
2	w.Header().Add("Vary", "Accept-Encoding")

It acts as a signal to downstream caches (like CDNs) to store different versions of the file based on the user's browser capabilities. It ensures a user who can't handle compression doesn't accidentally get served a cached .br file meant for a modern browser.

Conclusion

Now, during the build phase, we generate a sitemap.xml file. Our robots.txt is a configuration that lives directly in our server code instead of the file system. Web crawlers will have better information about our website and thus our ranking in Google's search algorithm will (theoretically) increase.

View Abstract Syntax Tree (Build-Time Generated)
Document
Paragraph
Text "How does Google find websit..."
Text " term?"
Paragraph
Text "The answer is that Google h..."
Text " guide."
Heading
Text "robots.txt and"
Text " sitemap.xml"
Paragraph
Text "First, we need to understan..."
CodeSpan
Text "robots.txt"
Text " data, and the other is the "
CodeSpan
Text "sitemap.xml"
Text " file."
Heading
Text "robots.txt"
Paragraph
CodeSpan
Text "robots.txt"
Text " is sort of like the "bounc..."
CodeSpan
Text "robots.txt"
Text " looks"
Text " like:"
FencedCodeBlock code: "const robotsTxt =..."
Paragraph
Text "Note that I'm serving it di..."
CodeSpan
Text "robots.txt"
Text " file."
Paragraph
Text "The "
CodeSpan
Text "robots.txt"
Text " data tells the web crawler..."
Text " information:"
List
ListItem
TextBlock
Text "Which URLs it is allowed to"
Text " crawl."
ListItem
TextBlock
Text "Which URLs it should"
Text " avoid."
ListItem
TextBlock
Text "The path of our"
Text " sitemap.xml."
Paragraph
Text "Now is a perfect time to di..."
Text " sitemap."
Heading
Text "sitemap.xml"
Paragraph
CodeSpan
Text "sitemap.xml"
Text " exists to tell the web cra..."
CodeSpan
Text "sitemap.xml"
Text " looks like"
Text " this:"
FencedCodeBlock code: "<?xml version="1...."
Paragraph
Text "Together, they function to ..."
Text " results."
Heading
Text "How sitemap.xml is"
Text " Implemented"
Paragraph
Text "We build the "
CodeSpan
Text "sitemap.xml"
Text " file during the build phas..."
Text " like:"
List
ListItem
TextBlock
Text "Our build process generates..."
CodeSpan
Text "/public"
Text "."
ListItem
TextBlock
Text "Then our build process walk..."
CodeSpan
Text "/public"
Text " directory, looking for any..."
CodeSpan
Text ".html"
Text " as a"
Text " suffix."
ListItem
TextBlock
Text "Using that information, it ..."
Text " URL."
Paragraph
Text "Here is what the code looks..."
Text " map:"
FencedCodeBlock code: "package builder "
Heading
Text "Final"
Text " Optimization"
Paragraph
Text "Previously, I talked about ..."
Text " this:"
FencedCodeBlock code: "return http.Handl..."
Paragraph
Text "It acts as a signal to down..."
Text " browser."
Heading
Text "Conclusion"
Paragraph
Text "Now, during the build phase..."
CodeSpan
Text "sitemap.xml"
Text " file. Our "
CodeSpan
Text "robots.txt"
Text " is a configuration that li..."
Text " increase."