blog.info()

How does Google find websites to show users when they search a term?

The answer is that Google has web crawlers that scour the web. Below is how I gave Google's web crawlers a tour guide.

robots.txt and sitemap.xml

First, we need to understand the two main components that web crawlers use when accessing our site. One is the robots.txt data, and the other is the sitemap.xml file.

robots.txt

robots.txt is sort of like the "bouncer" for the site. It tells web crawlers which paths are allowed for them to crawl and which ones aren't. Here is what my robots.txt looks like:

1const robotsTxt = `User-agent: *
2Allow: /
3Disallow: /api/
4Sitemap: https://thorn.sh/sitemap.xml`

Note that I'm serving it directly from Go instead of an actual robots.txt file.

The robots.txt data tells the web crawler 3 important pieces of information:

Which URLs it is allowed to crawl.
Which URLs it should avoid.
The path of our sitemap.xml.

Now is a perfect time to discuss the sitemap.

sitemap.xml

sitemap.xml exists to tell the web crawler about relevant content on the site. In my example, my sitemap.xml looks like this:

 1<?xml version="1.0" encoding="UTF-8"?>
 2<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 3  <url>
 4    <loc>https://thorn.sh/about/</loc>
 5    <lastmod>2025-12-07T19:32:14-06:00</lastmod>
 6  </url>
 7  <url>
 8    <loc>https://thorn.sh/adding-metrics-sse/</loc>
 9    <lastmod>2025-12-07T19:29:21-06:00</lastmod>
10  </url>
11  <url>
12    <loc>https://thorn.sh/architecture/</loc>
13    <lastmod>2025-12-07T19:32:14-06:00</lastmod>
14  </url>
15  ... 
16  ...

Together, they function to augment Google's ability to discover content on our site, which optimizes search engine results.

How sitemap.xml is Implemented

We build the sitemap.xml file during the build phase. Here's what the process looks like:

Our build process generates all of our public files in /public.
Then our build process walks the entire /public directory, looking for any files with .html as a suffix.
Using that information, it forms a valid URL.

Here is what the code looks like for generating the site map:

  1package builder
  2
  3import (
  4	"encoding/xml"
  5	"fmt"
  6	"log"
  7	"os"
  8	"path/filepath"
  9	"strings"
 10	"time"
 11)
 12
 13const (
 14	baseURL   = "https://thorn.sh"
 15	publicDir = "./public"
 16)
 17
 18type URLSet struct {
 19	XMLName xml.Name `xml:"http://www.sitemaps.org/schemas/sitemap/0.9 urlset"`
 20	URLs    []URL    `xml:"url"`
 21}
 22
 23type URL struct {
 24	Loc     string `xml:"loc"`
 25	LastMod string `xml:"lastmod"`
 26}
 27
 28func GenerateSiteMap() {
 29	log.Println("Generating site map...")
 30
 31	urls, err := generateURLs()
 32	if err != nil {
 33		panic(err)
 34	}
 35
 36	f, err := os.Create(filepath.Join(publicDir, "sitemap.xml"))
 37	if err != nil {
 38		panic(err)
 39	}
 40	defer f.Close()
 41
 42	f.Write([]byte(xml.Header))
 43	enc := xml.NewEncoder(f)
 44	enc.Indent("", "  ")
 45	if err := enc.Encode(URLSet{URLs: urls}); err != nil {
 46		panic(err)
 47	}
 48
 49	log.Printf("Sitemap generated with %d URLs\n", len(urls))
 50}
 51
 52func generateURLs() ([]URL, error) {
 53	var urls []URL
 54
 55	err := filepath.Walk(publicDir, func(path string, info os.FileInfo, err error) error {
 56		if err != nil {
 57			return err
 58		}
 59
 60		// Skip directories and non-html files
 61		if info.IsDir() || filepath.Ext(path) != ".html" {
 62			return nil
 63		}
 64
 65		// Convert File Path to URL
 66		// Rel path: "public/blog/post.html" -> "blog/post.html"
 67		relPath, _ := filepath.Rel(publicDir, path)
 68
 69		// Fix Windows slashes to Web slashes
 70		slug := filepath.ToSlash(relPath)
 71
 72		// Handle "Clean URLs" (remove index.html and .html extensions)
 73		if filepath.Base(slug) == "index.html" {
 74			slug = filepath.Dir(slug) // "blog/index.html" -> "blog"
 75			if slug == "." {
 76				slug = "" // root
 77			}
 78		} else {
 79			slug = strings.TrimSuffix(slug, ".html") // "about.html" -> "about"
 80		}
 81
 82		finalURL := ""
 83		if slug == "" {
 84			finalURL = strings.TrimSuffix(fmt.Sprintf("%s/%s", baseURL, slug), "/")
 85		} else {
 86			finalURL = fmt.Sprintf("%s/%s/", baseURL, slug)
 87		}
 88
 89		urls = append(urls, URL{
 90			Loc:     finalURL,
 91            // Note: This relies on self-hosted runners + incremental builds.
 92            // On ephemeral runners (like standard GitHub Actions), file ModTime resets 
 93            // on every clone, causing every page to look "new" and confusing Google.
 94			LastMod: info.ModTime().Format(time.RFC3339),
 95		})
 96
 97		return nil
 98	})
 99
100	return urls, err
101}

Final Optimization

Previously, I talked about custom compression middleware that is implemented for this site. One thing I missed in that implementation is this:

1return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
2	w.Header().Add("Vary", "Accept-Encoding")

It acts as a signal to downstream caches (like CDNs) to store different versions of the file based on the user's browser capabilities. It ensures a user who can't handle compression doesn't accidentally get served a cached .br file meant for a modern browser.

Conclusion

Now, during the build phase, we generate a sitemap.xml file. Our robots.txt is a configuration that lives directly in our server code instead of the file system. Web crawlers will have better information about our website and thus our ranking in Google's search algorithm will (theoretically) increase.