The 404s Google Found Before I Did

Google Search Console emailed me. Four URLs on ricsmo.com were returning 404 errors. I hadn’t submitted these URLs. I didn’t recognize two of them. And one was a page that definitely existed.

Here’s what I found, in order from most to least embarrassing.

The One I Shipped Myself

The URL was /tag/blog/. My tags page lives at /blog/tags. There is no /tag/blog/ route on my site.

Google found it because my sitemap generated a link to /blog/tag/blog. How? One of my blog posts had "blog" listed as a tag. The tag system turned it into a page at /blog/tag/blog/, and the sitemap generator picked it up and handed it to Google.

The tag “blog” isn’t a real tag. It was a catch-all I threw on a post about share buttons. The content wasn’t about blogging. It was about React. But once it was in the frontmatter, the build system treated it as legitimate.

The fix was removing the tag. The build no longer generates that page, and the sitemap no longer includes it.

The lesson: every tag in your frontmatter becomes a public URL. Don’t use tags as categories or catch-alls. If it wouldn’t make sense as a standalone page, it doesn’t belong in the tags array.

The Ones From a Previous Site

Two of the 404s were URLs from an older version of ricsmo.com that no longer exist:

/blog/choosing-your-digital-stage-udemy-udacity-pluralsight-skillshare-coursera-edx-or-linkedin-learning/
/blog/how-to-create-an-online-course-for-free-on-udemy/

Google had these in its index from a previous crawl and was checking whether they still resolved. They don’t. The fix is submitting removal requests in Search Console and waiting for Google to drop them from the index.

The www Subdomain

The fourth 404 was www.ricsmo.com. Google was crawling the www subdomain, which doesn’t have a site configured. Every URL on www returned 404.

This one was my fault for not setting up a redirect. Cloudflare makes this trivial. One redirect rule: if the hostname matches www.ricsmo.com, redirect to https://ricsmo.com$request_uri. Takes two minutes to configure.

I also added a Host: https://ricsmo.com directive to my robots.txt as a hint for crawlers that respect it. The redirect rule is the real fix. The robots.txt directive is extra credit.

The Response Code

Interestingly, all four pages in Search Console showed a 200 status code. Google wasn’t saying “these pages return 404.” It was saying “these pages contain links to 404s.” The pages themselves were fine. The outgoing links were the problem.

This is an important distinction. My blog posts were healthy. The links within them were not. SERanking, the SEO audit tool that flagged these, was crawling my pages and checking every external link. When it found links pointing to URLs that no longer existed, it reported them.

Most of the time, broken external links are someone else’s problem. A site you linked to changed their URL structure or went offline. Not your fault, not your responsibility. But in my case, two of the four were self-inflicted.

The Broader Audit

This incident prompted a full SEO audit. I ran SERanking against the entire site and found a few other issues:

External links with no anchor text. My share buttons (X, LinkedIn, Facebook, Reddit) wrap SVG icons in <a> tags with no visible text. The links work fine. Screen readers read the aria-label. But some SEO crawlers only check text content and flag icon-only links as missing anchors. I added title attributes as a fallback.

Text-to-HTML ratio below 10%. Every page was flagged for this. The cause: Next.js static export injects ~28KB of inline React scripts into every page. The actual text content on a blog post is maybe 5KB. The framework overhead was 5x the content. This was one of the reasons I migrated to Astro.

Canonical URL inheritance. My root layout set a canonical URL pointing to the homepage. Every child page inherited it. Google saw every page as a duplicate of the homepage. This one required removing the canonical from the root layout and adding self-referencing canonicals to every individual page.

None of these were visible problems. The site looked fine. Content loaded. Links worked. But under the surface, Google was getting confused signals about my site structure.

The Takeaway

Google finds things you didn’t know were broken. It crawls URLs you didn’t create, follows links you forgot about, and caches pages long after you’ve updated them.

The fix isn’t to panic about every 404 in Search Console. Most resolve themselves over time. The fix is to have clean site structure from the start: no ghost pages from bad tags, proper redirects for subdomains, and canonical URLs that actually point to the right page.

Check Search Console regularly. Fix the things you can. Submit removal requests for the things you can’t. And make sure your sitemap only contains URLs that actually exist.

If you’re building a site and want to avoid these headaches, let’s talk.

Navigation

The 404s Google Found Before I Did

The One I Shipped Myself

The Ones From a Previous Site

The www Subdomain

The Response Code

The Broader Audit

The Takeaway

More writing

Why I Switched from Next.js to Astro

Deploying Static Sites to Cloudflare Pages

Auto-Generating Social OG Images