You're Not Testing What You Think You're Testing

Jack Hartzler — Mon, 16 Mar 2026 04:49:24 GMT

The other day I had to change my car battery. I've done it before, so I was pretty confident that it would go smoothly. I had the tools, I knew how to scrape off corrosion from the terminals, how to work the finicky bolts that held my old battery in. I was so confident that I decided to do it outdoors in sub freezing weather, reasoning that I would get back inside before I got too cold. This turned out not to be true.
What I discovered was that the socket wrench that I had gotten since the last time I changed my battery did not have the extender piece that allowed it to get over the long bolt that held my battery in place. My other failure mode was forgetting to account for the fact that I would end up dropping that same socket wrench deep into my engine, and I did not have the long fishing pole magnet needed to get it out again. This meant I had to get under my car and fish it out by hand. After a long and embarrassing hour, I was covered in dirt, oil, and snow. I had a fresh battery in my car and a new conviction that confidence about the job and understanding what it actually is are two different things.

In the same way, I've come to realize that in software engineering, confidence and knowledge are not the same at all.

Recently I've been building a platform for hosting live in-person group games. And very quickly I hit a testing wall.

Multiplayer games are surprisingly hard to test by yourself.

Obviously, automated tests go a long way, and there's no substitute for having good unit and integration tests. But automated tests can only catch bugs you can think of in advance. And especially with AI code generation, it's more important than ever to be the human in the loop in terms of testing your own product. There are a lot of people online who promise that if you use enough agents for enough time, AI can verify AI. I think those people are very smart but they are also probably very rich and can afford like 35 Claudes. I can afford maybe 4 Claudes and only for a short time. Eventual consistency is great, but I need consistency at 8pm on a Friday night when 50 teens are using my app. But that's another blog post.

The point is, I need to be able to test my code manually. Let's consider this example.
I was working on a game mode where players write funny answers to prompts, and vote on which answer is funniest. The thing is, this game mode needs at least 3 players to work. 2 to write answers, 1 to vote on which is funniest. You can see the problem in testing it manually. After a day of flipping between three browsers trying to reproduce the game flow, I realized this wasn't sustainable.

My first pass at solving it was to make a /dev/testing page where I could kick off a game using bots, and a function to orchestrate some bot responses going through the flow of a game.

Eventually I cleaned it up and created a Playtest registry where games could register a handler to define what steps would occur in a round of the game. The auto_play method would call those steps and push the game through its state machine to the end. In the end it looked something like this.

# Dev Testing Controller, called from /dev/testing view
  def auto_play
    room = Room.find_by!(code: params[:id])

    # Start the game first if still in lobby
    if room.lobby?
      room.start_game!
      handler = playtest_handler_for(room)
      handler.start(room:)
      room.reload
    end

    game = room.current_game
    if game && !game.finished?
      handler = DevPlaytest::Registry.handler_for(game)
      handler.auto_play_step(game:)
    end

    redirect_to show_test_game_path(room, auto_play: params[:auto_play], interval: params[:interval])
  end
  
# Example Playtest module within the WriteAndVote game module
   module Playtest
      def self.start(room:)
        Games::WriteAndVote.game_started(room:, show_instructions: true)
      end

      def self.advance(game:)
        case game.status
        when "instructions"
          Games::WriteAndVote.start_from_instructions(game:)
        end
      end

      def self.bot_act(game:, exclude_player:)
        case game.status
        when "writing"
          submit_responses(game:, exclude_player:)
        when "voting"
          cast_votes(game:, exclude_player:)
        end
      end

      def self.auto_play_step(game:)
        case game.status
        when "instructions"
          Games::WriteAndVote.start_from_instructions(game:)
        when "writing", "voting"
          bot_act(game:, exclude_player: nil)
        end
      end

      ... more game-specific method definitions ...

      private_class_method :submit_responses, :cast_votes
    end
  end

It let me see the entire flow of a game, in browser, with as many bot players as I cared to make, along with the ability to click in to their player view, as well as the central shared screen view that the host would project on a TV. I was pretty happy with it and figured my testing woes were over.

After a little more dev time, I got some friends of mine to try it out for their apartment game night. To my horror they weren't able to cast a single vote - the view was locking the button before the vote was being cast.

In my excitement to test the game state machine, I neglected to call the Stimulus controller in the view layer with my 'clever' bot actors. They were hitting the service layer directly, bypassing the view entirely. D'oh!

The bot was hitting:

vote = Vote.create!(player: bot_player, response: chosen_response)
Games::WriteAndVote.process_vote(game:, vote:)

But a human would have hit:

# app/javascript/vote_feedback_controller.js
export default class extends Controller {
    static targets = ["button"]

    vote(event) {
        ...bunch of other js code dealing with animations...
        
        # the bug. this was disabling the button BEFORE the vote was cast
        // Disable ALL vote buttons to prevent multiple votes
        const allVoteButtons = document.querySelectorAll('button[data-action*="vote-feedback#vote"]')
        allVoteButtons.forEach(btn => {
            btn.disabled = true
            btn.classList.add("opacity-75", "cursor-not-allowed")
        })
        
}

Looking back over my system specs, which should have caught this - it was in the happy path for crying out loud - I saw the problem. My write_and_vote_happy_path_spec, happily running on every push to CI, was testing up until the point of casting a vote.
It wasn't checking if the vote was actually registered in the DB. Since a full cycle of the game was only to write two answers and vote on them, there was no point in going further. Why waste precious CI time running through two cycles of the game?

I could blame Claude for this, as it wrote the happy path spec, under my direction of course. But really it was on me. Spec quality is even more important than code quality, since it determines the acceptable baseline for your code. Garbage specs, garbage code, as I painfully saw here. I cleaned up the system specs and resolved to make sure each one tested the full flow for its area.
Later on I created some other dev testing tools, but those are for other posts.

The lesson: understand what your tests test. Understand what the acceptable baseline for your product is. In my case, it was being able to complete the whole flow of a game, start to finish, on every layer from view to db. My dev testing dashboard was not truly testing that. It's still a useful tool to test a game's state machine - but it's a much more limited solution than I thought. I needed to test the product, not the state machine.

My name is Jack and I am writing about my journey coding RoomRally, a platform for hosting live in person group games, as well as the software engineering lessons I learn along the way.

Jack's Tech Blog

You're Not Testing What You Think You're Testing