What Nine Tickets Taught Me About Reviewing My Own Code

I built a financial analysis pipeline — collects market data and news for US stocks, runs FinBERT sentiment analysis, generates trading signals, and spits out reports in text, JSON, and HTML. The pipeline itself is straightforward. What I want to talk about is the process I used to build it, because that's where I actually learned something.

I worked through nine tickets end-to-end, each one covering a distinct layer of the system. And after doing post-mortems on all of them, some patterns became impossible to ignore.

The Review Rounds Problem

I started with the assumption that a good spec review catches most issues before implementation. That turned out to be partially true — but only if you do multiple rounds with different lenses.

What I found across every ticket: Round 1 catches structural gaps. Missing files, undefined types, wrong import paths. Round 2 catches design issues — logical gaps, missing edge cases, how the component actually connects to the pipeline. Round 3 and beyond catch spec-reality mismatches: field names that don't exist in the actual dataclasses, version-specific constants that changed, contradictions between new acceptance criteria and existing code.

The tokenization ticket (T5) looked reasonable on first pass — eight acceptance criteria, clear API spec, three tokenizer backends (tiktoken, HuggingFace tokenizers, sentencepiece). But when I actually loaded the FusedRecord dataclass from the data fusion ticket, I realized "fused analysis text" was completely undefined. There was no contract between the structured output from fusion and the text-only input that tokenization expected. That's not a subtle bug — that's a missing interface. It took four review rounds to surface it.

The sentiment ticket (T6) had a similar problem. The initial spec listed dependencies: [Ticket 5] — the tokenizer. But FinBERT uses BertTokenizer internally (WordPiece, ~30,522 vocab tokens). It doesn't use any of the BPE backends I built in T5. The correct dependency was T4 (data fusion), which provides FusedRecord. Wrong dependency declaration, caught before implementation, but only because I actually read the upstream type definitions.

What I changed: I added a dependency analysis phase before writing any code. Read every referenced file. Verify every type and function exists. Check every pseudocode invocation against real signatures. This consistently found 3–4 gaps per ticket that spec-only review missed.

The Vague Third AC

Every single ticket started with three acceptance criteria. Every single ticket had a vague third one. I'm not exaggerating — this held across all nine tickets.

The pattern: AC-01 and AC-02 usually covered the happy path reasonably well. AC-03 was always something like "warnings included" or "human-readable rationale" or "dashboard-ready HTML." Completely unverifiable.

Compare these two versions of the same acceptance criterion for the trading signal ticket:

Vague:

Given strong positive sentiment and market uptrend, When signal is generated, Then result is buy

Concrete:

Given sentiment_score=+0.7, sentiment.confidence=0.8, market_return=+5%, When generate() is called, Then signal == "buy" and confidence > 0.5

The concrete version is computable. I can verify it by hand without interpreting "strong" or "uptrend." The signal formula is:

combined_score = 0.5 * sentiment_score + 0.5 * market_signal
# combined_score > 0.3 -> "buy"
# combined_score < -0.3 -> "sell"
# disagreement -> "hold" with confidence halved

With concrete inputs, I can plug in the numbers and know exactly what to expect. With vague inputs, I'm writing tests against my own interpretation of the spec — which is circular.

My rule now: Every AC should be verifiable by manual calculation from given inputs to expected outputs. If you can't pin down the numbers, the formula isn't specified well enough.

Code Review Finds Different Bugs Than Spec Review

I almost skipped code review on a few tickets where the spec review had been thorough. That would have been a mistake.

Some of what code review caught:

T1: Metrics accumulation never resets — unbounded growth over time
T8: getattr(result, "txt") crashes at runtime because the field is text, not txt
T4: _extract_date() crashes on None input — not covered by any test
T6: mock_model.__call__ = func is silently ignored by MagicMock

That last one cost me real time. Setting __call__ on a MagicMock instance does nothing because Python looks up __call__ on the type, not the instance. The fix is side_effect:

# Wrong - silently ignored
mock_model.__call__ = lambda input_ids, attention_mask: fake_output

# Correct
mock_model.side_effect = lambda input_ids, attention_mask: fake_output

The T8 field name bug (txt vs text) is the kind of thing that only shows up at runtime. Spec review doesn't catch it. Type checking would catch it, but only if the mock returns a real typed object. These are different failure modes and they need different review passes.

Python Gotchas I Actually Hit

A few specific things that bit me:

float('nan') bypasses range checks silently:

# Both of these are False for nan
nan > 0   # False
nan <= 0  # False

# Correct approach
import math
if math.isnan(value) or value <= 0:
    raise ValidationError(...)

json.load() returns float for all numbers, even when your dataclass declares volume: int. After a JSON roundtrip, isinstance(data.volume, int) returns False. Explicit coercion during deserialization is required.

asyncio.get_event_loop().time() is deprecated — use time.perf_counter() instead.

And for tiktoken: encode("<|endoftext|>") raises ValueError without allowed_special={"<|endoftext|>"}. Also, vocab_size is 100,277 — not 100,257 — because 20 additional special tokens were added in tiktoken 0.13.0. Hardcoding 100,257 fails silently in the wrong contexts.

The Shift That Actually Mattered

Looking at the effort distribution across all nine tickets, the ratio of spec review to implementation time shifted significantly:

T1–T3: Heavy implementation, light review — bugs surfaced downstream
T4–T6: Roughly equal
T7–T9: Heavier review, lighter implementation — code review found fewer issues

The later tickets weren't simpler. The review investment upfront just meant there was less to fix afterward.

The thing I keep coming back to: the bugs that cost the most time weren't hard bugs. They were boring bugs — wrong field names, missing None guards, incorrect dependency declarations. The kind of thing that a careful read of the actual code would catch immediately. The lesson isn't a new technique. It's just: read the actual files, not the spec's description of the files.

That's the habit I'm taking forward.

← Agentic AI Is Already in Your Codebase — and Most Teams Aren't Ready