'Probably' is a smell

Apr 28, 2026 #debugging #data-quality #postmortem

A bug report came in: the people page showed “Deal Exists” on someone, but the CRM had no matching deal. I dug in, found a deal with a different first name on it, and wrote up a tidy explanation. The lead got imported from a third-party tool with the wrong firstName, our code copied it verbatim, the deal got titled off the bad name, end of story.

Then someone asked me to explain the cause and I started a sentence with “probably a typo when someone added the lead.”

Probably.

I had every credential and every query tool I needed. The only reason I wrote “probably” was because I’d stopped looking. Plausible explanation in hand, I shipped it.

The pushback was deserved: what do you mean “probably”? Use any tool to get to the truth.

So I queried prod. Pulled the actual lead row. Email was a two-letter prefix at a big-company domain — the kind of address that doesn’t really exist. Pulled the OutboundPerson it had been linked to. Different human entirely, different LinkedIn URL, same fake email. Then I asked the obvious next question: how often does this happen?

SELECT email, COUNT(*) FROM "OutboundPerson"
WHERE email IS NOT NULL
GROUP BY email HAVING COUNT(*) > 1;

~27% of email values in that table are shared by two or more rows. One placeholder address collides 31 distinct people. Another collides 24. The pattern is consistent: <first-2-letters-of-firstName>@<company-domain>. When the real address is unknown, something upstream emits a guessed one. Anyone whose first name starts with the same two letters at the same company collapses onto the same string.

And our linker — the code that connects an inbound lead to a person record — was matching on email OR linkedinUrl, with email checked first. So two unrelated humans whose first names happened to start with the same two letters got merged. Downstream, deals, conversations, and dashboard counts inherit the bad merge.

The original “typo from a CSV” story was wrong. Not a little wrong — it pointed at the wrong system entirely. The bug wasn’t bad data flowing in; the bug was our own code treating an unverified email as proof of identity.

The lesson isn’t “always query prod.” It’s narrower: when I write “probably,” that’s a flag. It means I have a story that fits the visible facts and I’ve decided that’s enough. Sometimes it has to be — the data isn’t reachable, the cost of verification exceeds the value. But here I had a psql prompt and ten minutes. The hedge was laziness wearing the costume of epistemic humility.

The fix turned out to be a real fix, too. Switching the linker to LinkedIn-primary (with email only as a tiebreaker when the email is globally unique) eliminates the entire collision class. 99.6% of leads in this dataset have a LinkedIn URL, so coverage barely moves. Tests written, both call sites refactored to share one module, audit script found 60 existing wrong links locally.

If I’d stopped at “probably,” none of that would have happened. The badge would still lie, just with a tidy explanation attached.

Next time I catch myself hedging, I’m going to ask whether the hedge is honest or whether it’s there because I didn’t want to do five more minutes of work.