"It’s better for you to be wrong than for the data to be wrong."
This quote was on a whiteboard in a common area at a job I had several years ago, staring me in the face every day, and from there it wormed its way into my brain. I think about it every so often when staring down a data anomaly.
Although the quote might be up for some interpretation, what I took from it was an encouragement to speak up when you uncovered a potential bug that might affect the data. It is better to speak up and be mistaken than to squash your suspicions for fear of being proven wrong.
In other words, when it comes to identifying bugs in data that drives important decisions, Type I errors (false positives) are generally much less costly than Type II errors (false negatives).
Often when we are evaluating a classifier, we rely on the aptly-named confusion matrix and the Type I/Type II distinction. Various measures are based off of the ratios between these types of errors. Whether we use F1 score, or Area Under the Receiver Operating Characteristic Curve (which should have a post of its own some day), we are really talking about relationships between true positive, false positive, true negative, and false negative classifications.
One thing that is often glossed over is the issue of what the costs are for each type of error. If the cost of a false positive and a false negative are the same, then using AUROC without any adjustment is fine. But there are very few real-life situations where the costs of a false positive and a false negative are exactly the same.
In the words of a 2015 paper on the subject of utilizing AUROC to evaluate classifiers in clinical settings:
ROC AUC does not account for prevalence or different misclassification costs arising from false-negative and false-positive diagnoses. Change in ROC AUC has little direct clinical meaning for clinicians.
There are very significant possible costs to providing data that turns out later to be incomplete or simply incorrect.
When you see some baroque logic in an ETL that you don’t understand, or the data outputs seem suspiciously clean or malformed in some way, it’s quite likely that there is a perfectly good reason for it that someone on your team knows. There may be some documentation that you can scrounge around and find to give you some clues.
Data often doesn’t look the way we expect it to. Anomalies or unexpected results that we might think are bugs are frequently the exact insights stakeholders want! When you think you have found a bug and bring it forward, you will turn out to be wrong a lot — that’s why it’s important to be respectful when doing so and to choose the right time and venue. But if you are on the fence, and worrying about being wrong is part of your hesitation, you should generally err on the side of raising the question.
Because it’s better for you to be wrong than for the data to be wrong.
A data bug, according to Craiyon
Halligan S, Altman DG, Mallett S. Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: a discussion and proposal for an alternative approach. Eur Radiol. 2015 Apr;25(4):932-9. doi: 10.1007/s00330-014-3487-0. Epub 2015 Jan 20. PMID: 25599932; PMCID: PMC4356897.