Data Dictionaries and Metadata: The Boring Infrastructure That Makes Everything Work

Nobody gets excited about data dictionaries. Most statisticians I know would rather spend their time on analysis than on writing up variable definitions and classification notes. But after years of working on agricultural statistics in Cambodia, I think bad documentation can quietly erode data value over time in ways that are much harder to fix than bad collection.

In any statistical office, I've watched what happens when institutional memory depends on a single person. A senior statistician retires or transfers to another department, and suddenly nobody can explain why certain methodological decisions were made, or what the codes in a particular column actually represent. The dataset is technically intact. It's also practically useless for anyone trying to build on it, because the context that gives the numbers meaning walked out the door. It's especially acute in smaller offices where specialised knowledge concentrates in very few heads.

Standards like DDI (Data Documentation Initiative) and SDMX exist precisely for this reason. They provide structured, machine-readable ways to describe datasets: what each variable means, how it was measured, what classifications were used, how the sample was designed. The problem is that implementing these standards feels like overhead when you're already stretched thin trying to get the survey into the field and the results published on time. Documentation is always the thing that gets cut when deadlines press in. I've been guilty of this myself.

What's changed my thinking is the idea of automating the process. The approach that makes the most sense is building metadata extraction into the data processing pipeline itself, so that variable labels, value labels, skip patterns, and universe definitions get captured as the data moves through cleaning and tabulation. It means the documentation doesn't depend on someone remembering to write it up after the fact. The best metadata is the kind that generates itself as a by-product of doing the work, not as a separate task that competes for already-scarce time.

The payoff is real but delayed, which is why it's a hard sell. When a new analyst joins the team two years from now and can actually understand a dataset without spending three weeks tracking down the person who built it, that's the return on investment. It's the most boring kind of infrastructure, and one of the most important.

← Back to Writing Share on LinkedIn