Speaking Data: Data Provenance

In the world of art and antiquities, then the quality of the 'provenance' (‘to originate’) affects the value, and the same is true of data, although typically less explicitly.

Lots of data looks straight forward enough. However in any data there will be a number of stages before it gets to be analysed and reported, and where things might go awry to a greater or lesser extent, but may not be particularly noticeable. There in is the danger.

So here’s the broad journey. This is intended as a simple 12 step guide to think about where in this journey there may be issues which might impact on that analysis, reporting and messages that emerge. In the very simplest of terms those 12 steps combine into three broad parts: a planning part, an implementation part, and a data management part.

get the graphic

PART A: Planning

1. Purpose.

This is the point at which the headline purpose of the data collection is determined. There may be multiple purposes, but a the very least a question to be answered. The clarity here is the benchmark against which all further stages in this journey are tested and aligned. Often data collected for one purpose may be used for another purpose further down the line. This may be fine but equally may be inappropriate depending on the effect of other decisions in the process. Age data cannot be used as flexibly as date of birth data for example.

2. Requirements.

If ever there was a case for a “start with the end in mind” approach this is it. At this stage there is a question to be answered, with a specific degree of granularity and confidence. That granularity might be geographical (national, regional, local authority), or socio-economic (male/female, age group and so on) or any other dimension, and any multiple combination of those dimensions.

3. Constraints

Worthy of consideration in their own right, but broadly an extension of requirements. These constrains often include financial, quality and time. Constraints might not always be up front and obvious until after the event, so really worth considering up front. A cautionary tale… All police forces measured the level of satisfaction of members of the public involved in road traffic collisions (nee accidents). A once (and only once) overlooked constraint was to not seek the satisfaction level of those known to be fatally involved. Enough said.

4. Design

So with the question to answer then there’s the overall methodology to determine. Ask everyone (census) or just some (sample). May be mainstream quantitative or more qualitative approaches. This will also include setting acceptable tolerances for data quality to be applied further down the line.

PART B: Implementation

5. Definitions

This is about making sure there is a clear definition of measure or questions to ensure that responses can be differentiated. Great policing example…. When is a crime not a crime? Front door shows significant marks around the lock….attempted burglary from using a jemmy, just some (criminal) damage, or the home of an unsteady hand?

6. Specifications

This might be the units of measurement, or the number of decimal places. Needs to be consistent to ensure quality further down the line.

7. Collection

This might be ask, measure, count, read, hear, feel, smell, taste.

8. Recording

Having collected some data we then need record it somehow. That might be the write it down, data entry to a computer, audio recording, and in some cases it might even be ‘remember’.

PART C: Data Management

9. Enter

Having recorded something in one form this may need entering into a separate system for consistent handling further down the line… such as storage. Crime details transposed from a notebook to a computer crime system for example.

10. Validation

Ensuring data is valid, that it conforms to specific criteria. In the simplest of terms this might be ensuring that data is technically correct, a date is not 32^nd day of the 13^th month. This can then extend to logical validity, where technically correct data is cross-referenced with each other, for example checking the date of birth is before the data of marriage. These might be thought of as “hard” validations, where the tests are quite certain, as opposed to ‘soft’ validations, where the more unusual events are highlighted to be queried. So the more subjective soft validation might be to check that there is a minimum, say 16 years between date of birth and date of marriage.

11. Processing

There may well be some processing of the data, perhaps to organise this more efficiently for data storage. Or indeed to split the data into various time categories, perhaps monthly. One of my favourite examples I’ve seen was the processing of car crime data in order to do some hotspot mapping. Car crime is predominately a mixture of theft of motor vehicle and theft from motor vehicle. In this case to do the mapping required the postcodes.

12. Storage

The data gets stored, new data is added and backups are typically made, whether automatically or manually. Hopefully no data gets lost or over-written.

And then the analysis starts….and that’s another story.

This all looks straightforward enough, and mostly can be. The point here is that before there is any data analysis all these stages will have to have been addressed to a greater or lesser degree. Any inadequacies earlier in the journey cannot be compensated for further down the line, so in short the data ‘is what it is” when it gets to the analysis. Knowing enough about 'what it is' and indeed 'what it is not' is the foundation for a genuinely insightful analysis.