Value of Data Provenance graph.

Zombie Apocalypse – Data Provenance Saves the World

I frequently promote the concept of curated data provenance that is fully traceable to source. Most people quietly accept this as a generally worthwhile concept. However, most people also expect that active maintenance of data provenance should operate with an economy of scale. In other words, there is some practical limit beyond which further traceability is no longer worth the cost, time and effort.

I can accept that people have limited expectation for finding value from their investment in maintaining provenance. Their past experiences may not have been rewarded with direct benefits that are greater than the costs incurred in preserving provenance. Ironically, if past experience has ever successfully predicted anything it is that past experience is a poor predictor of future experiences in a world that is continuously adapting to rapidly advancing technology.

I am suggesting that the benefit/cost of managing data provenance is highly non-linear (figure 1). Furthermore, this non-linearity means the incremental cost of managing traceable provenance is small relative to the potential for benefit.

Figure 1: The value of maintaining full data provenance is non-linear. There are many ways source data can be re-purposed for great benefit.

In a conventional data acquisition system an electrical stimulus is applied to some sort of reactive surface exposed the elements (i.e. the sensor). The effect of the environment on the exposed surface alters the electrical properties of the surface and the altered stimulus is, usually, detected as an analog signal. The sensor is carefully engineered so that the alteration of the electrical signal is indicative of the value of a parameter of interest.

Converting the signal to a parameter value is done by correcting for unwanted influences (e.g. the sensor may be sensitive to other variables such as temperature); specifying state conditions that are assumed to be invariant (e.g. converting a pressure sensor response to water level requires an assumption of water density); and then solving for the parameter value using an equation determined by laboratory calibration.

The result of all of this is a quantized value valid for a specific time. There may be further processing of the signal on the sensor (e.g. electronic dampening of noise by averaging several samples) before the value is transmitted to the data logger. The data logger itself may do further processing of the signal before a value is stored for telemetry transmission or physical retrieval.

While it may be true that the sensor is sensitive to other environmental variables, the software on the sensor systematically scrubs out all of that information so that the residual of this sensitivity is reduced to white noise, which is further scrubbed out by electronic dampening. Reverse engineering the data to discover other, potentially useful, information in the seminal signal is not possible once any averaging has been done — even if the calibration equations are known.

Maintenance of data provenance typically starts when the derived parameter is received in the office. It is the experience of most hydrographers that they rarely, if ever, have wished they had access to the primary electrical analog signal which is the source of their water level time-series. The perceived potential for benefit is low with respect to the perceived cost of transmitting and managing the source data in perpetuity, hence all information upstream of the derived parameter value is routinely discarded.

Looking forward, we will increasingly be using advanced sensing technology that may have great unrealized potential for alternate analysis and use. It will be increasingly beneficial to preserve rather than discard the constituent elements of your data. For example, suppose you deploy a network of passive gamma soil moisture sensors and then only transmit the derived soil moisture data from a dense network of these sensors to the office in real time.

This is all well and good until the zombie apocalypse.

A scientist runs into your office with his latest zombie research that shows that zombies are detectable by passive gamma ray detection systems. You cannot go out in the field to reprogram your sensors; you cannot reconstruct the source gamma signal from the soil moisture data. Your only chance to protect the world with an early detection system for zombie attacks is if you had the foresight to ensure you have access to the primary data from your sensor network such that you can post-process the data to reveal the presence of zombies.

Will you be able to save the world?

Fully traceable data provenance can also be valuable even without a zombie apocalypse. For example, Dave Gunderson from the US Bureau of Reclamation exploits the pre-processed signals on his stage sensors to derive all sorts of useful diagnostic variables that help him monitor station health in real-time.

We can’t predict how the technology we’re deploying today could be repurposed to learn more about our environment. We can’t discover how useful the pre-processed information is for monitoring station health unless we have access to that data. One thing for sure is that if we continue to discard the primary information sensed by our technologies we will be severely limiting the potential for re-analysis that could potentially add unlimited value at a relatively small incremental cost.

  • Pete Dupen
    Posted at 2:09 am, November 2, 2015

    Stu, that was a clever and contemporary post, thankyou.

    Dumbing down to the zombie holocaust is both apposite and amusing – but your message that data is important and often in ways we can’t predict is heard. It’s a great addition to your consistent message encouraging us all to make the best available use of hydrometric data.



  • Michael Allchin
    Posted at 8:09 am, November 3, 2015

    I would echo Pete’s plaudits on this piece – but maybe there’s another angle to be developed here? If a dataset is not associated with appropriate metadata, describing in sufficient detail the ‘What?’, ‘How?’, ‘Where?’, ‘By Whom?’, ‘Why?’ and of course ‘When?’ of its origins, together with other pertinent information about validation and post-processing, it runs the risk of persisting in an un-dead state – still nominally alive, but highly dangerous if approached without due caution. I have frequently been frustrated by such Zombie Data.

  • Dave Gunderson
    Posted at 2:54 pm, November 3, 2015

    @Michael ‘dead on’ (pardon the pun)… Think of simple data from a sensor that was not simple. I’m thinking of flow measuring equipment in particular. To an end user, the simple data may be velocity or flow and nothing more. Internally within the flowmeter or ADVM we look at many factors that are used to make up the reading. Other internal data points that relate to Signal to Noise Ratio, Path Loss, water temperature, etc. Such data is often transparent to the end used but often points to why a measurement was an outlier or was unstable with the known condition. It is seeing this ‘diagnostic’ data that points us into understanding the stability of the readings and knowing if there are site problems. You have to dig for it.

    Like any good grave digger, you have to dig deep to get to the truth. The other truth is the deeper you dig, the more dirt you will find.

Post a Comment