Dark Data – Hydrometric Trash on the Hydrologic Landscape

In the modern world, it is rude and inconsiderate to indiscriminately consume resources for one time use.

It is not socially acceptable to litter the landscape with trash. It may have taken years of public education for the message to take hold, but the outcome is less pressure on our environment and a higher quality of life for everyone.

Why is there a different ethic for hydrometric data?

Hydrometric data are acquired for a wide variety of programs and projects, often at public expense. However, after it has served its primary purpose, much of this data is filed away, never to be seen again, like trash in a landfill. It is only when data are collected explicitly for re-use (e.g. by national hydrometric programs) that hydrometric data are properly curated to be easily searchable, discoverable, and accessible.

The prevailing attitude in the hydrometric community is that managing data in a manner that would make it suitable for re-use is too burdensome, too expensive, and too time consuming. These are exactly the same arguments that have been put up as barriers for all recycling initiatives. Resources are abundant and it is cheaper to mine the environment for new resources than it is to reuse or recycle resources that have already been mined.

Data mining is a bit more complicated than that. If data are acquired for a given water body for a given period of time, those data are unique and irreplaceable. It will never be possible to go back to that place and time and reproduce that data. Once data are lost, the loss is irreversible.

Hydrologic data are a scarce resource. Hydrological science has a voracious appetite for data. Water resource management is a data-driven activity. In spite of this it is considered normal and acceptable to acquire data with no expectation that meaningful metadata will be created and managed. This, inevitably, causes data to go dark.

A common argument is that it is data producers are actually doing data consumers a favor by hiding their data from re-use because otherwise the re-use may be mis-use. After all, so the argument goes, no data are better than bad data!

I would argue that good, bad, and in between are subjective terms that are entirely dependent on context. One man’s trash is another man’s treasure. One might argue that only data that are collected to the highest standards should be made publicly accessible for use. I would argue that even data collected to the most primitive of standards are potentially useful.

The famous quote by George Box that “all models are wrong, but some models are useful” can be adapted for hydrometric data sharing: “all data have uncertainty, and uncertain data can be useful.”

As a thought experiment consider a small creek running past an elementary school. Suppose that as a class project a teacher provides the children with some rudimentary tools and training for streamflow measurements. The class of 30 students each measures the stream every day and they each produce an annual hydrograph. The technology, skill, and training of these junior hydrographers would result in what most professionals would dismiss as ‘bad’ data. However, if the techniques they were taught are inherently unbiased, then a hydrograph aggregated from all 30 students would characterize the flow complete with a pattern of dispersion that characterizes measurement uncertainty.

This is the basic concept behind the emergence of low cost sensor networks whereby abundant, but cheap, data acquisition enable statistical grooming techniques that can, sometimes, outperform expensive, but sparse, data acquisition techniques. ‘Bad’ data can be quite informative if there is enough of it to be statistically meaningful

Suppose you are an engineer tasked with designing a culvert for this creek. Would you be interested in looking at this ‘bad’ data? Or would you sooner pretend that there are no data so you can treat it like any other ungauged stream. Suppose your design fails prematurely, would a lawyer for a litigant find value in this ‘bad’ data if you had chosen to ignore it?

The key to re-usability is that the provenance of the data must be discoverable.

This requires some effort on the part of the data producer and this effort is, I believe, at the heart of the argument. In the absence of any sort of reward, hydrographers would prefer to hoard rather than share their data. Data that will never be used again do not require metadata.

Even if provenance is available then the data consumer needs to evaluate it. Dr. Keith Beven has led a campaign to educate the hydrological community of the dangers of disinformative data. Bad model design and bad model calibration are too often the result of attempting to reproduce extremes in hydrographs where those extremes result from extrapolation of poorly conceived rating curves. Hydrologists are notoriously indiscriminate in their consumption of data.

There are no connoisseurs during a famine!

Hydrology is a data-starved science. If a hydrologist finds data, he/she will consume it. It is only in the case of over-abundance that people care about provenance. When the Red Cross delivers a truck load of food to a refugee camp no one is questioning whether it is certified organic, free range, and gluten free. Hydrologists don’t question the provenance of their data when they have so little to work with.

If we could somehow unlock the storehouses of dark data there could be relative abundance.

In the face of such abundance hydrologists and other data consumers might actually learn to care more about their data provenance.

In the face of modern technology, is the effort required to manage data in a way that facilitates data sharing so burdensome? Data that are collected for a single purpose have finite value. Data that are shared have an unbounded number of potential uses, now and into the future. Shared data therefore have infinite value.

The question is: will the hydrometric community rise to the challenge? Please reply below.

One response to “Dark Data – Hydrometric Trash on the Hydrologic Landscape”

  1. Violeta Cabello Villarejo February 23, 2015 at 1:37 pm

    Well, I can’t agree more on your reflections Stu. I just attended the Citizen Science conference in San Jose last week. People in other fields are gathering in communities of data collection and sharing to overcome similar barriers than the ones you posed for the water realm. Surprisingly (?), no water community in Citizen Science yet, some initiatives on water quality monitoring but no coherent strategy…It seems the big challenge is to first create the community and then start working together in our challenges. The question that comes to my mind is why the water scientists are so apart?

Join the conversation