Data Rounding and Why You Should Care

hydrology-blog-rounding-01

Many people believe that it makes no sense to store data at a resolution that is more precise than the resolution that it can be observed.

For example, it is believed that if you round water level to the nearest millimeter then the value will never be more than half a millimeter from the original.

This idea was accepted as a reasonable compromise in the 20th century, and data management systems from that era were designed around it as a core concept.

Modern data processing requirements, however, demand a different approach.

In order to understand why modern data management systems all store data to double precision we must consider the distinction between storing and reporting of data

Data storage is considered to be lossy if inexact approximations and partial discarding of data result in a loss of information.

Data reporting is the presentation of final results. Rounding the final result to a sensible resolution can be very useful.

Rounding to a given level of numerical precision communicates the limit of useful resolution to the end-user.

A value expressed as 10 infers that the ‘truth’ is somewhere between 9.5 and 10.5, whereas a value expressed as 10.0 infers that the truth is somewhere between 9.95 and 10.05, and this is a very important and useful distinction. Rounding data to a meaningful resolution facilitates meaningful analysis and data interpretation. For example, given a column of data at very high resolution, the position of the decimal place in each number can be lost in the visual noise, whereas if data are presented at a resolution of just a few digits then order of magnitude errors are readily identifiable. Finally, rounding of final data makes any final reports formatted for printing look tidy and professional.

Whereas rounding is merely useful for reporting the final data product, rounding is an inevitable consequence of digital data acquisition.

If we assume that the observed phenomena is a continuous process, then the analog signal must be discretized to create a digital signal. There is irreversible loss of information when we digitize that signal to store it on a datalogger and communicate those data to a data management system.

As a general rule, data are acquired and communicated at a similar resolution to that by which independent observations can be used to validate the result. For a continuous water level signal it is convenient to record values to the nearest millimeter. Acknowledging that this discretization of the underlying signal is only an approximation, this is a pragmatic limit because we have no empirical way of verifying that any more precise increment of water level is actually true.

Hydrology Corner Blog - Rounding

Figure 1. The effect of approximating an analog signal to a 3 decimal place digital representation as the signal approaches zero, illustrating that the residuals are scale-dependent and serially auto-correlated.

We can accept that these errors are not normally distributed (i.e. all errors over the full range from -0.5 to +0.5 millimeters are equally probable) and that the error is scale independent (i.e. the error magnitude becomes disproportionately large for very small numbers) and that the error is highly structured (i.e. the error for any point is strongly correlated with the errors of the preceding and succeeding points) if, and only if, we do no further rounding of intermediate values in the processing chain.

Clearly, data rounding is a necessary condition for getting analog data into a data processing chain and data rounding is a desirable condition for data leaving a data processing chain. Less clear is the harm that can be done by rounding of intermediate values within a data processing chain.

One way of evaluating this harm is submitting the data to a round trip to see if information surviving the data processing system is identical to the information at source, which can be evaluated by reversing all events in the data processing chain (Figure2).

hydrology-blog-rounding-02

Figure 2. A schematic for estimating the ‘lossiness’ resulting from intermediate rounding of values to some specified numerical precision.

If the errors that are inherently introduced by digitizing the analog signal were independent (in both time and proportionality) and normally distributed then it might be expected that it would be possible to apply linear corrections and non-linear transformations to the data without fundamentally altering the error structure and hence, conservation of information.

However, this is not the case. In a round-trip analysis comparing the effect of rounding all intermediate values to double precision vs. storing linearly corrected values to 3 decimals and non-linearly transformed values to 3 significant figures it can be shown that the error structure is irreversibly transformed by data processing.

One might hope that the errors from intermediate rounding would all magically average out when time-series statistics such as for daily and monthly averages and extremes are calculated. Unfortunately, this is also not the case.

The error structure evident at the unit value level will be amplified and magnified by data processing and can be highly impactful on the statistics that people in water scarce regions care the most about – low extremes.

Even when data are stored at double precision, there can be a very small bias introduced at every step in the data processing chain from tie-breaking. For this reason modern data management systems (at least AQUARIUS) use unbiased tie-breaking rules.

However, if the common round half up tie-breaking rule is used for intermediate rounding to the desired end-result resolution then one will find that one value in ten will be biased slightly high. The more steps there are in the processing chain the greater the net positive bias will grow. This systematic, and hence entirely unnecessary, bias can be identified in low flow statistics where the data processing system does not preserve double precision data integrity.

In summary, we can accept that data rounding is necessary for digitization of an analog signal.

We can demonstrate that rounding of intermediate values in a data processing chain creates disinformation. We can explain why rounding for reporting of final results is desirable. But this begs the question: what are final results and for whom?

This last question is becoming increasingly germane as we shift the paradigm from data reporting to data sharing.

In the 20th century, it was pretty clear that data publication (most often formatted for hard-copy printing) was the final result of the data processing chain. However, in the modern era the unit value discharge may be an intermediate value for computing another variable (e.g. suspended sediment), which may be an intermediate value for some other variable (e.g. contaminant loading).

Perhaps it is time to distinguish between ‘publishing’ rounded values versus ‘sharing’ double precision versions of our data.

In the data sharing paradigm it is certainly possible to include the metadata needed to correctly interpret meaning explicitly rather than continue to provide that information implicitly by the use of rounding rules. The choice should be yours.

 

3 responses to “Data Rounding and Why You Should Care”

  1. dave gunderson March 11, 2016 at 7:54 pm

    Report from the field: (It’s worse than you think)

    Quite a few years ago I realized the evils of data rounding and the problems introduced into the collection. The harsh reality if this is that the damage is done quite often at the data logger or even at the sensor itself. I call this phenomenon, “Death by Default”. This is where the sensor (or data logger) defaults the output to two right digits in the output product. In this case, the data rounding is done to the IEEE rounding algorithm (I forget the standard). Here is the rub. We want the resultant data to apply the “Banker’s Rounding Rules” (some refer to use the term ‘USGS Rounding’). If the sensor defaults the digital output to be in two digit resolution -or- the data logger defaults the output to two digits, the resultant data is already “biased”.

    Of course, the mistake is made on the ground level of the data collection cycle. The field staff unaware of the problem. The product is already tainted and nobody is aware of it.

    What is the solution? Program your sensors to output the output to three digit resolution (causing the rounding error to a lower ‘level’). The data logger should log the results to three digits resolution as well. This way, the downloaded log (or telemetry product) has no biasing in the collection, at least now will conform to USGS standards for data collection.

    For what it is worth, I brought this subject up at a users meeting several years back. I have suggested to a couple of vendors about the “default settings” in their product line. As I don’t support several hundred installations, just a another user’s opinion….

    Stu, spot on — as usual in your observations.

    Dave

    • Hi Dave
      As you infer, there are some complex transformations of data on-board even a simple water level transducer long before the first touch of the data by a stream hydrographer. There are already many simplifying assumptions implicit in these transformations (e.g. the temperature and density of water) and there can be quite a bit of electronic smoothing of the signal. Lossy data storage is likely somewhat less important for monitoring of large streams but can be impactful where small errors in stage lead to large uncertainties in calculated flow. It has also been shown that barometric compensated transducers can amplify diurnal variation in water level by as much as 1.5 cm., indicating that intermediate calculations do introduce bias that is irreversible unless intermediate values are preserved (at a suitable resolution). The lack of traceable provenance of data from field devices is becoming less excusable as storage becomes cheaper and data transfer protocols become faster.

  2. We can program the sensing and recording devices to as many decimals as they are designed to record. But the reality of stream gaging in natural channels is that precisions beyond two decimal places in flow measurements are dreams (and in many channels with irregular bottoms, or bottoms with boulders, the best precision is probably only to one decimal place).

Join the conversation