Papers with graphs on them.

Lying with Statistics: Every Day Has 2 Midnights & Every Midnight has 2 Days

“It is easy to lie with statistics. It is hard to tell the truth without statistics.” Andrejs Dunkel.

Time series statistics are awkward. Time series statistic are absolutely necessary. They distill the essential truth from very large volumes of data. The statistics that are most useful are very simple concepts such as max, min, mean and total.

Anyone with at least a grade 6 education can explain exactly how any of these statistics should be calculated. Right?

Maybe not.

There is the issue of the fencepost problem.

Which section of wire is a fencepost holding up – both of them. The boundary between the statistical intervals might be the top of the hour for hourly values, midnight for daily values or the stroke of New Year’s Eve for annual values. What, exactly, do you do with the New Year’s Eve storm that generates the maximum discharge for the previous year at exactly 12-31 24:00:00 and the maximum discharge for the succeeding year at exactly 01-01 00:00:00. It is only one value that is both the very last value of the preceding year and the very first value of the succeeding year by virtue of the special nature of midnight. In a set of data for instantaneous annual peak flow the value would be repeated in two successive years. However, including both values in an extremes frequency analysis would be wrong.

There is the issue of time weighted mean values.

Many people used to working in Excel would simply use the ‘average’ function which provides an arithmetic average (sum divided by count).

A more robust method is to calculate the area under the curve using, for example, trapezoidal integration and divide by time. The trapezoidal technique is based on the assumption that the data are continuous. This implies that if there are no values at the bin boundaries (e.g. midnight) then they must be estimated by interpolation. This is non-intuitive. If there are no midnight values then the algorithm must ‘look’ into the preceding and succeeding intervals to find a value to use to estimate the midnight value. This may mean that data from nearest neighbors potentially have influence on the statistical results.

There is the issue of matching the statistical technique to the type of data.

A total calculated for discharge data is obtained by trapezoidal integration with a unit conversion (e.g. from cubic meters per second to mega liters per day). A total for precipitation data is calculated as the net accumulation over the interval with no unit conversion.

There is the issue of time stamp.

In many jurisdictions it is conventional to place the timestamp at the end of the interval. This means that a value calculated for January first will have a time stamp of 01-01 24:00:00. This in turn will likely get reported on and plotted as 01-02 00:00:00 (exactly the same time except most computer systems ignore 24:00:00) hence in a tabular report the New Year’s Day statistic will either be missing or be the value for New Year’s Eve of the preceding year. Other jurisdictions deal with this by choosing a timestamp to be the start of the interval. This means in comparing data from one agency to another you need to know which data storage convention they have chosen. In either of these cases the data would likely be plotted as a step graph. In some jurisdictions they choose to plot the average at the average time (e.g. at noon) and plot the time series as a continuous plot. This is very intuitive for daily statistics but becomes a bit weird for monthly and annual statistics. The method of assigning valid time duration for time series statistics is rarely used (i.e. both the start and end of the interval that the statistic covers are explicitly declared). Representing statistics as a duration solves the problem of 24:00:00 = 00:00:00 but most data management systems were developed in an era when durational data were difficult to deal with.

There is the issue of the classroom problem.

It is a common use case for hydrologists to do statistics on statistics (e.g. to take the mean of daily means). The classroom problem is simply that the average for the school is different than the average of the average of the all of the classrooms. In general, the classroom problem is solved for time series statistics by using the technique of trapezoidal integration which is a time weighted method. A time weighted average of time weighted averages should, theoretically, be identical to a time weighted average of the underlying unit values. There can however be differences due to the occurrence of gaps in the data and the use of business rules for how those gaps are handled at different time scales.

There is the issue of time zones.

The time zone of the data logger may be in UTC but the report of daily statistics in Local Standard Time. Calculating the statistics in UTC and applying a time offset to the result is not the same as applying a time offset to the data and then computing the statistics.

In most cases time series statistics behave the way you expect.

This lulls us all into a sense of complacency that we think we understand them. Unfortunately, there are edge cases where careful attention to the details of the case are required to correctly understand the underlying truth that the statistics are designed to represent.

1 Comment
  • Ferdinand Quiñones
    Posted at 12:20 pm, November 28, 2013

    Stu: Greetings again. I am trying to locate an Excel subroutine or formula to integrate variable time-intervals flow (Q) and suspended sediment (SS) data to calculate the weighted daily average discharge of SS at a streamgaging station. The SS data is in a regresion equation. Normally, if the instantaneous values of Q at the gage are uniform in time (every 15 minutes), it is an easy weighted integration. The problem arises when, to improve accuracy of calculations, the stage sensors at the gage are triggered to take readings every 5 or 10 minutes, depending on the change in stage, and not at a predicted interval. The lineal weighted calculation formula for a day does not work when you have many days since you can have from a minimum of 96 points (every 15 min) to as many as 288 points (every 5 minutes). The problem arises when you go from one day to the next in duplicating the formula, since the range changes. If anybody has this figured out in an Excel formula, I would appreciate getting a copy. Thanks.

    Ferdinand Quiñones, PE

Post a Comment