Grouping runs of data in SQL

There have been a handful of times over the last few years where I have needed to take time series data, and group the runs of data together to determine when a certain value changed, and how long it stayed that way. Every time I do this, I have to go back and figure out how I did it the last time, so this time I am actually going to write it down.

First, the data. We have a log table that logs every operation done to a table. It stores all of the columns in the base table, plus who made the change, when it was done, and the operation (INSERT, UPDATE, DELETE). It isn’t particularly efficient as far as storage goes, and newer versions of SQLServer support this type of logging using built in features, but we are using an older version.

For the sake of simplicity, I am dropping all but the most important parts of this table for this exercise. Assume there are more columns in this table, and that there are DELETES being logged. Im just going to show rows that are inserted or updated, and have limited it to just two ids.

logged_at id value1
2016-05-31 11:22:23.647 1 4
2016-05-31 11:22:26.047 2 5
2016-06-06 07:42:08.417 2 10
2016-06-06 07:42:08.417 1 4
2016-06-14 09:47:55.743 1 4
2016-06-14 09:47:55.743 2 10
2016-06-14 13:12:55.613 2 10
2016-06-14 13:12:55.613 1 8
2016-06-17 10:03:23.020 1 8
2016-06-17 10:03:23.020 2 10
2016-09-23 09:55:01.280 1 4
2016-09-23 10:26:12.720 1 4
2017-01-03 18:23:33.647 2 5
2017-11-20 09:22:09.350 1 8
2017-11-20 09:22:09.350 2 10
2017-11-28 08:48:32.827 2 5
2017-11-28 08:48:37.883 2 5

We can see in this data that the values oscillate – for id 1, value is either 4 or 8, and for id 2, the value is 5 or 10. It goes back and forth over time. We can also see that the value will repeat – maybe there are some other changes for these records, but the value field stays the same across other updates.

What we want to do is eliminate the duplicate values in the runs of data, and gather the timestamp where that value was first seen in the run, and when it was last seen.

For example, for id 1, we should end up with x rows : 4, 8, 4, 8. For id 2, we should expect to have 5 rows: 5, 10, 5, 10, 5.

StackOverflow was helpful in figuring out how to do this. This post closely matched what I was trying to do. I wanted to see how the rownumbering worked, in particular, using two rownumbers and subtracting.

Lets start with this:

This is what we get:

logged_at id value1 byId byIdValue idMinusIdValue
2016-05-31 11:22:23.647 1 4 1 1 0
2016-06-06 07:42:08.417 1 4 2 2 0
2016-06-14 09:47:55.743 1 4 3 3 0
2016-06-14 13:12:55.613 1 8 4 1 3
2016-06-17 10:03:23.020 1 8 5 2 3
2016-09-23 09:55:01.280 1 4 6 4 2
2016-09-23 10:26:12.720 1 4 7 5 2
2017-11-20 09:22:09.350 1 8 8 3 5
2016-05-31 11:22:26.047 2 5 1 1 0
2016-06-06 07:42:08.417 2 10 2 1 1
2016-06-14 09:47:55.743 2 10 3 2 1
2016-06-14 13:12:55.613 2 10 4 3 1
2016-06-17 10:03:23.020 2 10 5 4 1
2017-01-03 18:23:33.647 2 5 6 2 4
2017-11-20 09:22:09.350 2 10 7 5 2
2017-11-28 08:48:32.827 2 5 8 3 5
2017-11-28 08:48:37.883 2 5 9 4 5

Notice that the value for idMinusValue is not sequential, but it does group together the runs of the data. idMinusValue also will repeat across ids.

Now we want to compress the runs, and sort correctly:

 

id value1 first_seen last_seen
1 4 2016-05-31 11:22:23.647 2016-06-14 09:47:55.743
1 8 2016-06-14 13:12:55.613 2016-06-17 10:03:23.020
1 4 2016-09-23 09:55:01.280 2016-09-23 10:26:12.720
1 8 2017-11-20 09:22:09.350 2017-11-20 09:22:09.350
2 5 2016-05-31 11:22:26.047 2016-05-31 11:22:26.047
2 10 2016-06-06 07:42:08.417 2016-06-17 10:03:23.020
2 5 2017-01-03 18:23:33.647 2017-01-03 18:23:33.647
2 10 2017-11-20 09:22:09.350 2017-11-20 09:22:09.350
2 5 2017-11-28 08:48:32.827 2017-11-28 08:48:37.883

We see the expected rows – 4,8,4,8 for id 1 and 5,10,5,10,5 for id 2.

Leave a Reply

Your email address will not be published. Required fields are marked *