How to Merge Consecutive Rows in SQL?
Recently, I was trying to group consecutive rows in SQL using one and then multiple column values. I looked up the internet on how to do and couldn't find anything so finally when I was able to figure it out, I thought I should share this so that it can be of some help others who might be facing the problem.
Let's start with defining and understanding the problem first. Let's say we are running a mobility company (take for example, Rapido). In order to solve a particular use case, we are collecting information on the time spent by each bike in different areas. Let's say this is represented as a table named lapse_time in our database with bike_id, area_id and the corresponding time_spent (in minutes).
Here is the information we have collected for example:
*Note: This data is sorted by bike_id and then by the order of entry of the row in the database. For example, bike_1 first spent 120 minutes in area_1. Then, it went to area_2 and spent 10 minutes over there and so on.
Now if you take a close look at row numbers 2, 3, and 4 in figure 1, you will observe that all these consecutive rows have the same area_id and bike_id, and as the data is sorted by the order of entry of the row in the database, it means that bike_1 spent 10 minutes in area_2, then 20 minutes again in area_2 , and then again 40 minutes in area_2. This looks very cumbersome and to solve the particular use case I was looking at, I needed to group together these 3 consecutive columns so that it conveys that bike_1 spent 70 (40 + 20 +10) minutes in area_2. Let's take another example to make it abundantly clear. Both the consecutive rows, 6 and 7, have the same area_id and bike_id, and thus can be grouped to the time_spent by the bike_2 in area_2 as 37 (12 + 25) minutes.
Many of you might be wondering why at all have we got separate rows for the same bike_id in the same area. That can be because of multiple reasons, primarily owing to the way how the data collection has been done.
For example, the hardware might be collecting information at specific intervals (e.g. every 10 minutes or so) and calculating the time spent in that interval. It might be the case that a bike is at the same location in multiple intervals, in which case we get separate rows. So, this scenario is expected in real-time data.
So, now let's try to solve this problem. The end goal here is to convert what it looks like in the figure 2 to that in the figure 3.
To this →
I personally encountered this problem while I was trying to calculate median idle time_spent in an area. If you take the data present in figure 2 and try to calculate the median idle time spent in let's say area_2, you won't get the desired result as the data hasn't been arranged accordingly. In figure 3, we have arranged the data by merging consecutive rows having the same bike_id and area_id; thus making it perfect for all sorts of calculations. Merging consecutive rows according to values of multiple columns was something that wasn't easily available, which then inspired me to write a blog regarding it.
So, how to do this? Let's go step by step.
Let's first make a dummy table in postgres named as lapse_time
Now, let's insert some dummy data into it.
We have a table exactly like the one in figure 2.
- Now, we need to add a row_number() to our table. This can be done simply using the following query. Let's call it query_1
Here's the result:
We need to calculate LAG of the columns based on which we want to perform the sum (in this case, bike_id and area_id). So, let's call the following operation query_2.
Here's the result for this query
Now comes the tricky part. We'll need to use sum over window function. In order to know more about the same, one should read more about window functions in postgres.
Before going to the actual query, here's one query that one should try to get clarity:
Here's the result for this query
For the actual query, we would use this with a slight modification in the sum over window function. We would be using CASE condition with sum, so here's the actual query. Let's call this query_3.
We would do the sum only when the column value is not equal to its lag value. Here's the result that we would get.
If we look at the grp column over here, we would notice that the consecutive rows with the same values of bike_id and area_id have the same value for grp. The final step is to use group by clause over the grp column, here's the query for the same:
And here's the result:
Now one can further go above this and select just the bike_id, area_id and time_spent from this table to get the exact result as in the Figure 3.
The final query
Aggregating all the steps mentioned above, here's the final query for merging consecutive rows in SQL using one or multiple columns
- This problem can further be generalized as recognition of patterns in SQL and rapid detection of unbroken sequences, better known as gaps and island problem.
- I personally faced this problem while working to measure fleet utilization and idle time. I have written another detailed blog post on the same. You might find interesting to deep dive into that problem here: