Engineering

How to Merge Consecutive Rows in SQL?

A tutorial explaining where you might need to merge consecutive rows in SQL and how to do it

November 14, 2022

Graphic introducing a tutorial on How to Merge Consecutive Rows in SQL

TL;DR

This article details step by step process of merging consecutive rows in SQL with a dummy example:

  • Use SQL to create a table named lapse_time in PostgreSQL with columns bike_id, area_id, and time_spent and insert dummy data into the table.
  • Use the ROW_NUMBER() function to assign row numbers to each row in the table, ordering by bike_id.
  • Use the LAG() function to find the previous values of bike_id, area_id for each row.
  • Use the SUM() function with a CASE condition to group consecutive rows with the same bike_id and area_id.
  • Finally, use the GROUP BY clause to group the data based on the calculated groups and sum the time_spent for each group.
  • The result will be the desired output showing aggregated time spent for each unique combination of bike_id and area_id.
Recently, I was trying to group consecutive rows in SQL using one and then multiple column values. I looked up the internet on how to do and couldn't find anything so finally when I was able to figure it out, I thought I should share this so that it can be of some help others who might be facing the problem.

The Problem

Let's start with defining and understanding the problem first. Let's say we are running a mobility company (take for example, Rapido). In order to solve a particular use case, we are collecting information on the time spent by each bike in different areas. Let's say this is represented as a table named lapse_time in our database with bike_id, area_id and the corresponding time_spent (in minutes).



CREATE table lapse_time(
	bike_id varchar,
  	area_id VARCHAR,
  	time_spent INTEGER
)


Here is the information we have collected for example:

Graphic showing SQL query results
Figure 1

*Note: This data is sorted by bike_id and then by the order of entry of the row in the database. For example, bike_1 first spent 120 minutes in area_1. Then, it went to area_2 and spent 10 minutes over there and so on.

Now if you take a close look at row numbers 2, 3, and 4 in figure 1, you will observe that all these consecutive rows have the same area_id and bike_id, and as the data is sorted by the order of entry of the row in the database, it means that bike_1 spent 10 minutes in area_2, then 20 minutes again in area_2 , and then again 40 minutes in area_2. This looks very cumbersome and to solve the particular use case I was looking at, I needed to group together these 3 consecutive columns so that it conveys that bike_1 spent 70 (40 + 20 +10) minutes in area_2. Let's take another example to make it abundantly clear. Both the consecutive rows, 6 and 7, have the same area_id and bike_id, and thus can be grouped to the time_spent by the bike_2 in area_2 as 37 (12 + 25) minutes.

Many of you might be wondering why at all have we got separate rows for the same  bike_id in the same area. That can be because of multiple reasons, primarily owing to the way how the data collection has been done.

For example, the hardware might be collecting information at specific intervals (e.g. every 10 minutes or so) and calculating the time spent in that interval. It might be the case that a bike is at the same location in multiple intervals, in which case we get separate rows. So, this scenario is expected in real-time data.

The Solution

So, now let's try to solve this problem. The end goal here is to convert what it looks like in the figure 2 to that in the figure 3.

Graphic showing SQL query results
Figure 2

To this →

Graphic showing SQL query results
Figure 3

I personally encountered this problem while I was trying to calculate median idle time_spent in an area. If you take the data present in figure 2 and try to calculate the median idle time spent in let's say area_2, you won't get the desired result as the data hasn't been arranged accordingly. In figure 3, we have arranged the data by merging consecutive rows having the same bike_id and area_id; thus making it perfect for all sorts of calculations. Merging consecutive rows according to values of multiple columns was something that wasn't easily available, which then inspired me to write a blog regarding it.

So, how to do this? Let's go step by step.

Let's first make a dummy table in postgres named as lapse_time


CREATE table lapse_time(
		bike_id varchar,
  	area_id VARCHAR,
  	time_spent INTEGER
)


Now, let's insert some dummy data into it.


INSERT into lapse_time VALUES 
('bike_1', 'area_1', 120),
('bike_1', 'area_2', 20),
('bike_1', 'area_2', 40),
('bike_1', 'area_2', 10),
('bike_1', 'area_3', 45),
('bike_2', 'area_2', 12),
('bike_2', 'area_2', 25),
('bike_2', 'area_3', 50)


We have a table exactly like the one in figure 2.

  • Now, we need to add a row_number() to our table. This can be done simply using the following query. Let's call it query_1


select *, ROW_NUMBER() OVER (ORDER BY bike_id) as rw_nm from lapse_time;

Here's the result:

Graphic showing SQL query results
Figure 4

We need to calculate LAG of the columns based on which we want to perform the sum (in this case, bike_id and area_id). So, let's call the following operation query_2.


SELECT t.*,
  LAG(bike_id) OVER (
    ORDER BY rw_nm
  ) as prev_1,
  LAG(area_id) OVER (
    ORDER BY rw_nm
  ) AS prev_2
from ({query_1}) t;

Here's the result for this query

Graphic showing SQL query results
Figure 5

Now comes the tricky part. We'll need to use sum over window function. In order to know more about the same, one should read more about window functions in postgres.

Before going to the actual query, here's one query that one should try to get clarity:


SELECT t.*,
  SUM(1) OVER (
    ORDER BY rw_nm
  ) as grp
from ({ query_2 }) t

Here's the result for this query

Graphic showing SQL query results
Figure 6

For the actual query, we would use this with a slight modification in the sum over window function. We would be using CASE condition with sum, so here's the actual query. Let's call this query_3.


SELECT t.*,
  SUM(
    CASE
      WHEN bike_id = prev_1
      AND area_id = prev_2 THEN 0
      ELSE 1
    END
  ) OVER (
    ORDER BY rw_nm
  ) as grp
from ({ query_2 }) t


We would do the sum only when the column value is not equal to its lag value. Here's the result that we would get.

Graphic showing SQL query results
Figure 7

If we look at the grp column over here, we would notice that the consecutive rows with the same values of bike_id and area_id have the same value for grp. The final step is to use group by clause over the grp column, here's the query for the same:


select bike_id,
  area_id,
  SUM(time_spent) as time_spent,
  min(rw_nm) as rw_nm
from ({query_3}) t
GROUP by grp,
  bike_id,
  area_id
ORDER by rw_nm


And here's the result:

Graphic showing SQL query results
Figure 8

Now one can further go above this and select just the bike_id, area_id and time_spent from this table to get the exact result as in the Figure 3.

The final query

Aggregating all the steps mentioned above, here's the final query for merging consecutive rows in SQL using one or multiple columns


select bike_id,
  area_id,
  SUM(time_spent) as time_spent,
  min(rw_nm) as rw_nm
from (
    SELECT t.*,
      SUM(
        CASE
          WHEN bike_id = prev_1
          AND area_id = prev_2 THEN 0
          ELSE 1
        END
      ) OVER (
        ORDER BY rw_nm
      ) as grp
    from (
        SELECT t.*,
          LAG(bike_id) OVER (
            ORDER BY rw_nm
          ) as prev_1,
          LAG(area_id) OVER (
            ORDER BY rw_nm
          ) AS prev_2
        from (
            select *,
              ROW_NUMBER() OVER (
                ORDER BY bike_id
              ) as rw_nm
            from lapse_time
          ) t
      ) t
  ) t
GROUP by grp,
  bike_id,
  area_id
ORDER by rw_nm


Further Reading

  • This problem can further be generalized as recognition of patterns in SQL and rapid detection of unbroken sequences, better known as gaps and island problem.
  • I personally faced this problem while working to measure fleet utilization and idle time. I have written another detailed blog post on the same. You might find interesting to deep dive into that problem here:

Receive Latest InsideOps Updates

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.