Mateus Trentz, Author at Towards Data Science https://towardsdatascience.com The world’s leading publication for data science, AI, and ML professionals. Fri, 04 Apr 2025 00:19:15 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Mateus Trentz, Author at Towards Data Science https://towardsdatascience.com 32 32 Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data https://towardsdatascience.com/are-we-watching-more-ads-than-content-analyzing-youtube-sponsor-data/ Fri, 04 Apr 2025 00:16:48 +0000 https://towardsdatascience.com/?p=605408 Exploring if sponsor segments are getting longer by the year

The post Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data appeared first on Towards Data Science.

]]>
I’m definitely not the only person who feels that YouTube sponsor segments have become longer and more frequent recently. Sometimes, I watch videos that seem to be trying to sell me something every couple of seconds.

On one hand, it’s great that both small and medium-sized YouTubers are able to make a living from their craft, but on the other hand, it sure is annoying to be bombarded by ads. 

In this blog post, I will explore these sponsor segments, using data from a popular browser extension called SponsorBlock, to figure out if the perceived increase in ads actually did happen and also to quantify how many ads I’m watching.

I will walk you through my analysis, providing code snippets in Sql, DuckDB, and pandas. All the code is available on my GitHub, and since the dataset is open, I will also teach you how to download it, so that you can follow along and play with the data yourself.

These are the questions I will be trying to answer in this analysis:

  • Have sponsor segments increased over the years?
  • Which channels have the highest percentage of sponsor time per video?
  • What is the density of sponsor segments throughout a video?

To get to these answers, we will have to cover much ground. This is the agenda for this post:

Let’s get this started!

How SponsorBlock Works

SponsorBlock is an extension that allows you to skip ad segments in videos, similar to how you skip Netflix intros. It’s incredibly accurate, as I don’t remember seeing one wrong segment since I started using it around a month ago, and I watch a lot of smaller non-English creators.

You might be asking yourself how the extension knows which parts of the video are sponsors, and, believe it or not, the answer is through crowdsourcing!

Users submit the timestamps for the ad segments, and other users vote if it’s accurate or not. For the average user, who isn’t contributing at all, the only thing you have to do is to press Enter to skip the ad.

Okay, now that you know what SponsorBlock is, let’s talk about the data. 

Cleaning the Data

If you want to follow along, you can download a copy of the data using this SponsorBlock Mirror (it might take you quite a few minutes to download it all). The database schema can be seen here, although most of it won’t be useful for this project.

As one might expect, their database schema is made for the extension to work properly, and not for some guy to basically leech from a huge community effort to find what percentage of ads his favorite creator runs. For this, some work will need to be done to clean and model the data.

The only two tables that are important for this analysis are:

  • sponsorTimes.csv : This is the most important table, containing the startTime and endTime of all crowdsourced sponsor segments. The CSV is around 5GB.
  • videoInfo.csv : Contains the video title, publication date, and channel ID associated with each video.

Before we get into it, these are all the libraries I ended up using. I will explain the less obvious ones as we go.

pandas
duckdb
requests
requests-cache
python-dotenv
seaborn
matplotlib
numpy

The first step, then, is to load the data. Surprisingly, this was already a bit challenging, as I was getting a lot of errors parsing some rows of the CSV. These were the settings I found to work for the majority of the rows:

import duckdb
import os

# Connect to an in-memory DuckDB instance
con = duckdb.connect(database=':memory:')

sponsor_times = con.read_csv(
    "sb-mirror/sponsorTimes.csv",
    header=True,
    columns={
        "videoID": "VARCHAR",
        "startTime": "DOUBLE",
        "endTime": "DOUBLE",
        "votes": "INTEGER",
        "locked": "INTEGER",
        "incorrectVotes": "INTEGER",
        "UUID": "VARCHAR",
        "userID": "VARCHAR",
        "timeSubmitted": "DOUBLE",
        "views": "INTEGER",
        "category": "VARCHAR",
        "actionType": "VARCHAR",
        "service": "VARCHAR",
        "videoDuration": "DOUBLE",
        "hidden": "INTEGER",
        "reputation": "DOUBLE",
        "shadowHidden": "INTEGER",
        "hashedVideoID": "VARCHAR",
        "userAgent": "VARCHAR",
        "description": "VARCHAR",
    },
    ignore_errors=True,
    quotechar="",
)

video_info = con.read_csv(
    "sb-mirror/videoInfo.csv",
    header=True,
    columns={
        "videoID": "VARCHAR",
        "channelID": "VARCHAR",
        "title": "VARCHAR",
        "published": "DOUBLE",
    },
    ignore_errors=True,
    quotechar=None,
)

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Here is what a sample of the data looks like:

con.sql("SELECT videoID, startTime, endTime, votes, locked, category FROM sponsor_times LIMIT 5")

con.sql("SELECT * FROM video_info LIMIT 5")
Sample of sponsorTimes.csv
Sample of videoInfo.csv

Understanding the data in the sponsorTimes table is ridiculously important, otherwise, the cleaning process won’t make any sense.

Each row represents a user-submitted timestamp for a sponsored segment. Since multiple users can submit segments for the same video, the dataset contains duplicate and potentially incorrect entries, which will need to be dealt with during cleaning.

To find incorrect segments, I will use the votes and the locked column, as the latter one represents segments that were confirmed to be correct. 

Another important column is the category. There are a bunch of categories like Intro, Outro, Filler, etc. For this analysis, I will only work with Sponsor and Self-Promo.

I started by applying some filters:

CREATE TABLE filtered AS
SELECT
    *
FROM sponsor_times
WHERE category IN ('sponsor', 'selfpromo') AND (votes > 0 OR locked=1)

Filtering for locked segments or segments with more than 0 votes was a big decision. This reduced the dataset by a huge percentage, but doing so made the data very reliable. For example, before doing this, all of the Top 50 channels with the highest percentage of ads were just spam, random channels that ran 99.9% of ads.

With this done, the next step is to get a dataset where each sponsor segment shows up only once. For example, a video with a sponsor segment at the beginning and another at the end should have only two rows of data.

This is very much not the case so far, since in one video we can have multiple user-submitted entries for each segment. To do this, I will use window functions to identify if two or more rows of data represent the same segment. 

The first window function compares the startTime of one row with the endTime of the previous. If these values don’t overlap, it means they are entries for separate segments, otherwise they are repeated entries for the same segment. 

CREATE TABLE new_segments AS
SELECT
    -- Coalesce to TRUE to deal with the first row of every window
    -- as the values are NULL, but it should count as a new segment.
    COALESCE(startTime > LAG(endTime) 
      OVER (PARTITION BY videoID ORDER BY startTime), true) 
      AS new_ad_segment,
    *
FROM filtered
Window Function example for a single video.

The new_ad_segment column is TRUE every time a row represents a new segment of a video. The first two rows, as their timestamps overlap, are properly marked as the same segment.

Next up, the second window function will label each ad segment by number:

CREATE TABLE ad_segments AS
SELECT
    SUM(new_ad_segment) 
      OVER (PARTITION BY videoID ORDER BY startTime)
      AS ad_segment,
    *
FROM new_segments
Example of labels for ad segments for a single video.

Finally, now that each segment is properly numbered, it’s easy to get the segment that is either locked or has the highest amount of votes.

CREATE TABLE unique_segments AS
SELECT DISTINCT ON (videoID, ad_segment)
    *
FROM ad_segments
ORDER BY videoID, ad_segment, locked DESC, votes DESC
Example of what the final dataset looks like for a single video.

That’s it! Now this table has one row for each unique ad segment, and I can start exploring the data.

If these queries feel complicated, and you need a refresher on window functions, check out this blog post that will teach you all you need to know about them! The last example covered in the blog post is almost exactly the process I used here.

Exploring and Enhancing the Data

Finally, the dataset is good enough to start exploring. The first thing I did was to get a sense of the size of the data:

  • 36.0k Unique Channels
  • 552.6k Unique Videos
  • 673.8k Unique Sponsor Segments, for an average of 1.22 segments per video

As mentioned earlier, filtering by segments that were either locked or had at least 1 upvote, reduced the dataset massively, by around 80%. But this is the price I had to pay to have data that I could work with.

To check if there is nothing immediately wrong with the data, I gathered the channels that have the most amount of videos:

CREATE TABLE top_5_channels AS 
SELECT
    channelID,
    count(DISTINCT unique_segments.videoID) AS video_count
FROM
    unique_segments
    LEFT JOIN video_info ON unique_segments.videoID = video_info.videoID 
WHERE
    channelID IS NOT NULL
    -- Some channel IDs are blank
    AND channelID != '""'
GROUP BY
    channelID
ORDER BY
    video_count DESC
LIMIT 5

The amount of videos per channel looks realistic… But this is terrible to work with. I don’t want to go to my browser and look up channel IDs every time I want to know the name of a channel.

To fix this, I created a small script with functions to get these values from the YouTube API in Python. I’m using the library requests_cache to make sure I won’t be repeating API calls and depleting the API limits.

import requests
import requests_cache
from dotenv import load_dotenv
import os

load_dotenv()
API_KEY = os.getenv("YT_API_KEY")

# Cache responses indefinitely
requests_cache.install_cache("youtube_cache", expire_after=None)

def get_channel_name(channel_id: str) -> str:
    url = (
        f"https://www.googleapis.com/youtube/v3/channels"
        f"?part=snippet&id={channel_id}&key={API_KEY}"
    )
    response = requests.get(url)
    data = response.json()

    try:
        return data.get("items", [])[0].get("snippet", {}).get("title", "")
    except (IndexError, AttributeError):
        return ""

Besides this, I also created very similar functions to get the country and thumbnail of each channel, which will be useful later. If you’re interested in the code, check the GitHub repo.

On my DuckDB code, I’m now able to register this Python function and call them within SQL! I just need to be very careful to always use them on aggregated and filtered data, otherwise, I can say bye-bye to my API quota.

# This the script created above
from youtube_api import get_channel_name

# Try registering the function, ignore if already exists
try:
    con.create_function('get_channel_name', get_channel_name, [str], str)
except Exception as e:
    print(f"Skipping function registration (possibly already exists): {e}")

# Get the channel names
channel_names = con.sql("""
    select
        channelID,
        get_channel_name(channelID) as channel_name,
        video_count
    from top_5_channels
""")

Much better! I looked up two channels that I’m familiar with on YouTube for a quick sanity check. Linus Tech Tips has a total of 7.2k videos uploaded, with 2.3k present in this dataset. Gamers Nexus has 3k videos, with 700 in the dataset. Looks good enough for me!

The last thing to do, before moving over to actually answering the question I set myself to answer, is to have an idea of the average duration of videos. 

This matches my expectations, for the most part. I’m still a bit surprised by the amount of 20-40-minute videos, as for many years the “meta” was to have videos of 10 minutes to maximize YouTube’s own ads. 

Also, I thought those buckets of video durations used in the previous graph were quite representative of how I think about video lengths, so I will be sticking with them for the next sections.

For reference, this is the pandas code used to create those buckets.

video_lengths = con.sql("""
  SELECT DISTINCT ON (videoID)
      videoID,
      videoDuration
  FROM
      unique_segments
  WHERE
      videoID IS NOT NULL
      AND videoDuration > 0
"""
).df()

# Define custom bins, in minutes
bins = [0, 3, 7, 12, 20, 40, 90, 180, 600, 9999999] 
labels = ["0-3", "3-7", "7-12", "12-20", "20-40", "40-90", "90-180", "180-600", "600+"]

# Assign each video to a bucket (trasnform duration to min)
video_lengths["duration_bucket"] = pd.cut(video_lengths["videoDuration"] / 60, bins=bins, labels=labels, right=False)

Have Sponsor Segments Increased Over the Years?

The big question. This will prove if I’m being paranoid or not about everyone trying to sell me something at all times. I will start, though, by answering a simpler question, which is the percentage of sponsors for different video durations.

My expectation is that shorter videos have a higher share of their runtime from sponsors in comparison to longer videos. Let’s check if this is actually the case.

CREATE TABLE video_total_ads AS
SELECT
    videoID,
    MAX(videoDuration) AS videoDuration,
    SUM(endTime - startTime) AS total_ad_duration,
    SUM(endTime - startTime) / 60 AS ad_minutes,
    SUM(endTime - startTime) / MAX(videoDuration) AS ad_percentage,
    MAX(videoDuration) / 60 AS video_duration_minutes
FROM
    unique_segments
WHERE
    videoDuration > 0
    AND videoDuration < 5400
    AND videoID IS NOT NULL
GROUP BY
    videoID

To keep the visualization simple, I’m applying similar buckets, but only up to 90 minutes.

# Define duration buckets (in minutes, up to 90min)
bins = [0, 3, 7, 12, 20, 30, 40, 60, 90]    
labels = ["0-3", "3-7", "7-12", "12-20", "20-30", "30-40", "40-60", "60-90"]

video_total_ads = video_total_ads.df()

# Apply the buckets again
video_total_ads["duration_bucket"] = pd.cut(video_total_ads["videoDuration"] / 60, bins=bins, labels=labels, right=False)

# Group by bucket and sum ad times and total durations
bucket_data = video_total_ads.groupby("duration_bucket")[["ad_minutes", "videoDuration"]].sum()

# Convert to percentage of total video time
bucket_data["ad_percentage"] = (bucket_data["ad_minutes"] / (bucket_data["videoDuration"] / 60)) * 100
bucket_data["video_percentage"] = 100 - bucket_data["ad_percentage"]

As expected, if you’re watching shorter-form content on YouTube, then around 10% of it is sponsored! Videos of 12–20 min in duration have 6.5% of sponsors, while 20–30 min have only 4.8%.

To move forward to the year-by-year analysis I need to join the sponsor times with the videoInfo table.

CREATE TABLE video_total_ads_joined AS
SELECT
    *
FROM
    video_total_ads
LEFT JOIN video_info ON video_total_ads.videoID = video_info.videoID

Next, let’s just check how many videos we have per year:

SELECT
    *,
    to_timestamp(NULLIF (published, 0)) AS published_date,
    extract(year FROM to_timestamp(NULLIF (published, 0))) AS published_year
FROM
    video_total_ads

Not good, not good at all. I’m not exactly sure why but there are a lot of videos that didn’t have the timestamp recorded. It seems that only in 2021 and 2022 videos were reliably stored with their published date.

I do have some ideas on how I can improve this dataset with other public data, but it’s a very time-consuming process and I will leave this for a future blog post. I don’t intend to settle for an answer based on limited data, but for now, I will have to make do with what I have.

I chose to keep the analysis between the years 2018 and 2023, given that those years had more data points.

# Limiting the years as for these here I have a decent amount of data.
start_year = 2018
end_year = 2023

plot_df = (
    video_total_ads_joined.df()
    .query(f"published_year >= {start_year} and published_year <= {end_year}")
    .groupby(["published_year", "duration_bucket"], as_index=False)
    [["ad_minutes", "video_duration_minutes"]]
    .sum()
)

# Calculate ad_percentage & content_percentage
plot_df["ad_percentage"] = (
    plot_df["ad_minutes"] / plot_df["video_duration_minutes"] * 100
)
plot_df["content_percentage"] = 100 - plot_df["ad_percentage"]

There is a steep increase in ad percentage, especially from 2020 to 2021, but afterward, it plateaus, especially for longer videos. This makes a lot of sense since during those years online advertisement grew a lot as people spent more and more time at home. 

For shorter videos, there does seem to be an increase from 2022 to 2023. But as the data is limited, and I don’t have data for 2024, I can’t get a conclusive answer to this. 

Next up, let’s move into questions that don’t depend on the publishing date, this way I can work with a larger portion of the dataset.

Which Channels Have the Highest Percentage of Sponsor Time Per Video?

This is a fun one for me, as I wonder if the channels I actively watch are the ones that run the most ads. 

Continuing from the table created previously, I can easily group the ad and video amount by channel:

CREATE TABLE ad_percentage_per_channel AS
SELECT
    channelID,
    sum(ad_minutes) AS channel_total_ad_minutes,
    sum(videoDuration) / 60 AS channel_total_video_minutes
FROM
    video_total_ads_joined
GROUP BY
    channelID

I decided to filter for channels that had at least 30 minutes of videos in the data, as a way of eliminating outliers.

SELECT
    channelID,
    channel_total_video_minutes,
    channel_total_ad_minutes,
    channel_ad_percentage
FROM
    ad_percentage_per_channel
WHERE
    -- At least 30 minutes of video
    channel_total_video_minutes > 1800
    AND channelID IS NOT NULL
ORDER BY
    channel_ad_percentage DESC
LIMIT 50

As quickly mentioned earlier, I also created some functions to get the country and thumbnail of channels. This allowed me to create this visualization.

I’m not sure if this surprised me or not. Some of the channels on this list I watch very frequently, especially Gaveta (#31), a Brazilian YouTuber who covers movies and film editing.

I also know that both he and Corridor Crew (#32) do a lot of self-sponsor, promoting their own content and products, so maybe this is also the case for other channels! 

In any case, the data seems good, and the percentages seem to match my manual checks and personal experience.

I would love to know if channels that you watch were present in this list, and if it surprised you or not!

If you want to see the Top 150 Creators, subscribe to my free newsletter, as I will be publishing the full list as well as more information about this analysis in there!

Have you ever thought about at which point of the video ads work best? People probably just skip sponsor segments placed at the beginning, and just move on and close the video for those placed at the end.

From personal experience, I feel that I’m more likely to watch an ad if it plays around the middle of a video, but I don’t think this is what creators do in most cases.

My goal, then, is to create a heatmap that shows the density of ads during a video runtime. Doing this was surprisingly not obvious, and the solution that I found was so clever that it kinda blew my mind. Let me show you.

This is the data needed for this analysis. One row per ad, with the timestamp when each segment starts and ends:

The first step is to normalize the intervals, e.g., I don’t care that an ad started at 63s, what I want to know is if it started at 1% of the video runtime or 50% of the video runtime.

CREATE TABLE ad_intervals AS
SELECT
    videoID,
    startTime,
    endTime,
    videoDuration,
    startTime / videoDuration AS start_fraction,
    endTime / videoDuration AS end_fraction
FROM
    unique_segments
WHERE
    -- Just to make sure we don't have bad data
    videoID IS NOT NULL
    AND startTime >= 0
    AND endTime <= videoDuration
    AND startTime < endTime
    -- Less than 40h
    AND videoDuration < 144000

Great, now all intervals are comparable, but the problem is far from solved.

I want you to think, how would you solve this? If I asked you “At 10% runtime out of all videos, how many ads are running?”

I do not believe that this is an obvious problem to solve. My first instinct was to create a bunch of buckets, and then, for each row, I would ask “Is there an ad running at 1% of the runtime? What about at 2%? And so on…”

This seemed like a terrible idea, though. I wouldn’t be able to do it in SQL, and the code to solve it would be incredibly messy. In the end, the implementation of the solution I found was remarkably simple, using the Sweep Line Algorithm, which is an algorithm that is often used in programming interviews and puzzles.

I will show you how I solved it but don’t worry if you don’t understand what is happening. I will share other resources for you to learn more about it later on.

The first thing to do is to transform each interval (startTime, endTime) into two events, one that will count as +1 when the ad starts, and another that will count as -1 when the ad finishes. Afterward, just order the dataset by the “start time”.

CREATE TABLE ad_events AS
WITH unioned as (
  -- This is the most important step.
  SELECT
      videoID,
      start_fraction as fraction,
      1 as delta
  FROM ad_intervals
  UNION ALL
  SELECT
      videoID,
      end_fraction as fraction,
      -1 as delta
  FROM ad_intervals
), ordered AS (
  SELECT
      videoID,
      fraction,
      delta
  FROM ad_events
  ORDER BY fraction, delta
)
SELECT * FROM ordered

Now it’s already much easier to see the path forward! All I have to do is use a running sum on the delta column, and then, at any point of the dataset, I can know how many ads are running! 

For example, if from 0s to 10s three ads started, but two of those also finished, I would have a delta of +3 and then -2, which means that there is only one ad currently running!

Going forward, and to simplify the data a bit, I first round the fractions to 4 decimal points and aggregate them. This is not necessary, but having too many rows was a problem when trying to plot the data. Finally, I divide the amount of running ads by the total amount of videos, to have it as a percentage.

CREATE TABLE ad_counter AS 
WITH rounded_and_grouped AS (
  SELECT
      ROUND(fraction, 4) as fraction,
      SUM(delta) as delta
  FROM ad_events
  GROUP BY ROUND(fraction, 4)
  ORDER BY fraction
), running_sum AS (
  SELECT
      fraction,
      SUM(delta) OVER (ORDER BY fraction) as ad_counter
  FROM rounded_and_grouped
), density AS (
  SELECT
      fraction,
      ad_counter,
      ad_counter / (SELECT COUNT(DISTINCT videoID) FROM unique_segments_filtered) as density
  FROM running_sum
)
SELECT * FROM density

With this data not only do I know that at the beginning of the videos (0.0% fraction), there are 69987 videos running ads, this also represents 17% of all videos in the dataset.

Now I can finally plot it as a heatmap:

As expected, the bumps at the extremities show that it’s way more common for channels to run ads at the beginning and end of the video. It’s also interesting that there is a plateau around the middle of the video, but then a drop, as the second half of the video is generally more ad-free.

What I found funny is that it’s apparently common for some videos to start straight away with an ad. I couldn’t picture this, so I manually checked 10 videos and it’s actually true… I’m not sure how representative it is, but most of the ones that I opened were gaming-related and in Russian, and they started directly with ads!

Before we move on to the conclusions, what did you think of the solution to this problem? I was surprised at how simple was doing this with the Sweep Line trick. If you want to know more about it, I recently published a blog post covering some SQL Patterns, and the last one is exactly this problem! Just repackaged in the context of counting concurrent meetings.

Conclusion

I really enjoyed doing this analysis since the data feels very personal to me, especially because I’ve been addicted to YouTube lately. I also feel that the answers I found were quite satisfactory, at least for the most part. To finish it off, let’s do a last recap!

Have Sponsor Segments Increased Over the Years?

There was a clear increase from 2020 to 2021. This was an effect that happened throughout all digital media and it’s clearly shown in this data. In more recent years, I can’t say whether there was an increase or not, as I don’t have enough data to be confident. 

Which Channels Have the Highest Percentage of Sponsor Time Per Video?

I got to create a very convincing list of the Top 50 channels that run the highest amount of ads. And I discovered that some of my favorite creators are the ones that spend the most amount of time trying to sell me something!

What is the density of sponsor segments throughout a video?

As expected, most people run ads at the beginning and the end of videos. Besides this, a lot of creators run ads around the middle of the video, making the second half slightly more ad-free. 

Also, there are YouTubers who immediately start a video with ads, which I think it’s a crazy strategy. 

Other Learnings and Next Steps

I liked how clear the data was in showing the percentage of ads in different video sizes. Now I know that I’m probably spending 5–6% of my time on YouTube watching ads if I’m not skipping them since I mostly watch videos that are 10–20 min.

I’m still not fully happy though with the year-by-year analysis. I’ve already looked into other data and downloaded more than 100 GB of YouTube metadata datasets. I’m confident that I can use it, together with the YouTube API, to fill some gaps and get a more convincing answer to my question.

Visualization Code

You might have noticed that I didn’t provide snippets to plot the charts shown here. This was on purpose to make the blog post more readable, as matplotlib code occupies a lot of space.

You can find all the code in my GitHub repo, that way you can copy my charts if you want to.


That’s it for this one! I really hope you enjoyed reading this blog post and learned something new!

If you’re curious about interesting topics that didn’t make it into this post, or enjoy learning about data, subscribe to my free newsletter on Substack. I publish whenever I have something genuinely interesting to share.

Want to connect directly or have questions? Reach out anytime at mtrentz.com.

All images and animations by the author unless stated otherwise.

The post Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data appeared first on Towards Data Science.

]]>
7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow https://towardsdatascience.com/7-powerful-dbeaver-tips-and-tricks-to-improve-your-sql-workflow/ Wed, 12 Mar 2025 04:01:53 +0000 https://towardsdatascience.com/?p=599509 Straight-to-the-point tips for the best SQL IDE

The post 7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow appeared first on Towards Data Science.

]]>
DBeaver is the most powerful open-source SQL IDE, but there are several features people don’t know about. In this post, I will share with you several features to speed up your workflow, with zero fluff.

I’ve learned these as I’m currently digging deeper into the tools I use daily, starting with Dbeaver. In a future post, I’ll compare the workflow between DBeaver versus building your SQL development environment on VSCode (or Cursor). If you’re interested in that, be sure to follow my publications!

Today, though, the focus is on learning the cool features of DBeaver. Let’s get started.

The Command Palette

This is one of the most powerful yet hidden features in DBeaver. Maybe people overlook it because it’s not called a “Command Palette”. You can open it with CMD + 3 (Mac) or CTRL + 3 (Windows).

From here, you can access basically any action in the IDE. I mostly use it for:

  • Switching between Sql scripts.
  • Navigating to specific settings.
  • Quickly accessing actions like Export Results, Refresh Schema, Open Templates, Rename File, etc.

(Officially, this feature in DBeaver is called “Find Actions.”)

Custom SQL formatter

Did you know you can easily set up a different formatter in DBeaver? I’m personally not a fan of the default formatting, and since I mainly use PostgreSQL, I prefer pg_formatter.

Let me show you how it works to set up pg_formatter, but keep in mind that the process will be similar for any SQL formatter that you can call via the terminal.

# Install PG Formatter
brew install pgformatter

# Find where the program is located.
# In my case: opt/homebrew/bin/pg_format
which pg_format

Next, go to Preferences → Editors → SQL Editor → Formatting, select an “External Formatter,” and then paste the path to your desired formatter.

💡 Or you can simply open the command palette and search for “Formatting”.

Expand columns on SELECT

Often, you might need to select most columns from a table, excluding just a few. DBeaver makes this easier by expanding your SELECT * into explicit column names.

You can do this with the CTRL + Space hotkey, both on Mac or on Windows. This might not work if you have it bound to another system shortcut, in which case you can look up for “Content Assist” in the command palette.

Quickly find column statistics

DBeaver has a bunch of features to speed up your analysis. One of them that I use frequently is the “Calc tab”, located to the right of your query results. It lets you quickly get the information about columns in your query results.

Here is what you can do with it:

  • Find the amount of unique and non-null values of categorical columns.
  • Get min, max, mean, median, etc., of numerical columns.

Very handy for quickly understanding your dataset!

Ad-hoc groups

Similarly to the Calc tab, the “Groupings tab” lets you quickly create group-by queries without writing SQL manually.

What you can do with it:

  • Quickly count occurrences of values.
  • Add multiple aggregations.

Although this one is pretty nice for simple aggregations, I find it a bit underwhelming as there is no way to count unique values, as I do in the GIF above, without having to manually write the metric function.

SQL templates

SQL Templates are extremely powerful, although I admit I don’t use them as often as I should. Templates save you from repeatedly writing common expressions.

You can see the built-in templates by opening the command palette and searching for “Templates”. You will see shortcuts for:

  • SELECT * FROM {table}
  • SELECT * FROM {table} WHERE {col} = {value}
  • And others, like selecting and ordering, counting by groups, etc.

All you need to do is to write the shorthand for the query and press tab:

You can also create your own templates, which is not hard if you just copy the existing ones and adapt them.

Advanced copy tips

You might already know DBeaver has extensive data export options. However, the standard export wizard can feel a bit overwhelming, as it shows you tons of configurations, even when you just want to quickly export a CSV.

A quicker way is selecting data in the Results tab, right-clicking, and choosing “Advanced Copy”. By doing this, you can copy your data in many formats like CSV, JSON, Markdown, TXT, and even SQL Insert Statements.

I find this one super helpful, especially when I need to quickly send data to a teammate.

A bonus tip is that you can copy this data to TSV, which gets properly recognized by Excel and Google Sheets into their proper cells! For this one though, you have to be double advanced, according to DBeaver 😅

Conclusions

I love DBeaver as a SQL IDE. It’s incredibly powerful and the interface is very clean. It’s honestly mindblowing that this tool is free and open-source. If you haven’t tried it yet, I recommend it a lot! 

I haven’t seen that many people talking about some of its nicest features and tricks, and most tips I’ve shared came from just using the software. There is probably a lot that I missed, especially when it comes to Plugins, which I haven’t used much.

I know I went through all of the tips very fast, so if you’re in doubt, feel free to reach out. Also, if you have more workflow tips, I would love to hear about them!


I hope you learned something new!

If you’re curious about other tips that didn’t make it into this post, or enjoy learning about general data topics, subscribe to my free newsletter on Substack. I publish whenever I have something genuinely interesting to share.

Want to connect directly or have questions? Reach out anytime at mtrentz.com.

All images and animations, unless otherwise noted, are by the author

The post 7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow appeared first on Towards Data Science.

]]>
Practical SQL Puzzles That Will Level Up Your Skill https://towardsdatascience.com/practical-sql-puzzles-that-will-level-up-your-skill/ Tue, 04 Mar 2025 19:46:10 +0000 https://towardsdatascience.com/?p=598673 Three real-world SQL patterns that can be applied to many problems

The post Practical SQL Puzzles That Will Level Up Your Skill appeared first on Towards Data Science.

]]>
There are some Sql patterns that, once you know them, you start seeing them everywhere. The solutions to the puzzles that I will show you today are actually very simple SQL queries, but understanding the concept behind them will surely unlock new solutions to the queries you write on a day-to-day basis.

These challenges are all based on real-world scenarios, as over the past few months I made a point of writing down every puzzle-like query that I had to build. I also encourage you to try them for yourself, so that you can challenge yourself first, which will improve your learning!

All queries to generate the datasets will be provided in a PostgreSQL and DuckDB-friendly syntax, so that you can easily copy and play with them. At the end I will also provide you a link to a GitHub repo containing all the code, as well as the answer to the bonus challenge I will leave for you!

I organized these puzzles in order of increasing difficulty, so, if you find the first ones too easy, at least take a look at the last one, which uses a technique that I truly believe you won’t have seen before.

Okay, let’s get started.

Analyzing ticket moves

I love this puzzle because of how short and simple the final query is, even though it deals with many edge cases. The data for this challenge shows tickets moving in between Kanban stages, and the objective is to find how long, on average, tickets stay in the Doing stage.

The data contains the ID of the ticket, the date the ticket was created, the date of the move, and the “from” and “to” stages of the move. The stages present are New, Doing, Review, and Done.

Some things you need to know (edge cases):

  • Tickets can move backwards, meaning tickets can go back to the Doing stage.
  • You should not include tickets that are still stuck in the Doing stage, as there is no way to know how long they will stay there for.
  • Tickets are not always created in the New stage.
CREATE TABLE ticket_moves (
    ticket_id INT NOT NULL,
    create_date DATE NOT NULL,
    move_date DATE NOT NULL,
    from_stage TEXT NOT NULL,
    to_stage TEXT NOT NULL
);
INSERT INTO ticket_moves (ticket_id, create_date, move_date, from_stage, to_stage)
    VALUES
        -- Ticket 1: Created in "New", then moves to Doing, Review, Done.
        (1, '2024-09-01', '2024-09-03', 'New', 'Doing'),
        (1, '2024-09-01', '2024-09-07', 'Doing', 'Review'),
        (1, '2024-09-01', '2024-09-10', 'Review', 'Done'),
        -- Ticket 2: Created in "New", then moves: New → Doing → Review → Doing again → Review.
        (2, '2024-09-05', '2024-09-08', 'New', 'Doing'),
        (2, '2024-09-05', '2024-09-12', 'Doing', 'Review'),
        (2, '2024-09-05', '2024-09-15', 'Review', 'Doing'),
        (2, '2024-09-05', '2024-09-20', 'Doing', 'Review'),
        -- Ticket 3: Created in "New", then moves to Doing. (Edge case: no subsequent move from Doing.)
        (3, '2024-09-10', '2024-09-16', 'New', 'Doing'),
        -- Ticket 4: Created already in "Doing", then moves to Review.
        (4, '2024-09-15', '2024-09-22', 'Doing', 'Review');

A summary of the data:

  • Ticket 1: Created in the New stage, moves normally to Doing, then Review, and then Done.
  • Ticket 2: Created in New, then moves: New → Doing → Review → Doing again → Review.
  • Ticket 3: Created in New, moves to Doing, but it is still stuck there.
  • Ticket 4: Created in the Doing stage, moves to Review afterward.

It might be a good idea to stop for a bit and think how you would deal with this. Can you find out how long a ticket stays on a single stage?

Honestly, this sounds intimidating at first, and it looks like it will be a nightmare to deal with all the edge cases. Let me show you the full solution to the problem, and then I will explain what is happening afterward.

WITH stage_intervals AS (
    SELECT
        ticket_id,
        from_stage,
        move_date 
        - COALESCE(
            LAG(move_date) OVER (
                PARTITION BY ticket_id 
                ORDER BY move_date
            ), 
            create_date
        ) AS days_in_stage
    FROM
        ticket_moves
)
SELECT
    SUM(days_in_stage) / COUNT(DISTINCT ticket_id) as avg_days_in_doing
FROM
    stage_intervals
WHERE
    from_stage = 'Doing';

The first CTE uses the LAG function to find the previous move of the ticket, which will be the time the ticket entered that stage. Calculating the duration is as simple as subtracting the previous date from the move date.

What you should notice is the use of the COALESCE in the previous move date. What that does is that if a ticket doesn’t have a previous move, then it uses the date of creation of the ticket. This takes care of the cases of tickets being created directly into the Doing stage, as it still will properly calculate the time it took to leave the stage.

This is the result of the first CTE, showing the time spent in each stage. Notice how the Ticket 2 has two entries, as it visited the Doing stage in two separate occasions.

With this done, it’s just a matter of getting the average as the SUM of total days spent in doing, divided by the distinct number of tickets that ever left the stage. Doing it this way, instead of simply using the AVG, makes sure that the two rows for Ticket 2 get properly accounted for as a single ticket.

Not so bad, right?

Finding contract sequences

The goal of this second challenge is to find the most recent contract sequence of every employee. A break of sequence happens when two contracts have a gap of more than one day between them. 

In this dataset, there are no contract overlaps, meaning that a contract for the same employee either has a gap or ends a day before the new one starts.

CREATE TABLE contracts (
    contract_id integer PRIMARY KEY,
    employee_id integer NOT NULL,
    start_date date NOT NULL,
    end_date date NOT NULL
);

INSERT INTO contracts (contract_id, employee_id, start_date, end_date)
VALUES 
    -- Employee 1: Two continuous contracts
    (1, 1, '2024-01-01', '2024-03-31'),
    (2, 1, '2024-04-01', '2024-06-30'),
    -- Employee 2: One contract, then a gap of three days, then two contracts
    (3, 2, '2024-01-01', '2024-02-15'),
    (4, 2, '2024-02-19', '2024-04-30'),
    (5, 2, '2024-05-01', '2024-07-31'),
    -- Employee 3: One contract
    (6, 3, '2024-03-01', '2024-08-31');

As a summary of the data:

  • Employee 1: Has two continuous contracts.
  • Employee 2: One contract, then a gap of three days, then two contracts.
  • Employee 3: One contract.

The expected result, given the dataset, is that all contracts should be included except for the first contract of Employee 2, which is the only one that has a gap.

Before explaining the logic behind the solution, I would like you to think about what operation can be used to join the contracts that belong to the same sequence. Focus only on the second row of data, what information do you need to know if this contract was a break or not?

I hope it’s clear that this is the perfect situation for window functions, again. They are incredibly useful for solving problems like this, and understanding when to use them helps a lot in finding clean solutions to problems.

First thing to do, then, is to get the end date of the previous contract for the same employee with the LAG function. Doing that, it’s simple to compare both dates and check if it was a break of sequence.

WITH ordered_contracts AS (
    SELECT
        *,
        LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS previous_end_date
    FROM
        contracts
),
gapped_contracts AS (
    SELECT
        *,
        -- Deals with the case of the first contract, which won't have
        -- a previous end date. In this case, it's still the start of a new
        -- sequence.
        CASE WHEN previous_end_date IS NULL
            OR previous_end_date < start_date - INTERVAL '1 day' THEN
            1
        ELSE
            0
        END AS is_new_sequence
    FROM
        ordered_contracts
)
SELECT * FROM gapped_contracts ORDER BY employee_id ASC;

An intuitive way to continue the query is to number the sequences of each employee. For example, an employee who has no gap, will always be on his first sequence, but an employee who had 5 breaks in contracts will be on his 5th sequence. Funnily enough, this is done by another window function.

--
-- Previous CTEs
--
sequences AS (
    SELECT
        *,
        SUM(is_new_sequence) OVER (PARTITION BY employee_id ORDER BY start_date) AS sequence_id
FROM
    gapped_contracts
)
SELECT * FROM sequences ORDER BY employee_id ASC;

Notice how, for Employee 2, he starts his sequence #2 after the first gapped value. To finish this query, I grouped the data by employee, got the value of their most recent sequence, and then did an inner join with the sequences to keep only the most recent one.

--
-- Previous CTEs
--
max_sequence AS (
    SELECT
        employee_id,
        MAX(sequence_id) AS max_sequence_id
FROM
    sequences
GROUP BY
    employee_id
),
latest_contract_sequence AS (
    SELECT
        c.contract_id,
        c.employee_id,
        c.start_date,
        c.end_date
    FROM
        sequences c
        JOIN max_sequence m ON c.sequence_id = m.max_sequence_id
            AND c.employee_id = m.employee_id
        ORDER BY
            c.employee_id,
            c.start_date
)
SELECT
    *
FROM
    latest_contract_sequence;

As expected, our final result is basically our starting query just with the first contract of Employee 2 missing! 

Tracking concurrent events

Finally, the last puzzle — I’m glad you made it this far. 

For me, this is the most mind-blowing one, as when I first encountered this problem I thought of a completely different solution that would be a mess to implement in SQL.

For this puzzle, I’ve changed the context from what I had to deal with for my job, as I think it will make it easier to explain. 

Imagine you’re a data analyst at an event venue, and you’re analyzing the talks scheduled for an upcoming event. You want to find the time of day where there will be the highest number of talks happening at the same time.

This is what you should know about the schedules:

  • Rooms are booked in increments of 30min, e.g. from 9h-10h30.
  • The data is clean, there are no overbookings of meeting rooms.
  • There can be back-to-back meetings in a single meeting room.

Meeting schedule visualized (this is the actual data). 

CREATE TABLE meetings (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO meetings (room, start_time, end_time) VALUES
    -- Room A meetings
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B meetings
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C meetings
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room C', '2024-10-01 11:30', '2024-10-01 12:00');

The way to solve this is using what is called a Sweep Line Algorithm, or also known as an event-based solution. This last name actually helps to understand what will be done, as the idea is that instead of dealing with intervals, which is what we have in the original data, we deal with events instead.

To do this, we need to transform every row into two separate events. The first event will be the Start of the meeting, and the second event will be the End of the meeting.

WITH events AS (
  -- Create an event for the start of each meeting (+1)
  SELECT 
    start_time AS event_time, 
    1 AS delta
  FROM meetings
  UNION ALL
  -- Create an event for the end of each meeting (-1)
  SELECT 
   -- Small trick to work with the back-to-back meetings (explained later)
    end_time - interval '1 minute' as end_time,
    -1 AS delta
  FROM meetings
)
SELECT * FROM events;

Take the time to understand what is happening here. To create two events from a single row of data, we’re simply unioning the dataset on itself; the first half uses the start time as the timestamp, and the second part uses the end time.

You might already notice the delta column created and see where this is going. When an event starts, we count it as +1, when it ends, we count it as -1. You might even be already thinking of another window function to solve this, and you’re actually right!

But before that, let me just explain the trick I used in the end dates. As I don’t want back-to-back meetings to count as two concurrent meetings, I’m subtracting a single minute of every end date. This way, if a meeting ends and another starts at 10h30, it won’t be assumed that two meetings are concurrently happening at 10h30.

Okay, back to the query and yet another window function. This time, though, the function of choice is a rolling SUM.

--
-- Previous CTEs
--
ordered_events AS (
  SELECT
    event_time,
    delta,
    SUM(delta) OVER (ORDER BY event_time, delta DESC) AS concurrent_meetings
  FROM events
)
SELECT * FROM ordered_events ORDER BY event_time DESC;

The rolling SUM at the Delta column is essentially walking down every record and finding how many events are active at that time. For example, at 9 am sharp, it sees two events starting, so it marks the number of concurrent meetings as two!

When the third meeting starts, the count goes up to three. But when it gets to 9h59 (10 am), then two meetings end, bringing the counter back to one. With this data, the only thing missing is to find when the highest value of concurrent meetings happens.

--
-- Previous CTEs
--
max_events AS (
  -- Find the maximum concurrent meetings value
  SELECT 
    event_time, 
    concurrent_meetings,
    RANK() OVER (ORDER BY concurrent_meetings DESC) AS rnk
  FROM ordered_events
)
SELECT event_time, concurrent_meetings
FROM max_events
WHERE rnk = 1;

That’s it! The interval of 9h30–10h is the one with the largest number of concurrent meetings, which checks out with the schedule visualization above!

This solution looks incredibly simple in my opinion, and it works for so many situations. Every time you are dealing with intervals now, you should think if the query wouldn’t be easier if you thought about it in the perspective of events.

But before you move on, and to really nail down this concept, I want to leave you with a bonus challenge, which is also a common application of the Sweep Line Algorithm. I hope you give it a try!

Bonus challenge

The context for this one is still the same as the last puzzle, but now, instead of trying to find the period when there are most concurrent meetings, the objective is to find bad scheduling. It seems that there are overlaps in the meeting rooms, which need to be listed so it can be fixed ASAP.

How would you find out if the same meeting room has two or more meetings booked at the same time? Here are some tips on how to solve it:

  • It’s still the same algorithm.
  • This means you will still do the UNION, but it will look slightly different.
  • You should think in the perspective of each meeting room.

You can use this data for the challenge:

CREATE TABLE meetings_overlap (
    room TEXT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL
);

INSERT INTO meetings_overlap (room, start_time, end_time) VALUES
    -- Room A meetings
    ('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
    ('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
    ('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
    -- Room B meetings
    ('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
    -- Room C meetings
    ('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
    -- Overlaps with previous meeting.
    ('Room C', '2024-10-01 09:30', '2024-10-01 12:00');

If you’re interested in the solution to this puzzle, as well as the rest of the queries, check this GitHub repo.

Conclusion

The first takeaway from this blog post is that window functions are overpowered. Ever since I got more comfortable with using them, I feel that my queries have gotten so much simpler and easier to read, and I hope the same happens to you.

If you’re interested in learning more about them, you would probably enjoy reading this other blog post I’ve written, where I go over how you can understand and use them effectively.

The second takeaway is that these patterns used in the challenges really do happen in many other places. You might need to find sequences of subscriptions, customer retention, or you might need to find overlap of tasks. There are many situations when you will need to use window functions in a very similar fashion to what was done in the puzzles.

The third thing I want you to remember is about this solution to using events besides dealing with intervals. I’ve looked at some problems I solved a long time ago that I could’ve used this pattern on to make my life easier, and unfortunately, I didn’t know about it at the time.


I really do hope you enjoyed this post and gave a shot to the puzzles yourself. And I’m sure that if you made it this far, you either learned something new about SQL or strengthened your knowledge of window functions! 

Thank you so much for reading. If you have questions or just want to get in touch with me, don’t hesitate to contact me at mtrentz.com.

All images by the author unless stated otherwise.

The post Practical SQL Puzzles That Will Level Up Your Skill appeared first on Towards Data Science.

]]>
Understand SQL Window Functions Once and For All https://towardsdatascience.com/understand-sql-window-functions-once-and-for-all-4447824c1cb4/ Wed, 01 May 2024 17:29:52 +0000 https://towardsdatascience.com/understand-sql-window-functions-once-and-for-all-4447824c1cb4/ A step-by-step guide to understanding window functions

The post Understand SQL Window Functions Once and For All appeared first on Towards Data Science.

]]>
Window functions are key to writing SQL code that is both efficient and easy to understand. Knowing how they work and when to use them will unlock new ways of solving your reporting problems.

The objective of this article is to explain window functions in SQL step by step in an understandable way so that you don’t need to rely on only memorizing the syntax.

Here is what we will cover:

  • An explanation on how you should view window functions
  • Go over many examples in increasing difficulty
  • Look at one specific real-case scenario to put our learnings into practice
  • Review what we’ve learned

Our dataset is simple, six rows of revenue Data for two regions in the year 2023.

Window Functions Are Sub Groups

If we took this dataset and ran a GROUP BY sum on the revenue of each region, it would be clear what happens, right? It would result in only two remaining rows, one for each region, and then the sum of the revenues:

The way I want you to view window functions is very similar to this but, instead of reducing the number of rows, the aggregation will run "in the background" and the values will be added to our existing rows.

First, an example:

Sql">SELECT
 id,
    date,
    region,
    revenue,
    SUM(revenue) OVER () as total_revenue
FROM
    sales

Notice that we don’t have any GROUP BY and our dataset is left intact. And yet we were able to get the sum of all revenues. Before we go more in depth in how this worked let’s just quickly talk about the full syntax before we start building up our knowledge.

The Window Function Syntax

The syntax goes like this:

SUM([some_column]) OVER (PARTITION BY [some_columns] ORDER BY [some_columns])

Picking apart each section, this is what we have:

  • An aggregation or window function: SUM, AVG, MAX, RANK, FIRST_VALUE
  • The OVER keyword which says this is a window function
  • The PARTITION BY section, which defines the groups
  • The ORDER BY section which defines if it’s a running function (we will cover this later on)

Don’t stress over what each of these means yet, as it will become clear when we go over the examples. For now just know that to define a window function we will use the OVER keyword. And as we saw in the first example, that’s the only requirement.

Building Our Understanding Step By Step

Moving to something actually useful, we will now apply a group in our function. The initial calculation will be kept to show you that we can run more than one window function at once, which means we can do different aggregations at once in the same query, without requiring sub-queries.

SELECT
    id,
    date,
    region,
    revenue,
    SUM(revenue) OVER (PARTITION BY region) as region_total,
    SUM(revenue) OVER () as total_revenue
FROM sales

As said, we use the PARTITION BY to define our groups (windows) that are used by our aggregation function! So, keeping our dataset intact we’ve got:

  • The total revenue for each region
  • The total revenue for the whole dataset

We’re also not restrained to a single group. Similar to GROUP BY we can partition our data on Region and Quarter, for example:

SELECT
    id,
    date,
    region,
    revenue,
    SUM(revenue) OVER (PARTITION BY 
          region, 
          date_trunc('quarter', date)
    ) AS region_quarterly_revenue
FROM sales

In the image we see that the only two data points for the same region and quarter got grouped together!

At this point I hope it’s clear how we can view this as doing a GROUP BY but in-place, without reducing the number of rows in our dataset. Of course, we don’t always want that, but it’s not that uncommon to see queries where someone groups data and then joins it back in the original dataset, complicating what could be a single window function.

Moving on to the ORDER BY keyword. This one defines a running window function. You’ve probably heard of a Running Sum once in your life, but if not, we should start with an example to make everything clear.

SELECT
    id,
    date,
    region,
    revenue,
    SUM(revenue) OVER (ORDER BY id) as running_total
FROM sales

What happens here is that we’ve went, row by row, summing the revenue with all previous values. This was done following the order of the id column, but it could’ve been any other column.

This specific example is not particularly useful because we’re summing across random months and two regions, but using what we’ve learned we can now find the cumulative revenue per region. We do that by applying the running sum within each group.

SELECT
    id,
    date,
    region,
    revenue,
    SUM(revenue) OVER (PARTITION BY region ORDER BY date) as running_total
FROM sales

Take the time to make sure you understand what happened here:

  • For each region we’re walking up month by month and summing the revenue
  • Once it’s done for that region we move to the next one, starting from scratch and again moving up the months!

It’s quite interesting to notice here that when we’re writing these running functions we have the "context" of other rows. What I mean is that to get the running sum at one point, we must know the previous values for the previous rows. This becomes more obvious when we learn that we can manually chose how many rows before/after we want to aggregate on.

SELECT
    id,
    date,
    region,
    revenue,
    SUM(revenue) OVER (ORDER BY id ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) 
    AS useless_sum
FROM
    sales

For this query we specified that for each row we wanted to look at one row behind and two rows ahead, so that means we get the sum of that range! Depending on the problem you’re solving this can be extremely powerful as it gives you complete control on how you’re grouping your data.

Finally, one last function I want to mention before we move into a harder example is the RANK function. This gets asked a lot in interviews and the logic behind it is the same as everything we’ve learned so far.

SELECT
    *,
    RANK() OVER (PARTITION BY region ORDER BY revenue DESC) as rank,
    RANK() OVER (ORDER BY revenue DESC) as overall_rank
FROM
    sales
ORDER BY region, revenue DESC

Just as before, we used ORDER BY to specify the order which we will walk, row by row, and PARTITION BY to specify our sub-groups.

The first column ranks each row within each region, meaning that we will have multiple "rank one’s" in the dataset. The second calculation is the rank across all rows in the dataset.

Forward Filling Missing Data

This is a problem that shows up every now and then and to solve it on SQL it takes heavy usage of Window Functions. To explain this concept we will use a different dataset containing timestamps and temperature measurements. Our goal is to fill in the rows missing temperature measurements with the last measured value.

Here is what we expect to have at the end:

Before we start I just want to mention that if you’re using Pandas you can solve this problem simply by running df.ffill() but if you’re on SQL the problem gets a bit more tricky.

The first step to solve this is to, somehow, group the NULLs with the previous non-null value. It might not be clear how we do this but I hope it’s clear that this will require a running function. Meaning that it’s a function that will "walk row by row", knowing when we hit a null value and when we hit a non-null value.

The solution is to use COUNT and, more specifically, count the values of temperature measurements. In the following query I run both a normal running count and also a count over the temperature values.

SELECT
    *,
    COUNT() OVER (ORDER BY timestamp) as normal_count,
    COUNT(temperature) OVER (ORDER BY timestamp) as group_count
from sensor
  • In the first calculation we simply counted up each row increasingly
  • On the second one we counted every value of temperature we saw, not counting when it was NULL

The normal_count column is useless for us, I just wanted to show what a running COUNT looked like. Our second calculation though, the group_count moves us closer to solving our problem!

Notice that this way of counting makes sure that the first value, just before the NULLs start, is counted and then, every time the function sees a null, nothing happens. This makes sure that we’re "tagging" every subsequent null with the same count we had when we stopped having measurements.

Moving on, we now need to copy over the first value that got tagged into all the other rows within that same group. Meaning that for the group 2 needs to all be filled with the value 15.0.

Can you think of a function now that we can use here? There is more than one answer for this, but, again, I hope that at least it’s clear that now we’re looking at a simple window aggregation with PARTITION BY .

SELECT
    *,
    FIRST_VALUE(temperature) OVER (PARTITION BY group_count) as filled_v1,
    MAX(temperature) OVER (PARTITION BY group_count) as filled_v2
FROM (
    SELECT
        *,
        COUNT(temperature) OVER (ORDER BY timestamp) as group_count
    from sensor
) as grouped
ORDER BY timestamp ASC

We can use both FIRST_VALUE or MAX to achieve what we want. The only goal is that we get the first non-null value. Since we know that each group contains one non-null value and a bunch of null values, both of these functions work!

This example is a great way to practice window functions. If you want a similar challenge try to add two sensors and then forward fill the values with the previous reading of that sensor. Something similar to this:

Could you do it? It doesn’t use anything that we haven’t learned here so far.

By now we know everything that we need about how window functions work in SQL, so let’s just do a quick recap!

Recap Time

This is what we’ve learned:

  • We use the OVER keyword to write window functions
  • We use PARTITION BY to specify our sub-groups (windows)
  • If we provide only the OVER() keyword our window is the whole dataset
  • We use ORDER BY when we want to have a running function, meaning that our calculation walks row by row
  • Window functions are useful when we want to group data to run an aggregation but we want to keep our dataset as is

I hope this helped you understand how window functions work and helps you apply it in the problems you need to solve.

All images by the author unless stated otherwise

The post Understand SQL Window Functions Once and For All appeared first on Towards Data Science.

]]>