Jekyll2021-08-27T13:04:10+00:00https://mjk.space/feed.xmlMichał KonarskiMy own space on the InternetAdvanced SQL - window frames2019-09-18T09:00:00+00:002019-09-18T09:00:00+00:00https://mjk.space/advances-sql-window-frames<p>This article is a part of my series of articles discussing advanced SQL concepts that are supported by popular databases for quite some time, but are not very well known by database users. My idea is to explain them in simple terms, with examples.</p>
<p>What you’re reading is a continuation of <a href="/advanced-sql-window-functions/">my post</a> published almost two years ago describing one of the most powerful features of modern SQL - <strong>window functions</strong>. They allow to perform calculations across a set of related rows, without actually grouping these rows together. In this article, I’m going to focus on an important aspect of window functions that make them even more flexible and useful - <strong>window frames</strong>.</p>
<p>Window frames have been a part of the SQL standard for some time now. All popular database systems <a href="https://data-xtractor.com/blog/query-builder/window-functions-support/">support them to some extent</a>, but none of them has all features implemented. PostgreSQL is currently <a href="https://modern-sql.com/blog/2019-02/postgresql-11#over">the leader in this field</a>. Its latest version 11 introduced most of the window frames related features described by the standard. Therefore I’ll be using it throughout this article.</p>
<h3 id="window-functions---a-quick-recap">Window functions - a quick recap</h3>
<p>Let’s start with a quick reminder about what window functions are. Let’s say that we’re working with the following table:</p>
<p><img src="/images/blog/advanced-sql-window-frames/films-schema.svg" alt="Table schema" class="center-image" />
<em>Table schema</em></p>
<p><img src="/images/blog/advanced-sql-window-frames/films-input-rows.svg" alt="Input rows" class="center-image" />
<em>Input rows</em></p>
<p>Now we have the following task to solve:</p>
<h5 id="for-each-film-find-an-average-rating-of-all-films-in-its-release-year">For each film find an average rating of all films in its release year.</h5>
<p><img src="/images/blog/advanced-sql-window-frames/result-rows-year-avg.svg" alt="Result rows with year_avg" class="center-image" /></p>
<p>To solve this problem we have to aggregate the input rows. We need to take rows belonging to the same year, then group them and compute an average for every group. This can be easily done with <code class="language-plaintext highlighter-rouge">GROUP BY</code> statement, but in the result set we’d get just one row for every <code class="language-plaintext highlighter-rouge">release_year</code>. The output set should contain an additional column, but the same number of rows as the input set. This is a job for a window function:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="k">AVG</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span><span class="p">)</span> <span class="k">AS</span> <span class="n">year_avg</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">release_year</span><span class="p">,</span> <span class="n">rating</span><span class="p">;</span>
</code></pre></div></div>
<p>The above code (<code class="language-plaintext highlighter-rouge">PARTITION BY release_year</code>) instructs the database engine to divide the input rows into disjoint sets called partitions, using the <code class="language-plaintext highlighter-rouge">release_year</code> column. Each partition receives only the rows that have the same value of <code class="language-plaintext highlighter-rouge">release_year</code>. Then, by using an aggregate function <code class="language-plaintext highlighter-rouge">AVG(rating)</code>, we tell the database engine how to calculate the final result for each partition. This diagram below presents the whole operation:</p>
<p><img src="/images/blog/advanced-sql-window-frames/partitioning.svg" alt="Partitioning" class="center-image" />
<em>Window functions partitioning</em></p>
<p>Now, let’s complicate our initial problem a little bit:</p>
<h5 id="for-each-film-find-an-average-rating-of-all-strictly-better-films-in-its-release-year">For each film find an average rating of all strictly better films in its release year.</h5>
<p><img src="/images/blog/advanced-sql-window-frames/result-rows-avg-better.svg" alt="Result rows with avg_of_better" class="center-image" /></p>
<p>It’s clear that now we also need to divide the rows into <code class="language-plaintext highlighter-rouge">release_year</code> partitions. But the calculation of average needs to be done only on a subset of a partition. The subset needs to be different for every row - we need to consider rows that have a greater value in the <code class="language-plaintext highlighter-rouge">rating</code> column only. Partitions are exactly the same for every row they contain, so they will not help us achieve this effect. We need something more powerful.</p>
<h3 id="window-frames">Window frames</h3>
<p>Window frames are a feature that allows us to divide partitions into smaller subsets. What’s even more important, these subsets can differ from a row to row. This is something that can’t be achieved with partitioning only. For example, we can have window frames that contain all the rows with the same or greater value in a given column:</p>
<p><img src="/images/blog/advanced-sql-window-frames/window-frames.svg" alt="Window frames" class="center-image" />
<em>Mechanism of creating window frames</em></p>
<p>SQL gives us many ways to specify which rows should be included in window frames. In the next paragraphs I will describe all these ways in detail.</p>
<h4 id="syntax">Syntax</h4>
<p>A general (and <a href="https://www.postgresql.org/docs/11/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS">much simplified</a>) format of a window function call is:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">function_name</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="p">...</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="p">...</span> <span class="n">frame_clause</span><span class="p">)</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">frame_clause</code> is the part that defines window frames. It looks as follows:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">mode</span> <span class="k">BETWEEN</span> <span class="n">frame_start</span> <span class="k">AND</span> <span class="n">frame_end</span> <span class="p">[</span> <span class="n">frame_exclusion</span> <span class="p">]</span>
</code></pre></div></div>
<p>This syntax can be divided into three sections:</p>
<ul>
<li><strong><em>mode</em></strong> sets the way a database engine treats input rows. There are three possible values: <code class="language-plaintext highlighter-rouge">ROWS</code>, <code class="language-plaintext highlighter-rouge">GROUPS</code> and <code class="language-plaintext highlighter-rouge">RANGE</code>.</li>
<li><strong><em>frame_start</em></strong> and <strong><em>frame_end</em></strong> define where a window frame starts and where it ends.</li>
<li><strong><em>frame_exclusion</em></strong> can be used to specify parts of a window frame that have to be excluded from the calculations.</li>
</ul>
<p>It’s important to remember is that window frames are constructed for every single input row separately, so their content can differ from row to row. Therefore it’s essential to consider a window frame with regard to the row that that frame is built for. We’ll call it <strong>the current row</strong>.</p>
<p>What is also usually crucial is to specify the order in which rows appear in a window frame. In most cases the exact position of the current row compared to other rows will have a direct impact on the content of a frame. Therefore it’s always safe to assume that if you want to use window framing you need to have the rows sorted consistently. It can be done by adding an <code class="language-plaintext highlighter-rouge">ORDER BY</code> clause to a window function call.</p>
<p>All the following examples have the rows sorted by the <code class="language-plaintext highlighter-rouge">rating</code> column in ascending order. For the sake of simplicity, I also slightly modified the input set - now it contains films released in a single year only.</p>
<h3 id="window-frame-modes">Window frame modes</h3>
<h4 id="rows-mode">Rows mode</h4>
<p>The <code class="language-plaintext highlighter-rouge">ROWS</code> mode is the simplest one. It instructs the database to treat each input row separately, as individual entities:</p>
<p><img src="/images/blog/advanced-sql-window-frames/rows-mode.svg" alt="Window frames" class="center-image" />
<em>ROWS mode</em></p>
<p>In the <code class="language-plaintext highlighter-rouge">ROWS</code> mode <strong><em>frame_start</em></strong> and <strong><em>frame_end</em></strong> allow us to specify which rows the window frame starts and ends with. They accept the following values:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">UNBOUNDED PRECEDING</code> - (possible only in <strong><em>frame_start</em></strong>) start with the first row of the partition</li>
<li><code class="language-plaintext highlighter-rouge">offset PRECEDING</code> - start/end with a given number of rows before the current row</li>
<li><code class="language-plaintext highlighter-rouge">CURRENT ROW</code> - start/end with the current row</li>
<li><code class="language-plaintext highlighter-rouge">offset FOLLOWING</code> - start/end with a given number of rows after the current row</li>
<li><code class="language-plaintext highlighter-rouge">UNBOUNDED FOLLOWING</code> - (possible only as a <strong><em>frame_end</em></strong>) end with the last row of the partition</li>
</ul>
<p>Let’s take a look at some examples. Remember that it’s crucial to know which row is the current row, because for different rows the window frame can look differently. All the figures below present how the frame looks like for a single, chosen input row.</p>
<p>Let’s start with a do-nothing option. It simply selects all rows from the beginning of the partition to the end:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="n">UNBOUNDED</span> <span class="n">FOLLOWING</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/rows-unbound-preceding-unbound-following.svg" alt="UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING" class="center-image" /></p>
<p>Now we’ll do something more interesting. In the example below we start with the beginning of the partition, but end with the current row. This is where the order of rows begins to matter:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="k">CURRENT</span> <span class="k">ROW</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/rows-unbound-preceding-current-row.svg" alt="ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" class="center-image" /></p>
<p>Here we start with the first row before the current row and end with the first row after the current row:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="mi">1</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="mi">1</span> <span class="n">FOLLOWING</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/rows-1-preceding-1-following.svg" alt="ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING" class="center-image" /></p>
<p>It’s not mandatory to include the current row though. In the example below we start and end before the current row:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="mi">3</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="mi">1</span> <span class="n">PRECEDING</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/rows-3-preceding-1-preceding.svg" alt="ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING" class="center-image" /></p>
<p>We can do more interesting things using the <strong><em>frame_exclusion</em></strong> part.</p>
<p><strong><em>frame_exclusion</em></strong> allows to exclude some specific rows from the window frame, even if they would be included according to the <strong><em>frame_start</em></strong> and <strong><em>frame_end</em></strong> options. What’s worth mentioning is that <strong><em>frame_exclusion</em></strong> works exactly the same regardless of the selected mode. Possible values are:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">EXCLUDE CURRENT ROW</code> - exclude the current row.</li>
<li><code class="language-plaintext highlighter-rouge">EXCLUDE GROUP</code> - exclude the current row and all peer rows, i.e rows that have the same value in the sorting column.</li>
<li><code class="language-plaintext highlighter-rouge">EXCLUDE TIES</code> - exclude all peer rows, but not the current row.</li>
<li><code class="language-plaintext highlighter-rouge">EXCLUDE NO OTHERS</code> - exclude nothing. This is the default option in case you omit the <strong><em>frame_exclusion</em></strong> part altogether.</li>
</ul>
<p>Let me show you an example.</p>
<p>Here we want to select all rows from the beginning of the partition to the end of the partition, but exclude the current row:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="n">UNBOUNDED</span> <span class="n">FOLLOWING</span> <span class="n">EXCLUDE</span> <span class="k">CURRENT</span> <span class="k">ROW</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/rows-unbouned-preceding-unbounded-following-exclude-current-row.svg" alt="ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING EXCLUDE CURRENT ROW" class="center-image" /></p>
<p>The rest of <strong><em>frame_exclusion</em></strong> options become interesting only in the case when the partition has duplicate values in the sorting column. I haven’t included them in the examples above on purpose, because there’s an important caveat when the <code class="language-plaintext highlighter-rouge">ROWS</code> mode is used together with sorting duplicates. In that case, the rows with duplicated sorting values are processed in an unspecified order, so their relative positions are not deterministic. This can lead to incorrect results when <code class="language-plaintext highlighter-rouge">offset PRECEDING</code> and <code class="language-plaintext highlighter-rouge">offset FOLLOWING</code> clauses are specified. The next section will explain the rest of the <strong><em>frame_exclusion</em></strong> options.</p>
<h4 id="groups-mode">Groups mode</h4>
<p><code class="language-plaintext highlighter-rouge">GROUPS</code> mode is made exactly for the case when the sorting column contains duplicates. Therefore in this paragraph I’ll use a sample of input rows that contains duplicates. In the <code class="language-plaintext highlighter-rouge">GROUPS</code> mode rows with duplicate sorting values are grouped together:</p>
<p><img src="/images/blog/advanced-sql-window-frames/groups-mode.svg" alt="Window frames" class="center-image" />
<em>GROUPS mode</em></p>
<p>The syntax looks as follows:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="n">frame_start</span> <span class="k">AND</span> <span class="n">frame_end</span> <span class="p">[</span> <span class="n">frame_exclusion</span> <span class="p">]</span>
</code></pre></div></div>
<p>The <strong><em>frame_start</em></strong> and <strong><em>frame_end</em></strong> parameters accept the same options as in <code class="language-plaintext highlighter-rouge">ROWS</code> mode, but the meaning of some of them differ:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">UNBOUNDED PRECEDING</code> and <code class="language-plaintext highlighter-rouge">UNBOUNDED FOLLOWING</code> work the same and mean either the first row or the last row of the current partition.</li>
<li><code class="language-plaintext highlighter-rouge">offset PRECEDING</code> and <code class="language-plaintext highlighter-rouge">offset FOLLOWING</code> now work with regard to groups. You can use them to specify a number of groups before or after the current group to be taken into account.</li>
<li><code class="language-plaintext highlighter-rouge">CURRENT ROW</code> also gets a different meaning, which might seem a bit misleading. When used as <strong><em>frame_start</em></strong> it means the first row in a group containing the current row. When used as <strong><em>frame_end</em></strong> it means the last row in a group containing the current row.</li>
</ul>
<p>As always it’s best to look at some examples.</p>
<p>Let’s start with a default option. It simply includes all partition rows:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="n">UNBOUNDED</span> <span class="n">FOLLOWING</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/groups-between-unbounded-preceding-and-unbounded-following.svg" alt="GROUPS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING" class="center-image" /></p>
<p>The example below shows the real power of the <code class="language-plaintext highlighter-rouge">GROUPS</code> mode. We start with the first row in the partition and we want to include everything up to the current group:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="k">CURRENT</span> <span class="k">ROW</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/groups-between-unbounded-preceding-and-current-row.svg" alt="GROUPS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" class="center-image" /></p>
<p>In the case of any duplicates we can be sure that all of them will be either included in the calculation.</p>
<p>The example below is symmetrical. Here we start with the current group and end at the end of the partition. As you can see, the meaning of <code class="language-plaintext highlighter-rouge">CURRENT ROW</code> changes depending on whether it used to define a beginning or an end of a window frame. Once again all duplicates are included:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="k">CURRENT</span> <span class="k">ROW</span> <span class="k">AND</span> <span class="n">UNBOUNDED</span> <span class="n">FOLLOWING</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/groups-between-current-row-and-unbounded-following.svg" alt="GROUPS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING" class="center-image" /></p>
<p>Now, let’s explicitly include other groups too. In the below example we start with the first group before the current group and end with the first group after the current group:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="mi">1</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="mi">1</span> <span class="n">FOLLOWING</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/groups-between-1-preceding-and-1-following.svg" alt="GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING" class="center-image" /></p>
<p>We can make it more complicated by using frame exclusions. For example, the statement below gives us the same result as above, but with the current group excluded:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="mi">1</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="mi">1</span> <span class="n">FOLLOWING</span> <span class="n">EXCLUDE</span> <span class="k">GROUP</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/groups-between-1-preceding-and-1-following-exclude-group.svg" alt="GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING" class="center-image" /></p>
<p>Here we exclude not the whole current group, but only the current row:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="mi">1</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="mi">1</span> <span class="n">FOLLOWING</span> <span class="n">EXCLUDE</span> <span class="k">CURRENT</span> <span class="k">ROW</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/groups-between-1-preceding-and-1-following-exclude-current-row.svg" alt="GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING EXCLUDE CURRENT ROW" class="center-image" /></p>
<p>In the last example we exclude ties, i.e all peer rows, but we leave the current row intact:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="mi">1</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="mi">1</span> <span class="n">FOLLOWING</span> <span class="n">EXCLUDE</span> <span class="n">TIES</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/groups-between-1-preceding-and-1-following-exclude-ties.svg" alt="GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING EXCLUDE CURRENT ROW" class="center-image" /></p>
<h4 id="range-mode">Range mode</h4>
<p><code class="language-plaintext highlighter-rouge">RANGE</code> mode is different from the previous two, because it doesn’t tie the rows together in any way. It instructs the database to work on a given range of values instead. The values that it looks at are the values of the sorting column. Postgres imposes a requirement, that in this mode you can put only one column in the <code class="language-plaintext highlighter-rouge">ORDER BY</code> clause.</p>
<p>The syntax is as follows:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">RANGE</span> <span class="k">BETWEEN</span> <span class="n">frame_start</span> <span class="k">AND</span> <span class="n">frame_end</span> <span class="p">[</span> <span class="n">frame_exclusion</span> <span class="p">]</span>
</code></pre></div></div>
<p>Instead of specifying the number of rows or groups, here we have to specify the maximum difference of values that the window frame should comprise. Both <strong><em>frame_start</em></strong> and <strong><em>frame_end</em></strong> have to be expressed in the same units as the sorting column is.</p>
<p>Let’s look at some examples.</p>
<p>In the below one we want to include all rows which sorting values differ no more than by 0.5 from the current row. The boundaries are inclusive, which means that the rows that differ by exactly 0.5 will be taken into consideration:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">RANGE</span> <span class="k">BETWEEN</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="mi">0</span><span class="p">.</span><span class="mi">2</span> <span class="n">FOLLOWING</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/ranges-between-05-and-02.svg" alt="GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING EXCLUDE CURRENT ROW" class="center-image" /></p>
<p>We can also mix the range with <code class="language-plaintext highlighter-rouge">CURRENT ROW</code>, which surprisingly means <em>the current group</em>. The effect is similar to what we saw in <code class="language-plaintext highlighter-rouge">GROUPS</code> mode. Again, all duplicates are included:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">RANGE</span> <span class="k">BETWEEN</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="k">CURRENT</span> <span class="k">ROW</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/ranges-between-05-and-current-row.svg" alt="RANGE BETWEEN 0.5 PRECEDING AND CURRENT ROW" class="center-image" /></p>
<p>Frame exclusion options will work exactly the same as in the other modes. For example, to exclude the current row (which in this context really means <em>the current row</em>):</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">RANGE</span> <span class="k">BETWEEN</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="k">CURRENT</span> <span class="k">ROW</span> <span class="n">EXCLUDE</span> <span class="k">CURRENT</span> <span class="k">ROW</span>
</code></pre></div></div>
<p><img src="/images/blog/advanced-sql-window-frames/ranges-between-05-and-current-row-exclude-current-row.svg" alt="RANGE BETWEEN 0.5 PRECEDING AND CURRENT ROW" class="center-image" /></p>
<h3 id="real-world-examples">Real-world examples</h3>
<p>Now, it’s finally time to do some realistic examples (or at least as close to being realistic as possible). Let’s start with the problem I mentioned at the beginning of the article.</p>
<h4 id="example-1-for-each-film-find-an-average-rating-of-all-strictly-better-films-in-its-release-year">Example 1. For each film find an average rating of all strictly better films in its release year.</h4>
<p><img src="/images/blog/advanced-sql-window-frames/result-rows-1.svg" alt="Result rows with average ratings of better films" class="center-image" />
<em>Result rows with average ratings of better films</em></p>
<p>Because it’s a real-world example we can’t just assume that the input set will not contain duplicates in the <code class="language-plaintext highlighter-rouge">rating</code> column. Therefore using <code class="language-plaintext highlighter-rouge">ROWS</code> mode would give us a incorrect result. We have to choose <code class="language-plaintext highlighter-rouge">GROUPS</code> mode instead.</p>
<p>What we need here is all the films that are <em>strictly better</em> than the current one. We need to exclude the current row and all others that are rated the same as the current row, regardless of the order they come in.</p>
<p>To achieve that we should start with the first row in the group after the current group and finish at the end of the partition:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="k">AVG</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">rating</span>
<span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="mi">1</span> <span class="n">FOLLOWING</span> <span class="k">AND</span> <span class="n">UNBOUNDED</span> <span class="n">FOLLOWING</span><span class="p">)</span>
<span class="k">AS</span> <span class="n">avg_of_better</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">release_year</span><span class="p">,</span> <span class="n">rating</span><span class="p">;</span>
</code></pre></div></div>
<p>Of course, there are many correct solutions. We can also start with the current group and exclude it:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="k">AVG</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">rating</span>
<span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="k">CURRENT</span> <span class="k">ROW</span> <span class="k">AND</span> <span class="n">UNBOUNDED</span> <span class="n">FOLLOWING</span> <span class="n">EXCLUDE</span> <span class="k">GROUP</span><span class="p">)</span>
<span class="k">AS</span> <span class="n">avg_of_better</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">release_year</span><span class="p">,</span> <span class="n">rating</span><span class="p">;</span>
</code></pre></div></div>
<h4 id="example-2-how-many-other-films-have-the-same-rank-as-me">Example 2. How many other films have the same rank as me?</h4>
<p><img src="/images/blog/advanced-sql-window-frames/result-rows-2.svg" alt="Result rows with a count of equally rated films" class="center-image" />
<em>Result rows with a count of equally rated films</em></p>
<p>Now we need to select all the rows belonging to the current row’s peer group and count them:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">rating</span>
<span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="k">CURRENT</span> <span class="k">ROW</span> <span class="k">AND</span> <span class="k">CURRENT</span> <span class="k">ROW</span> <span class="n">EXCLUDE</span> <span class="k">CURRENT</span> <span class="k">ROW</span><span class="p">)</span>
<span class="k">AS</span> <span class="n">count_of_equal</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">release_year</span><span class="p">,</span> <span class="n">rating</span><span class="p">;</span>
</code></pre></div></div>
<p>We use <code class="language-plaintext highlighter-rouge">CURRENT ROW</code> as both the beginning and the end of the partition in order to narrow the window frame down to the current group only. Remember that we’re in the <code class="language-plaintext highlighter-rouge">GROUPS</code> mode, so <code class="language-plaintext highlighter-rouge">CURRENT ROW</code> actually means the current group. The last thing is to exclude the actual current row, so we add <code class="language-plaintext highlighter-rouge">EXCLUDE CURRENT ROW</code> clause, which always excludes just the current row. If you think that this syntax is misleading, don’t worry, you’re not the only one.</p>
<h4 id="example-3-find-the-rank-of-an-immediately-better-rated-film">Example 3. Find the rank of an immediately better rated film</h4>
<p><img src="/images/blog/advanced-sql-window-frames/result-rows-3.svg" alt="Result rows with a rating of an immediately better film" class="center-image" />
<em>Result rows with a rating of an immediately better film</em></p>
<p>This example becomes tricky when we think about duplicates. There can be many rows with the same value as the current row, but we’re interested in the first one that has a greater value. We need to skip all the duplicates:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="n">FIRST_VALUE</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">rating</span>
<span class="n">GROUPS</span> <span class="k">BETWEEN</span> <span class="mi">1</span> <span class="n">FOLLOWING</span> <span class="k">AND</span> <span class="mi">1</span> <span class="n">FOLLOWING</span><span class="p">)</span> <span class="k">AS</span> <span class="n">rating_of_better</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">release_year</span><span class="p">,</span> <span class="n">rating</span><span class="p">;</span>
</code></pre></div></div>
<p>Here, with the <code class="language-plaintext highlighter-rouge">GROUPS</code> mode, we narrow down the framing window to the group immediately after the current group. Because this is a single group, all sorting values in the group are the same. We can choose any of them, e.g. by using <code class="language-plaintext highlighter-rouge">FIRST_VALUE()</code>.</p>
<p>We can also use the solution from Example 1 and use <code class="language-plaintext highlighter-rouge">MIN()</code> function. As always, there are many correct answers.</p>
<h4 id="example-4-how-many-films-are-better-by-05-or-less">Example 4. How many films are better by 0.5 or less?</h4>
<p>The last example requires <code class="language-plaintext highlighter-rouge">RANGE</code> mode. One important thing to remember about is excluding the current group. In this example, we don’t consider equally rated films as better.</p>
<p><img src="/images/blog/advanced-sql-window-frames/result-rows-4.svg" alt="Result rows with a count of films with ratings higher by 0.5 or less" class="center-image" />
<em>Result rows with a count of films with ratings higher by 0.5 or less</em></p>
<p>This example allows us to use the <code class="language-plaintext highlighter-rouge">RANGE</code> mode. Specifying the upper boundary is easy - <code class="language-plaintext highlighter-rouge">0.5 FOLLOWING</code>. But the lower one is more problematic. To start with exactly the first strictly better film, we need to include the current group and everything above (<code class="language-plaintext highlighter-rouge">BETWEEN CURRENT ROW ...</code>) but then exclude the current group again (<code class="language-plaintext highlighter-rouge">EXCLUDE GROUP</code>). We’re not in the <code class="language-plaintext highlighter-rouge">GROUPS</code>, so we can’t just say <code class="language-plaintext highlighter-rouge">1 FOLLOWING</code>:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">rating</span>
<span class="n">RANGE</span> <span class="k">BETWEEN</span> <span class="k">CURRENT</span> <span class="k">ROW</span> <span class="k">AND</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span> <span class="n">FOLLOWING</span> <span class="n">EXCLUDE</span> <span class="k">GROUP</span><span class="p">)</span>
<span class="k">AS</span> <span class="n">count_of_better</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">release_year</span><span class="p">,</span> <span class="n">rating</span><span class="p">;</span>
</code></pre></div></div>
<h3 id="summary">Summary</h3>
<p>Window functions are a very powerful SQL feature that can be extremely useful when you need your relational database system to do more complicated calculations for you. This article described window frames, which make window functions even more powerful. They allow you to flexibly narrow down the set of rows being used for calculations, so you can solve problems that previously couldn’t be solved with window functions only.</p>
<p>It’s worth to keep track of the latest SQL features being introduced to the popular relational database systems. It might take some time to learn and fully understand them, but if you want your database engine to do the heavy lifting, then they will prove useful for you.</p>
<div class="infobox">
<p>If you like my style of explaining things, you can check my article about other advanced SQL feature - <a href="/advanced-sql-cte/">Common Table Expressions</a>.</p>
</div>
<h3 id="resources">Resources</h3>
<ul>
<li><a href="https://www.postgresql.org/docs/11/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS">PostgreSQL documentation</a></li>
<li><a href="https://modern-sql.com">Modern SQL</a></li>
</ul>Deep dive into more advanced use of window functions with window framesGit aliases I can’t live without2018-11-06T08:00:00+00:002018-11-06T08:00:00+00:00https://mjk.space/git-aliases-i-cant-live-without<p>People are often surprised and curious at the same time when they see how I work with Git:</p>
<p><img src="/images/blog/git-aliases/workflow.gif" alt="My Git workflow" class="center-image" />
<em>My Git workflow</em></p>
<p>My love for aliases started when I installed <em>zsh</em> and its addon suite <em><a href="https://github.com/robbyrussell/oh-my-zsh">oh-my-zsh</a></em> for the first time. It contains a big set of predefined aliases and helper functions for different command line programs. I immediately liked the concept of typing just few letters instead of regular, long, parametrized invocations. The tool that I work with most often is Git, so it was a natural candidate for the alias revolution. Now, few years later, I can’t imagine using Git with the <code class="language-plaintext highlighter-rouge">git</code> command itself.</p>
<p>Of course, Git has its own <a href="https://git-scm.com/book/en/v2/Git-Basics-Git-Aliases">system for defining aliases</a>, which is perfectly fine. Personally I just don’t like that space between <code class="language-plaintext highlighter-rouge">git</code> and the alias. Shell aliases are also more flexible and can be used for other commands too, e.g. <code class="language-plaintext highlighter-rouge">docker</code>.</p>
<p>Below you’ll find the list of aliases that I use the most. Some of them come directly from <em>oh-my-zsh</em> and some were created by me. I hope you’ll find at least some of them useful! If you want to try all them on your own - just go and grab them from <a href="https://github.com/mjkonarski/oh-my-git-aliases">my repository</a>.</p>
<h5 id="1-lets-start-working-with-this-repo">1. Let’s start working with this repo!</h5>
<p><code class="language-plaintext highlighter-rouge">alias gcl = git clone</code></p>
<p>This is maybe not the most frequent Git command programmers use, but I personally like to get my hands on this <em>awesome-github-project-I-have-just-seen</em> as soon as possible.</p>
<h5 id="2-download-the-latest-state-from-the-remote">2. Download the latest state from the remote</h5>
<p><code class="language-plaintext highlighter-rouge">alias gfe = git fetch</code></p>
<p>I usually use fetch to get the newest changes from the remote repository because it doesn’t affect working directory and <em>HEAD</em> in any way. Later I can use other commands to modify local files explicitly.</p>
<h5 id="3-lets-see-some-other-branch">3. Let’s see some other branch!</h5>
<p><code class="language-plaintext highlighter-rouge">alias gco = git checkout</code></p>
<p>This is definitely one of the most useful commands on the daily basis. One of the reasons I had decided to write this article is that I still see people writing <code class="language-plaintext highlighter-rouge">git checkout</code> everytime they want to switch to other branch.</p>
<h5 id="4-get-back-to-the-previous-branch">4. Get back to the previous branch!</h5>
<p><code class="language-plaintext highlighter-rouge">gco -</code></p>
<p>This dash is a little trick that means “the previous branch”. I know that strictly speaking this is not an alias, but it’s just too useful not to mention. Also I’ve got the impression that not many people know about it.</p>
<p><code class="language-plaintext highlighter-rouge">checkout</code> is not the only option that accepts a dash - you can use it also with e.g. <code class="language-plaintext highlighter-rouge">merge</code>, <code class="language-plaintext highlighter-rouge">cherry-pick</code> and <code class="language-plaintext highlighter-rouge">rebase</code>.</p>
<h5 id="5-get-me-to-master-quickly">5. Get me to master quickly!</h5>
<p><code class="language-plaintext highlighter-rouge">alias gcm = git checkout master</code></p>
<p>If we switch often between some well defined branches, why don’t make it as simple as possible? Depending on your workflow you can also find other similar aliases useful: <code class="language-plaintext highlighter-rouge">gcd</code> (<em>develop</em>), <code class="language-plaintext highlighter-rouge">gcu</code> (<em>uat</em>), <code class="language-plaintext highlighter-rouge">gcs</code> (<em>stable</em>).</p>
<h5 id="6-where-am-i-and-whats-going-on">6. Where am I and what’s going on?</h5>
<p><code class="language-plaintext highlighter-rouge">alias gst = git status</code></p>
<p>Simple and self explanatory.</p>
<h5 id="7-i-dont-care-about-the-current-working-changes-just-give-me-the-latest-state-from-origin">7. I don’t care about the current working changes, just give me the latest state from origin!</h5>
<p><code class="language-plaintext highlighter-rouge">alias ggrh = git reset --hard origin/$(current_branch)</code></p>
<p>My personal favourite. How many times have you made such a terrible mess that you just wanted to get both staging area and working directory back to their original state? Now it’s only four keystrokes away.</p>
<p>Please note that this particular command resets the current branch to the latest commit from <em>origin</em>. This is exactly what <em>I</em> usually need, but may not be the thing that <em>you</em> need. I use it every time I don’t care about local changes and I simply want my current branch to reflect its remote counterpart. You may say that <code class="language-plaintext highlighter-rouge">git pull</code> can be used instead, but I just don’t like the fact that it tries to merge remote branch instead of just reset the current one to it.</p>
<p>Note that <code class="language-plaintext highlighter-rouge">current_branch</code> is a custom function (made by the author of <em>oh-my-zsh</em>). You can see it e.g. <a href="https://github.com/mjkonarski/oh-my-git-aliases/blob/master/oh-my-git-aliases.sh#L71">here</a>.</p>
<h5 id="8-what-are-the-current-changes">8. What are the current changes?</h5>
<p><code class="language-plaintext highlighter-rouge">alias gd = git diff</code></p>
<p>Another classic. It simply shows all changes made but not yet staged. If you want to see what changes had been already staged, use this version:</p>
<p><code class="language-plaintext highlighter-rouge">alias gdc = git diff --cached</code></p>
<h5 id="9-lets-commit-these-changed-files">9. Let’s commit these changed files!</h5>
<p><code class="language-plaintext highlighter-rouge">alias gca = git commit -a</code></p>
<p>This commits all changed files, so you don’t need to add them manually. However, if there are some new files, that had not been committed yet, obviously you need to point to them explicitly:</p>
<p><code class="language-plaintext highlighter-rouge">alias ga = git add</code></p>
<h5 id="10-i-have-some-changes-that-id-like-to-add-to-the-previous-commit">10. I have some changes that I’d like to add to the previous commit!</h5>
<p><code class="language-plaintext highlighter-rouge">alias gca! = git commit -a --amend</code></p>
<p>I use this one very often, as I like to keep my Git history clean and tidy (no “pull request fixes” or “forgot to add this file” type of commit messages). It simply takes all changes and adds them to the previous commit.</p>
<h5 id="11-i-did-the-previous-one-too-quick-how-to-uncommit-a-file">11. I did the previous one too quick, how to “uncommit” a file?</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gfr() {
git reset @~ "$@" && git commit --amend --no-edit
}
</code></pre></div></div>
<p>This one is a function, not an alias, and may seem a bit complicated at the first glance. It takes a name of a file you want to “uncommit”, removes all changes made to this file from the <em>HEAD</em> commit, but leaves it untouched in the working directory. Then it’s ready to be staged again, maybe as a separate commit. This is how it works in practice:</p>
<p><img src="/images/blog/git-aliases/grf.gif" alt="grf example" class="center-image" /></p>
<h5 id="12-ok-ready-to-push">12. Ok, ready to push!</h5>
<p><code class="language-plaintext highlighter-rouge">alias ggpush = git push origin $(current_branch)</code></p>
<p>I use this one every time I want to do a push. Because it implicitly passes the remote branch argument I can be sure that only one branch is pushed, regardless of the <code class="language-plaintext highlighter-rouge">push.default</code> <a href="https://git-scm.com/docs/git-config#git-config-pushdefault">setting</a>. Starting with Git 2.0 this is the default behaviour anyway, but the alias gives me extra safety in case I’d work with some legacy Git version.</p>
<p>This is maybe not that critical with a normal push, but critical as hell with the next command.</p>
<h5 id="13-im-ready-to-push-and-i-know-what-im-doing">13. I’m ready to push and I know what I’m doing</h5>
<p><code class="language-plaintext highlighter-rouge">alias ggpushf = git push --force-with-lease origin $(current_branch)</code></p>
<p>Pushing with force is clearly a controversial habit and many people will say that you should never ever do that. I agree, but only when it comes to critical, shared branches like <em>master</em>.</p>
<p>As I’ve already mentioned, I like to keep my git history clean. That sometimes involves changing already pushed commits. The <code class="language-plaintext highlighter-rouge">--force-with-lease</code> switch is particularly useful here, as it rejects the push when your local repository doesn’t have the latest state of the remote branch. Therefore it’s not possible to discard someone else’s modifications. At least not unintentionally.</p>
<p>I started using this alias with remote branch name part set to <code class="language-plaintext highlighter-rouge">$(current_branch)</code> after my colleague had once mistakenly invoked <code class="language-plaintext highlighter-rouge">git commit -f</code> (with <code class="language-plaintext highlighter-rouge">push.default</code> set to <code class="language-plaintext highlighter-rouge">matching</code>) and force-pushed all local branches to the <em>origin</em>. Including an old version of <em>master</em>. I still remember the panic in his eyes after he realised what had happened.</p>
<h5 id="14-oh-no-the-push-has-been-rejected-somebody-has-been-touching-my-branch">14. Oh no, the push has been rejected! Somebody has been touching my branch!</h5>
<p>You tried to push your branch to the remote repository, but got the following message:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>To gitlab.com:mjkonarski/my-repo.git
! [rejected] my-branch -> my-branch (non-fast-forward)
error: failed to push some refs to 'git@gitlab.com:mjkonarski/my-repo.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
</code></pre></div></div>
<p>This happens when more that one person works on the same branch. Maybe your colleague had pushed a change when you were not looking? Or you used two computers, not syncing the branch before? Here’s a simple solution:</p>
<p><code class="language-plaintext highlighter-rouge">alias glr = git pull --rebase</code></p>
<p>It pulls the latests changes and rebases your commits on the top of them automatically. If you’re lucky enough (and the remote changes were made to different files) you may even avoid resolving conflicts. Voilà, ready to push again!</p>
<h5 id="15-i-want-my-branch-to-reflect-the-latest-changes-from-master">15. I want my branch to reflect the latest changes from master!</h5>
<p>Let’s say that you have a branch you’ve created from <em>master</em> some time ago. You’ve pushed some changed, but in the meantime <em>master</em> itself had also been updated. Now you’d like your branch to reflect those latests commits. I strongly prefer rebasing over merging in that case - your commit history stays short and clean. It’s as easy as typing:</p>
<p><code class="language-plaintext highlighter-rouge">alias grbiom = git rebase --interactive origin/master</code></p>
<p>I use this command so often that this alias was one of the first I’ve started using. The <code class="language-plaintext highlighter-rouge">--interactive</code> switch spins up your favourite editor and lets you quickly check the list of commits that are about to be rebased on master. You can also use this opportunity to <em>squash</em>, <em>reword</em> or <em>reorder</em> commits. So many options with that simple alias!</p>
<h5 id="16-damn-i-tried-to-rebase-but-wild-conflicts-appeared-get-me-the-hell-out-of-here">16. Damn, I tried to rebase, but wild conflicts appeared! Get me the hell out of here!</h5>
<p>Nobody likes getting this message:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONFLICT (content): Merge conflict in my_file.md
Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".
</code></pre></div></div>
<p>Sometimes you may want just to abort the whole process and leave resolving the conflict for later. The above message gives a clue how to do it, but why in so many keystrokes?</p>
<p><code class="language-plaintext highlighter-rouge">alias grba = git rebase --abort</code></p>
<p>And we’re safe again. When you finally find the courage to do the merge again and resolve these conflicts, after <code class="language-plaintext highlighter-rouge">git add</code>-ing them you can simply carry on with the rebase typing:</p>
<p><code class="language-plaintext highlighter-rouge">alias grbc = git rebase --continue</code></p>
<h5 id="17-put-these-changes-away-for-a-second-please">17. Put these changes away for a second, please!</h5>
<p>Let’s say you had made some changes, but haven’t committed them yet. Now you want to quickly switch to a different branch and do some unrelated work:</p>
<p><code class="language-plaintext highlighter-rouge">alias gsta = git stash</code></p>
<p>This commit puts your modifications aside and reverts the clean state of <em>HEAD</em>.</p>
<h5 id="18-now-give-them-back">18. Now, give them back!</h5>
<p>When you’re done with your unrelated work you may bring back your changes with a quick:</p>
<p><code class="language-plaintext highlighter-rouge">alias gstp = git stash pop</code></p>
<h5 id="19-this-one-little-commit-looks-nice-lets-put-it-on-my-branch">19. This one little commit looks nice, let’s put it on my branch!</h5>
<p>Git has a nice feature called <em>cherry-pick</em>. You can use it to add any existing commit to the top of your current branch. It’s as simple as using this alias:</p>
<p><code class="language-plaintext highlighter-rouge">alias gcp = git cherry-pick</code></p>
<p>This can of course lead to a conflict, depending on a content of this commit. Resolving this conflict is exactly the same as resolving rebase conflicts. Therefore we’ve got similar options to abort and continue cherry picking as well:</p>
<p><code class="language-plaintext highlighter-rouge">alias gcpa = git cherry-pick --abort</code></p>
<p><code class="language-plaintext highlighter-rouge">alias gcpc = git cherry-pick --continue</code></p>
<hr />
<p>The above list for sure doesn’t cover all possible git use cases. I’d like to encourage you to take it as a good start for building your own suite of aliases. It’s always a good idea to seek for possible improvements in your daily workflow.</p>
<p>You can find all these aliases (and more!) in <a href="https://github.com/mjkonarski/oh-my-git-aliases">my Github repository</a>.</p>A list of handy Git aliases inspired by oh-my-zsh suite.I got infected with malware and appreciated by its author2018-03-05T17:00:00+00:002018-03-05T17:00:00+00:00https://mjk.space/got-infected-with-malware-and-appreciated-by-its-author<p>Events depicted in this article happened some time ago, but I’ve never had enough time and determination to actually write them down and publish them. Quite recently some of my friends who had heard this story convinced me to do so. Here it is.</p>
<p>It was a hot summer evening back in August 2014. I was working on my master’s thesis project. The project was about building software controlling a group of mobile robots and at that time I was running some simulations. The main part was running on a Linux machine, but the simulator had to be run on Windows, so I decided to put it in VMware virtual machine. Looking back, I must admit it was very overcomplicated, but I was still a student and I had a lot of free time.</p>
<p>Everything worked fine up until my control software lost connection to the simulator. Turning it off and on again didn’t help, so I looked at the WMware’s window and found out that the virtual machine had restarted and showed the following message:</p>
<p><img src="/images/blog/malware-analysis/malware-message.png" alt="Ransom message" class="center-image" />
<em>Ransom message</em></p>
<p>The message was in Polish, my native language. It said:</p>
<blockquote>
<p>Your computer has been locked and your disk has been encrypted. Please send a text message “WP A4792” to number 7928 to get the unlocking code. Enter the code in the box below.</p>
</blockquote>
<p>The price for this message was 9 PLN, which is around $3.</p>
<p>Thoughts started to run through my head. What the hell? A Polish CryptoLocker? All my data lost? But how? And why is it that cheap to recover?</p>
<p>Then I nervously tried to recall what data I had on this virtual machine. Much to my relief, I realised that it was just the system and couple of programs. Every important file was backed up somewhere else, so in fact, the whole situation seemed like it wasn’t much of a problem. So why not take a break from the university stuff and spend some time on analysing it? Maybe I’ll be able to repair the machine without actually paying the ransom.</p>
<p>My first idea was to find out whether the disk was really encrypted. The original <a href="https://en.wikipedia.org/wiki/CryptoLocker">CryptoLocker</a> encrypts all user files, but displays the ransom message on a running system instead of replacing the bootloader. But who knows, maybe this one is more radical.</p>
<p>To figure this out I had to take a look at the Master Boot Record. MBR is a small part at the beginning of a hard disk that contains two things: a partition table - information about how the disk is organised logically - and a boot loader - a piece of code that is used to start the operating system. It looked like the malware had changed the bootloader, so the system didn’t start, but what about the partition table?</p>
<p>I used Windows installation image to boot the virtual machine. Unfortunately, it did not recognize any system on the disk, so it seemed that the partition table was corrupted as well, or maybe the entire disk was encrypted.</p>
<p><img src="/images/blog/malware-analysis/recovery-no-systems.png" alt="No systems found" class="center-image" />
<em>Recovery console. No systems found</em></p>
<p>So I used TestDisk - a partition recovery tool that scans the actual data space and tries to recognise existing partitions. If only the MBR was corrupted, TestDisk would easily be able to repair it. After just a second, it announced a complete success and restored the partition table:</p>
<p><img src="/images/blog/malware-analysis/test-disk.png" alt="TestDisk restored the partition" class="center-image" />
<em>TestDisk restored the partition</em></p>
<p>Good, it seems that the files are fine. Now I can get back to the Windows recovery console and restore the bootloader.</p>
<p><img src="/images/blog/malware-analysis/recovery-system-found.png" alt="Recovery console found the system" class="center-image" />
<em>Recovery console found the system</em></p>
<p>Much to my surprise, the system was then able to boot and operate normally. Moreover, I didn’t notice any missing or inaccessible files. So this malware had only replaced the MBR, but it hadn’t encrypted any data. What’s the deal with an “unlocking code” then? How does it check if the code is correct and restore the system to an operational state? I decided to solve this mystery by digging into its machine code. At this point I kinda started to enjoy myself.</p>
<p>Because the malware attacked a system on a virtual machine, I was able to copy its state before playing with the MBR recovery tools. Now I could use this snapshot as a playground. I booted it up again with some basic Linux distribution and dumped the first 10 sectors of the disk into a file:</p>
<p><code class="language-plaintext highlighter-rouge">dd if=/dev/sda of=mbr bs=512 count=10</code></p>
<p>I got back to my host Linux machine and started analysing the dump. First of all it would be nice to get a human readable assembler instructions out of this binary machine code:</p>
<p><code class="language-plaintext highlighter-rouge">objdump -D -b binary -mi386 -Maddr16,data16 mbr</code></p>
<p>The whole thing looked like a good puzzle, so I printed the code listing on paper and started going through it instruction by instruction. And here is what I found.</p>
<p><img src="/images/blog/malware-analysis/printed-code.png" alt="In the middle of the analysis" class="center-image" />
<em>In the middle of the analysis</em></p>
<p>Let’s start with the first chunk, located at the very beginning of the MBR. This is the place where the execution starts. The part on the left contains a hexadecimal representation of the raw bytes from the file. On the right side, you can see the assembler mnemonics that represent disassembled machine code. I’ve also added some comments.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0: b8 00 00 mov ax,0x0 # put 0x0 into AX
3: 8e d8 mov ds,ax # copy AX into DS
5: b8 03 00 mov ax,0x3 # put 0x3 into AX
8: cd 10 int 0x10 # run 0x10 BIOS interruption
</code></pre></div></div>
<p>The only way that this kind of program can communicate with computer peripherals like disks, keyboards and screens are BIOS interrupts. They are special signals sent to the processor instructing it to run a particular external procedure. An interrupt can be called by using <code class="language-plaintext highlighter-rouge">int</code> mnemonic with an interrupt identifier and some other parameters placed in processor registers.</p>
<p>The above piece of code runs BIOS interrupt number <code class="language-plaintext highlighter-rouge">0x10</code>. It can be used to perform different operations with a screen, like writing and reading characters. This program requests cursor shape and position. To be honest, I have no idea why the author put it there, but let’s treat it as a nice warm up before the next part.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a: b8 00 80 mov ax,0x8000 # set destination memory address
d: 8e c0 mov es,ax
f: b8 01 02 mov ax,0x201 # set AX to read one sector from disk
12: b5 00 mov ch,0x0 # configure the source
14: b1 02 mov cl,0x2
16: b6 00 mov dh,0x0
18: b2 80 mov dl,0x80
1a: 31 db xor bx,bx
1c: cd 13 int 0x13 # run disk-related interrupt 0x13
</code></pre></div></div>
<p>Now the things start to make more sense. Interruption <code class="language-plaintext highlighter-rouge">0x13</code> together with AH register set to <code class="language-plaintext highlighter-rouge">0x02</code> read bytes from the disk into the memory. In this case, it will read the entire second sector of the disk (512 bytes). Why? A quick look at this content with a tool called <code class="language-plaintext highlighter-rouge">hexdump</code> immediately answers this question:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000200 b8 00 00 8e d8 20 4b 6f 6d 70 75 74 65 72 20 7a |..... Komputer z|
00000210 61 62 6c 6f 6b 6f 77 61 6e 79 2c 20 64 79 73 6b |ablokowany, dysk|
00000220 20 7a 61 73 7a 79 66 72 6f 77 61 6e 79 0d 0a 20 | zaszyfrowany.. |
00000230 41 62 79 20 6f 64 62 6c 6f 6b 6f 77 61 63 20 77 |Aby odblokowac w|
00000240 79 73 6c 69 6a 20 73 6d 73 20 6f 20 74 72 65 73 |yslij sms o tres|
00000250 63 69 0d 0a 20 22 57 50 20 41 34 37 39 32 22 20 |ci.. "WP A4792" |
00000260 6e 61 20 6e 72 20 37 39 32 38 2e 20 28 4b 6f 73 |na nr 7928. (Kos|
00000270 7a 74 20 39 7a 6c 29 0d 0a 20 4b 6f 64 20 77 70 |zt 9zl).. Kod wp|
00000280 69 73 7a 20 70 6f 6e 69 7a 65 6a 2e 0d 0a 0d 0a |isz ponizej.....|
00000290 20 4b 6f 64 3a 20 5b 20 20 20 20 20 20 20 20 5d | Kod: [ ]|
</code></pre></div></div>
<p>It simply contains human readable characters forming the malware’s message.</p>
<p>Let’s move on. The next part calls a procedure at address <code class="language-plaintext highlighter-rouge">0x83</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1e: b8 00 80 mov ax,0x8000
21: e8 5f 00 call 0x83
</code></pre></div></div>
<p>As I expected, the procedure at address <code class="language-plaintext highlighter-rouge">0x83</code> prints appropriate content of the memory to the screen:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>83: 8e c0 mov es,ax
85: be 05 00 mov si,0x5
88: 26 8a 04 mov al,BYTE PTR es:[si] # read one character
8b: 3c 00 cmp al,0x0 # check if it is equal to zero
8d: 74 09 je 0x98 # if yes then jump to return
8f: 83 c6 01 add si,0x1
92: b4 0e mov ah,0xe
94: cd 10 int 0x10 # otherwise print it to the screen
96: eb f0 jmp 0x88 # and jump back in a loop
98: c3 ret
</code></pre></div></div>
<p>Once all characters are printed, the program gets back to the original execution and moves on. The next part sets the cursor at position <code class="language-plaintext highlighter-rouge">(5, 6)</code>, which is the first field of the text input area surrounded by square brackets:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>24: b4 02 mov ah,0x2
26: b7 00 mov bh,0x0
28: b6 05 mov dh,0x5 # set column number
2a: b2 06 mov dl,0x6 # set row number
2c: cd 10 int 0x10 # move the cursor
</code></pre></div></div>
<p>Now things get really interesting. Interruption <code class="language-plaintext highlighter-rouge">0x16</code> is responsible for reading characters from the keyboard. Here it reads one symbol:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>3f: b8 e7 7c mov ax,0x7ce7 # set a destination for a character
42: 8e d8 mov ds,ax
44: bf 00 00 mov di,0x0
47: b4 00 mov ah,0x0
49: cd 16 int 0x16 # read one character
</code></pre></div></div>
<p>What happens next?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>4b: 3c 0d cmp al,0xd
4d: 74 0f je 0x5e
</code></pre></div></div>
<p>The malware checks if a user pressed ENTER and in that case it jumps to address <code class="language-plaintext highlighter-rouge">0x5e</code>. Otherwise, it enters a loop where it reads exactly 10 characters (this is the length of the unlocking code). If none of the inputted keys were ENTER, the program clears the input field and starts the whole character-reading loop from the beginning.</p>
<p>And now the final part. What happens when the user presses ENTER? Let’s go to <code class="language-plaintext highlighter-rouge">0x5e</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>5e: 80 3e 08 00 32 cmp BYTE PTR ds:0x8,0x32
63: 75 09 jne 0x6e
65: 80 3e 09 00 37 cmp BYTE PTR ds:0x9,0x37
6a: 75 02 jne 0x6e
6c: eb 3d jmp 0xab
</code></pre></div></div>
<p>Here the whole mystery unveils. The malware looks at 9th and 10th character of the user inputted sequence and checks if they are equal to <code class="language-plaintext highlighter-rouge">2</code> and <code class="language-plaintext highlighter-rouge">7</code>, respectively. So you can actually enter any “unlocking code”, as long as it ends up with <code class="language-plaintext highlighter-rouge">27</code>. Maybe it was only me, but I actually expected something more fancy. Should I send the text message paying the ransom? I would probably get some random sequence of characters, with the last two matching this pattern.</p>
<p>Last question that remained unanswered was what happens when the user enters the correct code. To find that out we have to jump to address <code class="language-plaintext highlighter-rouge">0xab</code>. The instructions below take the 5th sector from the disk and move it to the very beginning, and as you can imagine this 5th sector contained content of my original MBR. After this operation, everything goes back to normal and my operating system can boot.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ab: b8 00 80 mov ax,0x8000 # destination memory address
ae: 8e c0 mov es,ax
b0: b8 01 02 mov ax,0x201
b3: b5 00 mov ch,0x0
b5: b1 05 mov cl,0x5 # source - 5th sector
b7: b6 00 mov dh,0x0
b9: b2 80 mov dl,0x80
bb: 31 db xor bx,bx
bd: cd 13 int 0x13 # read one sector
bf: b8 00 80 mov ax,0x8000 # source memory address
c2: 8e c0 mov es,ax
c4: b8 01 03 mov ax,0x301
c7: b5 00 mov ch,0x0
c9: b1 01 mov cl,0x1 # destination - 1st sector
cb: b6 00 mov dh,0x0
cd: b2 80 mov dl,0x80
cf: 31 db xor bx,bx
d1: cd 13 int 0x13 # write one sector
</code></pre></div></div>
<p>Having solved this mystery, I thought that maybe I wasn’t not the only person to have this problem. I googled the ransom message text and actually found a few results.</p>
<p>First of all, I came across a local advertisements page where somebody offered his services to fix computers locked with that malware (price negotiable). The person stated that he or she is located in Cracow, the place where I live, so I started to suspect that this whole thing may be very local. By the way, to make this business look attractive it would have to be cheaper than 9 PLN. Doesn’t sound like a good deal for the service provider.</p>
<p><img src="/images/blog/malware-analysis/service-ad.png" alt="Removing service ad" class="center-image" width="600px" />
<em>Removing service ad</em></p>
<p>The second thing I found was <a href="https://www.elektroda.pl/rtvforum/topic2876447-30.html">a thread</a> on a well-known Polish discussion board about electronics and computers called <em>elektroda.pl</em>. One of its users described exactly the same problem and asked for help. Other people in this thread also noticed that the malware corrupts only the MBR and advised him with a solution more or less similar to mine. Nevertheless, I wrote a post with steps that I came up with. I also included the correct code pattern to unlock the system.</p>
<p>And then I forgot about the whole case.</p>
<hr />
<p>Three months later, out of the blue, somebody replied to this thread:</p>
<blockquote>
<p>Hello everybody and thank you for the time spent on this. I’m the author of this application. Respect to m4jkel (<strong>that’s me</strong>), who decided to analyze the code of my program. It was written in Assembler (which I adore) with Fasm compiler.</p>
<p>(…)</p>
<p>To become infected with it, the user had to install a pirated program, such as a game or operating system. If he had bought the original software he would not have any problems.</p>
<p>(…)</p>
<p>Please, don’t send messages to the given number, because the code that you’ll receive will not unlock your computer. It’s a flaw of the messaging service, which I was not able to configure to generate codes ending with desired characters.</p>
<p>Regards,
Karol</p>
</blockquote>
<p>Actually, it was quite nice that somebody appreciated my effort, even if it was the author. Unfortunately for the victims who decided to send a message, the code they received was not correct. It also meant that the author had not tested it before releasing the malware. Too bad for the victims. Maybe he was also the person who offered his repairing services to actually make some money after discovering that his initial plan hadn’t worked? Who knows.</p>
<p>I don’t know how many people had been affected besides me, but I’ve found a few more discussion boards mentioning the same problem. It looks like the author had prepared installation images of several programs and games with his malicious code and had put it on warez sites.</p>
<p>The last question that you probably have is how have I got infected with the malware. I have to admit that <em>Karol</em> had a point there. I had downloaded and installed a pirated game. My bad. As an excuse I can say that this game was not available on Steam back then, and I really wanted to take a break from working on my master’s thesis ;) And I’ve actually bought it later.</p>
<p>So, that’s what I’ve learned. Do not, under any circumstances, download and install pirated software. Apart from the fact that it’s illegal, you just put yourself, your data and your privacy at unnecessary risk. On the other hand, if you want to write a malware you better test that it works correctly. And, by the way, always do backups.</p>A story about how I got infected with malware, analyzed its machine code and got appreciated by its author.Advanced SQL - Common Table Expressions2018-01-26T11:00:00+00:002018-01-26T11:00:00+00:00https://mjk.space/advanced-sql-cte<p>This is the second article in my series discussing advanced SQL concepts. I want to describe features that are well supported in popular database management systems for quite some time, but somehow many people still don’t know about their existence. I’d like to explain them with examples, first giving a problem to solve using “plain old” SQL and then showing a better solution using advanced SQL.</p>
<p>You can find the first article about window functions <a href="/advanced-sql-window-functions/">here</a>.</p>
<p>This time I’d like to discuss <strong>Common Table Expressions (CTE)</strong>.</p>
<p>In this post I’ll be using PostgreSQL 10, because it’s the most feature-rich open source database available. Common Table Expressions have been available since Postgres 8.4, so any modern version will be fine. They are also <a href="https://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL#Common_table_expression">supported by other popular RDBMSes</a>.</p>
<h3 id="problem">Problem</h3>
<p>This time we’ll be working with three tables: <code class="language-plaintext highlighter-rouge">films</code>, <code class="language-plaintext highlighter-rouge">actors</code> and a linking table <code class="language-plaintext highlighter-rouge">films_actors</code> between them:</p>
<p><img src="/images/blog/advanced-sql-cte/db_schema.svg" alt="Films table schema" class="center-image" />
<em>Database schema</em></p>
<p>Something that may attract your attention are the columns <code class="language-plaintext highlighter-rouge">prequel_id</code> and <code class="language-plaintext highlighter-rouge">sequel_id</code>. They are foreign key referencing to the very same table <code class="language-plaintext highlighter-rouge">films</code> and pointing respectively to a prequel or sequel of a given film. To make things clear I prepared a set of sample rows for this table:</p>
<p><img src="/images/blog/advanced-sql-cte/films-rows.svg" alt="Input rows" class="center-image" />
<em>Films rows</em></p>
<p>As you can see there are two chains of the prequel-sequel relation (1-3-5 and 2-4) and one film that has no connections at all. I don’t think it’s necessary to provide the content of the other two tables - let’s pretend they have some meaningful data.</p>
<p>Here comes the first example:</p>
<h5 id="example-1-for-each-film-count-number-of-actors-starring-in-it">Example 1. For each film count number of actors starring in it.</h5>
<p>Easy. The result looks like this:</p>
<p><img src="/images/blog/advanced-sql-cte/example-1-results.svg" alt="Result rows with actors counts" class="center-image" />
<em>Result rows with actors counts</em></p>
<p>There is no catch here - it’s as simple as it looks. All we need to do is to join <code class="language-plaintext highlighter-rouge">films</code> with <code class="language-plaintext highlighter-rouge">film_actors</code> and count number of rows:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">actors</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">films_actors</span> <span class="n">fa</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fa</span><span class="p">.</span><span class="n">film_id</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
</code></pre></div></div>
<p>With that solution in mind let’s move on to the next thing.</p>
<h5 id="example-2-for-each-film-its-prequel-and-sequel-count-number-of-actors-starring-in-them">Example 2. For each film, its prequel and sequel count number of actors starring in them.</h5>
<p>This task doesn’t look any more complicated than the previous one. Right?</p>
<p><img src="/images/blog/advanced-sql-cte/example-2-results.svg" alt="Result rows with actors counts for prequel and sequel" class="center-image" />
<em>Result rows with actors counts for prequel and sequel</em></p>
<p>There is more than one solution, but the most straightforward just uses three identical queries to get information about the film and both its prequel and sequel:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">films</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">films</span><span class="p">.</span><span class="n">prequel_id</span><span class="p">,</span> <span class="n">films</span><span class="p">.</span><span class="n">sequel_id</span><span class="p">,</span>
<span class="n">films</span><span class="p">.</span><span class="n">actors</span><span class="p">,</span>
<span class="n">prequels</span><span class="p">.</span><span class="n">actors</span> <span class="k">AS</span> <span class="n">prequel_actors</span><span class="p">,</span>
<span class="n">sequels</span><span class="p">.</span><span class="n">actors</span> <span class="k">AS</span> <span class="n">sequel_actors</span>
<span class="k">FROM</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="o">*</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">as</span> <span class="n">actors</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">JOIN</span> <span class="n">films_actors</span> <span class="n">fa</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fa</span><span class="p">.</span><span class="n">film_id</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">prequel_id</span>
<span class="p">)</span> <span class="n">films</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="o">*</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">as</span> <span class="n">actors</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">JOIN</span> <span class="n">films_actors</span> <span class="n">fa</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fa</span><span class="p">.</span><span class="n">film_id</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
<span class="p">)</span> <span class="n">prequels</span> <span class="k">ON</span> <span class="n">films</span><span class="p">.</span><span class="n">prequel_id</span> <span class="o">=</span> <span class="n">prequels</span><span class="p">.</span><span class="n">id</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="o">*</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">as</span> <span class="n">actors</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">JOIN</span> <span class="n">films_actors</span> <span class="n">fa</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fa</span><span class="p">.</span><span class="n">film_id</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
<span class="p">)</span> <span class="n">sequels</span> <span class="k">ON</span> <span class="n">films</span><span class="p">.</span><span class="n">sequel_id</span> <span class="o">=</span> <span class="n">sequels</span><span class="p">.</span><span class="n">id</span>
</code></pre></div></div>
<p>The query works, but has some problems. It’s not easy to read and understand as you have to carefully compare, line by line, all three subqueries to make sure that they do exactly the same thing. Modifying one means also that you need to change others as well. Wouldn’t it be nice to write identical parts once and only refer to them somehow?</p>
<p>Moreover this also shows one general disadvantage of SQL - you need to read queries from inside to outside - because that’s the order in which they are executed. I think it would look much better if we had them one below the other.</p>
<p>And that’s what CTE are mainly about.</p>
<h3 id="common-table-expressions-cte">Common Table Expressions (CTE)</h3>
<p>CTE are a mechanism that allows to define temporary named result sets existing just for one query (you may also think about them as temporary “tables” or “views”). Let’s see how they work in practice by solving Example 1 once again:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="n">film_with_actors_count</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="o">*</span><span class="p">,</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">actors</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">JOIN</span> <span class="n">films_actors</span> <span class="n">fa</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fa</span><span class="p">.</span><span class="n">film_id</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">actors</span>
<span class="k">FROM</span> <span class="n">film_with_actors_count</span> <span class="n">f</span>
</code></pre></div></div>
<p>CTE are defined using <code class="language-plaintext highlighter-rouge">WITH … AS</code> clause. Inside them you can put almost any SQL statement you like (not only <code class="language-plaintext highlighter-rouge">SELECT</code>, but also <code class="language-plaintext highlighter-rouge">INSERT</code>, <code class="language-plaintext highlighter-rouge">UPDATE</code> or <code class="language-plaintext highlighter-rouge">DELETE</code>). Every CTE has a name, so you can easily refer to it in the main query, just like I did in the example above. Fun fact: there is no comma or semicolon between the last CTE definition and the main query.</p>
<p>Of course you can refer to them as many times you want. As a reference take a look at the new solution to Example 2:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="n">film_with_actors_count</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="o">*</span><span class="p">,</span> <span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">actors</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">JOIN</span> <span class="n">films_actors</span> <span class="n">fa</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fa</span><span class="p">.</span><span class="n">film_id</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">films</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">films</span><span class="p">.</span><span class="n">prequel_id</span><span class="p">,</span> <span class="n">films</span><span class="p">.</span><span class="n">sequel_id</span><span class="p">,</span>
<span class="n">films</span><span class="p">.</span><span class="n">actors</span><span class="p">,</span>
<span class="n">prequels</span><span class="p">.</span><span class="n">actors</span> <span class="k">AS</span> <span class="n">prequel_actors</span><span class="p">,</span>
<span class="n">sequels</span><span class="p">.</span><span class="n">actors</span> <span class="k">AS</span> <span class="n">sequel_actors</span>
<span class="k">FROM</span> <span class="n">film_with_actors_count</span> <span class="n">films</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">film_with_actors_count</span> <span class="n">prequels</span> <span class="k">on</span> <span class="n">films</span><span class="p">.</span><span class="n">prequel_id</span> <span class="o">=</span> <span class="n">prequels</span><span class="p">.</span><span class="n">id</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">film_with_actors_count</span> <span class="n">sequels</span> <span class="k">on</span> <span class="n">films</span><span class="p">.</span><span class="n">sequel_id</span> <span class="o">=</span> <span class="n">sequels</span><span class="p">.</span><span class="n">id</span>
</code></pre></div></div>
<p>Looks better, doesn’t it?</p>
<p>Let’s see what other problems CTE can solve. To visualize the first one I’ll use the example from my <a href="/advanced-sql-window-functions/">previous article</a> about window functions. This time with a little complication:</p>
<h5 id="example-3-return-a-single-film-with-the-greatest-number-of-actors-for-each-release-year">Example 3. Return a single film with the greatest number of actors for each release year.</h5>
<p>So, only one film from each year and only the one with the most actors:</p>
<p><img src="/images/blog/advanced-sql-cte/example-3-results.svg" alt="Films with greatest number of actors for each year" class="center-image" />
<em>Films with greatest number of actors for each year</em></p>
<p>To solve this problem we need to use window functions. Adding a new column with a correct values is just a matter of using <code class="language-plaintext highlighter-rouge">RANK()</code> over a correctly partitioned and ordered window:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">actors</span><span class="p">,</span>
<span class="n">RANK</span><span class="p">()</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">DESC</span><span class="p">)</span> <span class="k">AS</span> <span class="n">year_rank</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">films_actors</span> <span class="n">fa</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fa</span><span class="p">.</span><span class="n">film_id</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
</code></pre></div></div>
<p>That’ll give us the following result:</p>
<p><img src="/images/blog/advanced-sql-cte/example-3-mid-results.svg" alt="Films with year rank" class="center-image" />
<em>Films with year rank</em></p>
<p>And now we can simply add a <code class="language-plaintext highlighter-rouge">HAVING</code> clause, right?</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">actors</span><span class="p">,</span>
<span class="n">RANK</span><span class="p">()</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">DESC</span><span class="p">)</span> <span class="k">AS</span> <span class="n">year_rank</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">films_actors</span> <span class="n">fa</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fa</span><span class="p">.</span><span class="n">film_id</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
<span class="k">HAVING</span> <span class="n">year_rank</span> <span class="o">=</span> <span class="mi">1</span>
</code></pre></div></div>
<p>Unfortunately not:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ERROR: column "year_rank" does not exist
LINE 5: HAVING year_col == 1
</code></pre></div></div>
<p>That’s because window functions are not visible in any other clauses in the same query. To overcome this issue we can simple wrap the above query with another query and add necessary filtering there. Or, to make things clearer and simpler, use CTE:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="n">films_actors_year_rank</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">actors</span><span class="p">,</span>
<span class="n">RANK</span><span class="p">()</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">DESC</span><span class="p">)</span> <span class="k">AS</span> <span class="n">year_rank</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">films_actors</span> <span class="n">fa</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fa</span><span class="p">.</span><span class="n">film_id</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">actors</span>
<span class="k">FROM</span> <span class="n">films_actors_year_rank</span> <span class="n">f</span>
<span class="k">WHERE</span> <span class="n">year_rank</span> <span class="o">=</span> <span class="mi">1</span>
</code></pre></div></div>
<p>And now it works just fine. Once more a CTE expression was used to improve readability. This is good, but are they really only about making SQL code nicer?</p>
<p>Well, not exactly. In fact there are problems that simply can’t be solved without CTE.</p>
<h5 id="example-4-for-each-film-return-number-of-all-its-prequels-and-sequels">Example 4. For each film return number of all its prequels and sequels.</h5>
<p>And I’m having such thing in mind:</p>
<p><img src="/images/blog/advanced-sql-cte/example-4-results.svg" alt="Films with numbers of all their prequels and sequels" class="center-image" />
<em>Films with numbers of all their prequels and sequels</em></p>
<p>We need to count the length of both prequel and sequel chain for each film.</p>
<p>If you think about this problem for a while you may realize that it’s not difficult at all to check if a film has a single prequel or sequel by simply looking at its corresponding foreign key column. It’s not hard to extend it to the second level either. In other words we can check if the prequel’s prequel (or sequel’s sequel) exists by doing a self join. Adding more nesting however requires using more subsequent joins. To solve this problem for any length of the prequel/sequel sequence we’d need something more powerful.</p>
<p>Something like a recursion. Wait, what? In SQL? Yes, it’s possible.</p>
<h3 id="recursive-cte">Recursive CTE</h3>
<p>Recursive CTE has an interesting ability to invoke itself. You can put the name of a CTE in its body and therefore make it run recursively. This kind of CTE takes the form of:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">cte</span> <span class="k">AS</span> <span class="p">(</span>
<span class="c1">-- [non-recursive term]</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="c1">-- [recursive term]</span>
<span class="p">)</span>
</code></pre></div></div>
<p>A very simple working example from <a href="https://www.postgresql.org/docs/current/static/queries-with.html#QUERIES-WITH-SELECT">Postgres documentation</a> goes as follows:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">t</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">VALUES</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="c1">-- non-recursive term</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="n">n</span><span class="o">+</span><span class="mi">1</span> <span class="k">FROM</span> <span class="n">t</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o"><</span> <span class="mi">100</span> <span class="c1">-- recursive term</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t</span><span class="p">;</span>
</code></pre></div></div>
<p>The above query generates numbers from 1 to 100.</p>
<p>As you can see there are few differences between normal and recursive Common Table Expression. First thing is the usage of <code class="language-plaintext highlighter-rouge">RECURSIVE</code> term in the definition, which enables the recursive mode. Second thing is that the query consists of two separate parts connected with a <code class="language-plaintext highlighter-rouge">UNION ALL</code> operator. They are called respectively “non-recursive term” and “recursive term”. You can make sure that the result table will not have any duplicated by using <code class="language-plaintext highlighter-rouge">UNION</code> instead of <code class="language-plaintext highlighter-rouge">UNION ALL</code>. The last thing is the fact that the recursive term query invokes its own CTE - something not possible in the normal mode.</p>
<p>Let’s see how recursive CTE work exactly. Under the hood Postgres uses two temporary tables: working table and result table. The latter is the place that accumulates the final result of a CTE. Technically the whole process is actually iterative, not recursive, but that’s how this operation has been called by the SQL standards committee. Therefore it can be visualized in three steps:</p>
<p><strong>1. Initial step</strong></p>
<p>Initial step is evaluated only once. Executor runs the non-recursive term and puts the result both in working and result tables:</p>
<p><img src="/images/blog/advanced-sql-cte/recursive-step-1.svg" alt="First step of CTE recursion" class="center-image" /></p>
<p><strong>2. Repetitive step</strong></p>
<p>It is evaluated many times, in a loop. Executor runs the recursive term against the content of the working table and then merges its output with the result table. It removes duplicates if the <code class="language-plaintext highlighter-rouge">UNION</code> operator was used. The result is also used to replace the content of the working table and therefore prepare it for the next step. The whole process repeats as long as the working table is not empty.</p>
<p><img src="/images/blog/advanced-sql-cte/recursive-step-2.svg" alt="Second step of CTE recursion" class="center-image" /></p>
<p><strong>3. Final step</strong></p>
<p>Database simply return the content of the result table.</p>
<p><img src="/images/blog/advanced-sql-cte/recursive-step-3.svg" alt="Third step of CTE recursion" class="center-image" /></p>
<p>Armed with that knowledge let’s get back to Example 4. I asked you to find out how many prequels and sequels each film has.</p>
<p>Let’s start with prequels count and do it step by step. First thing is to write the CTE header:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">films_with_prequels_number</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">sequel_id</span><span class="p">,</span> <span class="n">prequels_num</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span>
</code></pre></div></div>
<p>I added the part with column names in brackets here. From now I can skip all aliases, but I need to be careful about putting statements in the right order. The additional column <code class="language-plaintext highlighter-rouge">prequels_num</code> will hold the number of prequels for a particular film.</p>
<p>Now let’s write the non-recursive term, which prepares the first set of rows both for working and result table. Because we’re counting prequels we have to select all films that have zero prequels - their <code class="language-plaintext highlighter-rouge">prequel_id</code> column will be set to <code class="language-plaintext highlighter-rouge">NULL</code>:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">sequel_id</span><span class="p">,</span> <span class="mi">0</span> <span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span> <span class="k">WHERE</span> <span class="n">f</span><span class="p">.</span><span class="n">prequel_id</span> <span class="k">IS</span> <span class="k">NULL</span>
</code></pre></div></div>
<p>Now we have to choose the linking operator. In our case both <code class="language-plaintext highlighter-rouge">UNION</code> and <code class="language-plaintext highlighter-rouge">UNION ALL</code> work identically, because we’re not expecting any duplicates.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">UNION</span> <span class="k">ALL</span>
</code></pre></div></div>
<p>Now the hardest part - the recursive term. We need to take the content of working table by recursively selecting from self and joining it together with the films table, effectively replacing each film with its sequel. We also have to remember about incrementing the <code class="language-plaintext highlighter-rouge">prequels_num</code> column:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">sequel_id</span><span class="p">,</span> <span class="n">fr</span><span class="p">.</span><span class="n">prequels_num</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">FROM</span> <span class="n">films_with_prequels_number</span> <span class="n">fr</span>
<span class="k">JOIN</span> <span class="n">films</span> <span class="n">f</span> <span class="k">ON</span> <span class="n">fr</span><span class="p">.</span><span class="n">sequel_id</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
<span class="p">)</span>
</code></pre></div></div>
<p>And that’s it. Using inner <code class="language-plaintext highlighter-rouge">JOIN</code> instead of <code class="language-plaintext highlighter-rouge">LEFT JOIN</code> ensures that the execution will eventually stop, because it effectively discards all rows that have <code class="language-plaintext highlighter-rouge">NULL</code> value in its joining column (in this case <code class="language-plaintext highlighter-rouge">sequel_id</code>). And we don’t have cycles here.</p>
<p>Now let’s see at the whole query with CTE expressions for both prequels and sequels:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">films_with_prequels_number</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">sequel_id</span><span class="p">,</span> <span class="n">prequels_num</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">sequel_id</span><span class="p">,</span> <span class="mi">0</span> <span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span> <span class="k">WHERE</span> <span class="n">f</span><span class="p">.</span><span class="n">prequel_id</span> <span class="k">IS</span> <span class="k">NULL</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">sequel_id</span><span class="p">,</span> <span class="n">fr</span><span class="p">.</span><span class="n">prequels_num</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">FROM</span> <span class="n">films_with_prequels_number</span> <span class="n">fr</span>
<span class="k">JOIN</span> <span class="n">films</span> <span class="n">f</span> <span class="k">ON</span> <span class="n">fr</span><span class="p">.</span><span class="n">sequel_id</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
<span class="p">),</span>
<span class="n">films_with_sequels_number</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">prequel_id</span><span class="p">,</span> <span class="n">sequels_num</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">prequel_id</span><span class="p">,</span> <span class="mi">0</span> <span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span> <span class="k">WHERE</span> <span class="n">f</span><span class="p">.</span><span class="n">sequel_id</span> <span class="k">IS</span> <span class="k">NULL</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">prequel_id</span><span class="p">,</span> <span class="n">fr</span><span class="p">.</span><span class="n">sequels_num</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">FROM</span> <span class="n">films_with_sequels_number</span> <span class="n">fr</span>
<span class="k">JOIN</span> <span class="n">films</span> <span class="n">f</span> <span class="k">ON</span> <span class="n">fr</span><span class="p">.</span><span class="n">prequel_id</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">prequel_id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">sequel_id</span><span class="p">,</span> <span class="n">fpn</span><span class="p">.</span><span class="n">prequels_num</span><span class="p">,</span> <span class="n">fsn</span><span class="p">.</span><span class="n">sequels_num</span> <span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">films_with_prequels_number</span> <span class="n">fpn</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fpn</span><span class="p">.</span><span class="n">id</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">films_with_sequels_number</span> <span class="n">fsn</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="n">fsn</span><span class="p">.</span><span class="n">id</span>
</code></pre></div></div>
<p>The very last thing we have to do is to write the final query - one that puts everything together. It’s as simple as selecting from <code class="language-plaintext highlighter-rouge">films</code> table and joining it with both CTE.</p>
<h5 id="bonus-example-find-fibonacci-sequence-with-numbers-below-100">Bonus example. Find Fibonacci sequence with numbers below 100.</h5>
<p>I’ll just post the solution here and leave it for a curious reader to analyze ;)</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">fibo</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">VALUES</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="k">UNION</span> <span class="k">ALL</span>
<span class="k">SELECT</span> <span class="n">b</span><span class="p">,</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>
<span class="k">FROM</span> <span class="n">fibo</span>
<span class="k">WHERE</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span> <span class="o"><</span> <span class="mi">100</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">b</span> <span class="k">FROM</span> <span class="n">fibo</span>
</code></pre></div></div>
<h3 id="cte----inconspicuous-but-powerful">CTE - inconspicuous but powerful</h3>
<p>CTE are an interesting SQL feature. They help to organize and simplify complicated queries and also make them easier to maintain by allowing a user to get rid of duplicated parts. In their simplest form however they don’t offer anything more, especially nothing in terms of manipulating data.</p>
<p>You may therefore think that they’re not very useful. But their true potential lies in the recursive mode. It enables you to do a thing otherwise impossible in pure SQL - write a query that invokes itself, which gives you a lot of new possibilities. For example, you can traverse your relational tables like they were graphs. Recursive CTE might seem hard at first glance, but once you get familiar with them, you will appreciate the power they give.</p>
<div class="infobox">
<p>If you like my style of explaining things, you can check my article about other advanced SQL feature - <a href="/advanced-sql-window-functions/">window functions</a>.</p>
</div>
<h3 id="resources">Resources</h3>
<ul>
<li><a href="http://modern-sql.com/">Modern SQL</a></li>
<li><a href="https://www.postgresql.org/docs/current/static/queries-with.html">PostgreSQL documentation</a></li>
<li><a href="https://stackoverflow.com/questions/4740748/when-to-use-common-table-expression-cte">Stack Overflow - When to use Common Table Expression (CTE)</a></li>
</ul>This is the second article in my series discussing advanced SQL concepts. I want to describe features that are well supported in popular database management systems for quite some time, but somehow many people still don’t know about their existence. I’d like to explain them with examples, first giving a problem to solve using “plain old” SQL and then showing a better solution using advanced SQL.Advanced SQL - window functions2017-11-09T09:00:00+00:002017-11-09T09:00:00+00:00https://mjk.space/advanced-sql-window-functions<p>This post starts a series of articles discussing advanced SQL concepts that are well supported in popular database management systems for quite some time, but somehow many people still don’t know about their existence. I’d like to explain them with examples, first giving a problem to solve using “plain old” SQL and then showing a better solution using advanced SQL.</p>
<p>The first feature that I’d like to present is <strong>window functions</strong>.</p>
<p>In this article I’ll be using PostgreSQL 10, because it’s the most feature-rich open source database available. Version 10 <a href="https://www.postgresql.org/about/news/1786/">has been just released</a>, but window functions have been available since 8.4, so any modern version will be fine.</p>
<div class="infobox">
<p>This post was written in 2017, but everything it describes works in later versions of Postgres.</p>
</div>
<h3 id="problem">Problem</h3>
<p>As I promised, let’s start with a problem. We’ll be working with a very simple one-table database. The table contains information about films: years they were released in, ID of a category and their ratings according to some imaginary movie database:</p>
<p><img src="/images/blog/advanced-sql-window-functions/films_schema.svg" alt="Films table schema" class="center-image" />
<em>Films table schema</em>
<img src="/images/blog/advanced-sql-window-functions/input-rows.svg" alt="Input rows" class="center-image" />
<em>Input rows</em></p>
<p>Here comes the task:</p>
<h5 id="example-1-for-each-film-find-an-average-rating-for-all-films-released-in-the-same-year">Example 1. For each film find an average rating for all films released in the same year.</h5>
<p>The result should look like this. All films released in the same year have the same average:</p>
<p><img src="/images/blog/advanced-sql-window-functions/result-rows-1.svg" alt="Result rows with year_avg" class="center-image" />
<em>Result rows with year’s average</em></p>
<p>Stop here for a second and think how would you tackle this problem using plain old SQL concepts like <code class="language-plaintext highlighter-rouge">JOIN</code>, <code class="language-plaintext highlighter-rouge">GROUP BY</code> and other things that come to your mind. There are at least few possible solutions.</p>
<p>One of the them is to use a subquery computing averages for all distinct years and joining them back with the query fetching all films:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span> <span class="n">years</span><span class="p">.</span><span class="n">year_avg</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="k">AVG</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="k">AS</span> <span class="n">year_avg</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span>
<span class="p">)</span> <span class="n">years</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span> <span class="o">=</span> <span class="n">years</span><span class="p">.</span><span class="n">release_year</span>
</code></pre></div></div>
<p>It doesn’t look very complicated so far. Why don’t we solve one more problem then?</p>
<h5 id="example-2-for-each-film-find-average-ratings-for-all-films-released-in-the-same-year-and-separately-in-the-same-category">Example 2. For each film find average ratings for all films released in the same year and separately in the same category.</h5>
<p>And I’m expecting the following. Again, the same categories have equal values:</p>
<p><img src="/images/blog/advanced-sql-window-functions/result-rows-2.svg" alt="Result rows with year_avg and category_avg" class="center-image" />
<em>Result rows with year’s and category’s averages</em></p>
<p>Looks easy. We just have to count the category averages in a similar way and join them together with the previous query:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">category_id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="n">years</span><span class="p">.</span><span class="n">year_avg</span><span class="p">,</span> <span class="n">categories</span><span class="p">.</span><span class="n">category_avg</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="k">AVG</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="k">AS</span> <span class="n">year_avg</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span>
<span class="p">)</span> <span class="n">years</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span> <span class="o">=</span> <span class="n">years</span><span class="p">.</span><span class="n">release_year</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">f</span><span class="p">.</span><span class="n">category_id</span><span class="p">,</span> <span class="k">AVG</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="k">AS</span> <span class="n">category_avg</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">f</span><span class="p">.</span><span class="n">category_id</span>
<span class="p">)</span> <span class="n">categories</span> <span class="k">ON</span> <span class="n">f</span><span class="p">.</span><span class="n">category_id</span> <span class="o">=</span> <span class="n">categories</span><span class="p">.</span><span class="n">category_id</span>
</code></pre></div></div>
<p>The query gets more and more complicated though. It takes some time to read it and realize what exactly are we joining here.</p>
<p>Please notice also a pattern: we select a set of rows from a table and then join them with aggregated versions of the same row set. But we can’t just use the <code class="language-plaintext highlighter-rouge">GROUP BY</code> in the main query because we want to get the full list of films as a result. Thus we have to copy-paste the main query to each subquery. Just imagine what if the main query would be complicated itself with a lot of joins, <code class="language-plaintext highlighter-rouge">WHERE</code> clauses or even its own grouping… Ideally we’d like to have a way to do some computations on a row set, but not altering it at the same time.</p>
<p>And this is exactly what <strong>window functions</strong> are all about.</p>
<h3 id="solution---window-functions">Solution - window functions</h3>
<p><a href="https://www.postgresql.org/docs/current/static/tutorial-window.html">PostgreSQL documentation</a> has a nice definition of what window functions are:</p>
<blockquote>
<p>A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. However, window functions do not cause rows to become grouped into a single output row like non-window aggregate calls would. Instead, the rows retain their separate identities.</p>
</blockquote>
<p>In other words window functions allow to get aggregated results without actually making the result set aggregated. Let’s see how they work in practice.</p>
<p>A simplified syntax looks like this:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">function_name</span> <span class="n">OVER</span> <span class="p">(</span> <span class="n">window_definition</span> <span class="p">)</span>
<span class="k">FROM</span> <span class="p">(...)</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">window_definition</code> defines the set of rows that the current row is related to (I’m going to call it <em>a window</em>) and <code class="language-plaintext highlighter-rouge">function_name</code> specifies the function that we’re gonna use to operate on rows in each window. For full syntax see the <a href="https://www.postgresql.org/docs/current/static/sql-expressions.html#syntax-window-functions">documentation</a>.</p>
<p>Let’s get back to the initial problem, where we needed to calculate year’s average for each film. The solution using window functions is much simpler:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">category_id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="k">AVG</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span><span class="p">)</span> <span class="k">AS</span> <span class="n">year_avg</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
</code></pre></div></div>
<p>The window here is defined by a <code class="language-plaintext highlighter-rouge">PARTITION BY</code> clause. It instructs the database to divide the row set into smaller parts, partitions, putting all rows with the same <code class="language-plaintext highlighter-rouge">release_year</code> together. Then the aggregate function <code class="language-plaintext highlighter-rouge">AVG(score)</code> is run against each partition and the result is added to each row.</p>
<p><img src="/images/blog/advanced-sql-window-functions/partitioning.svg" alt="Partitioning" class="center-image" />
<em>Window functions partitioning</em></p>
<p>As you can see all input rows are transfered to the result set, safe and sound. Additionally, any condition that we set on a main query applies to a window functions input also. In other words if we had added a <code class="language-plaintext highlighter-rouge">WHERE</code> clause filtering out some rows, these rows also would have been missing from a window function computation.</p>
<p>Window functions are a powerful feature. We can choose from a wide range of functions to use and ways to define windows. I’ll mention just few interesting possibilities here.</p>
<h5 id="example-3-for-each-film-find-its-ranking-position-within-its-release-year">Example 3. For each film find its ranking position within its release year.</h5>
<p><img src="/images/blog/advanced-sql-window-functions/result-rows-rank.svg" alt="Result rows with year's ranking position" class="center-image" />
<em>Result rows with year’s ranking position</em></p>
<p>This task is different, because each row now has a distinct value within a partition - its position according to the <code class="language-plaintext highlighter-rouge">rating</code>. To solve this we have to use one of the order-aware functions - <code class="language-plaintext highlighter-rouge">RANK()</code>:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">category_id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="n">RANK</span><span class="p">()</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">rating</span> <span class="k">DESC</span><span class="p">)</span> <span class="k">AS</span> <span class="n">year_rank</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">RANK()</code> returns the position of a row within a window (with appropriate gaps when two or more rows have the same rank). To make it possible we had not only to partition the row set by a release year, but also to ensure that the rows inside each partition are sorted properly (otherwise we would get just rubbish). That’s why we used the <code class="language-plaintext highlighter-rouge">ORDER BY</code> clause.</p>
<h5 id="example-4-for-each-film-find-its-general-ranking-position">Example 4. For each film find its general ranking position.</h5>
<p><img src="/images/blog/advanced-sql-window-functions/result-rows-general-rank.svg" alt="Result rows with general ranking position" class="center-image" />
<em>Result rows with general ranking position</em></p>
<p>It’s also possible to have the <code class="language-plaintext highlighter-rouge">ORDER BY</code> without <code class="language-plaintext highlighter-rouge">PARTITION BY</code>:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">category_id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="n">RANK</span><span class="p">()</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">rating</span> <span class="k">DESC</span><span class="p">)</span> <span class="k">AS</span> <span class="n">general_rank</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
</code></pre></div></div>
<p>This way we instructed the database to create one big partition with all rows. It’s useful when we want to operate on the whole row set altogether.</p>
<h5 id="example-5-for-each-film-find-the-rating-of-the-best-film-in-its-release-year">Example 5. For each film find the rating of the best film in its release year.</h5>
<p><img src="/images/blog/advanced-sql-window-functions/result-rows-best-rating.svg" alt="Result rows with year's best rating" class="center-image" />
<em>Result rows with year’s best rating</em></p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">category_id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="n">FIRST_VALUE</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">rating</span> <span class="k">DESC</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
</code></pre></div></div>
<p>In the query above I used a new function <code class="language-plaintext highlighter-rouge">FIRST_VALUE()</code> which returns the requested value of the first row in a window. There are also <a href="https://www.postgresql.org/docs/current/static/functions-window.html">other similar functions</a> like <code class="language-plaintext highlighter-rouge">LAST_VALUE()</code> or <code class="language-plaintext highlighter-rouge">NTH_VALUE()</code>, returning value of the last or specific row, respectively. What’s worth mentioning here, it’s possible to change the boundaries of a window, so that it doesn’t contain the whole partition. This can be done by using <code class="language-plaintext highlighter-rouge">RANGE</code> or <code class="language-plaintext highlighter-rouge">ROWS</code> clause.</p>
<h5 id="example-6-for-each-film-find-an-average-rating-of-all-better-films-in-its-release-year">Example 6. For each film find an average rating of all better films in its release year.</h5>
<p><img src="/images/blog/advanced-sql-window-functions/result-rows-avg-better.svg" alt="Result rows with an average rating of better films" class="center-image" />
<em>Result rows with an average ratings of better films</em></p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
<span class="n">f</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">release_year</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">category_id</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">rating</span><span class="p">,</span>
<span class="k">AVG</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">release_year</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">rating</span> <span class="k">DESC</span>
<span class="k">ROWS</span> <span class="k">BETWEEN</span> <span class="n">UNBOUNDED</span> <span class="n">PRECEDING</span> <span class="k">AND</span> <span class="mi">1</span> <span class="n">PRECEDING</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">films</span> <span class="n">f</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING</code> part instructs database to set the lower boundary on the window. Now, instead of going all the way down to the partition’s end, it stops at the row right before the current row. So, effectively we operate only on rows that have higher <code class="language-plaintext highlighter-rouge">rating</code>.</p>
<p>Things get more complicated when the <code class="language-plaintext highlighter-rouge">rating</code> column contains duplicates. To achieve the same result we would need to use <code class="language-plaintext highlighter-rouge">RANGE</code> modifier instead of <code class="language-plaintext highlighter-rouge">ROWS</code>, but unfortunately Postgres <a href="https://sonra.io/2017/09/15/window-functions-vendor-functionality-comparison/">doesn’t currently support</a> the <code class="language-plaintext highlighter-rouge">1 PRECEDING</code> part in that case.</p>
<div class="infobox">
<p>A lot of new features have been added to Postgres since version 10. See <a href="/advances-sql-window-frames/">my article about window frames</a>, which covers the latest developments in window functions.
<!-- You can also check out my [post about Common Table Expressions](http://localhost:4000/advanced-sql-cte/). --></p>
</div>
<h3 id="conclusion">Conclusion</h3>
<p>Window functions are my favorite advanced SQL feature. They simply allow to do aggregations without actually aggregating the result set. They are a flexible way to create sophisticated SQL queries, that otherwise would need to be long, complicated and hard to read and maintain. PostgreSQL and other databases offer a wide variety of different functions and options to specify the exact subset of rows we’d like to operate on.</p>
<p>Window functions were introduced in <a href="https://en.wikipedia.org/wiki/SQL:2003">SQL 2003</a>. Quite some time ago. Therefore, almost all popular RDBMSes implement them <a href="https://sonra.io/2017/09/15/window-functions-vendor-functionality-comparison/">at least to some extent</a>. The only exceptions are MySQL and SQLite. When it comes to MySQL however, <a href="https://dev.mysql.com/doc/refman/8.0/en/mysql-nutshell.html">it has been announced</a> that the upcoming version 8.0 will support window functions.</p>
<p>Even that window functions are often considered as “advanced SQL”, I believe that they are something that every SQL-oriented software developer should be familiar with.</p>
<h3 id="resources">Resources</h3>
<ul>
<li><a href="http://modern-sql.com/">Modern SQL</a></li>
<li><a href="https://www.postgresql.org/docs/current/static/tutorial-window.html">PostgreSQL documentation</a></li>
</ul>This post starts a series of articles discussing advanced SQL concepts that are well supported in popular database management systems for quite some time, but somehow many people still don’t know about their existence. I’d like to explain them with examples, first giving a problem to solve using “plain old” SQL and then showing a better solution using advanced SQL.5 things about programming I learned with Go2017-08-22T08:00:00+00:002017-08-22T08:00:00+00:00https://mjk.space/5-things-about-programming-learned-with-go<p>Go has been gaining a significant popularity over last few months. Language-related articles and blog posts are written every day. New Go projects are started on Github. Go conferences and meetups attract more and more people. This language certainly has its time now. It became a <a href="https://www.tiobe.com/tiobe-index/go/">language of the year 2016</a> according to TIOBE and recently even made its way to their elite club of 10 most popular languages in the world.</p>
<p>I came across Go a year ago and decided to give it a try. After spending some time with it I can say that it’s definitely a language worth learning. Even if you’re not planning to use it in the long run, playing with it for a while may help you to improve your programming skills in general. In this post I’d like to tell you about five things that I’ve learned with Go and found useful in other languages.</p>
<p><img src="/images/blog/5-things/go-mascot.svg" alt="Gopher - Go's mascot" class="center-image" width="150px" />
<em>Gopher - Go’s mascot</em></p>
<h3 id="1-it-is-possible-to-have-both-dynamic-like-syntax-and-static-safety">1. It is possible to have both dynamic-like syntax and static safety</h3>
<p>On a daily basis I work in Ruby and I really like its dynamic typing system. It makes the language easy to learn, easy to use and allows programmers to write code very quickly. In my opinion however it works very well mostly in a smaller codebase. When my project starts to grow and becomes more and more complicated I tend to miss the safety and reliability that statically typed languages provide. Even if I test my code carefully, it can always happen that I forget to cover some edge case and suddenly my object will appear in the context that I didn’t expect. Is it possible then to have a dynamic-like programming language and don’t give up the static safety at the same time? I think so. Let me speak in Go code!</p>
<p><del>Go is not an object-oriented language.</del> There is an ongoing discussion whether Go is or is not an object-oriented language <a href="https://www.quora.com/Is-the-programming-language-Go-a-functional-or-object-oriented-programming-language">[1]</a><a href="https://nathany.com/good/">[2]</a>. Even the authors don’t have a strong opinion <a href="https://golang.org/doc/faq#Is_Go_an_object-oriented_language">[3]</a>. But one of the OO features that Go definitely has is the interfaces. And they are pretty much the same as these you can find in Java or C++. They have names and define a set of function signatures:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">Animal</span> <span class="k">interface</span> <span class="p">{</span>
<span class="n">Speak</span><span class="p">()</span> <span class="kt">string</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Then we have Go’s equivalent of classes - structs. Structs are simple things that bundle together attributes:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">Dog</span> <span class="k">struct</span> <span class="p">{</span>
<span class="n">name</span> <span class="kt">string</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Now we can add a function to the struct:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="p">(</span><span class="n">d</span> <span class="n">Dog</span><span class="p">)</span> <span class="n">Speak</span><span class="p">()</span> <span class="kt">string</span> <span class="p">{</span>
<span class="k">return</span> <span class="s">"Woof!"</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It means that, from now, you can invoke that function on any instance of <code class="language-plaintext highlighter-rouge">Dog</code> struct.</p>
<p>This piece of code may seem strange at the first time. Why did we write it outside the struct? And what is this weird <code class="language-plaintext highlighter-rouge">(d Dog)</code> part before the function name? Let me explain. Authors of Go wanted to give users more flexibility by allowing them to add their logic to any type they like (as long as it is a part of the same package). <del>Even to the ones they’re not authors of (like some external libraries). Therefore they decided to keep functions outside the structs.</del> And because the compiler needs to know which type you’re extending, you have to specify its name explicitly and put it into this strange part called <em>receiver</em>.</p>
<p>To use the above code we can write a function that simply takes <code class="language-plaintext highlighter-rouge">Animal</code> as an argument and calls its method.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">SaySomething</span><span class="p">(</span><span class="n">a</span> <span class="n">Animal</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">Speak</span><span class="p">())</span>
<span class="p">}</span>
</code></pre></div></div>
<p>And as you can imagine we’re gonna put the <code class="language-plaintext highlighter-rouge">Dog</code> as an argument to the <code class="language-plaintext highlighter-rouge">SaySomething</code> function:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dog</span> <span class="o">:=</span> <span class="n">Dog</span><span class="p">{</span><span class="n">name</span><span class="o">:</span> <span class="s">"Charlie"</span><span class="p">}</span>
<span class="n">SaySomething</span><span class="p">(</span><span class="n">dog</span><span class="p">)</span>
</code></pre></div></div>
<p>“Very well”, you think, “but what do we need to do for the <code class="language-plaintext highlighter-rouge">Dog</code> to implement <code class="language-plaintext highlighter-rouge">Animal</code> interface?” Absolutely nothing, it’s done already! Go uses a concept called “automatic interface implementation”. A struct containing all methods defined in the interface automatically fulfills it. There is no <code class="language-plaintext highlighter-rouge">implements</code> keyword. Isn’t that cool? A friend of mine even likes to call it “a statically typed duck typing”, referring to the famous principle:</p>
<blockquote>
<p>“If it quacks like a duck, then it probably is a duck”.</p>
</blockquote>
<p>Thanks to that feature and type inference that allows us to omit the type of a variable while defining, we can feel like we’re working in a dynamically typed language. But here we get the safety of a typed system too.</p>
<p>Why is this important? If your project is written in a dynamic, highly abstractive language one day you may find out that some parts of it need to be rewritten in a lower level, compiled language. I noticed however that it’s quite hard to convince Ruby or Python programmer to start writing in a static language and ask them to give up the flexibility they had. But it may be easier to do with “statically-duck-typed” Go.</p>
<h3 id="2-its-better-to-compose-than-inherit">2. It’s better to compose than inherit</h3>
<p>In my <a href="/how-to-avoid-inheritance-in-ruby/">previous blog post</a> I described a problem that we can run into if we use object-oriented features too much. I told a story of a client that initially asks for a software that can be modeled with a single class and then gradually extends his concept, in a way that the inheritance seemed like a perfect answer for his increasing demands. Unfortunately, going that way led us to a huge tree of closely related classes where adding new logic, maintaining simplicity and avoiding code duplication was very hard.</p>
<p>My conclusion to that story was that if we want to mitigate the risk of getting lost inside the dark forest of code complexity we need to <strong>avoid inheritance</strong> and <strong>prefer composition</strong> instead. I know however that it can be hard to change your mind from one paradigm to another. In my case the thing that helped me the most was writing a code in a language that doesn’t support inheritance at all. You guessed it - that language was Go.</p>
<p>Go doesn’t have the concept of inheriting structs by design. The authors wanted to keep the language simple and clear. They didn’t find inheritance necessary, but they included a feature that is particularly useful when you want to use composition. In order to describe it, I’ll use an example taken from that other blog post.</p>
<p>Let’s say that we’re modeling a vehicle that can have different types of engines and bodies:</p>
<p><img src="/images/blog/inheritance/composition.png" alt="Vehicle" class="center-image" /></p>
<p>Let’s create two interfaces representing these features:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">Engine</span> <span class="k">interface</span> <span class="p">{</span>
<span class="n">Refill</span><span class="p">()</span>
<span class="p">}</span>
<span class="k">type</span> <span class="n">Body</span> <span class="k">interface</span> <span class="p">{</span>
<span class="n">Load</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Now we need to create a <code class="language-plaintext highlighter-rouge">Vehicle</code> struct that will compose above interfaces:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">Vehicle</span> <span class="k">struct</span> <span class="p">{</span>
<span class="n">Engine</span>
<span class="n">Body</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Can you see anything strange here? I deliberately omitted names of the fields that these interfaces define. Therefore I used a feature called embedding. From now on every single method existing in the embedded interface will be also visible directly on the <code class="language-plaintext highlighter-rouge">Vehicle</code> struct itself. That means that we can invoke, let’s say, <code class="language-plaintext highlighter-rouge">refill()</code> function on any instance of <code class="language-plaintext highlighter-rouge">Vehicle</code> and Go will pass that through to the Engine implementation. We get a proper composition for free and we don’t need to add any explicit delegation boilerplate. That’s how it works in practice:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vehicle</span> <span class="o">:=</span> <span class="n">Vehicle</span><span class="p">{</span><span class="n">Engine</span><span class="o">:</span> <span class="n">PetrolEngine</span><span class="p">{},</span> <span class="n">Body</span><span class="o">:</span> <span class="n">TruckBody</span><span class="p">{}}</span>
<span class="n">vehicle</span><span class="o">.</span><span class="n">refill</span><span class="p">()</span>
<span class="n">vehicle</span><span class="o">.</span><span class="n">load</span><span class="p">()</span>
</code></pre></div></div>
<p>If you can’t switch your mind to prefer composition over inheritance in your object-oriented language - try Go and write something more complex than “hello world”. Because it doesn’t support inheritance at all, you’re gonna need to learn how to compose. Quickly.</p>
<h3 id="3-channels-and-goroutines-are-powerful-way-to-solve-problems-involving-concurrency">3. Channels and goroutines are powerful way to solve problems involving concurrency</h3>
<p>Go has some really simple and cool tools that help you work with concurrency: channels and goroutines. What are they?</p>
<p>Goroutines are Go’s “green threads”. As you can imagine, they are not handled by an operating system, but by the Go scheduler that is included into each binary. And fortunately this scheduler is smart enough to automatically utilize all CPU cores. Goroutines are small and lightweight, therefore you can easily create many of them and get advanced parallelism for free.</p>
<p>Channel is a simple “pipe” you can use to connect goroutines together. You can take it, write something to one end and read it from the other end. It simply allows goroutines to communicate with each other in an asynchronous way.</p>
<p>Here is a quick example of how they can work together. Let’s imagine that we’ve got a function that runs a long computation and we don’t want it to block the whole program. This is what can be done:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">func</span> <span class="n">HeavyComputation</span><span class="p">(</span><span class="n">ch</span> <span class="k">chan</span> <span class="kt">int32</span><span class="p">)</span> <span class="p">{</span>
<span class="c">// long, serious math stuff</span>
<span class="n">ch</span> <span class="o"><-</span> <span class="n">result</span>
<span class="p">}</span>
</code></pre></div></div>
<p>As you can see, this function takes a channel in its list of arguments. Once it obtains a result it pushes the computed value directly to that channel.</p>
<p>Now let’s see how we can use it. First we need to create a new channel of type <code class="language-plaintext highlighter-rouge">int32</code>:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ch</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">chan</span> <span class="kt">int32</span><span class="p">)</span>
</code></pre></div></div>
<p>Then we can call our heavy function:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">go</span> <span class="n">HeavyComputation</span><span class="p">(</span><span class="n">ch</span><span class="p">)</span>
</code></pre></div></div>
<p>Here comes a bit of magic - the <code class="language-plaintext highlighter-rouge">go</code> keyword. You can put it in front of any function call. Go will then create a new goroutine with the same address space and use it to run the function. All of these happen in the background, so the execution will return immediately to allow you to do other things.</p>
<p>And that’s exactly what’s gonna happen in this case. The just created goroutine will live asynchronously doing its job and then it’ll send the result to the channel once ready. We can try to obtain the result in the following way:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">result</span> <span class="o">:=</span> <span class="o"><-</span><span class="n">ch</span>
</code></pre></div></div>
<p>If the result is ready, we’ll get it immediately. Otherwise we’d block here until <code class="language-plaintext highlighter-rouge">HeavyComputation</code> finishes and writes back to the channel.</p>
<p>Goroutines and channels are simple, yet very powerful mechanisms to work with concurrency and parallelism. Once you learn it, you’ll get a fresh look on how to solve this kind of problems. They offer an approach that is similar to the actor model known from languages and frameworks like Erlang and Akka, but I think they give more flexibility.</p>
<p>Programmers of other languages seem to start noticing their advantages. For instance, the authors of <a href="https://github.com/ruby-concurrency/concurrent-ruby">concurrent-ruby</a> library, an unopinionated concurrency tools framework, ported Go’s channels directly to their project.</p>
<p>With that knowledge we can jump directly to the next paragraph.</p>
<h3 id="4-dont-communicate-by-sharing-memory-share-memory-by-communicating">4. Don’t communicate by sharing memory, share memory by communicating.</h3>
<p>Traditional programming languages with their standard libraries (like C++, Java, Ruby or Python) encourage users to tackle concurrency problems in a way that many threads should have access to the same shared memory. In order to synchronize them and avoid simultaneous access programmers use locks. Locks prevent two thread from accessing a shared resource at the same time.</p>
<p>An example of this concept in Ruby may look like this:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lock</span> <span class="o">=</span> <span class="no">Mutex</span><span class="p">.</span><span class="nf">new</span>
<span class="n">a</span> <span class="o">=</span> <span class="no">Thread</span><span class="p">.</span><span class="nf">new</span> <span class="p">{</span>
<span class="n">lock</span><span class="p">.</span><span class="nf">synchronize</span> <span class="p">{</span>
<span class="c1"># access shared resource</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">b</span> <span class="o">=</span> <span class="no">Thread</span><span class="p">.</span><span class="nf">new</span> <span class="p">{</span>
<span class="n">lock</span><span class="p">.</span><span class="nf">synchronize</span> <span class="p">{</span>
<span class="c1"># access shared resource</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Thanks to goroutines and channels Go programmers can take a different approach. Instead of using locks to control access to a shared resource, they can simply use channels to pass around its pointer. Then only a goroutine that holds the pointer can use it and make modifications to the shared structure.</p>
<p>There is a great explanation in <a href="https://golang.org/doc/effective_go.html#sharing">Go’s documentation</a> that helped me to understand this mechanism:</p>
<blockquote>
<p>One way to think about this model is to consider a typical single-threaded program running on one CPU. It has no need for synchronization primitives. Now run another such instance; it too needs no synchronization. Now let those two communicate; if the communication is the synchronizer, there’s still no need for other synchronization.</p>
</blockquote>
<p>This is definitely not a new idea, but somehow to many people a lock is still the default solution for any concurrency problem. Of course it doesn’t mean that locking is useless. It can be used to implement simple things, like an atomic counter. But for higher level abstractions it’s good to consider different techniques, like the one that authors of Go suggest.</p>
<h3 id="5-there-is-nothing-exceptional-in-exceptions">5. There is nothing exceptional in exceptions</h3>
<p>Programming languages that handle errors in a form of exceptions encourage users to think about them in a certain way. They are called “exceptions”, so there must happen something exceptional, extraordinary and uncommon for the “exception” to be triggered, right? Maybe I shouldn’t care too much about it? Maybe I can just pretend it won’t happen?</p>
<p>Go is different, because it doesn’t have the concept of exceptions by design. It might look like a lack of feature is called a feature, but it actually makes sense if you think about it for a while. In fact there is nothing exceptional in exceptions. They are usually just one of possible return values from a function. IO error during socket communication? It’s a network so we need to be prepared. No space left on device? It happens, nobody has unlimited hard drive. Database record not found? Well, doesn’t sound like something impossible.</p>
<p>If errors are merely return values why should we treat them differently? We shouldn’t. Here is how they are handled in Go. Let’s try to open a file:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">f</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">os</span><span class="o">.</span><span class="n">Open</span><span class="p">(</span><span class="s">"filename.ext"</span><span class="p">)</span>
</code></pre></div></div>
<p>As you can see, this (and many other) Go functions returns two values - the handler and the error. The whole safety checking is as simple as comparing the error to <code class="language-plaintext highlighter-rouge">nil</code>. When the file is successfully opened we receive the handler, but the error is set to <code class="language-plaintext highlighter-rouge">nil</code>. Otherwise we can find the error struct there.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="k">return</span> <span class="n">err</span>
<span class="p">}</span>
<span class="c">// do something with the file</span>
</code></pre></div></div>
<p>To be honest I’m not sure if this is the most beautiful way of handling errors I’ve ever seen, but it definitely does a good job in encouraging programmer not to ignore them. You can’t simply omit assigning the second return value. In case you do, Go will complain:</p>
<p><code class="language-plaintext highlighter-rouge">multiple-value os.Open() in single-value context</code></p>
<p>Go will also force you to read it later at least once. Otherwise you’ll get another error:</p>
<p><code class="language-plaintext highlighter-rouge">err declared and not used</code></p>
<p>Regardless of the language that you use on the daily basis it’s good to think about exceptions like they were regular return values. Don’t pretend that they just won’t occur. Bad things happen usually in the least expected moment. Don’t leave you catch blocks empty.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Go is an interesting language that presents a different approach to writing code. It deliberately misses some features that we know from other languages, like inheritance or exceptions. Instead it encourages users to tackle problems with its own toolset. Therefore, if you want to write maintainable, clean and robust code, you have to start thinking in a different, Go-like way. This is however a good thing, since the skills that you learn here can be successfully used in other languages. Your milage may vary, but I think that once you start playing with Go you’ll quickly find out that it actually helps you becoming a better programmer in general.</p>Go has been gaining a significant popularity over last few months. Language-related articles and blog posts are written every day. New Go projects are started on Github. Go conferences and meetups attract more and more people. This language certainly has its time now. It became a language of the year 2016 according to TIOBE and recently even made its way to their elite club of 10 most popular languages in the world.How to avoid inheritance in Ruby?2017-07-10T10:00:00+00:002017-07-10T10:00:00+00:00https://mjk.space/how-to-avoid-inheritance-in-ruby<p>What’s wrong with the inheritance? Let me illustrate it with an example.</p>
<p>Let’s say that a client asked you to create a traffic simulator application. He wants it to be able to simulate the movement of some vehicles. If you use an object oriented language like Ruby you’ll probably come up with a model class that contains all the logic and properties, like this:</p>
<p><img src="/images/blog/inheritance/inheritance_level1.png" alt="Vehicle class" class="center-image" />
<em>Vehicle class</em></p>
<p>You reach back to the client with the complete solution. Fine. But now he tells you that he wants these vehicles to be either cars or trucks. You know these types will share at least some behavior, so you don’t want to duplicate the code. No problem! Let’s use inheritance! It’s a proper <em>is-a</em> relation, so why not?</p>
<p><img src="/images/blog/inheritance/inheritance_level2.png" alt="Vehicle with inheritance" class="center-image" />
<em>Vehicle with inheritance</em></p>
<p>The client is happy again. But then he gets back to you and says that it would be great if these cars and trucks could have different types of engines. Let’s say: petrol or electric ones. Again, inheritance to the rescue!</p>
<p><img src="/images/blog/inheritance/inheritance_level3.png" alt="Full inheritance tree" class="center-image" />
<em>Vehicle with even more inheritance</em></p>
<p>The client is more than happy now. But what if he calls you back to ask for another fragmentation level? Say private cars, police cruisers, fire brigade trucks, ambulances and so on or and so forth? Our inheritance tree will grow bigger and become more complicated. Instead of reducing code duplication, we’ll end up with having the same logic in many places. There is even a <a href="https://en.wikipedia.org/wiki/Combinatorial_explosion">wikipedia article</a> describing this phenomenon.</p>
<p>This is not an artificial problem that I’ve just made up. I encountered it many times during my professional career either developing a new feature or trying to add a new behavior to a legacy code. It’s even more likely to happen when you use Rails which forces you to inherit from classes like <em>ApplicationRecord</em> or <em>ApplicationController</em>.</p>
<p>For the reference here is the code that may be produced with inheritance:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Vehicle</span>
<span class="k">def</span> <span class="nf">run</span>
<span class="n">refill</span>
<span class="nb">load</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">Car</span> <span class="o"><</span> <span class="no">Vehicle</span>
<span class="k">def</span> <span class="nf">load</span>
<span class="c1"># load passengers</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">Truck</span> <span class="o"><</span> <span class="no">Vehicle</span>
<span class="k">def</span> <span class="nf">load</span>
<span class="c1"># load cargo</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">PetrolCar</span> <span class="o"><</span> <span class="no">Car</span>
<span class="k">def</span> <span class="nf">refill</span>
<span class="c1"># refill with fuel</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">ElectricCar</span> <span class="o"><</span> <span class="no">Car</span>
<span class="k">def</span> <span class="nf">refill</span>
<span class="c1"># refill with electricity</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">PetrolTruck</span> <span class="o"><</span> <span class="no">Truck</span>
<span class="k">def</span> <span class="nf">refill</span>
<span class="c1"># refill with fuel (code duplication!)</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">ElectricTruck</span> <span class="o"><</span> <span class="no">Truck</span>
<span class="k">def</span> <span class="nf">refill</span>
<span class="c1"># refill with electricity (code duplication!)</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Can we do something about it? Yes, we can.</p>
<h3 id="maybe-mixins">Maybe mixins?</h3>
<p>Mixins are usually the first thing that comes to the minds of Ruby programmers when they notice that the inheritance is not a solution anymore. What are they? Basically they are modules with a set of methods that can be included into a class and become undistinguishable part of it. We can simply use them to extract any common logic and avoid code duplication.</p>
<p>Let’s see what we can do with the mixins. First, we need to create the modules that we’ll include later on:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">module</span> <span class="nn">Vehicle</span>
<span class="k">def</span> <span class="nf">run</span>
<span class="n">refill</span>
<span class="nb">load</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">module</span> <span class="nn">Truck</span>
<span class="k">def</span> <span class="nf">load</span>
<span class="c1"># load cargo</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">module</span> <span class="nn">Car</span>
<span class="k">def</span> <span class="nf">load</span>
<span class="c1"># load passengers</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">module</span> <span class="nn">ElectricEngine</span>
<span class="k">def</span> <span class="nf">refill</span>
<span class="c1"># refill with electricity</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">module</span> <span class="nn">PetrolEngine</span>
<span class="k">def</span> <span class="nf">refill</span>
<span class="c1"># refill with petrol</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Then we can define specific classes and include the mixins, that we’ve just created:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">PetrolCar</span>
<span class="kp">include</span> <span class="no">Vehicle</span>
<span class="kp">include</span> <span class="no">Car</span>
<span class="kp">include</span> <span class="no">PetrolEngine</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">ElectricCar</span>
<span class="kp">include</span> <span class="no">Vehicle</span>
<span class="kp">include</span> <span class="no">Car</span>
<span class="kp">include</span> <span class="no">ElectricEngine</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">PetrolTruck</span>
<span class="kp">include</span> <span class="no">Vehicle</span>
<span class="kp">include</span> <span class="no">Truck</span>
<span class="kp">include</span> <span class="no">PetrolEngine</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">ElectricTruck</span>
<span class="kp">include</span> <span class="no">Vehicle</span>
<span class="kp">include</span> <span class="no">Truck</span>
<span class="kp">include</span> <span class="no">ElectricEngine</span>
<span class="k">end</span>
</code></pre></div></div>
<p>It looks better: no code is duplicated, we can add a new level of specialization and easily build any type of vehicle. It’s also clear what features our vehicles have.</p>
<p>There are still some problems though. When you look at this class you’re not sure how the included behavior is used. A mixin adds a couple of new methods but it’s not immediately obvious what they are, how does the class interfere with them and how does it affect the execution flow. If by any chance two modules contain methods with the same name, you’re gonna run into problems - one module will silently use the method from the other one. In the same way a module can mess up the code in your own class.</p>
<p>Mixins are not bad and there are definitely some good use cases for them. In my opinion they might work well when you want to define meta behavior of a class like logging, authorization or validation. The good thing is that they keep the code clean and small. They’re fine as long as you trust their implementation and know that they don’t break any other logic. The thing to remember is that in fact they’re just <strong>a way to implicitly implement multiple inheritance</strong> in Ruby.</p>
<blockquote class="twitter-tweet" data-lang="pl"><p lang="en" dir="ltr">In OOP there’s this thing to prefer composition over inheritance. And in Ruby people constantly forget that modules == multiple inheritance</p>— Piotr Solnica (@_solnic_) <a href="https://twitter.com/_solnic_/status/623224611212251136">20 lipca 2015</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Can we do better? Yes, we can!</p>
<h3 id="composition">Composition</h3>
<p>Composition is the term that I’ve known for a long time but started using it just recently. I simply didn’t <em>feel</em> it good enough to be able to use it comfortably. Then one day I came across an absolutely fantastic talk given by Sandi Metz in 2015 in Atlanta, called <a href="https://www.youtube.com/watch?v=OMPfEXIlTVE">Nothing is something</a>. Among other things she speaks about the composition and solves exactly the same problem that I mentioned in the beginning.</p>
<p>How does the composition work? Instead of trying to share <strong>the same</strong> behavior between classes, you should identify what kind of concepts are these things that <strong>differ</strong>, name them, extract into separate classes and then compose into your final object.</p>
<p>If inheritance is about <em>is-a</em> relationship, then composition is about <em>has-a</em>. Therefore we’ve got to change the structure of our problem in order to leverage the composition. Our vehicle <strong>is not</strong> an electric vehicle anymore but rather it <strong>has</strong> an electric engine. It <strong>is not</strong> a truck but it <strong>has</strong> a truck body. In that way we can identify two concepts: <strong>engine</strong> and <strong>body</strong>.</p>
<p>The structure of our application can now look like this. We have implemented the engine and body concepts and created two placeholders for them in the <em>Vehicle</em> class:</p>
<p><img src="/images/blog/inheritance/composition.png" alt="Composition" class="center-image" />
<em>Composition in action</em></p>
<p>What does it look like in the code? Let’s start with the main class:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Vehicle</span>
<span class="k">def</span> <span class="nf">initialize</span><span class="p">(</span><span class="n">engine</span><span class="p">:,</span> <span class="n">body</span><span class="p">:)</span>
<span class="vi">@engine</span> <span class="o">=</span> <span class="n">engine</span>
<span class="vi">@body</span> <span class="o">=</span> <span class="n">body</span>
<span class="k">end</span>
<span class="k">def</span> <span class="nf">run</span>
<span class="vi">@engine</span><span class="p">.</span><span class="nf">refill</span>
<span class="vi">@body</span><span class="p">.</span><span class="nf">load</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Now we can create the implementations of our concepts. We will inject them into the <em>Vehicle</em> object.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ElectricEngine</span>
<span class="k">def</span> <span class="nf">refill</span>
<span class="c1"># refill with electricity</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">PetrolEngine</span>
<span class="k">def</span> <span class="nf">refill</span>
<span class="c1"># refill with petrol</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">TruckBody</span>
<span class="k">def</span> <span class="nf">load</span>
<span class="c1"># load cargo</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">CarBody</span>
<span class="k">def</span> <span class="nf">load</span>
<span class="c1"># load passengers</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Finally, we can put everything together:</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">petrol_car</span> <span class="o">=</span> <span class="no">Vehicle</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="ss">engine: </span><span class="no">PetrolEngine</span><span class="p">.</span><span class="nf">new</span><span class="p">,</span> <span class="ss">body: </span><span class="no">CarBody</span><span class="p">.</span><span class="nf">new</span><span class="p">)</span>
<span class="n">electric_car</span> <span class="o">=</span> <span class="no">Vehicle</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="ss">engine: </span><span class="no">ElectricEngine</span><span class="p">.</span><span class="nf">new</span><span class="p">,</span> <span class="ss">body: </span><span class="no">CarBody</span><span class="p">.</span><span class="nf">new</span><span class="p">)</span>
<span class="n">petrol_truck</span> <span class="o">=</span> <span class="no">Vehicle</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="ss">engine: </span><span class="no">PetrolEngine</span><span class="p">.</span><span class="nf">new</span><span class="p">,</span> <span class="ss">body: </span><span class="no">TruckBody</span><span class="p">.</span><span class="nf">new</span><span class="p">)</span>
<span class="n">electric_truck</span> <span class="o">=</span> <span class="no">Vehicle</span><span class="p">.</span><span class="nf">new</span><span class="p">(</span><span class="ss">engine: </span><span class="no">ElectricEngine</span><span class="p">.</span><span class="nf">new</span><span class="p">,</span> <span class="ss">body: </span><span class="no">TruckBody</span><span class="p">.</span><span class="nf">new</span><span class="p">)</span>
</code></pre></div></div>
<p>This approach has many advantages. The way that the vehicle classes use external logic is perfectly clear at the first glance. There are no problems with conflicting names either. Each class do exactly one thing (satisfying the <a href="https://en.wikipedia.org/wiki/Single_responsibility_principle"><strong>single responsibility principle</strong></a>). Therefore you can easily test each of them by checking how well do they do this only thing.</p>
<p>We also achieved <strong>high cohesion</strong> (keeping the same logic together) maintaining <strong>low coupling</strong> (making classes loosely dependent on each other) at the same time. We can easily change the code responsible for engine or body not worrying about their clients, as long as we don’t change the interface.</p>
<p>Does composition have any downsides? Of course it does. It tends to make the code longer, especially when it comes to injecting all the dependencies into the final object. You have to write additional boilerplate in order to store references, setup delegations and enforce correct execution flow. As a remedy you can use one of many <a href="https://en.wikipedia.org/wiki/Creational_pattern">creational patterns</a>, like <em>Factory</em> or <em>Builder</em>.</p>
<p>For me the hardest thing in composition was to change my mindset in order to be able to think about problems in that way. What unexpectedly helped me in this matter was playing with <strong>Go</strong>. It is a programming language that doesn’t have inheritance by design but makes it possible to write code in an object-oriented-like way. It also contains features which encourage programmers to use composition. Once I spent some time with it I suddenly realized that I became way more fluent in using this pattern. I’m going to describe it soon in the next blog post.</p>
<h3 id="conclusion">Conclusion</h3>
<p>I gave you examples of three different approaches to structuring your code: inheritance, mixins and composition. <strong>Inheritance</strong> is the first choice for many programmers but to me it’s extremely overused, makes code complicated and hard to maintain. <strong>Mixins</strong> seem like a smart and more powerful replacement but in fact they are just a way to achieve implicit multi-base inheritance which can even increase code complexity. <strong>Composition</strong> is the most talkative but at the same time the most straightforward and clear approach to maintain dependencies between classes. It helps to keep them small, separated and easy to test. It’s my personal favorite.</p>
<p>You have to remember though that object-oriented programming is just a convention that some programmers came up with in order to help other programmers solve their problems. Don’t be a slave to these rules. Choose the solution that fits your situation best.</p>
<p>And after all, keep in mind that:</p>
<blockquote>
<p>Designing object-oriented software is hard, and designing reusable object-oriented software is even harder. - Gang of Four</p>
</blockquote>
<p>It comes with experience.</p>
<h4 id="resources">Resources</h4>
<ul>
<li><a href="https://www.youtube.com/watch?v=OMPfEXIlTVE">RailsConf 2015 - Nothing is Something</a></li>
<li><a href="https://learnrubythehardway.org/book/ex44.html">https://learnrubythehardway.org/book/ex44.html</a></li>
<li><a href="https://robots.thoughtbot.com/reusable-oo-inheritance">https://robots.thoughtbot.com/reusable-oo-inheritance</a></li>
<li><a href="https://www.thoughtworks.com/insights/blog/composition-vs-inheritance-how-choose">https://www.thoughtworks.com/insights/blog/composition-vs-inheritance-how-choose</a></li>
</ul>What’s wrong with the inheritance? Let me illustrate it with an example.