SQL FOR DATA ANALYSIS

10 min readMar 11, 2021

This article is about using Structured Query Language to perform analysis on data stored in relational tables. It includes seven chapters starting from basic SQL and going on to advanced SQL features. This article is a summary of the course under the link: https://classroom.udacity.com/courses/ud198

Chapter 1: Basic characteristics of SQL

An Entity Relationship Diagram (ERD) represents how data is arranged in a database. An ERD of an imaginary company called Parch and Posey which sells different types of papers to customers, is used here to perform data analysis.

All the rows from the orders table that had a total amount lesser than USD 500 were found by using the WHERE function along with a comparison operator and the rows were ordered by the total amount starting from the smallest value. I wanted to find the first 10 ten orders that had the least values, therefore the limit was set to 10. The WHERE function could also be used for pulling out non-numeric data, for example, “ WHERE name = ‘Jacob Lee’ ” could be used to find information about a particular customer account with the name Jacob Lee.

Arithmetic operation was performed under the SELECT statement to calculate the percentage revenue of poster paper sold for every order. Division and addition were combined here and the results were again limited to 10.

The task above was to find names of customer accounts that started with C or W and which had the letters ‘ana’ or ‘Ana’ in primary_poc. This was achieved by using the functions AND, OR and LIKE. The names that included the letters ‘eana’ in primary_poc were excluded from the results and this was done by using the NOT LIKE function.

Chapter 2: SQL Joins

Joins in SQL are used to link different tables which are related to each other by a particular column name. The JOIN function pulls out the required columns from different tables which is simple, powerful and makes analysis a lot more easier when different tables have different information.

First, all the rows from the orders table were pulled. The accounts table was then joined to the orders table by specifying a relationship between the two tables which is done by using a logical statement, i.e. the ON function. Here the account id column of the orders table is the same as the id column of the accounts table and so this relationship is used to link the two tables. Columns from the accounts table could also be pulled under the SELECT statement as shown below.

The code above shows how to join and pull columns from three different tables. The region name and account name of every order was required along with the unit price of the order. All the required columns were first mentioned under the SELECT statement and an alias name was given to each column that would be present in the results. The accounts table was then joined to the orders table using the link between id of the two tables. Since there was no direct relationship between the regions table and accounts table, the sales reps table had to be joined first onto the accounts table and then the regions table was joined to that by using the link between regions and sales reps tables.

The previous code was modified later to only include the results of orders which had standard quantity paper sold above 100 and poster quantity paper sold above 50. This was indicated by using the WHERE clause and the results were ordered by unit price from highest to lowest by using the function DESC.

Chapter 3: SQL Aggregations

SQL has a few functions that could be used to aggregate data

COUNT : counts the number of rows in a column
SUM : adds up the rows in a column
AVERAGE : calculates the mean of a column
MIN & MAX : returns the highest and lowest values of a column

The code above finds the accounts that have used Facebook as the channel to reach their customers more than 6 times. The web events table is joined to the accounts table. First the number of times each channel has been used is counted and then the HAVING function was used to filter the results that had Facebook more than 6 times. The results were in groups of account id, account name and channel and were ordered by the number of times the channel has been used.

The month in which Parch and Posey had the greatest amount of sales in dollars between 2014 and 2017 was found using a DATE_PART function. Initially, the sum of total dollars earned every month was calculated. The results were then grouped together by every month and ordered from highest to lowest value of total dollars earned. Since we wanted the month with maximum earnings, the limit was set to 1.

CASE statement is used above to categorize each order into 3 different categories based on the total number of items sold in each order.

Chapter 4: Sub-queries

Sometimes questions cannot be answered with the existing tables of data and this is where sub-queries come into help. A query could be written to create a new table and then another sub-query could be written to answer the question using the new table created.

The code below shows the total orders for the region with highest sales.

The total amount of sales for every region was found by using the SUM aggregation and then grouped in terms of region as show below.

Next, a sub-query of the above results was created and the region with the highest sales was found using the MAX aggregation. The sub-query was named as sub.

In the last step, the total orders from this particular region was counted. HAVING function was used to join this last part of the code to the previous code to get the required results as shown below.

The above code could be written in an easy way to read by writing every sub-query in a common table expression.

WITH statement was used to indicate every table. The task of finding the total sales amount for every region is under table 1 and named as t1. Table t2 gets the maximum of the total sales amount. The count of total orders is then found under the main code.

Chapter 5: Data cleaning

Raw data is mostly messy and unstructured. Cleaning is therefore required to make it structured and to convert columns into different data types as per requirement.

The question was to separate the primary_poc into two different columns, i.e. first name and last name.

The exact character which splits the two words in primary_poc is a space, for example the name John Lee is separated by a space between the two words. Therefore, the STRPOS statement found the space between words in primary_poc and the characters from the left till the space was considered as first name. The length of primary_poc was then calculated and the length of first name along with the space character was reduced from it. The RIGHT statement was then used to consider the characters from the right of primary_poc till the space character as last name.

STRPOS is useful when it comes to separating two words. In a similar way, CONCAT is a statement which could be used to join two or more words or characters. The above solution creates an email address by combining names and characters. A sub-query was first created to find the first and last names separately. These names were then pulled out in the main query and joined together in the format of an email address.

The date in the database was not in the correct format and so had to be changed into the right format. SUBSTR comes into use when certain characters have to be separated from a word. Let’s take 01/03/2015 as an example. SQL requires date to be stored in the format yyyy/mm/dd. As you see above, SUBSTR(date, 7, 4) gives 4 characters from the 7th position, i.e. it gives 2015. LEFT(date,2) gives 2 characters from the left side which is 01. SUBSTR(date, 4,2) again gives 2 characters from the 4th position which is 03. This changes the date to the correct format of 2015/01/03. The symbol || is another form of CONCAT and is used to join characters.

Chapter 6: Window Functions

Window function is a powerful tool in SQL which allows you to compare one row to another without doing joints. Two important keys of window functions are OVER and PARTITION BY.

The running total of the amount of sales for standard quality paper in USD for every year. The amount was added up and PARTITION BY grouped the results into years starting from the first year of sales. ORDER BY occurred_at then orders the results under every year in the order of sales that has happened within that year. Running total gives the addition of every row in a column.

The code above is similar to the previous code having several aggregations this time. Instead of writing the partition function under each aggregation, it is written at the end of the code as an alias using the WINDOW function. This makes the code simple and short.

NTILE is a function which divides the given column into a required number of subsets by comparing every row with all the other rows in the column. As you can see above, standard_qty which is mentioned under the ORDER BY statement is divided into 4 subsets, i.e. into quartiles. Therefore, a standard_qty of 0 would fall under the first quartile and the highest value of standard_qty would fall under the last quartile.

Chapter 7: Advanced Joins and Performance Tuning

The joins learnt earlier were inner joins which gives only matched rows or unmatched rows from only one table when two or more tables are joined. Advanced joins include outer joins which give the unmatched rows of all the tables that are joined.

The code above gives all the rows from both the accounts and sales reps tables. FULL OUTER JOIN not only gives accounts which have a sales rep and sales reps who have an account but they also give the accounts which do not have a sales rep and sales reps who do not have an account.

UNION or UNION ALL are operators which are used when we want to combine results from different SELECT statements. UNION ALL does not above duplicate rows, therefore, the above code gives every result twice. Replacing UNION ALL with UNION would give half the number of results that was given by the above code, as UNION removes duplicate rows and gives every result only once.

Though SQL is the easiest to work with huge databases, some queries may take hours to execute if the database is very huge with thousands of rows. Such queries could be modified by reducing the number of calculations the query has to perform, to make them run faster. Further runtime depends on factors such as:

Aggregations
Joins
Table size
Other users executing queries on the same database

Another thing that could be done to reduce runtime is by reducing the number of rows in the join which makes the joins less complicated. This could be done by using a pre-aggregation of one table in a sub-query and then joining it to the outer query.

SQL FOR DATA ANALYSIS

Written by Nerthiga Balakrishnan