Data normalization with SQL
⏱ Время чтения текста – 5 минутAccording to GIGO (garbage in, garbage out) principle, errors in input data lead to erroneous analysis results. The results of our work directly depend on the quality of data preparation.
For instance, when we need to prepare data to use in ML algorithms (like k-NN, k-means, logistic regression, etc.), features in the original dataset may vary in scale like the age and height of a person. This may lead to the incorrect performance of the algorithm. Thus, such data needs to be rescaled first.
In this tutorial, we will consider the ways to scale the data using SQL query: min-max normalization, min-max normalization for an arbitrary range, and z-score normalization. For each of these methods we have prepared two SQL query options: one using a SELECT subquery and another using a window function OVER().
We will work with the simple table students that contains the data on the height of the students:
name | height |
Ivan | 174 |
Peter | 181 |
Dan | 199 |
Kate | 158 |
Mike | 179 |
Silvia | 165 |
Giulia | 152 |
Robert | 188 |
Steven | 177 |
Sophia | 165 |
Min-max rescaling
Min-max scaling approach scales the data using the fixed range from 0 to 1. In this case, all the data is on the same scale which will exclude the impact of outliers on the conclusions.
The formula for the min-max scaling is given as:
We multiply the numerator by 1.0 in order to get a floating point number at the end.
SQL-query with a subquery:
SELECT height,
1.0 * (height-t1.min_height)/(t1.max_height - t1.min_height) AS scaled_minmax
FROM students,
(SELECT min(height) as min_height,
max(height) as max_height
FROM students
) as t1;
SQL-query with a window function:
SELECT height,
(height - MIN(height) OVER ()) * 1.0 / (MAX(height) OVER () - MIN(height) OVER ()) AS scaled_minmax
FROM students;
As a result, we get the values in [0, 1] range where 0 is the height of the shortest student and 1 is the height of the tallest one.
name | height | scaled_minmax |
Ivan | 174 | 0.46809 |
Peter | 181 | 0.61702 |
Dan | 199 | 1 |
Kate | 158 | 0.12766 |
Mike | 179 | 0.57447 |
Silvia | 165 | 0.2766 |
Giulia | 152 | 0 |
Robert | 188 | 0.76596 |
Steven | 177 | 0.53191 |
Sophia | 165 | 0.2766 |
Rescaling within a given range
This is an option of min-max normalization between an arbitrary set of values. When it comes to data scaling, the values do not always need to be in the range between 0 and 1. In these cases, the following formula is applied.
This allows us to scale the data to an arbitrary scale. In our example, let’s assume a=10.0 and b=20.0.
SQL-query with subquery:
SELECT height,
((height - min_height) * (20.0 - 10.0) / (max_height - min_height)) + 10 AS scaled_ab
FROM students,
(SELECT MAX(height) as max_height,
MIN(height) as min_height
FROM students
) t1;
SQL-query with a window function:
SELECT height,
((height - MIN(height) OVER() ) * (20.0 - 10.0) / (MAX(height) OVER() - MIN(height) OVER())) + 10.0 AS scaled_ab
FROM students;
We receive similar results as before, but with data spread between 10 and 20.
name | height | scaled_ab |
Ivan | 174 | 14.68085 |
Peter | 181 | 16.17021 |
Dan | 199 | 20 |
Kate | 158 | 11.2766 |
Mike | 179 | 15.74468 |
Silvia | 165 | 12.76596 |
Giulia | 152 | 10 |
Robert | 188 | 17.65957 |
Steven | 177 | 15.31915 |
Sophia | 165 | 12.76596 |
Z-score normalization
Using Z-score normalization, the data will be scaled so that it has the properties of a standard normal distribution where the mean (μ) is equal to 0 and the standard deviation (σ) to 1.
Z-score is calculated using the formula:
SQL-query with a subquery:
SELECT height,
(height - t1.mean) * 1.0 / t1.sigma AS zscore
FROM students,
(SELECT AVG(height) AS mean,
STDDEV(height) AS sigma
FROM students
) t1;
SQL-query with a window function:
SELECT height,
(height - AVG(height) OVER()) * 1.0 / STDDEV(height) OVER() AS z-score
FROM students;
As a result, we can easily notice the outliers that exceed the standard deviation.
name | height | zscore |
Ivan | 174 | 0.01488 |
Peter | 181 | 0.53582 |
Dan | 199 | 1.87538 |
Kate | 158 | -1.17583 |
Mike | 179 | 0.38698 |
Silvia | 165 | -0.65489 |
Giulia | 152 | -1.62235 |
Robert | 188 | 1.05676 |
Steven | 177 | 0.23814 |
Sophia | 165 | -0.65489 |