Data normalization with SQL

⏱ Время чтения текста – 5 минут

According to GIGO (garbage in, garbage out) principle, errors in input data lead to erroneous analysis results. The results of our work directly depend on the quality of data preparation.

For instance, when we need to prepare data to use in ML algorithms (like k-NN, k-means, logistic regression, etc.), features in the original dataset may vary in scale like the age and height of a person. This may lead to the incorrect performance of the algorithm. Thus, such data needs to be rescaled first.

In this tutorial, we will consider the ways to scale the data using SQL query: min-max normalization, min-max normalization for an arbitrary range, and z-score normalization. For each of these methods we have prepared two SQL query options: one using a SELECT subquery and another using a window function OVER().

We will work with the simple table students that contains the data on the height of the students:

name	height
Ivan	174
Peter	181
Dan	199
Kate	158
Mike	179
Silvia	165
Giulia	152
Robert	188
Steven	177
Sophia	165

Min-max rescaling

Min-max scaling approach scales the data using the fixed range from 0 to 1. In this case, all the data is on the same scale which will exclude the impact of outliers on the conclusions.

The formula for the min-max scaling is given as:

We multiply the numerator by 1.0 in order to get a floating point number at the end.

SQL-query with a subquery:

SELECT height, 
       1.0 * (height-t1.min_height)/(t1.max_height - t1.min_height) AS scaled_minmax
  FROM students, 
      (SELECT min(height) as min_height, 
              max(height) as max_height 
         FROM students
      ) as t1;

SQL-query with a window function:

SELECT height, 
       (height - MIN(height) OVER ()) * 1.0 / (MAX(height) OVER () - MIN(height) OVER ()) AS scaled_minmax
  FROM students;

As a result, we get the values in [0, 1] range where 0 is the height of the shortest student and 1 is the height of the tallest one.

name	height	scaled_minmax
Ivan	174	0.46809
Peter	181	0.61702
Dan	199	1
Kate	158	0.12766
Mike	179	0.57447
Silvia	165	0.2766
Giulia	152	0
Robert	188	0.76596
Steven	177	0.53191
Sophia	165	0.2766

Rescaling within a given range

This is an option of min-max normalization between an arbitrary set of values. When it comes to data scaling, the values do not always need to be in the range between 0 and 1. In these cases, the following formula is applied.

This allows us to scale the data to an arbitrary scale. In our example, let’s assume a=10.0 and b=20.0.

SQL-query with subquery:

SELECT height, 
       ((height - min_height) * (20.0 - 10.0) / (max_height - min_height)) + 10 AS scaled_ab
  FROM students,
      (SELECT MAX(height) as max_height, 
              MIN(height) as min_height
         FROM students  
      ) t1;

SQL-query with a window function:

SELECT height, 
       ((height - MIN(height) OVER() ) * (20.0 - 10.0) / (MAX(height) OVER() - MIN(height) OVER())) + 10.0 AS scaled_ab
  FROM students;

We receive similar results as before, but with data spread between 10 and 20.

name	height	scaled_ab
Ivan	174	14.68085
Peter	181	16.17021
Dan	199	20
Kate	158	11.2766
Mike	179	15.74468
Silvia	165	12.76596
Giulia	152	10
Robert	188	17.65957
Steven	177	15.31915
Sophia	165	12.76596

Z-score normalization

Using Z-score normalization, the data will be scaled so that it has the properties of a standard normal distribution where the mean (μ) is equal to 0 and the standard deviation (σ) to 1.

Z-score is calculated using the formula:

SQL-query with a subquery:

SELECT height, 
       (height - t1.mean) * 1.0 / t1.sigma AS zscore
  FROM students,
      (SELECT AVG(height) AS mean, 
              STDDEV(height) AS sigma
         FROM students
        ) t1;

SQL-query with a window function:

SELECT height, 
       (height - AVG(height) OVER()) * 1.0 / STDDEV(height) OVER() AS z-score
  FROM students;

As a result, we can easily notice the outliers that exceed the standard deviation.

name	height	zscore
Ivan	174	0.01488
Peter	181	0.53582
Dan	199	1.87538
Kate	158	-1.17583
Mike	179	0.38698
Silvia	165	-0.65489
Giulia	152	-1.62235
Robert	188	1.05676
Steven	177	0.23814
Sophia	165	-0.65489

Your password

LEFT JOIN: blog on analytics, visualisation & data science

Data normalization with SQL

Min-max rescaling

Rescaling within a given range

Z-score normalization