0% |
Z-score normalization (standardization) is a statistical technique used to transform data so that it has a mean of 0 and a standard deviation of 1. The technique is also available in the scikit-learn library under the StandardScaler name.This is achieved by subtracting the man of the dataset from each data point and then dividing the result by the standard deviation of the dataset. The formula for Z-score normalization can be written as:
\begin{equation}
z = \frac{x-\mu}{\sigma}
\end{equation}
where: \(z\) is the Z-score, the \(x\) is the data point, \(\mu\) is the mean of the dataset feature/variable, and \(\sigma\) is the standard deviation of the dataset feature/variable.
First step is to calculate the mean value of the dataset feature/variable.
\begin{equation}
\mu = \frac{1}{N}\sum_{i=1}^N x_i
\end{equation}
where \(N\) is the number of data points and \(x_i\) represents each data point.
Second step is to calculate the standard deviation \(\sigma\).
\begin{equation}
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2}
\end{equation}
Third and final step is to standardize all the values for feature/variable.
\begin{equation}
z_i = \frac{x_i-\mu}{\sigma}
\end{equation}
where \(z_i\) is the Z-score of the i-th data sample.
Perform the z-score normalization for the following array:
\begin{equation}
x = [10,20,30,40,50]
\end{equation}
To perform the z-score normalization the first step is to calculate the mean of the array.
\begin{equation}
\mu = \frac{10+20+30+40+50}{5} = \frac{150}{5} = 30,
\end{equation}
The second step is to calculate the standard deviation (\(\sigma\)).
\begin{eqnarray}
\sigma &=& \sqrt{\frac{(10-30)^2 + (20-30)^2 + (30-30)^2 + (40-30)^2 + (50-30)^2}{5}}\\
\sigma &=& \sqrt{\frac{400+100+0+100+400}{5}}\\
\sigma &=& \sqrt{\frac{1000}{5}} = \sqrt{200} = 14.14.
\end{eqnarray}
After mean and standard deviation of the array were calculated the Z-score normalization can be performed.
\begin{eqnarray}
z_1 &=& \frac{10-30}{14.14} = -1.41\\
z_2 &=& \frac{20-30}{14.14} = -0.71\\
z_3 &=& \frac{30-30}{14.14} = 0\\
z_4 &=& \frac{40-30}{14.14} = 0.71\\
z_5 &=& \frac{50-30}{14.14} = 1.41
\end{eqnarray}
So the standardize dataset (Z-scores) are equal to:
\begin{equation}
z = [-1.41, -0.71, 0, 0.71, 1.41]
\end{equation}
The Z-score normalization is very useful in machine learning algorithms that assume or perform better when the data is normally distributed and centered around 0 with a standard deviation of 1 such as k-nearest neighbors and principle component analysis. The Z-scores can be used to identify outliers in the data, as the values far from 0 indicate usual data points.
The Z-score normalization ensures that different features in a dataset contribute equally to the analysis, preventing features with larger ranges from dominating those with smaller ranges.
Steps for performing Z-score normalization
First step is to calculate the mean value of the dataset feature/variable.
\begin{equation}
\mu = \frac{1}{N}\sum_{i=1}^N x_i
\end{equation}
where \(N\) is the number of data points and \(x_i\) represents each data point.
Second step is to calculate the standard deviation \(\sigma\).
\begin{equation}
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2}
\end{equation}
Third and final step is to standardize all the values for feature/variable.
\begin{equation}
z_i = \frac{x_i-\mu}{\sigma}
\end{equation}
where \(z_i\) is the Z-score of the i-th data sample.
Example - Z-score normalization
Perform the z-score normalization for the following array:
\begin{equation}
x = [10,20,30,40,50]
\end{equation}
To perform the z-score normalization the first step is to calculate the mean of the array.
\begin{equation}
\mu = \frac{10+20+30+40+50}{5} = \frac{150}{5} = 30,
\end{equation}
The second step is to calculate the standard deviation (\(\sigma\)).
\begin{eqnarray}
\sigma &=& \sqrt{\frac{(10-30)^2 + (20-30)^2 + (30-30)^2 + (40-30)^2 + (50-30)^2}{5}}\\
\sigma &=& \sqrt{\frac{400+100+0+100+400}{5}}\\
\sigma &=& \sqrt{\frac{1000}{5}} = \sqrt{200} = 14.14.
\end{eqnarray}
After mean and standard deviation of the array were calculated the Z-score normalization can be performed.
\begin{eqnarray}
z_1 &=& \frac{10-30}{14.14} = -1.41\\
z_2 &=& \frac{20-30}{14.14} = -0.71\\
z_3 &=& \frac{30-30}{14.14} = 0\\
z_4 &=& \frac{40-30}{14.14} = 0.71\\
z_5 &=& \frac{50-30}{14.14} = 1.41
\end{eqnarray}
So the standardize dataset (Z-scores) are equal to:
\begin{equation}
z = [-1.41, -0.71, 0, 0.71, 1.41]
\end{equation}
Uses and Importance
The Z-score normalization is very useful in machine learning algorithms that assume or perform better when the data is normally distributed and centered around 0 with a standard deviation of 1 such as k-nearest neighbors and principle component analysis. The Z-scores can be used to identify outliers in the data, as the values far from 0 indicate usual data points. The Z-score normalization ensures that different features in a dataset contribute equally to the analysis, preventing features with larger ranges from dominating those with smaller ranges.
Nema komentara:
Objavi komentar