utorak, 30. srpnja 2024.

About Web-site

AnalyzeMyData.blogspot.com is a comprehensive online platform designed to streamline and enhance the data analysis process for users of all levels. This website offers a suite of powerful tools that enable users to convert datasets from various file formats, perform detailed statistical analyses, apply a variety of preprocessing techniques, and clean datasets with ease. All that is required from the user is to provide the dataset; the platform takes care of the rest. Here’s a closer look at what AnalyzeMyData.blogspot.com offers:

Key Features

Dataset Conversion
- Versatility: Convert datasets from numerous file formats including CSV, Excel, JSON, XML, and more. This flexibility ensures that users can work with data in the format they are most comfortable with or that their specific project requires.
- Ease of Use: The conversion process is straightforward and user-friendly, making it accessible even for those with minimal technical expertise.
Statistical Analysis

Comprehensive Tools: Perform a wide range of statistical analyses, from basic descriptive statistics (mean, median, mode, standard deviation) to more complex inferential statistics (regression analysis, hypothesis testing, ANOVA).
Detailed Reports: Generate detailed reports that provide insights into the dataset, helping users understand underlying patterns and trends.

Preprocessing Techniques
- Data Transformation: Apply various data transformation techniques such as normalization, standardization, and scaling to prepare data for analysis.
- Feature Engineering: Create new features from existing data to enhance model performance and improve analysis outcomes.
Dataset Cleaning
- Error Detection: Identify and correct errors in the dataset, such as missing values, duplicates, and outliers, to ensure the data is accurate and reliable.
- Automated Cleaning: Utilize automated tools that streamline the cleaning process, saving time and reducing the potential for human error.

Benefits

User-Friendly Interface: The website is designed with a user-friendly interface that guides users through each step of the data analysis process. This ensures that even those with limited technical skills can effectively use the platform.
Time Efficiency: By automating many aspects of data conversion, analysis, and cleaning, AnalyzeMyData.blogspot.com significantly reduces the time and effort required to prepare data for analysis.
Enhanced Accuracy: Automated tools and detailed statistical reports enhance the accuracy of data analysis, leading to more reliable insights and better decision-making.
Versatility and Flexibility: The platform’s ability to handle various file formats and apply multiple preprocessing techniques makes it a versatile tool suitable for a wide range of applications across different industries.

Use Cases

Academic Research: Researchers can use the platform to quickly convert and clean their data, allowing them to focus on analysis and interpretation.
Business Analytics: Businesses can leverage the platform to preprocess and analyze sales, marketing, and operational data, driving more informed strategic decisions.
Healthcare Data Analysis: Healthcare professionals can utilize the tools to clean and analyze patient data, improving the quality of care and operational efficiency.
Financial Analysis: Financial analysts can benefit from the platform’s capabilities to process and analyze financial data, enhancing investment strategies and risk management.

Getting Started

Getting started with AnalyzeMyData.blogspot.com is simple:

Upload Your Dataset: Upload your dataset in any supported file format (AT THE MOMENT ONLY .CSV FORMAT IS SUPPORTED).
Choose Your Tools: Select the tools and techniques you wish to apply, whether it’s file conversion, statistical analysis, preprocessing, or data cleaning.
Execute and Review: Execute the chosen operations and review the results. The platform will provide detailed outputs and reports to guide your next steps.

AnalyzeMyData.blogspot.com is the go-to platform for anyone looking to efficiently manage and analyze their data. By providing a comprehensive set of tools in an easy-to-use interface, it empowers users to derive meaningful insights and make data-driven decisions with confidence.

ponedjeljak, 29. srpnja 2024.

Z-Score Normalization (StandardScaler)

Step one: Select a .csv format file.

Step two: Download the file after the process is completed.

Z-score normalization (standardization) is a statistical technique used to transform data so that it has a mean of 0 and a standard deviation of 1. The technique is also available in the scikit-learn library under the StandardScaler name.This is achieved by subtracting the man of the dataset from each data point and then dividing the result by the standard deviation of the dataset. The formula for Z-score normalization can be written as: \begin{equation} z = \frac{x-\mu}{\sigma} \end{equation} where: \(z\) is the Z-score, the \(x\) is the data point, \(\mu\) is the mean of the dataset feature/variable, and \(\sigma\) is the standard deviation of the dataset feature/variable.

Steps for performing Z-score normalization

First step is to calculate the mean value of the dataset feature/variable. \begin{equation} \mu = \frac{1}{N}\sum_{i=1}^N x_i \end{equation} where \(N\) is the number of data points and \(x_i\) represents each data point. Second step is to calculate the standard deviation \(\sigma\). \begin{equation} \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2} \end{equation} Third and final step is to standardize all the values for feature/variable. \begin{equation} z_i = \frac{x_i-\mu}{\sigma} \end{equation} where \(z_i\) is the Z-score of the i-th data sample.

Example - Z-score normalization

Perform the z-score normalization for the following array: \begin{equation} x = [10,20,30,40,50] \end{equation} To perform the z-score normalization the first step is to calculate the mean of the array. \begin{equation} \mu = \frac{10+20+30+40+50}{5} = \frac{150}{5} = 30, \end{equation} The second step is to calculate the standard deviation (\(\sigma\)). \begin{eqnarray} \sigma &=& \sqrt{\frac{(10-30)^2 + (20-30)^2 + (30-30)^2 + (40-30)^2 + (50-30)^2}{5}}\\ \sigma &=& \sqrt{\frac{400+100+0+100+400}{5}}\\ \sigma &=& \sqrt{\frac{1000}{5}} = \sqrt{200} = 14.14. \end{eqnarray} After mean and standard deviation of the array were calculated the Z-score normalization can be performed. \begin{eqnarray} z_1 &=& \frac{10-30}{14.14} = -1.41\\ z_2 &=& \frac{20-30}{14.14} = -0.71\\ z_3 &=& \frac{30-30}{14.14} = 0\\ z_4 &=& \frac{40-30}{14.14} = 0.71\\ z_5 &=& \frac{50-30}{14.14} = 1.41 \end{eqnarray} So the standardize dataset (Z-scores) are equal to: \begin{equation} z = [-1.41, -0.71, 0, 0.71, 1.41] \end{equation}

Uses and Importance

The Z-score normalization is very useful in machine learning algorithms that assume or perform better when the data is normally distributed and centered around 0 with a standard deviation of 1 such as k-nearest neighbors and principle component analysis. The Z-scores can be used to identify outliers in the data, as the values far from 0 indicate usual data points.
The Z-score normalization ensures that different features in a dataset contribute equally to the analysis, preventing features with larger ranges from dominating those with smaller ranges.

nedjelja, 28. srpnja 2024.

MaxAbsScaler

Step one: Select a .csv format file.

Step two: Download the file after the process is completed.

The Max Abs Scaler

The MaxAbsScaler is the data preprocessing technique which is commonly used in Machine learning to scale the dataset features (variables). The main purpose of this scaler is to transform the data to a -1 to 1 range, based on the maximum absolute value of each feature. This technique is particularly useful for data that contains both positive and negative values and when maintaining sparsity is important (text processing and sparse data matrices).

How MaxAbsScaler works ?

The MaxAbsScaler consist of two steps i.e. Identifying the Maximum Absolute value and Scale the Feature (Variable). After the dataset is uploaded the MaxAbsScaler algorithm for each feature/variable (column) in the dataset, find the maximum absolute value. Then each value in the feature/variable (column) is divided by the maximum absolute value for that feature. The formula for scaling values of any feature/variable (column) can be written in the following form: \begin{equation} x_s = \frac{x}{\max(|x|)} \end{equation} where \(x\) is the original value, and \(\max(|x|)\) is the maximum absolute value of the feature/variable (column).

Example: Application of MaxAbsScaler

In this example the MaxAbsScaler will be applied on two features/Variables (columns) of the dataset. Here the features are defined in form of the array. \begin{eqnarray} x_1 &=& [1,-2,3,-4]\\ x_2 &=& [-3,4,-1,2] \end{eqnarray} The first step is to identify the maximum absolute value for each feature/variable column. For the first feature \(x_1\) the maximum absolute value is equal to 4. \begin{equation} \max(|1|,|-2|,|3|,|-4|) = \max(1,2,3,4) = 4 \end{equation} For the second feature \(x_2\) the maximum absolute value is equal to 4. \begin{equation} \max(|-3|, |4|, |-1|, |2|) = \max(3,4,1,2) = 4 \end{equation} The second step is to scale each value in the dataset by dividing the maximum absolute value for the respective feature.

Scaling First Feature

The original value of the first element in a first feature \(x_1\) array is 1 while the scaled value is equal to 0.25. \begin{equation} x_{s11} = \frac{x_{11}}{\max(|x_1|)} = \frac{1}{4} = 0.25. \end{equation} The original value of the second element in a first feature \(x_1\) array is -2 while the scaled value is equal to -0.5. \begin{equation} x_{s12} = \frac{x_{12}}{\max(|x_1|)} = \frac{-2}{4} = -0.5. \end{equation} The original value of the third element in a first feature \(x_1\) array is 3 while the scaled value is equal to -0.5. \begin{equation} x_{s13} = \frac{x_{13}}{\max(|x_1|)} = \frac{3}{4} = 0.75. \end{equation} The original value of the fourth element in a first feature \(x_1\) array is -4 while the scaled value is equal to -0.5. \begin{equation} x_{s14} = \frac{x_{14}}{\max(|x_1|)} = \frac{-4}{4} = -1.00. \end{equation} The scaled array can be written as: \begin{equation} x_{1s} = [0.25,-0.50,0.75,-1.00] \end{equation}

Scaling Second Feature

The original value of the first element in a second feature \(x_2\) array is -3 while the scaled value is equal to -0.75. \begin{equation} x_{s21} = \frac{x_{21}}{\max(|x_2|)} = \frac{-3}{4} = -0.75. \end{equation} The original value of the second element in a second feature \(x_2\) array is 4 while the scaled value is equal to 1.00. \begin{equation} x_{s22} = \frac{x_{22}}{\max(|x_2|)} = \frac{4}{4} = 1.00. \end{equation} The original value of the third element in a second feature \(x_2\) array is -1 while the scaled value is equal to -0.25. \begin{equation} x_{s23} = \frac{x_{23}}{\max(|x_2|)} = \frac{-1}{4} = -0.25. \end{equation} The original value of the fourth element in a second feature \(x_2\) array is 2 while the scaled value is equal to 0.5. \begin{equation} x_{s24} = \frac{x_{24}}{\max(|x_2|)} = \frac{2}{4} = 0.5. \end{equation} The scaled array can be written as: \begin{equation} x_{2s} = [-0.75,1.00,-0.25,0.5] \end{equation} The MaxAbsScaler is used when the sparsity of the data must be preserved, when data contains both positive and negative values, and when scaling data would not affect the distorting the range of values significantly.

subota, 27. srpnja 2024.

Pearson Correlation Heatmap

To obtain the Pearson's correlation heatmap for your dataset please click on the choose file button. By clicking on this button the "Open" window will appear. All you need to do is locate the dataset on your computer (.csv format only) and click "Open" and the Pearson's correlation heatmap will appear.

Pearson's correlation heatmap

The pearsons correlation coefficient, denoted as \(r\), is a measure of the linear relationship between two variables. It quantifies the degree to which two variables are linearly related, providing both the direction and the strength of the relationship.
The range of \(r\) is from -1 to +1 where +1 indicates the perfect positive linear correlation, while -1 indicates the perfect negative correlation. The 0 value indicates no linear relationship. The positive correlation \(r > 0\) between two variables indicates that if the value of one variable decreases the value of other variable also decreases. If the value of one variable increases the value of the other will also increase. The negative corelation coefficient \(r < 0\) between two dataset variables indicates that if the value of one variable increases the value of the other will decrease. If the value of one variable increases.
The magnitude of the \(r\) can be:

\(0.0 \leq |r|< 0.3\) - weak correlation
\(0.3 \leq |r| < 0.7\) - moderate correlation
\(0.7 \leq |r| \leq 1\) - strong correlation

The pearsons correlation coefficient between two dataset variables can be calculated using formula: \begin{equation} r = \frac{\sum{(x_i-\overline{x})(y_i-\overline{y})}}{\sqrt{\sum{(x_i - \overline{x})^2}\sum{(y_i-\overline{y})^2}}} \end{equation} where:

\(x_i\) and \(y_i\) are the individual dataset samples
\(\overline{x}\) and \(\overline{y}\) are the mean values of the variable \(x\) and variable \(y\).

Steps to calculate Pearsons correlation coefficient

The calculation of the Perasons correlation coefficient consists of the following steps:

Compute the mean< - calculate the mean values of x and y (\(\overline{x}\),\(\overline{y}\)).
Compute the deviations - Subtract the mean of x from each \(x_i\) to get deviations for \(x\). Subtract the mean of \(y\) from each \(y_i\) to get deviations for \(y\).
Compute the products of deviations - Multiply the deviations of \(x\) and \(y\) for each pair of observations.
Sum the products - sum all the products obtained in the previous step
Compute the sum of squared deviations - square the deviations of \(x\) and sum them. Square the deviations of \(y\) and sum them.
Calculate \(r\) - divide the sum of the products of deviations by the square root of the product of the sum of squared deviations.

Example: Calculation of Pearson's correlation coefficient step by step.

Two dataset variables are represented as arrays: \begin{eqnarray} x = [2,4,6,8,10] y = [3,5,7,9,11] \end{eqnarray} As seen both variable arrays have the same number of samples i.e. 5.

Compute the mean value of \(x\) and \(y\)

\begin{equation} \overline{x} = \frac{2+4+6+8+10}{5} = \frac{30}{5} = 6 \end{equation} \begin{equation} \overline{y} = \frac{3+5+7+9+11}{5} = \frac{30}{5} = 7 \end{equation}

Compute the deviations

For \(x\) array the deviations are: \begin{equation} -4,-2,0,2,4 \end{equation} For \(y\) array the deviations are: \begin{equation} -4,-2,0,2,4 \end{equation}

Compute the product of deviations

Multiply the deviations of \(x\) and \(y\) for each pair of dataset samples. \begin{equation} 16,4,0,4,16 \end{equation}

Sum the products

\begin{equation} \sum{(x_i-\overline{x}) (y_i - \overline{y})} = 16+4+0+4+16 = 40 \end{equation}

Compute the sum of squared deviations

\begin{equation} \sum{(x_i - \overline{x})^2} = 16+4+0+4+16 = 40 \end{equation} \begin{equation} \sum{(y_i - \overline{y})^2} = 16+4+0+4+16 = 40 \end{equation}

Calculate the \(r\)

\begin{equation} r = \frac{40}{\sqrt{40\cdot 40}} = \frac{40}{40} = 1 \end{equation}

Conclusion

The Pearson's correlation coefficient is a powerfull statistical tool for understanding the linear relationship between two variables. It provides a quantifiable measure of both the direction and strength of this relationship, which can be invaluable in data analysis, research, and many applied fields. However, it's essential to be aware of its assumptions and limitations to interpret the results correctly.

petak, 26. srpnja 2024.

Analyze my data

Load CSV in JavaScript

Here you can analyze any dataset you want as long as all the variables are in numeric format. To analyze the dataset first you need to upload the dataset (.csv format only!!!). When you click on the Choose File button the Open file window will pop up and will require you to select the dataset (.csv format) from your computer, after you select the file, click Ok and the webiste will automatically generate all statistics for the dataaset i.e. provide the basic information about the dataset (number of features/variables and samples), show what each feature/variable (column) contains (numbers, strings or mixed type), show the dataset, provide the resuts of the statistical analysis (mean, median, mode, range, varince, standard deviation, inter quartile range, skewness, kurtosis, count unique values, max values, min values, sum, var coefficients, geometric mean, harmonic mean, confidence interval, quartile 1, quartile 2, and quartile 3), histograms of each dataset variable, Pearsons correlation analysis, outlier analysis..... Each of these statisticall methods is described below.

Display Dataset

Histograms

Pearson's correlation heatmap

Outlier Detection using Boxplot

Dataset Basic Information

Dataset Statistical Analysis

For the uploaded dataset this web-application is calculating the measures of central tendency, measures of dispersion, measures of Shape, unique values, extremes, specialized means, confidence intervals, nad quartiles for each dataset variable. It should be noted that the web-application is also counting the number of samples of each variable. If the count values are not the same for all dataset variables it means that the variable is missing value for some dataset sample.

Measures of Central Tendency

The measures of central tendency are statistical metrics that describe the center point or typical value of a dataset. They provide summary that represents the entire dataset, which can be useful for understanding the general behavior of the data. As you probably know the most common measures of central tendency measures are mean, median and mode.
The measures of central tendency are important due to simplification and summary, descriptive statistcs, decision making, and data analysis.

Simplification and summary are important
- Condensing information - the measure of central tendency reduce the large dataset into a single represntative value. This makes it easier to understand and communicate.
- Comparative Analysis - Facilitate comparison between different datasets by providing a common ground.
Descriptive statistics
- Insight into dataset distribution - the statistcs offer the insight into the varaibles distribution, highlighting where the central point lies.
- Initial Analysis - Serve as the starting point for the further statistical analysis and hypotheses testing
Decision making
- Guiding Decisions - central tendency measures are often used in business, economics, social sciences, and other fields to make informed decisions.
- Policy and Planning - Aid in policy formulation, planning, and setting benchmarks.
Data Analysis
- Identifying Trends - help in identifying trends and patterns within the data.
- Detecting Outliers - these measures are primarly focused on determining the central values of each dataste variables (at least in this case), however, they can also assist in detecting outliers when combined with measures of dispersion.

Mean

The mean value of dataset variable, or average, is calculated by summing all the dataset variable values and dividing the sum by the number of of values that variable contains (length of an array in which all variable values are stored). It is represented as \(\overline{x}\). The formula for calculating mean is written as: \begin{equation} \overline{x} = \frac{\sum_{i=1}^nx_i}{n}, \end{equation} where:

\(\sum_{i=1}^nx_i\) - sum of all the values that variable contains
\(n\) - is the number of samples that variable contains.

Example: Calculating Mean of an Array Calculate the mean value of an array: \begin{equation} x = [2,4,6,8,10,12,14,16] \end{equation} First step - determine the array length which is pretty simple - count the number of elements in the array. By counting the number of array elements it was found that array consists of 8 elements. Second step - calculate the sum. \begin{equation} \sum_{i=1}^8 x_i = 2+4+6+8+10+12+14+16 = 72 \end{equation}s Third step - calculate the mean value. \begin{equation} \overline{x} = \frac{\sum_{i=1}^8x_i}{8} = \frac{72}{8} = 9 \end{equation}

Median

The median is the middle value of the dataset variable when it is ordered from lowest to the highest value. If the dataset has an even number of observation (sampleS), the median is the average of the two middle numbers. Example: Calculation of the median value Calculate the median value for the array \begin{equation} x_1 = [20,30,40,50,60], \end{equation} and \begin{equation} x_2 = [50,30,40,20] \end{equation} The median of the \(x_1\) variable is 40. The proof for that is simple i.e. calculate the sum of all array elements \begin{equation} 20+30+40+50+60 = 200 \end{equation} and divide it by 5 (5 elements in the array): \begin{equation} \frac{200}{5} = 40 \end{equation} So the median of the \(x_1\) array is 40. For the second array \(x_2\) the array elements first have to be ordered from smallest to the largest \begin{equation} x_{2O} = [20,30,40,50] \end{equation} As the median definition states the average of the two middle values is the median of this array. \begin{equation} \frac{30+40}{2} = \frac{70}{2} = 35 \end{equation} Proof: sum all the elements from the ordered array and the sum is equal to 140 and divided it by 4 since there are 4 elements in the list. The result of division is 35.

Mode

The mode of the dataset variable is the value that appears most frequently in a dataset variable. A dataset may hae one mode, more than one mode, or no mode at all.
Example: Calculating the mode of an array For an array [1,4,4,6,5] and [1,2,3,4,5] determine the mode values. In case of first array the array contains 5 elements and only value 4 occurs more than once so the mode value is 4. In case of second array the array also contains 5 elements however none of the array elements occurs more than once so there is no mode.

Measures of Dispersion

The measures of dispersion are range, variance, standard deviation, and interquartile range.

Range

The range is the difference between the maximum and the minimum value in the variable array. The range formula can be written as: \begin{equation} range = max - min \end{equation} Example: Range calculation of an array Calculate the range of an array [20,30,40,50,60]. First step: determine the minimum and the maximum value. The minimum value in the variable array is 20 and the maximum is 60. The range can be calculated by subtracting minimum from the maximum value. \begin{equation} range = 60 - 20 = 40 \end{equation} The range of this array is 40.

Variance

The variance is a measure of the spread of te data points around the mean value. It is the average of the squared differences from the mean. To determine the variance

Standard Deviation

The standard deviaiton is the square root of the variance. It provides a measure of the average distance from the mean. The population standard deviation is calculated using formula: \begin{equation} \siga = \sqrt{\sigma^2} \end{equation}

Interquartile Range (IQR)

The IQR is the range of the midle 50\% of the data. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1). Formula for calcuating the IQR can be written as: \begin{equation} IQR = Q_3 - Q_1 \end{equation}

Measures of Shape

The measures of shape are skewness and kurtosis

Skewness

The skewness measures the asymmetry of the distribution of values in a dataset. Positive skew (right-skewed) means the tail on the right side is longer or fatter. Negative skew (left-skewed) means the tail on the left side is longer of fatter. The formula for calculating skewness can be written as: \begin{equation} Skew = \frac{n}{(n-1)(n-2)}\sum\left(\frac{x_i-\overline{x}}{s}\right)^2 \end{equation}

Kurtosis

Kurtosis measure the tailedness of the distribution of values in a dataset. High kurtosis means more of the variance is due to infrequent extreme deviations. Low kurtosis means the variance is more evenly distributed. \begin{equation} Kurtosis = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum\left( \frac{x_i-\overline{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)} \end{equation}

Unique Values

the cound of the distinct values in the dataset. For example in the dataset [1,2,3,4,5,6] there are 6 unique values: 1,2,3,4,5, and 6.

Extremes

The extremes are minimum and maximum values

Maximum

The largest value in variable. Example: Determine the maximum value in array The array of the dataset variable consist of the following elements. \begin{equation} x = [4,5,10,20,100,2000,200,300] \end{equation} As seen from the array the maximum value of the array is 2000.

Minimum

The smallest value in a variable. Example: Determine the minimum value in the array In this example the array from the previous example will be used. The minimum value in the array is 4.

Specialized means

Geometric mean

The geometric menan is the nth root of the product of n values. It is useful for data that are multiplicative or that vary exponentially. Formula for calculating geometric mean is: \begin{equation} GMean = \left(\Pi_{i=1}^nx_i\right)^\frac{1}{n} \end{equation}

Harmonic mean

The harmonic mean is the reciprocal of the aritmetic mean of the reciporcals of the values. It is useful for rates and ratios. The fromula for calcualting the harmonic mean can be written as: \begin{equation} HMean = \frac{n}{\sum_{i=1}^n \frac{1}{x_i}} \end{equation}

Confidence Intervals

The confidence interval gives an estimated range of values which is likely to include an unknown population parameter. The interval has an associated confidence level that quantifies the level of confidence that the parameter lies within the interval. The formula for confidence interval can be written as: \begin{equation} CI = \overline{x}\pm z \left(\frac{s}{\sqrt{n}}\right) \end{equation} The CI in previous formula is used for the min with known standard deviation.

Quartiles

Quartile 25 (Q1)

The first quartile (Q1) is the 25th percentile of the data. If is the median of the lower half of the dataset.

Quartile 50 (Q2)

The secnd quartile (Q2) is the median, which is the 50th percentle

Quartile 75 (Q3)

The third quaritle (Q3) is the 75th percentile of the data. It is the median of the upper half of the dataset.

Pearson's Correlation Analysis

Pearson's correlation analysis is a statistical method used to measure the strength and direction of the linear relationship between two continuous variables. Represented by the correlation coefficient \(r\), the value of r ranges from -1 to 1. A value of 1 indicates a perfect positive linear relationship, meaning that as one variable increases, the other variable also increases proportionally. Conversely, a value of -1 signifies a perfect negative linear relationship, where one variable increases as the other decreases. A value of 0 indicates no linear relationship between the variables. Pearson's correlation is widely used in fields such as finance, economics, social sciences, and natural sciences to understand and quantify the degree to which two variables are related. Pearson's correlation coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. This normalization ensures that the coefficient is dimensionless and thus comparable across different datasets. The interpretation of the correlation coefficient depends on its magnitude and sign. For instance, values close to 0 imply a weak linear relationship, while values closer to ±1 indicate a stronger linear relationship. It is important to note that Pearson's correlation only measures linear relationships and can be influenced by outliers, which may distort the true relationship between the variables. Therefore, it is often accompanied by visual inspections such as scatterplots to assess the nature of the relationship more comprehensively.

Outlier detection

Detecting outliers using boxplot involves identifying data points that lie significantly outside the interquartile range (IQR). The outlier detection consists of the following steps:

Calculate the quartiles - Only first and third quartile are required.
Calculate the interquartile range (IQR) - as mentioned previously to calculate the interquaritle range the third quartile is subtracted from the first one.
Determine the outlier thresholds - the lower and upper bound have to be calculated. The formula for the lower bound can be written as: \begin{equation} LB = Q_1 - 1.5\cdot IQR, \end{equation} The formula for the upper bound can be written as: \begin{equation} UB = Q_3 + 1.5\cdot IQR. \end{equation}
Identify the outliers - Any data points below the lower bound or above the upper bound are considered outliers.

srijeda, 24. srpnja 2024.

MinMaxScaler

Here you can upload your dataset by clicking on the Choose file button. By clicking on this button the pop-up "Open" window will appear and you will have to locate the dataset on your computer. After finding the dataset click on the Open window and all the dataset variables will be automatically scaled using MinMaxScaler. This scaler will scaler each dataset variable to 0-1 range. The detailed description of the MinMaxScaler is given below.

Step one: Select a .csv format file.

Step two: Download the file after the process is completed.

What is MinMaxScaler?

MinMaxScaler is the feature (variable) scaling technique which is popular in Machine Learning mostly in scikit-learn library in Python. This scaler transform each feature (variable) by scaling each feature to a given range. So far this scaler can scale to 0-1 range only.

How it works ?

The scaler works by subtracting the minimum value of each feature and then dividing it by the range (maximum value-minimum value) of the feature. The formula for scaling each feature (variable) value can be written as: \begin{equation} x_s = \frac{x-x_{min}}{x_{max} - x_{min}} \end{equation} where:

\(x\) is the original feature (variable) value,
\(x_{min}\) - is the minimum value of the feature (variable), and
\(x_{max}\) - is the maximum value of the feature (variable).

Example of MinMaxScaler

Let's say we have a following array of values: \begin{equation} x = [2,4,6,8,10] \end{equation} The following steps are necessary to scale this array to 0-1 range:

Find the minimum value
Find the maximum value
Apply the min-max scaling formula
Calculate scale values

Find the minimum value

The minimum value of the array is 2. \begin{equation} x_{min} = \min([2,4,6,8,10]) = 2 \end{equation}

Find the maximum value

The maximum value of the array is 10. \begin{equation} x_{max} = \max([2,4,6,8,10]) = 10 \end{equation}

Apply the min-max scaling formula

The formula for scaling each value of the array to the range [0,1] can be written as: \begin{equation} x_s = \frac{x-x_{min}}{x_{max} - x_{min}} \end{equation}

Calculate scaled values

For \(x_0 = 2\) \begin{equation} x_{s0} = \frac{x_0-x_{min}}{x_{max} - x_{min}} = \frac{2-2}{10-2} = \frac{0}{8} = 0 \end{equation}
For \(x_1 = 4\) \begin{equation} x_{s1} = \frac{x_1-x_{min}}{x_{max} - x_{min}} = \frac{4-2}{10-2} = \frac{2}{8} = 0.25 \end{equation}
For \(x_2 = 6\) \begin{equation} x_{s2} = \frac{x_2-x_{min}}{x_{max} - x_{min}} = \frac{6-2}{10-2} = \frac{4}{8} = 0.5 \end{equation}
For \(x_3 = 8\) \begin{equation} x_{s3} = \frac{x_3-x_{min}}{x_{max} - x_{min}} = \frac{8-2}{10-2} = \frac{6}{8} = 0.75 \end{equation}
For \(x_4 = 2\) \begin{equation} x_{s4} = \frac{x_4-x_{min}}{x_{max} - x_{min}} = \frac{10-2}{10-2} = \frac{8}{8} = 1. \end{equation}

What are the benefits of utilizing the MinMaxScaler?

The Scaler ensures that all the dataset features (variables) are on the same scale, which is crucial for many machine learning algorithms that are sensitive to the scales of the input features, such as k-nearest neighbors or artificial neural networks.
By scaling features to a fixed range, it can help mitigate the effect of outliers to some extent, although outliers can still influence the scaling. MinMaxScaler is a simple yet effective tool for feature scaling, especially useful when the assumptions of certain machine learning algorithms about the data distribution need to be met.