petak, 26. srpnja 2024.

Analyze my data

Load CSV in JavaScript
Here you can analyze any dataset you want as long as all the variables are in numeric format. To analyze the dataset first you need to upload the dataset (.csv format only!!!). When you click on the Choose File button the Open file window will pop up and will require you to select the dataset (.csv format) from your computer, after you select the file, click Ok and the webiste will automatically generate all statistics for the dataaset i.e. provide the basic information about the dataset (number of features/variables and samples), show what each feature/variable (column) contains (numbers, strings or mixed type), show the dataset, provide the resuts of the statistical analysis (mean, median, mode, range, varince, standard deviation, inter quartile range, skewness, kurtosis, count unique values, max values, min values, sum, var coefficients, geometric mean, harmonic mean, confidence interval, quartile 1, quartile 2, and quartile 3), histograms of each dataset variable, Pearsons correlation analysis, outlier analysis..... Each of these statisticall methods is described below.

Display Dataset

Histograms

Pearson's correlation heatmap

Outlier Detection using Boxplot

Dataset Basic Information

Dataset Statistical Analysis

For the uploaded dataset this web-application is calculating the measures of central tendency, measures of dispersion, measures of Shape, unique values, extremes, specialized means, confidence intervals, nad quartiles for each dataset variable. It should be noted that the web-application is also counting the number of samples of each variable. If the count values are not the same for all dataset variables it means that the variable is missing value for some dataset sample.

Measures of Central Tendency

The measures of central tendency are statistical metrics that describe the center point or typical value of a dataset. They provide summary that represents the entire dataset, which can be useful for understanding the general behavior of the data. As you probably know the most common measures of central tendency measures are mean, median and mode.
The measures of central tendency are important due to simplification and summary, descriptive statistcs, decision making, and data analysis.
  • Simplification and summary are important
    • Condensing information - the measure of central tendency reduce the large dataset into a single represntative value. This makes it easier to understand and communicate.
    • Comparative Analysis - Facilitate comparison between different datasets by providing a common ground.
  • Descriptive statistics
    • Insight into dataset distribution - the statistcs offer the insight into the varaibles distribution, highlighting where the central point lies.
    • Initial Analysis - Serve as the starting point for the further statistical analysis and hypotheses testing
  • Decision making
    • Guiding Decisions - central tendency measures are often used in business, economics, social sciences, and other fields to make informed decisions.
    • Policy and Planning - Aid in policy formulation, planning, and setting benchmarks.
  • Data Analysis
    • Identifying Trends - help in identifying trends and patterns within the data.
    • Detecting Outliers - these measures are primarly focused on determining the central values of each dataste variables (at least in this case), however, they can also assist in detecting outliers when combined with measures of dispersion.

Mean

The mean value of dataset variable, or average, is calculated by summing all the dataset variable values and dividing the sum by the number of of values that variable contains (length of an array in which all variable values are stored). It is represented as \(\overline{x}\). The formula for calculating mean is written as: \begin{equation} \overline{x} = \frac{\sum_{i=1}^nx_i}{n}, \end{equation} where:
  • \(\sum_{i=1}^nx_i\) - sum of all the values that variable contains
  • \(n\) - is the number of samples that variable contains.
Example: Calculating Mean of an Array
Calculate the mean value of an array: \begin{equation} x = [2,4,6,8,10,12,14,16] \end{equation} First step - determine the array length which is pretty simple - count the number of elements in the array. By counting the number of array elements it was found that array consists of 8 elements. Second step - calculate the sum. \begin{equation} \sum_{i=1}^8 x_i = 2+4+6+8+10+12+14+16 = 72 \end{equation}s Third step - calculate the mean value. \begin{equation} \overline{x} = \frac{\sum_{i=1}^8x_i}{8} = \frac{72}{8} = 9 \end{equation}

Median

The median is the middle value of the dataset variable when it is ordered from lowest to the highest value. If the dataset has an even number of observation (sampleS), the median is the average of the two middle numbers.
Example: Calculation of the median value
Calculate the median value for the array \begin{equation} x_1 = [20,30,40,50,60], \end{equation} and \begin{equation} x_2 = [50,30,40,20] \end{equation} The median of the \(x_1\) variable is 40. The proof for that is simple i.e. calculate the sum of all array elements \begin{equation} 20+30+40+50+60 = 200 \end{equation} and divide it by 5 (5 elements in the array): \begin{equation} \frac{200}{5} = 40 \end{equation} So the median of the \(x_1\) array is 40. For the second array \(x_2\) the array elements first have to be ordered from smallest to the largest \begin{equation} x_{2O} = [20,30,40,50] \end{equation} As the median definition states the average of the two middle values is the median of this array. \begin{equation} \frac{30+40}{2} = \frac{70}{2} = 35 \end{equation} Proof: sum all the elements from the ordered array and the sum is equal to 140 and divided it by 4 since there are 4 elements in the list. The result of division is 35.

Mode

The mode of the dataset variable is the value that appears most frequently in a dataset variable. A dataset may hae one mode, more than one mode, or no mode at all.
Example: Calculating the mode of an array
For an array [1,4,4,6,5] and [1,2,3,4,5] determine the mode values. In case of first array the array contains 5 elements and only value 4 occurs more than once so the mode value is 4. In case of second array the array also contains 5 elements however none of the array elements occurs more than once so there is no mode.

Measures of Dispersion

The measures of dispersion are range, variance, standard deviation, and interquartile range.

Range

The range is the difference between the maximum and the minimum value in the variable array. The range formula can be written as: \begin{equation} range = max - min \end{equation}
Example: Range calculation of an array
Calculate the range of an array [20,30,40,50,60]. First step: determine the minimum and the maximum value. The minimum value in the variable array is 20 and the maximum is 60. The range can be calculated by subtracting minimum from the maximum value. \begin{equation} range = 60 - 20 = 40 \end{equation} The range of this array is 40.

Variance

The variance is a measure of the spread of te data points around the mean value. It is the average of the squared differences from the mean. To determine the variance

Standard Deviation

The standard deviaiton is the square root of the variance. It provides a measure of the average distance from the mean. The population standard deviation is calculated using formula: \begin{equation} \siga = \sqrt{\sigma^2} \end{equation}

Interquartile Range (IQR)

The IQR is the range of the midle 50\% of the data. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1). Formula for calcuating the IQR can be written as: \begin{equation} IQR = Q_3 - Q_1 \end{equation}

Measures of Shape

The measures of shape are skewness and kurtosis

Skewness

The skewness measures the asymmetry of the distribution of values in a dataset. Positive skew (right-skewed) means the tail on the right side is longer or fatter. Negative skew (left-skewed) means the tail on the left side is longer of fatter. The formula for calculating skewness can be written as: \begin{equation} Skew = \frac{n}{(n-1)(n-2)}\sum\left(\frac{x_i-\overline{x}}{s}\right)^2 \end{equation}

Kurtosis

Kurtosis measure the tailedness of the distribution of values in a dataset. High kurtosis means more of the variance is due to infrequent extreme deviations. Low kurtosis means the variance is more evenly distributed. \begin{equation} Kurtosis = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum\left( \frac{x_i-\overline{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)} \end{equation}

Unique Values

the cound of the distinct values in the dataset. For example in the dataset [1,2,3,4,5,6] there are 6 unique values: 1,2,3,4,5, and 6.

Extremes

The extremes are minimum and maximum values

Maximum

The largest value in variable.
Example: Determine the maximum value in array
The array of the dataset variable consist of the following elements. \begin{equation} x = [4,5,10,20,100,2000,200,300] \end{equation} As seen from the array the maximum value of the array is 2000.

Minimum

The smallest value in a variable.
Example: Determine the minimum value in the array
In this example the array from the previous example will be used. The minimum value in the array is 4.

Specialized means

Geometric mean

The geometric menan is the nth root of the product of n values. It is useful for data that are multiplicative or that vary exponentially. Formula for calculating geometric mean is: \begin{equation} GMean = \left(\Pi_{i=1}^nx_i\right)^\frac{1}{n} \end{equation}

Harmonic mean

The harmonic mean is the reciprocal of the aritmetic mean of the reciporcals of the values. It is useful for rates and ratios. The fromula for calcualting the harmonic mean can be written as: \begin{equation} HMean = \frac{n}{\sum_{i=1}^n \frac{1}{x_i}} \end{equation}

Confidence Intervals

The confidence interval gives an estimated range of values which is likely to include an unknown population parameter. The interval has an associated confidence level that quantifies the level of confidence that the parameter lies within the interval. The formula for confidence interval can be written as: \begin{equation} CI = \overline{x}\pm z \left(\frac{s}{\sqrt{n}}\right) \end{equation} The CI in previous formula is used for the min with known standard deviation.

Quartiles

Quartile 25 (Q1)

The first quartile (Q1) is the 25th percentile of the data. If is the median of the lower half of the dataset.

Quartile 50 (Q2)

The secnd quartile (Q2) is the median, which is the 50th percentle

Quartile 75 (Q3)

The third quaritle (Q3) is the 75th percentile of the data. It is the median of the upper half of the dataset.

Pearson's Correlation Analysis

Pearson's correlation analysis is a statistical method used to measure the strength and direction of the linear relationship between two continuous variables. Represented by the correlation coefficient \(r\), the value of r ranges from -1 to 1. A value of 1 indicates a perfect positive linear relationship, meaning that as one variable increases, the other variable also increases proportionally. Conversely, a value of -1 signifies a perfect negative linear relationship, where one variable increases as the other decreases. A value of 0 indicates no linear relationship between the variables. Pearson's correlation is widely used in fields such as finance, economics, social sciences, and natural sciences to understand and quantify the degree to which two variables are related. Pearson's correlation coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. This normalization ensures that the coefficient is dimensionless and thus comparable across different datasets. The interpretation of the correlation coefficient depends on its magnitude and sign. For instance, values close to 0 imply a weak linear relationship, while values closer to ±1 indicate a stronger linear relationship. It is important to note that Pearson's correlation only measures linear relationships and can be influenced by outliers, which may distort the true relationship between the variables. Therefore, it is often accompanied by visual inspections such as scatterplots to assess the nature of the relationship more comprehensively.

Outlier detection

Detecting outliers using boxplot involves identifying data points that lie significantly outside the interquartile range (IQR). The outlier detection consists of the following steps:
  • Calculate the quartiles - Only first and third quartile are required.
  • Calculate the interquartile range (IQR) - as mentioned previously to calculate the interquaritle range the third quartile is subtracted from the first one.
  • Determine the outlier thresholds - the lower and upper bound have to be calculated. The formula for the lower bound can be written as: \begin{equation} LB = Q_1 - 1.5\cdot IQR, \end{equation} The formula for the upper bound can be written as: \begin{equation} UB = Q_3 + 1.5\cdot IQR. \end{equation}
  • Identify the outliers - Any data points below the lower bound or above the upper bound are considered outliers.

Nema komentara:

Objavi komentar

CSV to SQL Converter

CSV to SQL Converter Step 1: Choose CSV File ...