Here you can analyze any dataset you want as long as all the variables are in numeric format. To analyze the dataset first you need to upload the dataset (.csv format only!!!). When you click on the Choose File button the Open file window will pop up and will require you to select the dataset (.csv format) from your computer, after you select the file, click Ok and the webiste will automatically generate all statistics for the dataaset i.e. provide the basic information about the dataset (number of features/variables and samples), show what each feature/variable (column) contains (numbers, strings or mixed type), show the dataset, provide the resuts of the statistical analysis (mean, median, mode, range, varince, standard deviation, inter quartile range, skewness, kurtosis, count unique values, max values, min values, sum, var coefficients, geometric mean, harmonic mean, confidence interval, quartile 1, quartile 2, and quartile 3), histograms of each dataset variable, Pearsons correlation analysis, outlier analysis..... Each of these statisticall methods is described below.
Display Dataset
Histograms
Pearson's correlation heatmap
Outlier Detection using Boxplot
Dataset Basic Information
Dataset Statistical Analysis
For the uploaded dataset this web-application is calculating the measures of central tendency, measures of dispersion, measures of Shape, unique values, extremes, specialized means, confidence intervals, nad quartiles for each dataset variable. It should be noted that the web-application is also counting the number of samples of each variable. If the count values are not the same for all dataset variables it means that the variable is missing value for some dataset sample. Measures of Central Tendency
The measures of central tendency are statistical metrics that describe the center point or typical value of a dataset. They provide summary that represents the entire dataset, which can be useful for understanding the general behavior of the data. As you probably know the most common measures of central tendency measures are mean, median and mode.The measures of central tendency are important due to simplification and summary, descriptive statistcs, decision making, and data analysis.
- Simplification and summary are important
- Condensing information - the measure of central tendency reduce the large dataset into a single represntative value. This makes it easier to understand and communicate.
- Comparative Analysis - Facilitate comparison between different datasets by providing a common ground.
- Descriptive statistics
- Insight into dataset distribution - the statistcs offer the insight into the varaibles distribution, highlighting where the central point lies.
- Initial Analysis - Serve as the starting point for the further statistical analysis and hypotheses testing
- Decision making
- Guiding Decisions - central tendency measures are often used in business, economics, social sciences, and other fields to make informed decisions.
- Policy and Planning - Aid in policy formulation, planning, and setting benchmarks.
- Data Analysis
- Identifying Trends - help in identifying trends and patterns within the data.
- Detecting Outliers - these measures are primarly focused on determining the central values of each dataste variables (at least in this case), however, they can also assist in detecting outliers when combined with measures of dispersion.
Mean
The mean value of dataset variable, or average, is calculated by summing all the dataset variable values and dividing the sum by the number of of values that variable contains (length of an array in which all variable values are stored). It is represented as \(\overline{x}\). The formula for calculating mean is written as:
\begin{equation}
\overline{x} = \frac{\sum_{i=1}^nx_i}{n},
\end{equation}
where:
- \(\sum_{i=1}^nx_i\) - sum of all the values that variable contains
- \(n\) - is the number of samples that variable contains.
Median
The median is the middle value of the dataset variable when it is ordered from lowest to the highest value. If the dataset has an even number of observation (sampleS), the median is the average of the two middle numbers.
Mode
The mode of the dataset variable is the value that appears most frequently in a dataset variable. A dataset may hae one mode, more than one mode, or no mode at all.Measures of Dispersion
The measures of dispersion are range, variance, standard deviation, and interquartile range.
Range
The range is the difference between the maximum and the minimum value in the variable array. The range formula can be written as:
\begin{equation}
range = max - min
\end{equation}
Variance
The variance is a measure of the spread of te data points around the mean value. It is the average of the squared differences from the mean.
To determine the variance
Standard Deviation
The standard deviaiton is the square root of the variance. It provides a measure of the average distance from the mean. The population standard deviation is calculated using formula:
\begin{equation}
\siga = \sqrt{\sigma^2}
\end{equation}
Interquartile Range (IQR)
The IQR is the range of the midle 50\% of the data. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1). Formula for calcuating the IQR can be written as:
\begin{equation}
IQR = Q_3 - Q_1
\end{equation}
Measures of Shape
The measures of shape are skewness and kurtosis
Skewness
The skewness measures the asymmetry of the distribution of values in a dataset. Positive skew (right-skewed) means the tail on the right side is longer or fatter. Negative skew (left-skewed) means the tail on the left side is longer of fatter.
The formula for calculating skewness can be written as:
\begin{equation}
Skew = \frac{n}{(n-1)(n-2)}\sum\left(\frac{x_i-\overline{x}}{s}\right)^2
\end{equation}
Kurtosis
Kurtosis measure the tailedness of the distribution of values in a dataset. High kurtosis means more of the variance is due to infrequent extreme deviations. Low kurtosis means the variance is more evenly distributed.
\begin{equation}
Kurtosis = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum\left( \frac{x_i-\overline{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}
\end{equation}
Unique Values
the cound of the distinct values in the dataset. For example in the dataset [1,2,3,4,5,6] there are 6 unique values: 1,2,3,4,5, and 6.
Extremes
The extremes are minimum and maximum values
Maximum
The largest value in variable.
Minimum
The smallest value in a variable.
Specialized means
Geometric mean
The geometric menan is the nth root of the product of n values. It is useful for data that are multiplicative or that vary exponentially.
Formula for calculating geometric mean is:
\begin{equation}
GMean = \left(\Pi_{i=1}^nx_i\right)^\frac{1}{n}
\end{equation}
Harmonic mean
The harmonic mean is the reciprocal of the aritmetic mean of the reciporcals of the values. It is useful for rates and ratios. The fromula for calcualting the harmonic mean can be written as:
\begin{equation}
HMean = \frac{n}{\sum_{i=1}^n \frac{1}{x_i}}
\end{equation}
Confidence Intervals
The confidence interval gives an estimated range of values which is likely to include an unknown population parameter. The interval has an associated confidence level that quantifies the level of confidence that the parameter lies within the interval. The formula for confidence interval can be written as:
\begin{equation}
CI = \overline{x}\pm z \left(\frac{s}{\sqrt{n}}\right)
\end{equation}
The CI in previous formula is used for the min with known standard deviation.
Quartiles
Quartile 25 (Q1)
The first quartile (Q1) is the 25th percentile of the data. If is the median of the lower half of the dataset.
Quartile 50 (Q2)
The secnd quartile (Q2) is the median, which is the 50th percentle
Quartile 75 (Q3)
The third quaritle (Q3) is the 75th percentile of the data. It is the median of the upper half of the dataset.
Pearson's Correlation Analysis
Pearson's correlation analysis is a statistical method used to measure the strength and direction of the linear relationship between two continuous variables. Represented by the correlation coefficient \(r\), the value of
r ranges from -1 to 1. A value of 1 indicates a perfect positive linear relationship, meaning that as one variable increases, the other variable also increases proportionally. Conversely, a value of -1 signifies a perfect negative linear relationship, where one variable increases as the other decreases. A value of 0 indicates no linear relationship between the variables. Pearson's correlation is widely used in fields such as finance, economics, social sciences, and natural sciences to understand and quantify the degree to which two variables are related.
Pearson's correlation coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. This normalization ensures that the coefficient is dimensionless and thus comparable across different datasets. The interpretation of the correlation coefficient depends on its magnitude and sign. For instance, values close to 0 imply a weak linear relationship, while values closer to ±1 indicate a stronger linear relationship. It is important to note that Pearson's correlation only measures linear relationships and can be influenced by outliers, which may distort the true relationship between the variables. Therefore, it is often accompanied by visual inspections such as scatterplots to assess the nature of the relationship more comprehensively.
Outlier detection
Detecting outliers using boxplot involves identifying data points that lie significantly outside the interquartile range (IQR). The outlier detection consists of the following steps:
- Calculate the quartiles - Only first and third quartile are required.
- Calculate the interquartile range (IQR) - as mentioned previously to calculate the interquaritle range the third quartile is subtracted from the first one.
- Determine the outlier thresholds - the lower and upper bound have to be calculated. The formula for the lower bound can be written as: \begin{equation} LB = Q_1 - 1.5\cdot IQR, \end{equation} The formula for the upper bound can be written as: \begin{equation} UB = Q_3 + 1.5\cdot IQR. \end{equation}
- Identify the outliers - Any data points below the lower bound or above the upper bound are considered outliers.
Nema komentara:
Objavi komentar