On this page you can generate the boxplots for all dataset variables and for each variable in the dataset individually. The dataset must be in .csv format. Click on the choose file and navigate to the dataset stored on your local computer. After uploading the dataset the web-application will in couple of seconds (for larger datasets may be a couple of minutes- PLEASE BE PATIENT) generate the boxplot graphs. The boxplots are good fr finding the outliers in the dataset variables. The full description of the boxplots and outliers is given below.

Boxplot Analysis

Boxplots a.k.a. box-and-whisker plots, is a tool used to display the distribution of the dataset. These type of plots help to visualize the central tendency, variability, and presence of outliers in the data.

central tendency - is a summary measure used to describe the entire array with a single value that represent the middle or centre of tis distribution.
variability - refers to the spread or dispersion of data points in the dataset. It shows how much the data values differ from the central tendency. Boxplots display variability through the width of the box (interquartile range) and the extent of the whiskers.
outliers - An outlier is a data point that significantly deviates from the other values in the dataset. For instance, if you measure room temperatures daily for a year and most readings are around 20°C, but some readings are as high as 60°C or more, these unusually high values are considered outliers. They are far from the typical range of temperatures and stand out because they are much higher than the majority of the data.

The boxplot components

The components of the boxplot are box, whiskers and outliers. The box consist of interquartile range (IQR) and the median. The whiskers consists of upper whiskers and lowe whiskers. The outliers consists of outlier points.

Interquartile range - the box represents the middle 50% of the data, between the first quartile (Q1) and the third quartile (Q3). This range is known as IQR, which measures the spread of the middle 50% of the data.
Median - inside the box (IRQ), a line indicates the median (the middle value of the dataset). This line divides the box into two parts, showing the central tendency.

Whiskers - they show the range of the data whithin the acceptable range for outliers.

Upper Whisker - the area that extends from Q3 to the highest value within 1.5*IQR above Q3.
Lower Whisker - the area that extends from Q1 to the lowest value whithin 1.5*IQR below Q1.

Notches - these are optional i.e. in some boxplot, the box might be notched which provides a visual indication of the confidence interval around the median. This is useful to compare the medians between different groups.

How to calculate the elements of the boxplot

The elements you need for visualizing the boxplot of the dataset variable are: minimum, q1 (first quartile), median, q3 (third quartile) and maximum value. The minimum and maximum values are very simple i.e. in Python simple built-in functions min and max should immediately find these values of a variable represented as the list (array). To caculate the rest of boxplot elements the dataset variable array must be ordered from minimum to the maximum value.

Median

The median is the middle value of a dataset when it is ordered in ascending or descending order. If the dataset has an odd number of samples, the median is the middle number. If the dataset has an even number off samples, the median is the average of the two middle number. The initial steps for calculating median are:

Order the dataset from the smallest to largest.
Count the number of samples \(n\)

The number of samples \(n\) of the dataset variable can be even or odd. If the \(n\) is odd the formula for calculating median can be written as: \begin{equation} \mathrm{Median} = \mathrm{Value \ at \ position} \left(\frac{n+1}{2}\right) \end{equation} If the \(n\) is even then the formula for calculating median can be written as: \begin{equation} \mathrm{Median} = \frac{\mathrm{Value \ at \ position} \left(\frac{n}{2}\right) + \mathrm{Value \ at \ position}\left(\frac{n}{2}+1\right)}{2} \end{equation}

The first quartile (Q1)

The first quartile is the medina of the lower half of the dataset. This excludes the median if the number of observation is odd.

Order the dataset from smallest to largest.
Find the medain of the dataset (this splits the data into two halves

For an odd number of observations:

Exclude the median
Calculate the median of the lower half of the dataset

For an even number of observations:

Include all observations in the lower half
Calculate the median of the lower half.

\begin{equation} Q1 = \mathrm{Median \ of \ the \ lower \ half\ of\ the \ dataset} \end{equation}

Third quartile (Q3)

The third quartile is the median of the upper half of the dataset. This excludes the median if the number of observations is odd. All you need to do regarding the variable array is to order the dataset from smallest to largest and find the median of the dataset (this splits the data into two halves).
For an odd number of observations:

exclude the median
Calculate the median of the upper half of the dataset

For and even number of observations:

Include all observations in the upper half.
Calculate the median of the upper half.

\begin{equation} Q3 = \mathrm{Median \ of \ the \ upper \ half \ of \ the \ dataset } \end{equation}

Example 1 - boxplot of the variable array

Let's say that in your dataset you have variable array which can be written as: \begin{equation} x1 = {7,15,36,39,40,41,42,43,47,49,50} \end{equation}

Median:
- Determine the number of samples in an array. There are total of 11 samples in the array \(n = 11\).
- Since there is odd number of samples in the array the formula for calculating medain can be written as: \begin{eqnarray} \mathrm{Median} &=& \mathrm{Value \ at \ position} \left(\frac{n+1}{2}\right)\\ \mathrm{Median} &=& \mathrm{Value \ at \ position} 6 = 41 \end{eqnarray}
First Quartile (Q1)
- Lower half of the dataset is {7,15,36,39,40}
- The lower half of the dataset has 5 samples \(n = 5\)
- Since there is an odd number of samples in the lower half of the dataset the formula for calculating Q1 can be written as: \begin{eqnarray} Q1 &=& \mathrm{Value \ at \ position} \left(\frac{n+1}{2}\right) \end{eqnarray} The Q1 is equal to: \begin{eqnarray} Q1 &=& \mathrm{Value \ at \ position} \left(\frac{5+1}{2}\right)\\ Q1 &=& \mathrm{Value \ at \ position} 3 = 36 \end{eqnarray}
Third Quartile (Q3):
- Upper half of the dataset is {42, 43, 47,49, 50}
- The previous dataset has total of 5 elements \(n=5\)

The graphical representation of the boxplot is shown in graph below.

Example 2 - Boxplot of the array

Calculate the elements of the boxplot for the variable array: \begin{eqnarray} x2 &=& [0.22282631981886203, 0.33327499339415323, 0.6035745101276722, 0.11984346131101398,\\ && 0.09702272572745141, \\ &&0.43879013796137145, 0.42263257722406067, 0.9100302404596698, 0.9949987505067759,\\ && 0.34989456081026016, 0.7210719733028825, 0.938852437455908, 0.5210582975500546,\\ && 0.9069480435304358, 0.5297307869586085, 0.1305760512414731, 0.49945149998051996,\\ && 0.6535912723639796, 0.04642877887705432, 0.44500933465218173, 0.07401745439724872,\\ && 0.8665330106373801, 0.2544749444404535, 0.7162709439537049, 0.6291732003042578,\\ && 0.6425323123985618, 0.7805864525598794, 0.37618816080352957, 0.2532760981989497,\\ && 0.0005817565786245815, 0.4336000904165307, 0.9518023266267381, 0.3676744414089834,\\ && 0.060691789839560695, 0.29470242322554596, 0.3046535295525603,\\ && 0.08014071827393332, 0.4116261229385494, 0.22285393108998708, 0.16236132103229384,\\ && 0.9261693942918452, 0.6143896997694736, 0.4853980597454164, 0.3698979316877262,\\ && 0.05146100365925799, 0.7999062088122444, 0.2527144557601462, 0.5987730125165471,\\ && 0.21363412555365469, 0.5543859163697267] \end{eqnarray} The first step is to sort the dataset from smallest to largest value. The sorted variable array can be written as: \begin{eqnarray} x2 &=& [0.0005817565786245815, 0.04642877887705432, 0.05146100365925799, 0.060691789839560695,\\ && 0.07401745439724872, 0.08014071827393332, 0.09702272572745141, 0.11984346131101398,\\ && 0.1305760512414731, 0.16236132103229384, 0.21363412555365469, 0.22282631981886203,\\ && 0.22285393108998708, 0.2527144557601462, 0.2532760981989497, 0.2544749444404535,\\ && 0.29470242322554596, 0.3046535295525603, 0.33327499339415323, 0.34989456081026016,\\ && 0.3676744414089834, 0.3698979316877262, 0.37618816080352957, 0.4116261229385494,\\ && 0.42263257722406067, 0.4336000904165307, 0.43879013796137145, 0.44500933465218173,\\ && 0.4853980597454164, 0.49945149998051996, 0.5210582975500546, 0.5297307869586085, \\ && 0.5543859163697267, 0.5987730125165471, 0.6035745101276722, 0.6143896997694736, \\ && 0.6291732003042578, 0.6425323123985618, 0.6535912723639796, 0.7162709439537049,\\ && 0.7210719733028825, 0.7805864525598794, 0.7999062088122444, 0.8665330106373801,\\ && 0.9069480435304358, 0.9100302404596698, 0.9261693942918452, 0.938852437455908,\\ && 0.9518023266267381, 0.9949987505067759] \end{eqnarray} Now we can determine the minimum and maximum values.

The minimum value is equal to 0.0005817565786245815
The maximum value is equal to 0.9949987505067759

To calculate the median, Q1, and Q3 the dataset variable array has to be sorted from minimum to maximum value and the number of elements in the array has to be determined. Based on the number of array elements the correct formula for calculating the median, q1, and q3 can be applied. The total number of elements in the array is 50 (\(n = 50\)).

Median \begin{eqnarray} \mathrm{Median} &=& \frac{\mathrm{Value \ at \ position} \left(\frac{50}{2}\right) + \mathrm{Value \ at \ position} \left(\frac{50}{2} + 1\right)}{2}\\ \mathrm{Median} &=& \frac{\mathrm{Value \ at \ position}25+ \mathrm{Value \ at \ position}26}{2}\\ \mathrm{Median} &=& \frac{0.43879013796137145+0.44500933465218173}{2} \mathrm{Median} &=& 0.4336000904165307; \end{eqnarray}
Quartile Q1 \begin{eqnarray} Q1 &=& \mathrm{Value \ at \ position} \left(0.25\cdot (50)\right)\\ Q1 &=& 0.22285393108998708 \end{eqnarray}
Quartile Q3 \begin{eqnarray} Q3 &=& \mathrm{Value \ at \ position} \left(0.75\cdot (50)\right)\\ Q3 &=& 0.6425323123985618 \end{eqnarray}

How to interpret boxplots

After the boxplots are generated it is very important and necessary to know how to interpret the boxplots. The key components in the interpretation are the central tendency, spread, skewness, outliers, and comparisons between multiple boxplots.
Central tendency - the median line inside the box provides the information about the dataset's central value.
Spread - The length of the box (IQR) indicates the spread of variability of the middle 50\% of the data. A longer box means greater spread.
Skewness - in case the median line is closer to the top or the bottom of the box it indicates skewness. If the whiskers are uneven, it shows skewness in the data distribution.
Outliers - the samples (points) located outside the whiskers are considered outliers. Their presence and distribution can indicate anomalies or variability in the dataset.

The advantages and limitations of the boxplots

The advantages of boxplots are:

they provide a clear summary of the data distribution
They are effective for comparing distributions i.e. especially useful when comparing multiple groups.
They help to identify unusual data points (outliers).

The limitations of the boxplots are:

the boxplots doesn't show the exact distribution shape or density of the data -> check histograms or violin plots
they are less intuitive for some audiences - can be less straightforward to interpret without additional context.

AnalyzeMyData

četvrtak, 8. kolovoza 2024.

Boxplot Generator

Boxplot Analysis

The boxplot components

How to calculate the elements of the boxplot

Median

The first quartile (Q1)

Third quartile (Q3)

Example 1 - boxplot of the variable array

Example 2 - Boxplot of the array

How to interpret boxplots

The advantages and limitations of the boxplots

Nema komentara:

Objavi komentar

CSV to SQL Converter

Prijavi zloupotrebu

Oznake