On this page you can generate the boxplots for all dataset variables and for each variable in the dataset individually. The dataset must be in .csv format. Click on the choose file and navigate to the dataset stored on your local computer. After uploading the dataset the web-application will in couple of seconds (for larger datasets may be a couple of minutes- PLEASE BE PATIENT) generate the boxplot graphs. The boxplots are good fr finding the outliers in the dataset variables. The full description of the boxplots and outliers is given below.
Boxplot Analysis
Boxplots a.k.a. box-and-whisker plots, is a tool used to display the distribution of the dataset. These type of plots help to visualize the central tendency, variability, and presence of outliers in the data.
- central tendency - is a summary measure used to describe the entire array with a single value that represent the middle or centre of tis distribution.
- variability - refers to the spread or dispersion of data points in the dataset. It shows how much the data values differ from the central tendency. Boxplots display variability through the width of the box (interquartile range) and the extent of the whiskers.
- outliers - An outlier is a data point that significantly deviates from the other values in the dataset. For instance, if you measure room temperatures daily for a year and most readings are around 20°C, but some readings are as high as 60°C or more, these unusually high values are considered outliers. They are far from the typical range of temperatures and stand out because they are much higher than the majority of the data.
The boxplot components
The components of the boxplot are box, whiskers and outliers. The box consist of interquartile range (IQR) and the median. The whiskers consists of upper whiskers and lowe whiskers. The outliers consists of outlier points.
- Interquartile range - the box represents the middle 50% of the data, between the first quartile (Q1) and the third quartile (Q3). This range is known as IQR, which measures the spread of the middle 50% of the data.
- Median - inside the box (IRQ), a line indicates the median (the middle value of the dataset). This line divides the box into two parts, showing the central tendency.
- Upper Whisker - the area that extends from Q3 to the highest value within 1.5*IQR above Q3.
- Lower Whisker - the area that extends from Q1 to the lowest value whithin 1.5*IQR below Q1.
How to calculate the elements of the boxplot
The elements you need for visualizing the boxplot of the dataset variable are: minimum, q1 (first quartile), median, q3 (third quartile) and maximum value. The minimum and maximum values are very simple i.e. in Python simple built-in functions min and max should immediately find these values of a variable represented as the list (array). To caculate the rest of boxplot elements the dataset variable array must be ordered from minimum to the maximum value.
Median
The median is the middle value of a dataset when it is ordered in ascending or descending order. If the dataset has an odd number of samples, the median is the middle number. If the dataset has an even number off samples, the median is the average of the two middle number.
The initial steps for calculating median are:
- Order the dataset from the smallest to largest.
- Count the number of samples \(n\)
The first quartile (Q1)
The first quartile is the medina of the lower half of the dataset. This excludes the median if the number of observation is odd.
- Order the dataset from smallest to largest.
- Find the medain of the dataset (this splits the data into two halves
- Exclude the median
- Calculate the median of the lower half of the dataset
- Include all observations in the lower half
- Calculate the median of the lower half.
Third quartile (Q3)
The third quartile is the median of the upper half of the dataset. This excludes the median if the number of observations is odd. All you need to do regarding the variable array is to order the dataset from smallest to largest and find the median of the dataset (this splits the data into two halves).For an odd number of observations:
- exclude the median
- Calculate the median of the upper half of the dataset
- Include all observations in the upper half.
- Calculate the median of the upper half.
Example 1 - boxplot of the variable array
Let's say that in your dataset you have variable array which can be written as:
\begin{equation}
x1 = {7,15,36,39,40,41,42,43,47,49,50}
\end{equation}
- Median:
- Determine the number of samples in an array. There are total of 11 samples in the array \(n = 11\).
- Since there is odd number of samples in the array the formula for calculating medain can be written as: \begin{eqnarray} \mathrm{Median} &=& \mathrm{Value \ at \ position} \left(\frac{n+1}{2}\right)\\ \mathrm{Median} &=& \mathrm{Value \ at \ position} 6 = 41 \end{eqnarray}
- First Quartile (Q1)
- Lower half of the dataset is {7,15,36,39,40}
- The lower half of the dataset has 5 samples \(n = 5\)
- Since there is an odd number of samples in the lower half of the dataset the formula for calculating Q1 can be written as: \begin{eqnarray} Q1 &=& \mathrm{Value \ at \ position} \left(\frac{n+1}{2}\right) \end{eqnarray} The Q1 is equal to: \begin{eqnarray} Q1 &=& \mathrm{Value \ at \ position} \left(\frac{5+1}{2}\right)\\ Q1 &=& \mathrm{Value \ at \ position} 3 = 36 \end{eqnarray}
- Third Quartile (Q3):
- Upper half of the dataset is {42, 43, 47,49, 50}
- The previous dataset has total of 5 elements \(n=5\) Since there is an odd number of samples in the upper half of the dataset the formula for calculating Q3 can be written as: \begin{eqnarray} Q3 &=& \mathrm{Value \ at \ position} \left(\frac{n+1}{2}\right) \end{eqnarray} The Q1 is equal to: \begin{eqnarray} Q1 &=& \mathrm{Value \ at \ position} \left(\frac{5+1}{2}\right)\\ Q1 &=& \mathrm{Value \ at \ position} 3 = 47 \end{eqnarray}
Example 2 - Boxplot of the array
Calculate the elements of the boxplot for the variable array:
\begin{eqnarray}
x2 &=& [0.22282631981886203, 0.33327499339415323, 0.6035745101276722, 0.11984346131101398,\\ && 0.09702272572745141, \\ &&0.43879013796137145, 0.42263257722406067, 0.9100302404596698, 0.9949987505067759,\\ && 0.34989456081026016, 0.7210719733028825, 0.938852437455908, 0.5210582975500546,\\ && 0.9069480435304358, 0.5297307869586085, 0.1305760512414731, 0.49945149998051996,\\ && 0.6535912723639796, 0.04642877887705432, 0.44500933465218173, 0.07401745439724872,\\ && 0.8665330106373801, 0.2544749444404535, 0.7162709439537049, 0.6291732003042578,\\ && 0.6425323123985618, 0.7805864525598794, 0.37618816080352957, 0.2532760981989497,\\ && 0.0005817565786245815, 0.4336000904165307, 0.9518023266267381, 0.3676744414089834,\\ && 0.060691789839560695, 0.29470242322554596, 0.3046535295525603,\\ && 0.08014071827393332, 0.4116261229385494, 0.22285393108998708, 0.16236132103229384,\\ && 0.9261693942918452, 0.6143896997694736, 0.4853980597454164, 0.3698979316877262,\\ && 0.05146100365925799, 0.7999062088122444, 0.2527144557601462, 0.5987730125165471,\\ && 0.21363412555365469, 0.5543859163697267]
\end{eqnarray}
The first step is to sort the dataset from smallest to largest value. The sorted variable array can be written as:
\begin{eqnarray}
x2 &=& [0.0005817565786245815, 0.04642877887705432, 0.05146100365925799, 0.060691789839560695,\\ && 0.07401745439724872, 0.08014071827393332, 0.09702272572745141, 0.11984346131101398,\\ && 0.1305760512414731, 0.16236132103229384, 0.21363412555365469, 0.22282631981886203,\\ && 0.22285393108998708, 0.2527144557601462, 0.2532760981989497, 0.2544749444404535,\\ && 0.29470242322554596, 0.3046535295525603, 0.33327499339415323, 0.34989456081026016,\\ && 0.3676744414089834, 0.3698979316877262, 0.37618816080352957, 0.4116261229385494,\\ && 0.42263257722406067, 0.4336000904165307, 0.43879013796137145, 0.44500933465218173,\\ && 0.4853980597454164, 0.49945149998051996, 0.5210582975500546, 0.5297307869586085, \\ && 0.5543859163697267, 0.5987730125165471, 0.6035745101276722, 0.6143896997694736, \\ && 0.6291732003042578, 0.6425323123985618, 0.6535912723639796, 0.7162709439537049,\\ && 0.7210719733028825, 0.7805864525598794, 0.7999062088122444, 0.8665330106373801,\\ && 0.9069480435304358, 0.9100302404596698, 0.9261693942918452, 0.938852437455908,\\ && 0.9518023266267381, 0.9949987505067759]
\end{eqnarray}
Now we can determine the minimum and maximum values.
- The minimum value is equal to 0.0005817565786245815
- The maximum value is equal to 0.9949987505067759
- Median \begin{eqnarray} \mathrm{Median} &=& \frac{\mathrm{Value \ at \ position} \left(\frac{50}{2}\right) + \mathrm{Value \ at \ position} \left(\frac{50}{2} + 1\right)}{2}\\ \mathrm{Median} &=& \frac{\mathrm{Value \ at \ position}25+ \mathrm{Value \ at \ position}26}{2}\\ \mathrm{Median} &=& \frac{0.43879013796137145+0.44500933465218173}{2} \mathrm{Median} &=& 0.4336000904165307; \end{eqnarray}
- Quartile Q1 \begin{eqnarray} Q1 &=& \mathrm{Value \ at \ position} \left(0.25\cdot (50)\right)\\ Q1 &=& 0.22285393108998708 \end{eqnarray}
- Quartile Q3 \begin{eqnarray} Q3 &=& \mathrm{Value \ at \ position} \left(0.75\cdot (50)\right)\\ Q3 &=& 0.6425323123985618 \end{eqnarray}
How to interpret boxplots
After the boxplots are generated it is very important and necessary to know how to interpret the boxplots. The key components in the interpretation are the central tendency, spread, skewness, outliers, and comparisons between multiple boxplots.Central tendency - the median line inside the box provides the information about the dataset's central value.
Spread - The length of the box (IQR) indicates the spread of variability of the middle 50\% of the data. A longer box means greater spread.
Skewness - in case the median line is closer to the top or the bottom of the box it indicates skewness. If the whiskers are uneven, it shows skewness in the data distribution.
Outliers - the samples (points) located outside the whiskers are considered outliers. Their presence and distribution can indicate anomalies or variability in the dataset.
The advantages and limitations of the boxplots
The advantages of boxplots are:
- they provide a clear summary of the data distribution
- They are effective for comparing distributions i.e. especially useful when comparing multiple groups.
- They help to identify unusual data points (outliers).
- the boxplots doesn't show the exact distribution shape or density of the data -> check histograms or violin plots
- they are less intuitive for some audiences - can be less straightforward to interpret without additional context.