In this article, we will try to understand the concept behind box plots. When i first saw a box plot, I was utterly confused and could not extract much information out of it on the first go. This article will help you to avoid the situation I faced in understanding a box plot.
Introduction to box plots
A Box and Whisker Plot (or Box Plot) is a convenient way of visually displaying the data distribution through their quartiles. It is a graphical rendition of statistical data based on the minimum, first quartile, median, third quartile, and maximum. The term “box plot” comes from the fact that the graph looks like a rectangle with lines extending from the top and bottom. Because of the extending lines, this type of graph is sometimes called a box-and-whisker plot.
Let us understand these 5 components of the box plot
- Median Value
Value or quantity that falls halfway between a set of values arranged in an ascending or descending order. When the set contains an odd number of values, the median value is exactly in middle. If the number of values is even, the median is computed by averaging the two numbers closest to the middle.
- Lower Quartile(Q1)
The lower quartile is also known as the first quartile, splits the lower 25% of the data. Quartiles are three points that divide the data set into four equal groups. Each group represents the one-fourth of the data set. The lower quartile is the middle value of the lower half.
- Upper Quartile(Q3)
Upper quartile is also known as the third quartile. It splits lowest 75% (or highest 25%) of data. It can be also seen as the middle value of the upper half.
- Interquartile Range(Q3-Q1)
The Interquartile range is from Q1 to Q3. It is the difference between the lower quartile and upper quartile. The IQR is often seen as a better measure of a spread than the range (highest value-lowest value) as it is not affected by outliers.
- Highest Value
This point in the box plot represents the highest value in the data distribution over which the box plot is built which is not an outlier. This point does not correspond to the highest value in your dataset. Suppose you have some data like 65,76,87,100,105,100000. Here the largest value is 100000 but it is most likely to be an outlier and hence the box plot will not mark this as the maximum value. The most feasible option will be 105 as the maximum value of the box plot.
- Lowest Value
This point in the box plot represents the lowest value in the data distribution over which the box plot is built and is not an outlier (smallest value in the Interquartile range of the distribution). This point does not correspond to the smallest value in your dataset. Suppose you have some data like 0.005,65,76,87,100,105. Here the smallest value is 0.005 but it is most likely to be an outlier and hence the box plot will not mark this as the minimum value. The most feasible option will be 65 as the minimum value of the box plot.
Why box plots?
- Handles Large Data Easily
Due to the five-number data summary, a box plot can handle and present a summary of a large amount of data. A box plot consists of the median, which is the midpoint of the range of data; the upper and lower quartiles, which represent the numbers above and below the highest and lower quarters of the data and the minimum and maximum data values. Organizing data in a box plot by using five key concepts is an efficient way of dealing with large data too unmanageable for other graphs, such as line plots or stem and leaf plots.
- Exact Values Not Retained
The box plot does not keep the exact values and details of the distribution results, which is an issue with handling such large amounts of data in this graph type. A box plot shows only a simple summary of the distribution of results so that you can quickly view it and compare it with other data. Use a box plot in combination with another statistical graph method, like a histogram, for a more thorough, more detailed analysis of the data.
- A clear summary
A box plot is a highly visually effective way of viewing a clear summary of one or more sets of data. It is particularly useful for quickly summarizing and comparing different sets of results from different experiments. At a glance, a box plot allows a graphical display of the distribution of results and provides indications of symmetry within the data.
- Displays outliers
A box plot is one of very few statistical graph methods that show outliers. There might be one outlier or multiple outliers within a set of data, which occurs both below and above the minimum and maximum data values. By extending the lesser and greater data values to a max of 1.5 times the inter-quartile range, the box plot delivers outliers or obscure results. Any results of data that fall outside of the minimum and maximum values known as outliers are easy to determine on a box plot graph.
Understanding different box plots
We have data on different house prices in 5 different areas of Bangalore. We will try to understand the distribution of this data and try to find some insights out of it.
The Box plot as an Indicator of Centrality
We will try to gather our first insight by observing the centrality of the box plots. Centerline represents the median value for the house price in different areas. Houses on airport road have the highest median value of the house which makes it a comparatively expensive place to live in whereas houses in Marathali have the least median value which allows us to conclude that houses here are relatively cheapest to live.
The Box plot as an indicator of the spread
The spread of a box plot talks about the variance present in the data. More the spread, more the variance. If you look closely at the first two box plots, both Whitefield and Hoskote areas have the same median house price value so it seems like both places fall into the same budget category. But if we look more closely, we can observe that width of Hoskote box plot is more than Whitefield box plot. Hoskote area has more variance in house price as compared to Whitefield i.e. Hoskote offers more variety of budget in houses as compared to Whitefield. If we look at the overall graph, we find that Bellathur area has the most spread in its box plot. This clearly states that this area has the widest variety in the budget of the houses.
The Box plot as an indicator of symmetry
Symmetry around the median talks about skewness present in the data. If the median line is towards the lower half of the box plot, then it is right skewed (positive skew) and if the median line is towards the upper portion of the box plot then it is left-skewed (negative skew). If we look at the box plot representing Marathalli, we can observe that median is towards the lower half of the box plot and hence it is right skewed (positive skew) which means that most of the houses are on the cheaper side in Marathalli and only a few are expensive.
The Box plot as an indicator of tail length
Tail length talks about the kurtosis present in data. There are three cases here. Either your data will be normally distributed or it will have more data in its tail as compared to a normal distribution(platykurtic) or it will have fewer data in tails as compared to a normal distribution(leptokuritc). A long tail shows that the distribution is platykurtic and shorter tail gives the idea of distribution being leptokurtic. In above example, Marathalli has the shortest tail as compared to other box plots which may mean that in Marathalli most of the house prices lie in the interquartile range (q3-q1).
Types of box plots
Variable width box plots
Box plot represents a numeric vector of data that is split in several groups. When the number of points in each group is highly different, it can be great to represent it using the width of the box. The widths of the box plot indicate the size of the samples. The wider the box, the larger the sample. This is usually an option in statistical software programs, not all Box Plots have the widths proportional to the sample size. One common convention is to make the width of the boxes for a group of data proportional to the square roots of the number of observations in a given sample.
Notched box plots
It works the same as a standard Box Plot, but has a narrowing of the box around the median value. This acts as a handy visual guide to help read and compare the differences between the median values across each data series. Notches visually illustrate an estimate on whether there is a significant difference of medians. The width of the notches is proportional to the inter quartile range of the sample.
Complications in box plots
- Box plots generally do not go well when the sample size of distribution is small.
- One case of particular concern — where a box plot can be deceptive — is when the data are distributed into “two lumps” rather than the “one lump” cases we’ve considered so far. A “bee swarm” plot shows that in this dataset there are lots of data near 10 and 15 but relatively few in between. See that a box plot would not give you any evidence of this.