Python is a very popular programming language for data visualization. This is largely because of its matplotlib library, which contains a plethora of built-in capabilities for presenting data in a visual manner.
This tutorial will teach you how to create boxplots, scatterplots, and histograms in Python using matplotlib.
About the author
Nick McCullum is a Python and JavaScript developer from New Brunswick, Canada. Nick teaches Python, SQL, and JavaScript courses on his website.
We will be working with a data set from the UCI Machine Learning Repository in this article, which is a free, open-source catalog of data sets that you can use to practice your machine learning or data visualization skills. Another option would be to install something like OpenCV.
We will be using the Iris data set from the UCI Machine Learning Repository, which is a data set that is intended to be used for predicting flower characteristics and species. The Iris data set is one of the oldest data sets in the world and is a common example used in data science education.
We'll be importing this dataset into a pandas DataFrame in this tutorial. Because of this, we'll start by importing pandas under the alias pd like this:
import pandas as pd
Once this is done, you can import the data required for this tutorial with the following statement:
data = pd.read_json('https://raw.githubusercontent.com/nicholasmccullum/python-visualization/master/iris/iris.json')
This imports a nicely-formatted version of the Iris data set into our program from a GitHub file that I have uploaded for the public to use.
The Iris data set is a collection of observations from flowers with five features:
sepalLength
sepalWidth
petalLength
petalWidth
species
With our data import out of the way, let's move on to learning how to create boxplots in Python using matplotlib!
The first type of chart that we will create is a boxplot. Before doing this, we need to import the matplotlib data visualization library. Specifically, we will be importing the pyplot interface from matplotlib under the alias plt.
Here's the code to do this:
import matplotlib.pyplot as plt
Now that this is done, we need to make a slight change to our data.
Boxplots can only be performed on numerical data, and the species column is categorical, not numerical. To fix this, we'll create a new variable called boxplot_data that excludes the species column:
boxplot_data = data.drop('species', axis=1)
After creating the boxplot, you can use the plt.show() command to open it in a new window. It will look like this:
That boxplot is not very informative, and is generally not what we'd expect.
Why is this?
It's because the boxplot function created a separate boxplot for each row, instead of a separate boxplot for each column (as desired).
Fortunately, the solution for this problem is quite simple. We just need to call the transpose method on boxplot_data within the boxplot function:
Scatterplots can be created in matplotlib using the plt.scatter method.
Like boxplots, scatterplots can only be created using numerical data. We will not need to create a separate dataset in this case because the plt.scatter method requires us to name specific columns for both the x and y axes. Because of this, we can work directly with our original data DataFrame and pass in particular column names in square brackets.
As an example, here is how you would plot sepalLength on the x axis and sepalWidth on the y axis using the plt.scatter method.
In this particular case, you do not need to actually include the x= and y= specifications within the method's parameters. The following code generates an identical scatterplot:
Histograms are bar charts that show the frequency of observations across a data distribution. We can create histograms in Python using matplotlib with the plt.hist method.
As an example, let's see what the distribution look like within the petalLength feature of the Iris data set:
plt.hist(data['petalLength'])
As you can see, there seems to be a high degree of concentration for petalLength values around 1 and 5.
You can also create histograms that plot multiple features at once. These different features will be identified by different colors within the histogram.
As an example, here's how we would plot every feature from the Iris data set (excluding species, since it is non-numerical) in a histogram:
plt.hist(data.drop('species',axis=1).transpose())
This is a nice visualization, but it is pretty useless without knowing which colors represent each feature.
So far in this tutorial, we have learned how to create basic boxplots, scatterplots, and histograms in Python using Matplotlib.
We have not discussed how to make these charts visually appealing.
Accordingly, I wanted to conclude this article by discussing some general guidelines for styling your matplotlib visualizations.
Here's the plot we will be using as an example in this last section of this tutorial:
plt.hist(data['sepalWidth'])
How to add titles to matplotlib visualizations
You can add a title to a matplotlib visualization with the plt.title method.
As an example, here's how you would add the title A Histogram of Sepal Widths from the Iris Data Set to our sample histogram:
plt.hist(data['sepalWidth'])
plt.title('A Histogram of Sepal Widths from the Iris Data Set')
How to label the x-axis in matplotlib visualizations
You can label the x-axis of a matplotlib visualization with the plt.xlabel method.
As an example, here's how we could label our x-axis with the title Sepal Width:
plt.hist(data['sepalWidth'])
plt.title('A Histogram of Sepal Widths from the Iris Data Set')
plt.xlabel('Sepal Width')
How to label the y-axis in matplotlib visualizations
Just like with the x-axis, we can label the y-axis of a matplotlib visualization with the plt.ylabel method.
Here's how we would add the title Frequency to our histogram:
plt.hist(data['sepalWidth'])
plt.title('A Histogram of Sepal Widths from the Iris Data Set')
plt.xlabel('Sepal Width')
plt.ylabel('Frequency')
How to change the size of matplotlib visualizations
The last styling tool that we will explore is how to resize matplotlib visualizations.
The height and width of a matplotlib canvas can be changed by passing in a figsize tuple into the plt.figure method. The default tuple is (6.0, 4.0), which implies that the figure is 6 inches wide and 4 inches tall.
Here's how you could increase the size of the figure to 10 inches wide and 8 inches tall:
plt.figure(figsize=[10,8])
plt.hist(data['sepalWidth'])
plt.title('A Histogram of Sepal Widths from the Iris Data Set')
plt.xlabel('Sepal Width')
plt.ylabel('Frequency')
plt.figure(figsize=[50,8])
In this tutorial, you learned how to create boxplots, scatterplots, and histograms in Python using matplotlib. We also learned the basics of how to add titles to and change the size of matplotlib plots.
Data visualization is a highly in-demand field for Python developers. This tutorial should set you on a path towards becoming an experienced practitioner of the matplotlib library.
Python, like many programming languages, implements containers, which are collections of other objects. As such, Python provides the for statement for iterating over each element in the collection.