We will be working with a data set from the UCI Machine Learning Repository in this article, which is a free, open-source catalogue of data sets that you can use to practice your machine learning or data visualization skills.
We will be using the Iris data set from the UCI Machine Learning Repository, which is a data set that is intended to be used for predicting flower characteristics and species. The Iris data set is one of the oldest data sets in the world and is a common example used in data science education.
We'll be importing this dataset into a pandas DataFrame in this tutorial. Because of this, we'll start by importing pandas under the alias
pd like this:
import pandas as pd
Once this is done, you can import the data required for this tutorial with the following statement:
data = pd.read_json('https://raw.githubusercontent.com/nicholasmccullum/python-visualization/master/iris/iris.json')
This imports a nicely-formatted version of the Iris data set into our program from a GitHub file that I have uploaded for the public to use.
The Iris data set is a collection of observations from flowers with five features:
With our data import out of the way, let's move on to learning how to create boxplots in Python using matplotlib!
The first type of chart that we will create is a boxplot. Before doing this, we need to import the matplotlib data visualization library. Specifically, we will be importing the
pyplot interface from matplotlib under the alias
Here's the code to do this:
import matplotlib.pyplot as plt
Now that this is done, we need to make a slight change to our data.
Boxplots can only be performed on numerical data, and the
species column is categorical, not numerical. To fix this, we'll create a new variable called
boxplot_data that excludes the
boxplot_data = data.drop('species', axis=1)
After creating the boxplot, you can use the
plt.show() command to open it in a new window. It will look like this:
That boxplot is not very informative, and is generally not what we'd expect.
Why is this?
It's because the boxplot function created a separate boxplot for each row, instead of a separate boxplot for each column (as desired).
Fortunately, the solution for this problem is quite simple. We just need to call the
transpose method on
boxplot_data within the boxplot function:
Scatterplots can be created in matplotlib using the
Like boxplots, scatterplots can only be created using numerical data. We will not need to create a separate dataset in this case because the
plt.scatter method requires us to name specific columns for both the
y axes. Because of this, we can work directly with our original
data DataFrame and pass in particular column names in square brackets.
As an example, here is how you would plot
sepalLength on the x axis and
sepalWidth on the y axis using the
In this particular case, you do not need to actually include the
y= specifications within the method's parameters. The following code generates an identical scatterplot:
Let's move on to learning about how to create histograms in Python using matplotlib.
Histograms are bar charts that show the frequency of observations across a data distribution. We can create histograms in Python using matplotlib with the
As an example, let's see what the distribution look like within the
petalLength feature of the Iris data set:
As you can see, there seems to be a high degree of concentration for
petalLength values around 1 and 5.
You can also create histograms that plot multiple features at once. These different features will be identified by different colors within the histogram.
As an example, here's how we would plot every feature from the Iris data set (excluding
species, since it is non-numerical) in a histogram:
This is a nice visualization, but it is pretty useless without knowing which colors represent each feature.
To fix this, let's add a legend:
plt.hist(data.drop('species',axis=1).transpose()) plt.legend(data.drop('species', axis=1).columns)
So far in this tutorial, we have learned how to create basic boxplots, scatterplots, and histograms in Python using Matplotlib.
We have not discussed how to make these charts visually appealing.
Accordingly, I wanted to conclude this article by discussing some general guidelines for styling your matplotlib visualizations.
Here's the plot we will be using as an example in this last section of this tutorial:
How to add titles to matplotlib visualizations
You can add a title to a matplotlib visualization with the
As an example, here's how you would add the title
A Histogram of Sepal Widths from the Iris Data Set to our sample histogram:
plt.hist(data['sepalWidth']) plt.title('A Histogram of Sepal Widths from the Iris Data Set')
How to label the x-axis in matplotlib visualizations
You can label the x-axis of a matplotlib visualization with the
As an example, here's how we could label our x-axis with the title
plt.hist(data['sepalWidth']) plt.title('A Histogram of Sepal Widths from the Iris Data Set') plt.xlabel('Sepal Width')
How to label the y-axis in matplotlib visualizations
Just like with the x-axis, we can label the y-axis of a matplotlib visualization with the
Here's how we would add the title
Frequency to our histogram:
plt.hist(data['sepalWidth']) plt.title('A Histogram of Sepal Widths from the Iris Data Set') plt.xlabel('Sepal Width') plt.ylabel('Frequency')
How to change the size of matplotlib visualizations
The last styling tool that we will explore is how to resize matplotlib visualizations.
The height and width of a matplotlib canvas can be changed by passing in a
figsize tuple into the
plt.figure method. The default tuple is
(6.0, 4.0), which implies that the figure is 6 inches wide and 4 inches tall.
Here's how you could increase the size of the figure to 10 inches wide and 8 inches tall:
plt.figure(figsize=[10,8]) plt.hist(data['sepalWidth']) plt.title('A Histogram of Sepal Widths from the Iris Data Set') plt.xlabel('Sepal Width') plt.ylabel('Frequency') plt.figure(figsize=[50,8])
In this tutorial, you learned how to create boxplots, scatterplots, and histograms in Python using matplotlib. We also learned the basics of how to add titles to and change the size of matplotlib plots.
Data visualization is a highly in-demand field for Python developers. This tutorial should set you on a path towards becoming an experienced practitioner of the matplotlib library.