Knowledge for the World

Introduction to Python generators

Generators in Python are incredibly powerful yet often hard to understand. In this guide we'll cover generators in depth. We'll talk about how and why to use them, the difference between generator functions and regular functions, the yield keyword, and provide examples.

This guide assumes you have a basic knowledge of Python (especially how regular functions work).

Throughout this guide we are going to work towards solving a problem. Suppose we are tasked with writing a function that accepts two parameters - a list of file names and a pattern to match. We need to read from all of the files and return an iterable containing only the lines that match the pattern.

In these interests [?]
  • python
    80 Subscribers Subscribe
  • code
    110 Subscribers Subscribe
1

Regular functions

Regular functions are straight forward. Execution starts at the first line and continues until it reaches a return statement, exception, or the end of the function, which implies return None.

We might be tempted to solve our problem using a regular function. It may look something like this:

def find_matches(filenames, pattern):
    matches = []
    for fname in filenames:
        for line in open(fname):
            if pattern in line:
                matches.append(line)
    return matches

Then we could use our function like this:

files = ['t1.txt', 't2.txt', 't3.txt']
for match in find_matches(files, 'the'):
    print match

Our find_matches function is pretty easy to follow. We loop through the file names, open each file, then loop through each line in the file and test it against the pattern. If it matches, we append it to a list. Finally we return the list.

This meets all of our requirements, sort of.

Let's talk about what is happening here. When we call find_matches we are passing control over to this function. find_matches starts executing at the first line, does all of the work required to build up our list of matches, and then returns the full list. When it returns, it returns control back to the caller and completely finishes executing.

This is generally what we expect from a function. However, there are some problems. What if we're dealing with extremely large log files? And what if there are a lot of them? Fortunately for us, Python's open actually is efficient and doesn't load the entire file into memory so we're safe there. But what if our matches list far exceeds the available memory on our machine? I know what you're thinking, we could just buy more memory or a new computer altogether, but that isn't very scalable. The right solution involves being more efficient.

Enter generators.

2

Generator functions and generator objects

A generator function looks very similar to a regular function, but there's one major difference: yield. When you include the yield keyword in a function, the function automatically becomes a generator function.

This means that when we call the generator function, the first thing it does is return a generator object without beginning execution at all. When we call next, only then does it begin execution, and it executes until it reaches a yield statement.

Here's a basic, useless example:

def useless():
    yield "King Arthur"
    yield "Brave Sir Robin"
    yield "Sir Galahad the Chaste"

u = useless()

print next(u)
# "King Arthur"
print next(u)
# "Brave Sir Robin"
print next(u)
# "Sir Galahad the Chaste"

You can see here that whenever you call next on the generator object, it continues executing until it reaches another yield statement. So what happens if you call next again?

Well, it raises a StopIteration exception.

$ python useless.py
King Arthur
Brave Sir Robin
Sir Galahad the Chaste

Traceback (most recent call last):
  File "useless.py", line 10, in <module>
    print next(u)
StopIteration

What is interesting about the generator function is that even though control is passed back to the caller, its state is frozen. Calling next simply resumes execution until it reaches another yield statement.

The value of using a generator for our purpose is clear. We can now write a generator function that yields one match at a time rather than loading up all of the matches in memory.

def find_matches(filenames, pattern):
    for fname in filenames:
        for line in open(fname):
            if pattern in line:
                yield line

We can call it the same way and get the same apparent results.

files = ['t1.txt', 't2.txt', 't3.txt']
for match in find_matches(files, 'the'):
    print match

The difference is our generator can handle extremely large files and many of them. Without a generator this would be extremely messy.