Knowledge for the World

Python collections

Much of what you need to do with Python can be done using built-in containers like dict, list, set, and tuple. But these aren't always the most optimal. In this guide, I'll cover why and when to use collections and provide interesting examples of each. This is designed to supplement the documentation with examples and explanation, not replace it.

1

from collections import Counter

A counter is a dictionary-like object designed to keep tallies. With a counter, the key is the item to be counted and value is the count. You could certainly use a regular dictionary to keep a count, but a counter provides much more control.

A counter object ends up looking just like a dictionary and even contains a dictionary interface.

ctr = Counter({'birds': 200, 'lizards': 340, 'hamsters': 120})
ctr['hamsters'] # 120

One thing to note is that if you try to access a key that doesn't exist, the counter will return 0 rather than raising a KeyError as a standard dictionary would.

Counters come with a brilliant set of methods that will make your life easier if you learn how to use them.

Get the most common word in a text file

import re
words = re.findall(r'\w+', open('ipencil.txt').read().lower())
Counter(words).most_common(1) # [('the', 148)]

Get the count of each number in a long string of numbers

numbers = """
73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450
"""
numbers = re.sub("\n", "", numbers)
Counter(numbers).most_common()
[('2', 112),
 ('5', 107),
 ('4', 107),
 ('6', 103),
 ('9', 100),
 ('8', 100),
 ('1', 99),
 ('0', 97),
 ('7', 91),
 ('3', 84)]

most_common is a very valuable method. If you pass in an integer as the first parameter, it will return that many results. If you call it without any arguments, it will return the frequency of all elements. As you can see it returns a list of tuples - the tuple structured like this (value, frequency).

When dealing with multiple Counter objects you can perform operations against them. For instance, you can add two counters which would add the counts for each key. You can also perform intersection or union. If I wanted to compare the values for given keys between two counters, I can return the minimum or maximum values only.

For example, a student has taken 4 quizzes two times each. She is allowed to keep the highest score for each quiz.

first_attempt = Counter({1: 90, 2: 65, 3: 78, 4: 88})
second_attempt = Counter({1: 88, 2: 84, 3: 95, 4: 92})
final = first_attempt | second_attempt
final # Counter({3: 95, 4: 92, 1: 90, 2: 84})

2

from collections import deque

deque stands for "double-ended queue" and is used as a stack or queue. Although lists offer many of the same operations, they are not optimized for variable-length operations.

How do you know when to use a deque verses a list?

Basically if you're structuring the data in a way that requires quickly appending to either end or retrieving from either end then you would want to use a deque. For instance, if you're creating a queue of objects that need to be processed and you want to process them in the order they arrived, you would want to append new objects to one end and pop objects off of the other end for processing.

queue = deque()
# append values to wait for processing
queue.appendleft("first")
queue.appendleft("second")
queue.appendleft("third")
# pop values when ready
process(queue.pop()) # would process "first"
# add values while processing
queue.appendleft("fourth")
# what does the queue look like now?
queue # deque(['fourth', 'third', 'second'])

As you can see we're adding items to the left and popping them from the right. Deque provides four commonly used methods for appending and popping from either side of the queue: append, appendleft, pop, and popleft.

In the above example we started with an empty deque, but we can also create a deque from another iterable.

>>> numbers = [0, 1, 2, 3, 5, 7, 11, 13]
>>> queue = deque(numbers)
>>> print queue
deque([0, 1, 2, 3, 5, 7, 11, 13])

Or how about from a range:

>>> queue = deque(range(0, 10))
>>> print queue
deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

3

from collections import defaultdict

Suppose you have a sequence of key-value pairs. Perhaps you are keeping track of how many miles you run each day, and you want to know which day of the week you are most active.

days = [('monday', 2.5), ('wednesday', 2), ('friday', 1.5), ('monday', 3), ('tuesday', 3.5), ('thursday', 2), ('friday', 2.5)]
active_days = defaultdict(float)
for k, v in days:
    active_days[k] += v
# defaultdict(<type 'float'>, {'tuesday': 3.5, 'friday': 4.0, 'thursday': 2.0, 'wednesday': 2.0, 'monday': 5.5})

This can be accomplished using many other data types, but defaultdict allows us to specify the default type of the value. This is simpler and faster than using a regular dict with dict.setdefault.

You pass in the default type upon instantiation. Then you can immediately begin setting values even if the key is not yet set. This would obviously throw a KeyError if you tried this with a normal dictionary.

Here is an example using a list as the default value. Here we have a list of sets. Each set has a letter and a number, and the letters are both uppercase and lowercase. Suppose we want to make a list of values grouped by letter ignoring case.

letters = [('A', 10), ('B', 3), ('C', 4), ('a', 36), ('b', 8), ('c', 10)]
grouped_letters = defaultdict(list)
for k, v in letters:
    grouped_letters[k.lower()].append(v)
# defaultdict(<type 'list'>, {'a': [10, 36], 'c': [4, 10], 'b': [3, 8]})

4

from collections import namedtuple

A namedtuple is a ... named tuple. When you use a standard tuple it's difficult to convey the meaning of each position of the tuple. A named tuple is just like a normal tuple, but it allows you to give names to each position making the code more readable and self-documenting. Also with a namedtuple you can access the positions by name as well as index.

To instantiate we pass in the name of the type we want to create. Then we pass in a list of field names.

coordinate = namedtuple('Coordinate', ['x', 'y'])

Now when we want to use our named tuple, coordinate, we can use it like a tuple.

c = coordinate(10, 20)

Or we can instantiate by name:

c = coordinate(x=10, y=20)

And just like a normal tuple we can still access by index and unpack, but our namedtuple allows to access values to name as well.

>>> x, y = c
>>> x, y
(10, 20)
>>> c.x
10
>>> c.y
20
>>> c[0]
10
>>> c[1]
20

A great example comes straight from the documentation. If we want to grab data from a csv and provide useful names for the positions rather than just indices, we can use a named tuple:

User = namedtuple('User', 'name, email, username, staff')
import csv
for user in map(User._make, csv.reader(open("users.csv", "rb"))):
    print user.name, user.title

In the above example, we're using the _make method which accepts an iterable and produces the namedtuple based on those values.

Using our coordinate example, we can create a coordinate from a list using _make.

>>> c = [30, 45]
>>> coordinate._make(c)
coordinate(x=30, y=45)

You can convert a dictionary to a namedtuple using the double-start-operator.

>>> c = {'x': 30, 'y': 45}
>>> coordinate(**c)
coordinate(x=30, y=45)

5

from collections import OrderedDict

OrderedDicts act just like regular dictionaries except they remember the order that items were added. This matters primarily when you are iterating over the OrderedDict as the order will reflect the order in which the keys were added.

A regular dictionary doesn't care about order:

d = {}
d['a'] = 1
d['b'] = 10
d['c'] = 8
for letter in d:
    print letter
# a
# c
# b

You can imagine what an OrderedDict would do:

d = OrderedDict()
d['a'] = 1
d['b'] = 10
d['c'] = 8
for letter in d:
    print letter
# a
# b
# c

It simply maintains the order. As a subclass of dict, OrderedDict has all of the same methods. Being that it cares about order, there are a few added methods. OrderedDict.popitem pops the most recently added element (LIFO), unless last=False is specified in which case it takes the first element added (FIFO).

d = OrderedDict()
d['a'] = 1
d['b'] = 10
d['c'] = 8
d.popitem()
# ('c', 8)
d
# OrderedDict([('a', 1), ('b', 10)])
d.popitem(last=False)
# ('a', 1)
d
# OrderedDict([('b', 10)])

Since order matters in iteration, you can iterate over an OrderedDict backwards using reverse().

d = OrderedDict()
d['a'] = 1
d['b'] = 10
d['c'] = 8
for letter in reversed(d):
    print letter
# c
# b
# a

6

Check out the Python documentation for collections here.