Iterables in Segments

posted

Start with an iterable (list, tuple, set, etc.) and loop through it. Only instead of taking one element at a time, grab a segment of elements and then loop through them respectively. Recently I needed to improve upon some ways I found to do this, so I documented three approaches.

To keep things simple,

  • I refer to the iterable as itr
  • the sub iterable as sub_itr
  • the segment size or len() of the sub iterable as n
  • and the functions as grouper

Method 1: The Naive Approach

The solution is really trivial this way. Loop through an iterable n elements at a time.

We can use the step parameter of python's built-in range method to slice an iterable when we loop through it. That code would look something like this:

for sub_itr in range(0, len(itr), n):
    for i in sub_itr:
        doSomething(i)

This method is obviously not the most conservative. Both in memory and in time. Luckily these can be easily corrected.

Method 2: The Conservative Approach

If our problem is memory, we can certainly improve upon that by simply swapping our range method for it's ex, the xrange.

[xrange returns an] opaque sequence type which yields the same values as the corresponding list, without actually storing them all simultaneously

So now we're being smarter with memory, range saved all of it's segments and xrange doesn't. Cool! But, what about that nested loop?

We can improve upon that tremendously with the use of a generator. I am not going to explain generators any better than Jeff Knupp did in this blog post. In short, generators are iterators that you can only iterate over once. They don't store anything, rather they 'generate' their products simultaneously. You'll commonly see the yield statement with generators.

Our code now looks like this and I suspect it's going to clean up our memory footprint, and speed things up a bit.

def grouper(itr, n):
    for sub_itr in xrange(0, len(itr), n):
        yield iter[sub_itr:sub_itr + n]

Method 3: itertools

When you write enough python you develop a sense about how 'pythonic' a piece might be. This implementation didn't feel there yet. So I did what I always do when I need a piece of code to impress my friends. I googled it with the phrase 'most pythonic way to' in front of it. This led me to an obviously perfect question on Stack Overflow: What is the most “pythonic” way to iterate over a list in chunks?. And sure enough the selected answer struck a chord.

So now let's import some itertools and see what happens:

from itertools import izip_longest

def grouper(itr, n):
    args = [iter(itr)] * n
    return izip_longest(*args)

Complexity Comparisons

I've shown you 3 ways to iterate over a list in segments with python. I hope that they've built upon each other in ways that make it easy to grasp the process of improving the code and it's implementations. I hope this correlates with the performance of each method but let's see:

Some test values:

>>> ten = range(1, 10+1)
>>> thousand = range(1, 1000+1)
>>> million = range(1, 1000000+1)

To test for time I used the timeit module:

$ python grouper.py
naive_grouper(ten, 2), 0.0598 seconds
naive_grouper(thousand, 200), 0.0303 seconds
naive_grouper(million, 200000), 0.0332 seconds
conservative_grouper(ten, 2), 0.0033 seconds
conservative_grouper(thousand, 200), 0.0009 seconds
conservative_grouper(million, 200000), 0.001 seconds
itertools_grouper(ten, 2), 0.0017 seconds
itertools_grouper(thousand, 200), 0.0095 seconds
itertools_grouper(million, 200000), 10.1163 seconds

Right away we see some pretty obvious evidence to which method is the most time efficient. Let's check for memory usage. Using the memory_profiler module and the @profile decorator I'll quickly sum up the memory usage of each function line by line:

$ python -m memory_profiler grouper.py

Filename: grouper.py

Line #    Mem usage    Increment   Line Contents
================================================
     7   39.469 MiB    0.000 MiB   @profile
     8                             def naive_grouper(itr, n):
     9   39.469 MiB    0.000 MiB       for i in range(0, len(ten), 2):
    10   39.469 MiB    0.000 MiB           print ten[i:i+3]


Filename: grouper.py

Line #    Mem usage    Increment   Line Contents
================================================
    18   39.469 MiB    0.000 MiB   @profile
    19                             def itertools_grouper(itr, n):
    20   39.469 MiB    0.000 MiB       args = [iter(itr)] * n
    21   39.477 MiB    0.008 MiB       return izip_longest(*args)

Seems like method 2 is a winner.

Lists, Python, itertools

Latest Posts

The Martian.

This book was absolutely riveting. It kept me up two nights in a row and had me imagining amber Martian landscapes around the clock. The author, Andy Weir, was previously a software ......

Pragmatic MVP, References

The Pragmatic MVP is a talk I gave at TalTech Expo 2015 on building effective early stage prototypes. Below is a list of websites, articles, and books I used in preparation...

Introducing wk

A while back I wrote about some handy aliases for virtualenv and serializing frequently and commonly chained commands that one uses to setup project environments. This is a first step at ......

Comments