Mastering Python’s Iteration Protocol: Iterables, Iterators, and Generators


Over the years, I’ve received innumerable queries from (mostly) undergraduate computer science students on the difference between iterables, iterators, and generators in programming languages like Python and JavaScript.

This article consolidates into one master reference all of the answers that I’ve given and scattered across various internal lecture notes, private Notion documents, Discord servers, and direct messages.

This edition has been adapted for Python idioms. The JavaScript edition, which covers the exact same concepts but with JavaScript’s primitives instead, will be coming soon.



The Iterator Contract

Our story begins in the Iterator Contract: the very interface that underpins our entire discussion.



Iterables and __iter__

Mechanically and contractually, an Iterable is any object that implements the __iter__ dunder method (and can thus be passed as an argument to the builtin iter function). Semantically, it is any container- or collection-like object (e.g., lists, sets, trees, etc.) that can be “iterated” on.

The mechanics of “being iterated on” is quite simple. When we pass an object to the iter function, we expect that it returns an object called the iterator. The iterator is responsible for maintaining the current iteration state of the collection traversal.



Iterators and __next__

Mechanically and contractually, an Iterator is any object that implements the __next__ dunder method (and can thus be passed as an argument to the builtin next function). Semantically, it is an object that keeps track of the current iteration state. (Pedantically, an Iterator should also implement the Iterable interface, but this detail is sometimes omitted in practice.)

As you might expect, the next function returns the “next” item in the iteration (of all elements in an Iterable container). You can visualize the Iterator as a “stream” of objects that you can request from the Iterator by repeatedly invoking the next function. When there are no more items left in the Iterator, the underlying __next__ implementation must raise the StopIteration exception to signal that it has been exhausted.

# The `range` object is an `Iterable` container.
container = range(3)

# We can therefore obtain an `Iterator` over its elements.
iterator = iter(container)

# We can safely extract the first three items via `next`.
print(next(iterator)) # 0
print(next(iterator)) # 1
print(next(iterator)) # 2

# But the final invocation will raise an exception!
try:
    next(iterator) # StopIteration
except StopIteration:
    print('Done!')
Enter fullscreen mode

Exit fullscreen mode

Under the hood, this is exactly how the for loop works. The object being iterated on is first passed to the iter function to obtain an Iterator. Then, the for loop invokes next until it hits the StopIteration exception.

# This is fancy syntactic sugar for...
for i in range(3):
    print(i)
Enter fullscreen mode

Exit fullscreen mode

# ... a loop that invokes `next` over and over again
# until there are no more elements left.
container = range(3)
iterator = iter(container)
while True:
    try:
        print(next(iterator))
    except StopIteration:
        break
Enter fullscreen mode

Exit fullscreen mode



Aside: The Importance of Brand New Independent Instances

I want to emphasize why iter is typically expected to return a brand new iterator object. Naturally, when you create an iterator, you expect that the returned instance is its own self-contained iteration state that is independent from other invocations of iter.

# A `list` of `str` objects is an `Iterable` out of the box.
items = ['hello', 'world']

# These two invocations of `iter` create independent iterators
# that each maintain their own state.
first_iterator = iter(items)
other_iterator = iter(items)
assert first_iterator is not other_iterator

# Mutating the iteration state of one iterator should not leak
# into other iterators. They are their own instances!

print(next(first_iterator)) # 'hello'
print(next(other_iterator)) # 'hello'

print(next(first_iterator)) # 'world'
print(next(other_iterator)) # 'world'
Enter fullscreen mode

Exit fullscreen mode

There are exceptions to this rule, however, which we’ll see in the worked examples of the following section.



Worked Examples

To get a feel around how the Iterator Contract is typically implemented, let’s go through a few worked examples.



Infinite Counter

Let’s implement an infinite counter. It takes in a start argument which determines where to start counting to infinity from.

from dataclasses import dataclass

@dataclass
class InfiniteCounter:
    ''' The starting point of the counter. '''
    start = 0

    # Implementing the `Iterable` contract
    def __iter__(self):
        # Observe that we're returning a newly constructed iterator every time.
        return _InfiniteCounterIterator(self.start)

@dataclass
class _InfiniteCounterIterator:
    '''
    Tracks where we are now in the iteration.
    This is our private `Iterator` class.
    '''
    current: int

    def __iter__(self):
        # We should clone ourselves here so that we don't give out
        # references to our iteration state. Otherwise, it would be
        # possible for multiple variables to "advance" the iterator
        # from different variables. We don't want that!
        #
        # However, it is totally legal to just return `self` here. In fact,
        # many iterators from the standard library simply `return self`
        # (likely to avoid the performance overhead of cloning). As such,
        # we will do the same. We must keep in mind the footguns, though!
        return self
        # return InfiniteCounterIterator(self.current)

    def __next__(self):
        value = self.current # take the first value
        self.current += 1    # advance the iterator
        return value
Enter fullscreen mode

Exit fullscreen mode

# This is an infinite loop because we never raise `StopIteration`!
container = InfiniteCounter()
for i in container:
    print(i) # 0, 1, 2, 3, ...
Enter fullscreen mode

Exit fullscreen mode



Singly Linked List

Now let’s do the same drill for a basic linked list implementation.

from dataclasses import dataclass
from typing import Optional, Self

@dataclass
class Node[T]:
    next: Optional[Self]
    data: T

@dataclass
class LinkedList[T]:
    head: Optional[Node[T]]

    def __iter__(self):
        return _LinkedListIterator(self.head)

class _LinkedListIterator[T]:
    current: Optional[Node[T]]

    def __iter__(self):
        return self

    def __next__(self):
        if self.current is None:
            # No more data to yield!
            raise StopIteration

        data = self.current.data         # take the current value
        self.current = self.current.next # advance the iterator
        return data
Enter fullscreen mode

Exit fullscreen mode



Generators

Now I’m sure you’re thinking that writing explicit Iterator classes are such a hassle—and you’d be right! In Pythonic code, we rarely interface with iterators at such a low level. Typically, we use generators and comprehension syntax to make our lives easier.

Syntactically, a Generator is a function that contains at least one yield keyword. Contractually, it returns an Iterator that streams the yield-ed items. Colloquially, it is a “pausable function”.

# Generators are best understood by example.
def generate_hello_world():
    print('start')
    yield 'hello' # pause!
    print('middle')
    yield 'world' # pause!
    print('end')

# Generators return iterators!
iterator = generate_hello_world()

print(next(iterator))
# 'start'
# 'hello'

print(next(iterator))
# 'world'
# 'middle'

print(next(iterator))
# 'end'
# StopIteration!
Enter fullscreen mode

Exit fullscreen mode

There are also times when it is useful to exhaust an inner iterator first from within the generator. This is uncommon in practice, but it’s handy to know it when you need it! For that, we have the yield from keyword.

def concat_iterators[T](
    first: Iterator[T],
    other: Iterator[T],
    bonus: T,
):
    # Exhaust this iterator first and forward the results.
    yield from first
    # Then we can exhaust the second iterator after.
    yield from other
    # We can even yield our own items at any point.
    yield bonus
Enter fullscreen mode

Exit fullscreen mode

def flatten_iterators[T](iterators: Iterator[Iterator[T]]):
    ''' A fancy way to flatten an iterator of iterators. '''
    for iterator in iterators:
        yield from iterator
Enter fullscreen mode

Exit fullscreen mode



The Comprehension Syntax

You may also be familiar with generators through a different syntax: list/set/dictionary comprehensions!

# Plain comprehensions are just fancy syntactic sugar over generators!
iterator = (i for i in range(10) if i % 2 == 0)
print(next(iterator)) # 0
print(next(iterator)) # 2
print(next(iterator)) # 4
print(next(iterator)) # 6
print(next(iterator)) # 8
print(next(iterator)) # StopIteration!
Enter fullscreen mode

Exit fullscreen mode

One common use case for generators is the ability to “aggregate” them into a result or some other collection.

# These are functionally equivalent!
items = [i for i in range(10) if i % 2 == 0]

# The `list` constructor also accepts a `Generator` argument.
# The yielded items will be streamed into the final `list`.
items = list(i for i in range(10) if i % 2 == 0)
Enter fullscreen mode

Exit fullscreen mode

# The same is true for sets and dictionaries!
items = { i for i in range(10) if i % 2 == 0 }
items = set(i for i in range(10) if i % 2 == 0)
Enter fullscreen mode

Exit fullscreen mode

Internally, you can imagine these “generator consumers” as for loops that process each element into an aggregation.

from collections.abc import Iterator

def collect_into_list[T](items: Iterator[T]):
    ''' This is effectively what the `list` constructor does. '''
    result = list[T]()
    for item in items:
        result.append(item)
    return result

def collect_into_set[T](items: Iterator[T]):
    ''' This is effectively what the `set` constructor does. '''
    result = set[T]()
    for item in items:
        result.add(item)
    return result
Enter fullscreen mode

Exit fullscreen mode

from collections.abc import Iterator
from math import nan

def sum_floats(numbers: Iterator[float]):
    ''' This is how the actual `sum` function works! '''
    total = 0.0
    for number in numbers:
        total += number
    return total

def avg_floats(numbers: Iterator[float]):
    ''' Now let's get fancy with arithmetic means! '''
    items = 0
    total = 0.0

    for number in numbers:
        items += 1
        total += number

    # Let's assume non-empty iterators for simplicity.
    assert items > 0

    # Alternatively, we may implement this as a running average.
    return total / items
Enter fullscreen mode

Exit fullscreen mode

from random import random

# Pro tip: you don't need a list comprehension here!
# You can use plain generators to avoid the unnecessary
# memory allocation for the temporary `list`.
total = sum(i for i in range(10) if i % 2 == 0)
float_total = sum_floats(random() for _ in range(10))
float_average = avg_floats(random() for _ in range(10))
Enter fullscreen mode

Exit fullscreen mode



Simplifying the Worked Examples

Now let’s take the worked Iterator examples from the previous section and see how we can leverage generators to simplify their implementation.

from dataclasses import dataclass

@dataclass
class InfiniteCounter:
    ''' The starting point of the counter. '''
    start = 0

    def __iter__(self):
        # Now the infinite loop is more apparent...
        current = self.start
        while True:
            yield current
            current += 1
Enter fullscreen mode

Exit fullscreen mode

from dataclasses import dataclass
from typing import Optional, Self

@dataclass
class Node[T]:
    next: Optional[Self]
    data: T

    def __iter__(self):
        # Do you notice the parallels to
        # the explicit iterator version?
        current = self.head
        while current is not None:
            yield current.data
            current = current.next

@dataclass
class LinkedList[T]:
    head: Optional[Node[T]]

    def __iter__(self):
        # Now we can just invoke the inner generator
        # since we've moved the generator to the `Node`.
        if self.head is not None:
            yield from self.head
Enter fullscreen mode

Exit fullscreen mode



The Conveniences of itertools

Now that we’ve seen how we can build iterator primitives simply out of the next function and even the yield keyword, let’s explore some of the builtin iterator utilities of the standard library: the itertools module.

# Infinite Iterators
import itertools as it

print(*it.count(10)) # 10 11 12 13 ...
print(*it.cycle('ABCD')) # A B C D A B C D ...
print(*it.repeat(10)) # 10 10 10 10 ...
Enter fullscreen mode

Exit fullscreen mode

# Iterator Combinators
import itertools as it

print(*it.accumulate(range(1, 6))) # 1 3 6 10 15
print(*it.batched('ABCDEFG', n=2)) # AB CD EF G
print(*it.chain('ABC', 'DEF')) # A B C D E F
print(*it.chain.from_iterable(['ABC', 'DEF'])) # A B C D E F
print(*it.pairwise('ABCDEFG')) # AB BC CD DE EF FG
print(*it.zip_longest('ABCD', 'ab', fillvalue='-')) # Aa Bb C- D-
Enter fullscreen mode

Exit fullscreen mode

# Combinatoric Iterators
import itertools as it

print(*it.product('ABCD', repeat=2))
# AA AB AC AD BA BB BC BD CA CB CC CD DA DB DC DD

print(*it.permutations('ABCD', repeat=2))
# AB AC AD BA BC BD CA CB CD DA DB DC

print(*it.combinations('ABCD', repeat=2))
# AB AC AD BC BD CD

print(*it.combinations_with_replacement('ABCD', 2))
# AA AB AC AD BB BC BD CC CD DD
Enter fullscreen mode

Exit fullscreen mode



Recap: what did we learn?

We’ve discussed a lot, so let’s recap on what we learned.

  • An Iterable is something you can create a Iterator from via the builtin iter function.
    • You can then repeatedly invoke the builtin next function with an Iterator as its argument until the StopIteration exception is raised.
    • Internally, a for loop is just syntactic sugar for advancing an Iterator.
  • Generator functions and the yield keyword are another syntactic sugar over explicit Iterator implementations.

    • Generators stream data to the caller via the yield keyword.
    • Generators use the yield from keyword to exhaust internal iterators first before proceeding.
    • Generators are almost like “pausable functions”, but not intended for that exact purpose.
  • The comprehension syntax is syntactic sugar over explicit simple Generator implementations.
  • The itertools module is your best friend.

There’s a common thread here: it’s always one syntactic sugar over the other. That’s what makes iterables, iterators, and generators so confusing to beginners! To best understand fancy generators, we had to go back to first principles: the Iterator Contract. Everything hinges on that basic design pattern as the primitive for composable utilities.

Now, a lot of things in computer programming (and computer science in general!) are like that. We face abstractions upon abstractions upon abstractions every day. Although it’s more convenient to live in higher levels of abstraction, we must never neglect understanding how things really work at the lowest levels.

The best engineers never shy away from digging deeper.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *