Data Structures Internals in Python

Deep dive into how Set, Dict, OrderedDict, HeapQ, List, and Deque are implemented internally in CPython

Table of Contents

  1. OrderedDict
  2. HeapQ
  3. Set
  4. Dict & DefaultDict
  5. List
  6. Deque
  7. Hash Tables - Deep Dive
  8. Self-Balancing Binary Search Trees

1. OrderedDict

Overview

OrderedDict is a dict subclass that has methods specialized for rearranging dictionary order. It maintains insertion order while providing O(1) operations.

Key Methods

popitem(last=True) - Returns and removes a (key, value) pair in LIFO order if last is true or FIFO order if false.

move_to_end(key, last=True) - Moves an existing key to either end of an ordered dictionary. Useful in LRU Cache implementations.

d = OrderedDict.fromkeys('abcde')
d.move_to_end('b')
''.join(d)  # 'acdeb'

d.move_to_end('b', last=False)
''.join(d)  # 'bacde'

Internal Implementation (CPython)

A hash table (normal dict) + a doubly linked list

This design gives:

The OrderedDict Object

typedef struct {
    PyDictObject dict;        // underlying dict
    _ODictNode *od_first;     // head of linked list
    _ODictNode *od_last;      // tail of linked list
    Py_ssize_t od_state;      // mutation counter
} PyODictObject;

It literally embeds a PyDictObject, and order is maintained outside the hash table.

The Linked List Node

typedef struct _odictnode {
    PyObject *key;
    Py_hash_t hash;
    struct _odictnode *next;
    struct _odictnode *prev;
} _ODictNode;

Critical Detail: Each key in the dict has exactly one node in this doubly linked list.

How Lookup Works (Critical Detail)

Unlike a naive design, OrderedDict does not store nodes inside dict values.

Instead:

So conceptually:

dict[key] → _ODictNode*

This avoids extra hash lookups.

Insertion Flow

od[key] = value

  1. Compute hash(key)
  2. Check if key already exists in dict
  3. If new key:
  4. Store the actual value in a parallel structure managed by dict

Important: Order is updated only on first insertion, not on overwrite.

Deletion Flow

del od[key]

  1. Lookup node via dict
  2. Unlink node from doubly linked list
  3. Remove entry from dict
  4. Free node

All steps are O(1).

Iteration (Why It's Fast)

Iteration does not touch the hash table order at all.

od_first → node → node → node → od_last

This is why:

Move-to-End Operation

This is the secret sauce behind LRU caches.

od.move_to_end(key, last=True)

Internally:

  1. Find node in dict
  2. Unlink it
  3. Reinsert at head or tail

No rehashing. No copying.

Operation Complexities

Operation Complexity Explanation
popitem(last=True) O(1) Head/tail removal
move_to_end(key, last=True) O(1) Unlink + relink node
od[key] = value O(1) avg amortized Hash lookup + append node if new
od[key] O(1) avg amortized Dict lookup
del od[key] O(1) avg amortized Dict delete + unlink node

2. HeapQ

Overview

The heapq module provides an implementation of the heap queue algorithm, also known as the priority queue algorithm. Min-heaps are binary trees for which every parent node has a value less than or equal to any of its children (heap invariant).

Why Use HeapQ?

Function Complexities

Function Complexity Why
heappush(heap, x) O(log n) Sift-up
heappop(heap) O(log n) Sift-down
heapreplace(heap, x) O(log n) Pop + push combined
heappushpop(heap, x) O(log n) (best-case O(1)) Conditional sift
heapify(x) O(n) Floyd's algorithm

Internal Implementation

Core design choice: A heap is just a Python list.

There is:

The heap invariant (min-heap):

heap[i] <= heap[2*i + 1]
heap[i] <= heap[2*i + 2]

Why an Array-Based Heap?

Python design constraints:

So heapq is optimized to:

Example Code

import heapq

li = [25, 20, 15, 30, 40]

# Convert the list into a heap
heapq.heapify(li)

print("Heap queue:", li)

The Only Two Real Operations

Everything in heapq reduces to two internal helpers:

_siftdown → used when inserting

Moves an element up the tree.

_siftup → used when removing

Moves an element down the tree.

_siftdown (Bubble Up)

Called after inserting at the end.

Simplified logic (from heapq.py):

def _siftdown(heap, startpos, pos):
    newitem = heap[pos]
    while pos > startpos:
        parentpos = (pos - 1) >> 1
        parent = heap[parentpos]
        if newitem < parent:
            heap[pos] = parent
            pos = parentpos
        else:
            break
    heap[pos] = newitem

Key Internal Optimizations

⏱ O(log n) comparisons, minimal writes.

_siftup (Bubble Down)

Used after popping the root.

def _siftup(heap, pos):
    endpos = len(heap)
    startpos = pos
    newitem = heap[pos]
    childpos = 2*pos + 1

    while childpos < endpos:
        rightpos = childpos + 1
        if rightpos < endpos and heap[rightpos] < heap[childpos]:
            childpos = rightpos
        heap[pos] = heap[childpos]
        pos = childpos
        childpos = 2*pos + 1

    heap[pos] = newitem
    _siftdown(heap, startpos, pos)

Why _siftdown Again at the End?

This is a classic CPython micro-optimization:

heappush

def heappush(heap, item):
    heap.append(item)
    _siftdown(heap, 0, len(heap) - 1)

That's literally it.

heappop

def heappop(heap):
    lastelt = heap.pop()
    if heap:
        returnitem = heap[0]
        heap[0] = lastelt
        _siftup(heap, 0)
        return returnitem
    return lastelt

Important detail:

Why heapify is O(n)

def heapify(x):
    for i in reversed(range(len(x)//2)):
        _siftup(x, i)

Explanation:

How to Implement Max Heap?

Use:

heapq.heappush(heap, -x)

That is, store and pop elements by their negative values.

3. Set

Overview

A Set in Python is used to store a collection of items with the following properties:

Operation Complexities

Operation Complexity (avg) Worst Notes
x in s O(1) O(n) Hash lookup
s.add(x) O(1) O(n) May resize
s.remove(x) O(1) O(n) Raises KeyError
s.discard(x) O(1) O(n) No error
len(s) O(1) Stored count
s.clear() O(n) Frees table
x not in s O(1) O(n) Negated lookup

Operations NOT Supported

Operation Reason
Indexing (s[0]) No order
Slicing No order
Stable iteration Hash-table scan
Sorting in-place Not meaningful

Internal Implementation

A Python set is a hash table, not a tree, not a bitset.

It is very similar to dict, but:

High-Level Structure

typedef struct {
    PyObject_HEAD
    Py_ssize_t fill;        // active + dummy entries
    Py_ssize_t used;        // active entries only
    Py_ssize_t mask;        // table size - 1
    setentry *table;        // hash table
    Py_hash_t hash;         // for frozenset only
    Py_ssize_t finger;      // iteration state
} PySetObject;

What These Mean

Field Meaning
used Number of live elements
fill Live + deleted ("dummy")
mask table_size - 1 (power of two)
table Pointer to entry array
finger Iterator progress
hash Cached hash (frozenset only)

Entry Structure

typedef struct {
    PyObject *key;
    Py_hash_t hash;
} setentry;

Special values:

Hashing & Indexing

Index calculation:

i = hash & mask;

Table size is always a power of two, so modulo is a cheap bitmask.

Collision Resolution (THIS IS KEY)

Python sets use open addressing with perturbation probing (not chaining, not linear probing).

Probe sequence:

i = (i * 5 + 1 + perturb) & mask;
perturb >>= 5;

Why this design?

This same scheme is used by dict.

Lookup Operation (x in set)

Steps

  1. Compute hash(x)
  2. Compute initial index: hash & mask
  3. Probe until:

Complexity

Case Cost
Average O(1)
Worst (pathological hashes) O(n)

Python randomizes hashes to prevent DOS attacks.

Insertion (set.add(x))

  1. Lookup slot using probing
  2. If key already exists → do nothing
  3. Else:
  4. Check load factor → resize if needed

Load Factor Rules

Resize when:

fill * 3 >= mask * 2  (~66%)

This keeps probe lengths short.

Deletion (set.remove, discard)

Deletion does not clear the slot.

Instead:

entry->key = dummy;

Why?

Too many dummies → triggers resize.

Resizing (Rehashing)

When resizing:

Cost: O(n), but amortized O(1) per operation.

Iteration

Iteration is a linear scan over the table:

for i in range(table_size):
    if entry[i].key is real:
        yield entry[i].key

Important Consequences

4. Dict & DefaultDict

Overview

Python's dict (dictionary) is one of the most fundamental and heavily optimized data structures in the language. It implements an associative array that maps hashable keys to arbitrary values.

Dict vs Set

A dict is essentially a set with values attached:

Both use the same underlying hash table implementation with perturbation probing.

Internal Structure

Modern Python dicts (3.6+) use a compact dict design:

typedef struct {
    PyObject_HEAD
    Py_ssize_t ma_used;          // number of items
    PyDictKeysObject *ma_keys;   // keys table
    PyObject **ma_values;        // values array
} PyDictObject;

The Compact Design

Before Python 3.6, dicts stored entries in a sparse array. The new design uses two arrays:

Benefits:

Operation Complexities

Operation Average Worst Notes
d[key] O(1) O(n) Hash lookup
d[key] = value O(1) O(n) May resize
del d[key] O(1) O(n) Marks entry as deleted
key in d O(1) O(n) Hash lookup
d.get(key, default) O(1) O(n) Safe lookup
Iteration O(n) O(n) Linear scan of entries

DefaultDict

collections.defaultdict is a subclass of dict that provides a default value for missing keys:

from collections import defaultdict

# Instead of KeyError, returns default value
d = defaultdict(int)
d['missing']  # Returns 0

# Common use case: grouping
groups = defaultdict(list)
for item in items:
    groups[item.category].append(item)

Implementation: DefaultDict simply overrides __missing__() to call the default_factory function when a key is not found.

Dict Views

Python dicts provide dynamic views:

d = {'a': 1, 'b': 2}
keys = d.keys()      # dict_keys view
values = d.values()  # dict_values view
items = d.items()    # dict_items view

These views are dynamic - they reflect changes to the dictionary in real-time and don't create a copy.

5. List

Overview

Python's list is a dynamic array that can grow or shrink as needed. It's one of the most commonly used data structures in Python.

Internal Implementation

A list is a dynamic array with over-allocation for efficiency.

typedef struct {
    PyObject_VAR_HEAD
    PyObject **ob_item;    // pointer to array of object pointers
    Py_ssize_t allocated;  // how many slots allocated
} PyListObject;

Key points:

Growth Strategy

When a list needs to grow, it doesn't just add one slot. Instead, it over-allocates:

Growth pattern:

new_allocated = (size >> 3) + (size < 9 ? 3 : 6) + size

This means:

Why over-allocate? Makes append operations amortized O(1) instead of O(n).

Operation Complexities

Operation Complexity Notes
list[i] O(1) Direct array access
list.append(x) O(1) amortized May trigger reallocation
list.insert(i, x) O(n) Shifts all elements after i
list.pop() O(1) Remove from end
list.pop(i) O(n) Shifts all elements after i
list.remove(x) O(n) Search + shift
x in list O(n) Linear search
list.sort() O(n log n) Timsort algorithm
list.reverse() O(n) In-place reversal
Slicing list[a:b] O(b-a) Creates new list

Memory Considerations

Lists store pointers, not values directly:

# Each list element is a pointer (8 bytes on 64-bit)
lst = [1, 2, 3]  # Stores 3 pointers to int objects

# The actual integers are separate Python objects
# List overhead: ~56 bytes + 8 bytes per element

Why Append is Fast

Example of list growth:

lst = []
for i in range(10):
    lst.append(i)
    print(f"size={len(lst)}, allocated={sys.getsizeof(lst)}")

Output shows allocated space grows in jumps: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88...

This means appending 10 items only triggers ~4 reallocations instead of 10.

6. Deque

Overview

collections.deque (double-ended queue) is optimized for fast appends and pops from both ends. Unlike lists, deques use a different internal structure.

Internal Implementation

A deque is implemented as a doubly-linked list of blocks (arrays).

Structure:

typedef struct {
    PyObject_VAR_HEAD
    block *leftblock;      // leftmost block
    block *rightblock;     // rightmost block
    Py_ssize_t leftindex;  // index in leftblock
    Py_ssize_t rightindex; // index in rightblock
    size_t state;          // modification counter
    Py_ssize_t maxlen;     // max size (or -1)
} dequeobject;

Each block:

#define BLOCKLEN 64  // elements per block

typedef struct block {
    struct block *leftlink;
    struct block *rightlink;
    PyObject *data[BLOCKLEN];
} block;

Why This Design?

Hybrid approach combining benefits of:

Instead of one node per element, deque uses blocks of 64 elements, reducing memory overhead and improving cache performance.

Operation Complexities

Operation Complexity Notes
deque.append(x) O(1) Add to right end
deque.appendleft(x) O(1) Add to left end
deque.pop() O(1) Remove from right end
deque.popleft() O(1) Remove from left end
deque[i] O(n) Must traverse blocks
deque.rotate(k) O(k) Rotate elements
x in deque O(n) Linear search
len(deque) O(1) Stored count

Deque vs List

Operation List Deque
Append right O(1) amortized O(1)
Append left O(n) O(1)
Pop right O(1) O(1)
Pop left O(n) O(1)
Index access O(1) O(n)
Memory Contiguous Blocks

Use deque when:

Use list when:

Bounded Deque

Deques can have a maximum length:

from collections import deque

# Bounded deque - automatically drops from opposite end
dq = deque(maxlen=3)
dq.extend([1, 2, 3])  # deque([1, 2, 3])
dq.append(4)          # deque([2, 3, 4]) - 1 was dropped

This is perfect for implementing:

7. Hash Tables - Deep Dive

What a Hash Table Fundamentally Is

A hash table implements a mapping: key → value

With the goal of:

It achieves this by turning a key into an array index using a hash function.

Hash Functions

A hash function maps a key to an integer: h = hash(key)

Good hash functions aim for:

From Hash to Table Index

Hash tables store data in an array of size N. To map a hash to an index:

index = hash % N

In real implementations (including Python):

index = hash & (N - 1)

This only works when N is a power of two (8, 16, 32, 64, 128...). Bitwise AND is much faster than modulo.

Collision Resolution Strategies

A) Separate Chaining

Each bucket holds a linked list: index 3 → (k1,v1) → (k2,v2) → (k3,v3)

Simple, but pointer-heavy and cache-unfriendly. Used in many textbooks, not used in Python.

B) Open Addressing (Python uses this)

All entries live inside the array itself. If a slot is taken, you probe for another slot.

Perturbation Probing (Python's choice):

i = hash & mask
perturb = hash

while slot not usable:
    i = (i * 5 + 1 + perturb) & mask
    perturb >>= 5

Why this is brilliant:

Deletion - The Tricky Part

You cannot just clear a slot. Why?

A collides → placed later
B collides → placed even later

If A is deleted and slot cleared,
lookup(B) will stop early → BUG

Solution: DUMMY markers

Instead of clearing: slot = DUMMY

This preserves probe chains. Too many DUMMY slots triggers a resize.

Load Factor & Resizing

Load factor = used_slots / table_size

Python keeps it below ~66%

When exceeded:

Cost: O(n), but amortized O(1) per operation.

Why Hash Tables Beat Trees in Python

Structure Access Cache Allocations
Hash table O(1) Excellent Minimal
Tree O(log n) Poor Many
Linked list O(n) Poor Many

Hash tables align perfectly with modern CPUs.

Hash Randomization (Security)

Python randomizes hash seeds at startup. This prevents attackers from crafting keys that all collide, which stopped real-world DoS attacks.

8. Self-Balancing Binary Search Trees

What are Self-Balancing Binary Search Trees?

In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.

The Height Problem

Most operations on a binary search tree (BST) take time directly proportional to the height of the tree, so it is desirable to keep the height small.

A binary tree with height h can contain at most:

2^0 + 2^1 + ... + 2^h = 2^(h+1) - 1 nodes

This implies that for any tree with n nodes and height h:

n ≤ 2^(h+1) - 1

Which means:
h ≥ ⌈log₂(n+1) - 1⌉ ≥ ⌊log₂(n)⌋

In other words, the minimum height of a binary tree with n nodes is log₂(n), rounded down: ⌊log₂(n)⌋

The Degenerate Case

However, the simplest algorithms for BST item insertion may yield a tree with height n in rather common situations. For example, when items are inserted in sorted key order, the tree degenerates into a linked list with n nodes.

The difference in performance may be enormous:

For n = 1,000,000:
Minimum height: ⌊log₂(1,000,000)⌋ = 19
Worst-case height: 1,000,000

That's a 50,000x difference!

The Solution

Self-balancing binary trees solve this problem by performing transformations (such as tree rotations) at key insertion times, to keep the height proportional to log₂(n). Although a certain overhead is involved, it is not bigger than the always necessary lookup cost.

Language Implementations

Language Data Structure Implementation
C++ STL set, map Red-Black Tree
Java TreeSet, TreeMap Red-Black Tree
Python No native support Use bisect module or PyPI: rbtree, pyavl

Use Cases for Self-Balancing BSTs

Self-balancing BSTs can be used in a natural way to construct and maintain:

Advantages vs Hash Tables

Disadvantages vs Hash Tables

AVL Trees

Definition

An AVL tree is a self-balancing Binary Search Tree (BST) where the difference between heights of left and right subtrees for any node cannot be more than one.

Balance Factor: For each node:

balance_factor = height(left_subtree) - height(right_subtree)

For an AVL tree: -1 ≤ balance_factor ≤ +1

The values above nodes represent the difference between left and right subtree heights:

Example AVL Tree:

        12 (+1)
       /      \
    8 (+1)    18 (+1)
   /   \      /
5 (+1) 11(0) 17(0)
/
4(0)

Each node shows its balance factor in parentheses.

Insertion in AVL Tree

Insertion follows the same basic rules as in a Binary Search Tree (BST):

  1. A new key is placed in its correct position based on BST rules (left < node < right)
  2. After insertion, the balance factor of each node is checked during the path back up to the root
  3. If any node becomes unbalanced (balance factor < -1 or > +1), a rotation is required

Rotation Operations

Two basic operations can rebalance a BST without violating the BST property:

Right Rotation:
     y                x
    / \              / \
   x   T3    →     T1  y
  / \                 / \
 T1  T2              T2  T3

Left Rotation:
   x                  y
  / \                / \
 T1  y       →      x   T3
    / \            / \
   T2  T3         T1  T2

Four Cases for Rebalancing

1. Left-Left Case: Node is inserted in left subtree of left child

Solution: Single right rotation at the unbalanced node

       z                y
      / \              / \
     y   T4    →      x   z
    / \                  / \
   x   T3               T3  T4
  / \
 T1  T2

2. Left-Right Case: Node is inserted in right subtree of left child

Solution: Left rotation on left child, then right rotation on root

     z                z                x
    / \              / \              / \
   y   T4    →      x   T4     →    y   z
  / \              / \              / \   \
 T1  x            y   T3           T1 T2  T3 T4
    / \          / \
   T2  T3       T1  T2

3. Right-Right Case: Node is inserted in right subtree of right child

Solution: Single left rotation at the unbalanced node

   z                  y
  / \                / \
 T1  y       →      z   x
    / \            / \   \
   T2  x          T1 T2  T3 T4
      / \
     T3  T4

4. Right-Left Case: Node is inserted in left subtree of right child

Solution: Right rotation on right child, then left rotation on root

   z                z                  x
  / \              / \                / \
 T1  y      →     T1  x         →   z   y
    / \              / \            / \   \
   x   T4           T2  y          T1 T2  T3 T4
  / \                  / \
 T2  T3               T3  T4

Insertion Algorithm

Follow these steps:

  1. Perform the normal BST insertion
  2. Update the height of the current node (ancestor of newly inserted node)
  3. Get the balance factor: left_height - right_height
  4. If balance factor > 1: Left-heavy case
  5. If balance factor < -1: Right-heavy case

Time Complexity

Operation Complexity Explanation
Search O(log n) Tree height is O(log n)
Insert O(log n) Search + constant rotation operations
Delete O(log n) Search + constant rotation operations
Rotation O(1) Only pointer changes

Deletion in AVL Tree

To ensure the tree remains AVL after deletion:

  1. Perform standard BST delete
  2. Update heights of affected nodes
  3. Get balance factor of each ancestor
  4. If unbalanced, perform appropriate rotations (same cases as insertion)

Important differences from insertion:

Auxiliary Space

O(log n) for recursion call stack (since we use recursive insertion/deletion).

Why Python Doesn't Use Self-Balancing BSTs

Python's standard library does not support self-balancing BSTs natively because:

When to use PyPI BST implementations (rbtree, pyavl):

Summary

Key Takeaways:

Quick Reference Table

Data Structure Best For Worst Operation Memory
List Random access, iteration Insert/delete in middle: O(n) Contiguous, over-allocated
Deque Queue, stack operations Random access: O(n) Blocks of 64 elements
Dict Key-value mapping Worst-case lookup: O(n) Sparse indices + dense entries
Set Membership testing, uniqueness Worst-case lookup: O(n) Hash table, no values
OrderedDict Ordered key-value, LRU cache Extra memory for linked list Dict + doubly linked list
HeapQ Priority queue, top-k problems Not for random access Just a list

Understanding these internals helps you write more efficient Python code and make better decisions about which data structure to use for your specific use case.