This article is written in: 🇺🇸

Data Structures: Collections and Containers

In computer science, a collection (often interchangeably referred to as a container) is a sophisticated data structure designed to hold multiple entities, these could be simple elements like numbers or text strings, or more complex objects like user-defined structures. Collections play a critical role in helping you store, organize, and manipulate different types of data in your programs.

There are two primary perspectives to view collections from:

  1. The Abstract Level: This refers to the conceptual view of collections. Here, we define what collections are, and establish their capabilities and characteristics. For instance, we might define collections based on how they store and retrieve items. This could be based on a specific order (like arrays or lists), or using unique identifiers or keys (like dictionaries or maps).

  2. The Machine Level: This pertains to the actual implementation of collections. In this perspective, we concern ourselves with the most efficient ways to actualize our abstract collection models. To decide the best data structures and algorithms for a specific task, we need to consider various factors, such as memory usage, speed of operations (like insertion, deletion, search), and the flexibility of the structure.

Understanding and studying collections in-depth is crucial for every programmer. This knowledge empowers you to choose the most appropriate collection type for a given task, leading to programs that are not only more efficient in terms of time and space complexities, but also much more maintainable and robust due to clear structure and organization.

G70oT

Linked Lists

A Linked List is a fundamental data structure in computer science, which comprises a sequence of elements, known as nodes. Each node in this sequence contains two distinct parts:

  1. A value (or data), which could range from simple data types to more complex structures.
  2. A reference (commonly referred to as a link or a pointer) to the subsequent node in the sequence.

Importantly, the last node in the linked list points to a null value, effectively signifying the end of the list.

Linked lists can be highly valuable when the data doesn't require contiguous storage in memory or when the exact size of the data is undetermined at the time of memory allocation. While they can serve as an alternative to arrays, it's essential to note that linked lists generally have slower access times and are more challenging to manipulate due to their non-contiguous storage and absence of direct access to individual elements.

[HEAD]-->[1| ]-->[2| ]-->[3| ]-->[4| ]-->[NULL]

Common Applications

Linked lists find utility in a wide variety of areas including, but not limited to:

Typical Interface

Here are some standard operations associated with linked lists:

Time Complexity

Understanding the time complexity of basic operations is key to efficient use of linked lists:

Operation Average case Worst case
Access O(n) O(n)
Insert O(1) O(1)
Delete O(1) O(1)
Search O(n) O(n)

Note that while insertion and deletion are constant time operations, access and search operations require traversing the list, resulting in linear time complexity.

Implementation Details

Linked lists can be implemented in various ways, largely dependent on the programming language used. A conventional implementation strategy involves defining a custom class or structure to represent a node. This class encapsulates a value and a reference to the next node. Additionally, methods for adding, removing, and accessing elements are provided to manipulate the list.

In some languages, linked lists are provided as part of the standard library, abstracting away the underlying implementation details and providing a rich set of functionality.

Vectors

In computer science, a vector can be thought of as a dynamic array with the ability to adjust its size as required. While similar to traditional arrays, vectors distinguish themselves with their superior flexibility and efficiency in numerous scenarios. They store elements in contiguous blocks of memory, ensuring swift access to elements through index-based referencing.

Vectors are often favored over low-level arrays due to their augmented feature set and integrated safety mechanisms. These characteristics contribute to their easier usability and overall reliability.

-------------------------
| 0 | 1 | 2 | 3 | 4 | 5 |
-------------------------

Practical Applications

Vectors can be leveraged for a wide array of tasks, such as:

Typical Interface

Some of the commonly used operations for vectors are:

Time Complexity

A thorough understanding of the time complexity of basic operations is key to efficient use of vectors:

Operation Average case Worst case
Access O(1) O(1)
Insert O(1) O(n)
Delete O(1) O(n)
Search O(n) O(n)

While access to elements is a constant time operation, insertions and deletions can vary depending on where these operations are performed. If at the end, they are constant time operations; however, elsewhere, these operations can result in shifting of elements, causing linear time complexity.

Implementation Details

Vectors are typically implemented using an array of elements accompanied by additional metadata such as the current size and total capacity. When a resize operation is necessary, a new, larger array is created, and the existing elements are copied over to the new array. As resizing can be time and resource-consuming, vectors often allocate extra memory ahead of time to minimize the frequency of resizing.

In some programming languages, vectors might internally use other data structures like linked lists. However, regardless of the underlying implementation, the interface presented to the programmer and the time complexity of the operations remain similar.

Stacks

A stack is a fundamental data structure that models a First-In-Last-Out (FILO) or Last-In-First-Out (LIFO) strategy. In essence, the most recently added element (the top of the stack) is the first one to be removed. This data structure is widely utilized across various programming languages to execute a range of operations, such as arithmetic computations, character printing, and more.

[5]  <- top
[4]
[3]
[2]
[1]  <- bottom

Key Interface Methods

Typical operations associated with a stack include:

Common Applications

Stacks are incredibly versatile and are used in a multitude of applications, including:

Time Complexity

Knowing the time complexity of stack operations is crucial for writing efficient code:

Operation Average case Worst case
Access O(n) O(n)
Insert (Push) O(1) O(1)
Delete (Pop) O(1) O(1)
Search O(n) O(n)

As can be seen, while push and pop operations are constant time, access (retrieving an element without popping) and search operations require linear time since, in the worst case, all elements must be inspected.

Implementation Details

Stacks can be implemented using various underlying data structures, but arrays and linked lists are among the most common due to their inherent characteristics which nicely complement the operations and needs of a stack.

The choice between arrays and linked lists often depends on the specific requirements of the application, such as memory usage and the cost of operations. With an array, resizing can be expensive but access is faster, while with a linked list, insertion and deletion are faster but more memory is used due to the extra storage of pointers.

Queues

A queue is a fundamental data structure that stores elements in a sequence, with modifications made by adding elements to the end (enqueuing) and removing them from the front (dequeuing). The queue follows a First In First Out (FIFO) strategy, implying that the oldest elements are processed first.

[1]  <- front
[2]
[3]
[4]
[5]  <- rear

Core Interface Methods

The standard operations associated with a queue are:

Practical Applications

Queues are incredibly versatile and find use in a variety of real-world scenarios, including:

Comparing Stacks and Queues

While stacks and queues are similar data structures, they differ in several key aspects:

Time Complexity

Understanding the time complexity of queue operations is critical for writing efficient code:

Operation Average case Worst case
Access O(n) O(n)
Insert (Enqueue) O(1) O(1)
Delete (Dequeue) O(1) O(1)
Search O(n) O(n)

Enqueue and dequeue operations are constant time, while accessing or searching for a specific element require linear time as all elements may need to be inspected.

Implementation Details

Queues can be implemented using various underlying data structures, including arrays, linked lists, circular buffers, and dynamic arrays. The choice of implementation depends on the specific requirements of the application, such as memory usage, the cost of operations, and whether the size of the queue changes frequently.

Heaps

A heap is a specialized tree-based data structure that satisfies the heap property. There are two main types of heaps: max-heaps and min-heaps.

Heaps are always complete binary trees, which means they are filled at all levels, except the last, which is filled from left to right. This structure ensures that the heap remains balanced.

[100]
       /     \
    [19]     [36]
   /   \     /   \
 [17] [3]  [25]  [1]

Key Methods of the Heap Interface

The basic operations of a heap are as follows:

Practical Applications of Heaps

Heaps are versatile and find widespread use across various computational problems:

Time Complexity of Heap Operations

The time complexity of standard heap operations is as follows:

Operation Average case
Find min/max O(1)
Delete min/max O(log n)
Insert O(log n)
Merge O(m log(m+n))

These time complexities make heaps especially useful in situations where we need to repeatedly remove the minimum (or maximum) element.

Implementation Details

Heaps can be represented efficiently using a one-dimensional array, which allows for easy calculation of the parent, left, and right child positions. The standard mapping from the heap to this array representation is as follows:

  1. The root of the tree is placed at index 1.
  2. For an element at index i in the array:
  3. The left child is placed at index 2*i.
  4. The right child is placed at index 2*i + 1.
  5. The parent of an element at index i is placed at index floor(i/2).

This layout keeps the parent-child relationships consistent and makes navigation within the heap straightforward and efficient.

Binary Search Trees (BST)

A Binary Search Tree (BST) is a type of binary tree where each node has up to two children and maintains a specific ordering property among its values. For every node, the values of all nodes in its left subtree are less than its own value, and the values of all nodes in its right subtree are greater than its own value. In a well-balanced BST, this property enables efficient lookup, addition, and deletion operations, ideally distributing the nodes evenly across the left and right subtrees. Note that BSTs typically disallow duplicate values.

[1]
       /   \
     [2]   [3]
    /   \     \
  [4]   [5]   [6]

Key Operations of the BST Interface

The fundamental operations provided by a BST are:

Time Complexity of BST Operations

The efficiency of BST operations depends on the height of the tree. In a balanced tree, this results in an average-case time complexity of O(log n). However, in the worst-case scenario (a degenerate or unbalanced tree), the time complexity degrades to O(n).

Operation Average case Worst case
Access O(log n) O(n)
Insert O(log n) O(n)
Delete O(log n) O(n)
Search O(log n) O(n)

BST Implementation Details

A BST is typically implemented using a node-based model where each node encapsulates a value and pointers to the left and right child nodes. A special root pointer points to the first node, if any.

Insertion, deletion, and search operations are often implemented recursively to traverse the tree starting from the root, making comparisons at each node to guide the direction of traversal (left or right). This design pattern exploits the recursive nature of the tree structure and results in clean, comprehensible code.

AVL Trees

Named after its inventors, G.M. Adelson-Velsky and E.M. Landis, AVL trees are a subtype of binary search trees (BSTs) that self-balance. Unlike regular BSTs, AVL trees maintain a stringent balance by ensuring the height difference between the left and right subtrees of any node is at most one. This tight control on balance guarantees optimal performance for lookups, insertions, and deletions, making AVL trees highly efficient.

[20]       
      /    \
   [10]    [30]
  /   \     /   \
[5]  [15] [25]  [35]

Key Characteristics of AVL Trees

The defining characteristics of AVL trees include:

Applications of AVL Trees

The AVL tree's self-balancing property lends itself to various applications, such as:

AVL Tree Interface and Methods

AVL trees support standard BST operations with some modifications to maintain balance:

Time Complexity of AVL Tree Operations

Thanks to their self-balancing property, AVL trees ensure logarithmic time complexity for essential operations, even in the worst-case scenario.

Operation Average Worst
Access O(log n) O(log n)
Insert O(log n) O(log n)
Delete O(log n) O(log n)
Search O(log n) O(log n)

Implementation of AVL Trees

The key to maintaining AVL trees' balance is the application of rotation operations. Four types of rotations can occur based on the tree's balance state:

  1. Right rotation: Applies when the left subtree of a node is taller than its right subtree. This rotation sets the node's left child as the new root, making the old root the right child of the new root. It concludes by updating the nodes' heights.

  2. Left rotation: Applies when the right subtree of a node is taller than its left subtree. This rotation sets the node's right child as the new root and the old root as the left child of the new root. It concludes by updating the nodes' heights.

  3. Right-left rotation: Applies when the right subtree of a node is taller, and the right child's left subtree is taller than its right. This rotation first applies a right rotation on the right child, followed by a left rotation on the initial node, updating node heights as needed.

  4. Left-right rotation: Applies when the left subtree of a node is taller, and the left child's right subtree is taller than its left. This rotation first applies a left rotation on the left child, followed by a right rotation on the initial node, updating node heights as needed.

These rotations ensure that AVL trees maintain their crucial property of balance, offering consistently high performance.

Red-Black Trees

Red-Black trees are a type of self-balancing binary search trees, with each node bearing an additional attribute: color. Each node in a Red-Black tree is colored either red or black, and these color attributes follow specific rules to maintain the tree's balance. This balance ensures that fundamental tree operations such as insertions, deletions, and searches remain efficient.

(30B)      
         /     \
     (20R)    (40B)
    /   \      /   \
(10B) (25B) (35B) (50B)

Key Properties of Red-Black Trees

Red-Black trees abide by the following essential properties:

Applications of Red-Black Trees

Red-Black trees find application in several areas, including:

Red-Black Tree Interface and Methods

Red-Black trees support typical BST operations, albeit with additional steps to manage node colors and balance:

Time Complexity of Red-Black Tree Operations

Due to their self-balancing nature, Red-Black trees ensure efficient performance even in the worst-case scenarios, with a logarithmic time complexity for key operations:

Operation Average Worst
Access O(log n) O(log n)
Insert O(log n) O(log n)
Delete O(log n) O(log n)
Search O(log n) O(log n)

Implementation of Red-Black Trees

Red-Black trees are a sophisticated binary search tree variant. They require careful management of node colors during insertions and deletions, often involving multiple rebalancing steps:

  1. Standard BST insertion is performed initially.
  2. If the new node is the tree root, it's colored black to satisfy the Red-Black tree properties.
  3. If the new node's parent is black, no further action is needed.
  4. If the new node's parent is red, and its uncle (the sibling of its parent) is also red, both the parent and uncle are recolored black, and the grandparent is recolored red (if not the root). This step is known as "recoloring."
  5. If the new node's parent is red, but its uncle is black or NIL, the tree might need a rotation or a series of rotations. These rotations can be left, right, or a combination of both based on the relative placements of the new node, its parent, and its grandparent. These rotations aim to restructure the tree while preserving the BST property and restoring the Red-Black tree properties.

These careful rotations and recoloring steps enable Red-Black trees to maintain balance and ensure efficient performance across various operations.

Hash Tables

Hash tables, often referred to as hash maps or dictionaries, are a type of data structure that employs a hash function to pair keys with their corresponding values. This mechanism enables efficient insertion, deletion, and lookup of items within a collection, making hash tables a fundamental tool in many computing scenarios. They are typically used in contexts where swift lookup and insertion operations are critical, such as in database management and networking applications.

+-----------------+
| Key | Value     |
|-----|-----------|
| K1  | V1        |
| K2  | V2        |
| K3  | V3        |
| K4  | V4        |
| ... | ...       |
+-----------------+

Key Applications of Hash Tables

Hash tables are ubiquitous across various domains, including:

Basic Operations of a Hash Table

Hash tables support a range of operations, which include:

Time Complexity of Hash Table Operations

The performance of hash table operations depends on the hash function's quality, the strategy for resolving collisions, and the load factor (the ratio of the number of key-value pairs to the number of slots or buckets in the table).

Operation Average case Worst case
Access - -
Insert O(1) O(n)
Delete O(1) O(n)
Search O(1) O(n)

Implementing Hash Tables

The creation of a hash table involves setting up a data structure to store key-value pairs, designing a hash function to map keys to slots in the data structure, and devising a collision resolution strategy to handle instances where multiple keys hash to the same slot.

Common data structures used include arrays (for open addressing) and linked lists or balanced trees (for separate chaining).

Hash Function

The hash function plays a pivotal role in a hash table's performance as it determines where a key-value pair will be stored within the table. An ideal hash function distributes keys evenly across the table, balancing the number of keys in each bucket. This uniform distribution can be achieved by leveraging prime numbers both for the size of the hash table and within the hash function itself. When hashing strings, a common approach is to calculate the ASCII value for each character, add them, and then compute the modulo with respect to the size of the table.

# n: number of keys
# m: number of buckets 
h(x) = n % m

Collision Resolution

Collisions occur when two or more keys hash to the same slot. There are two common strategies to handle these collisions: chaining and open addressing.

Chaining

With chaining, each slot or bucket in the hash table acts as a linked list. Each new key-value pair is appended to its designated list. While this approach allows for efficient insertions and deletions, the worst-case time complexity for lookups escalates to O(n), where n is the number of keys stored.

The following is an example:

index  Bucket
---------------------
0      -> ["apple"]
1      -> ["banana", "mango"]
2      -> ["cherry"]
3      -> []
4      -> ["dates", "elderberry", "fig"]
...

In the above example, "banana" and "mango" hash to the same index (1), so they are stored in the same linked list.

Open Addressing

In open addressing, the hash table is an array, and each key-value pair is stored directly in an array slot. When a collision arises, the hash table searches for the next available slot according to a predefined probe sequence. Common probing techniques include linear probing, where the table checks each slot one by one, and quadratic probing, which checks slots based on a quadratic function of the distance from the original hash.

Here's an example using linear probing:

index  Bucket
---------------------
0      -> "apple"
1      -> "cherry"
2      -> "banana"  (originally hashed to 1, moved due to collision)
3      -> "mango"   (originally hashed to 1, moved due to collision)
4      -> "dates"
5      -> "elderberry"
6      -> "fig"
...

In this case, "banana" and "mango" were hashed to index 1, but since it was already occupied by "cherry", they were moved to the next available slots (2 and 3, respectively).

Table of Contents

  1. Data Structures: Collections and Containers
  2. Linked Lists
    1. Common Applications
    2. Typical Interface
    3. Time Complexity
    4. Implementation Details
  3. Vectors
    1. Practical Applications
    2. Typical Interface
    3. Time Complexity
    4. Implementation Details
  4. Stacks
    1. Key Interface Methods
    2. Common Applications
    3. Time Complexity
    4. Implementation Details
  5. Queues
    1. Core Interface Methods
    2. Practical Applications
    3. Comparing Stacks and Queues
    4. Time Complexity
    5. Implementation Details
  6. Heaps
    1. Key Methods of the Heap Interface
    2. Practical Applications of Heaps
    3. Time Complexity of Heap Operations
    4. Implementation Details
  7. Binary Search Trees (BST)
    1. Key Operations of the BST Interface
    2. Time Complexity of BST Operations
    3. BST Implementation Details
  8. AVL Trees
    1. Key Characteristics of AVL Trees
    2. Applications of AVL Trees
    3. AVL Tree Interface and Methods
    4. Time Complexity of AVL Tree Operations
    5. Implementation of AVL Trees
  9. Red-Black Trees
    1. Key Properties of Red-Black Trees
    2. Applications of Red-Black Trees
    3. Red-Black Tree Interface and Methods
    4. Time Complexity of Red-Black Tree Operations
    5. Implementation of Red-Black Trees
  10. Hash Tables
    1. Key Applications of Hash Tables
    2. Basic Operations of a Hash Table
    3. Time Complexity of Hash Table Operations
    4. Implementing Hash Tables
      1. Hash Function
      2. Collision Resolution