Last modified: June 06, 2026

This article is written in: ๐Ÿ‡บ๐Ÿ‡ธ

Choosing a Database

Choosing the right database can significantly influence your projectโ€™s reliability, performance, cost, and ability to scale. A good database choice is not only about picking a popular engine; it is about matching the database to your data shape, query patterns, consistency needs, latency requirements, operational capacity, and long-term growth.

A practical way to start is to ask:

A useful rule of thumb:

Choose the database model based on the shape of your data and access pattern first. Choose the specific database engine later.

Data Models

Different database models are optimized for different types of data and workloads. Most systems do not use only one database forever. It is common to start with one primary database and introduce specialized systems later for caching, search, analytics, or time-series workloads.

Relational Databases / Row-Store SQL

Relational databases store data in tables with rows and columns. They usually enforce a fixed schema and support strong consistency, transactions, constraints, joins, and relational integrity.

They are often the safest default choice for business applications because they are mature, well understood, and flexible enough for many use cases.

Best for: structured data, transactions, financial records, inventory, orders, user accounts, permissions, and applications where correctness matters.

Best for Typical engines Key traits
Financial ledgers, inventory, order processing, SaaS apps PostgreSQL, MySQL, SQL Server, Oracle ACID transactions, joins, indexes, constraints, mature tooling

Example:

+----+----------+----------+
| id | customer | balance  |
+----+----------+----------+
|  1 | Omar     | 1250.00  |
|  2 | Layla    |   93.70  |
+----+----------+----------+

Use relational databases when your data has clear relationships and you need reliable updates, joins, and transactional correctness.

Avoid forcing everything into relational tables when the data is deeply nested, rapidly changing, or mostly accessed as whole documents.

Document and Key-Value Databases

Document databases store semi-structured records, usually as JSON-like documents. Key-value databases store values under unique keys and are optimized for simple lookups.

These systems are useful when the schema changes frequently or when the application usually reads and writes entire objects rather than joining many tables.

Best for: user profiles, content feeds, product catalogs, configuration, shopping carts, sessions, and high-scale applications with simple access patterns.

Best for Typical engines Key traits
User profiles, catalogs, content feeds, session-like data MongoDB, DynamoDB, Couchbase, Firestore Flexible schema, horizontal scaling, document access, tunable consistency depending on engine

Example:

{
  "user": "Omar",
  "posts": [
    { "id": 1, "text": "Hello" },
    { "id": 2, "text": "World" }
  ]
}

Use document databases when your data is naturally object-shaped and usually accessed as a whole document.

Use key-value databases when your access pattern is mostly:

get value by key
put value by key
delete value by key

Be careful when your workload needs many joins, complex ad-hoc queries, or multi-record transactions. Some NoSQL systems support these features, but they are usually not their strongest area.

Time-Series Databases

Time-series databases are optimized for data points indexed by time. They are designed for high write rates, compression, retention policies, rollups, and time-window queries.

Best for: metrics, logs, IoT telemetry, sensor readings, monitoring data, market ticks, and event streams.

Best for Typical engines Key traits
Server metrics, IoT telemetry, market data, observability TimescaleDB, InfluxDB, Amazon Timestream, Prometheus High ingest rate, timestamp indexing, retention policies, downsampling, window queries

Example:

value
20 โ”ค                              *
18 โ”ค                         *
16 โ”ค                    *
14 โ”ค               *
12 โ”ค          *
10 โ”ค     *
 8 โ”ค *
 0 โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ time

Use a time-series database when most queries ask questions like:

A normal relational database can handle small or moderate time-series workloads, especially with good indexing and partitioning. Dedicated time-series systems become more valuable when ingest volume, retention, and aggregation needs grow.

Graph Databases

Graph databases represent data as nodes and edges. They are optimized for traversing relationships, finding paths, and querying highly connected data.

Best for: social networks, fraud detection, recommendation systems, identity graphs, dependency graphs, and permission relationships.

Best for Typical engines Key traits
Social graphs, fraud rings, recommendations, network dependencies Neo4j, Amazon Neptune, ArangoDB Fast relationship traversal, path finding, pattern matching

Example:

(Omar)โ”€โ”€knowsโ”€โ”€(Layla)
  โ”‚              โ”‚
knows          knows
  โ”‚              โ”‚
  โ–ผ              โ–ผ
(Zayn)       (Noura)

Use a graph database when the relationships between entities are as important as the entities themselves.

A relational database can model graphs using tables, but deep relationship traversal can become slow or complicated as the number of hops increases.

In-Memory Databases and Caches

In-memory systems keep data in RAM to provide very low-latency access. They are often used as a cache in front of a primary database, but some can also be used as primary data stores for specific workloads.

Best for: session state, leaderboards, counters, rate limiting, temporary data, queues, and frequently accessed hot data.

Best for Typical engines Key traits
Session state, leaderboard counters, low-latency lookups, caching Redis, Memcached Very low latency, TTL support, simple data structures, optional persistence depending on engine

Example:

Client โ†’ Cache โ†’ Primary DB
          โ”‚
          โ””โ”€โ”€ TTL: 60 seconds

Use caching when the same data is read frequently and does not need to be recomputed or refetched every time.

Be careful with cache invalidation. A cache can improve performance, but it also introduces consistency questions:

Scalability

Scalability is about how your database handles growth in users, traffic, data volume, and query complexity.

There are two broad approaches:

Most systems start with vertical scaling because it is simpler. Horizontal scaling becomes necessary when one machine is no longer enough or when availability and geographic distribution become important.

Scale-Up / Vertical Scaling

Vertical scaling increases the resources of a single database node by adding more CPU, RAM, storage, or IOPS.

Best for How it scales Trade-offs
MVPs, small-to-medium applications, early OLTP systems Add more CPU, RAM, storage, or IOPS to one server Simple to operate, but has hardware limits and can become expensive

Example:

+------------+
| 4 vCPU     |
| 8 GB RAM   |
+------------+
      โ”‚ upgrade
      โ–ผ
+------------+
| 32 vCPU    |
| 256 GB RAM |
+------------+

Vertical scaling is often the best first step because it avoids distributed-system complexity. However, it eventually reaches a limit.

Read Replication

Read replication creates one or more copies of the primary database. The primary handles writes, while replicas handle read queries.

Best for How it scales Trade-offs
Read-heavy applications, dashboards, reporting, APIs with many reads Send read traffic to replicas The primary can still bottleneck writes; replicas may lag behind the primary

Example:

Clients
  โ”‚
  โ”œโ”€โ–บ Replica 1  SELECT
  โ”œโ”€โ–บ Replica 2  SELECT
  โ””โ”€โ–บ Primary    READ / WRITE

Read replicas help when reads dominate writes. They do not automatically solve write scalability.

Be careful when your application reads immediately after writing. If it reads from a lagging replica, it may not see the latest data.

Sharding / Partitioning

Sharding splits data across multiple nodes. Each shard stores only part of the data, usually based on a shard key such as user_id, tenant_id, or a hash of a key.

Best for How it scales Trade-offs
Very large datasets, high-write systems, multi-tenant SaaS Split data by key range, tenant, or hash across nodes Cross-shard joins and transactions become more complex

Example:

Shard key = user_id % 3

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Shard 0    โ”‚ Shard 1    โ”‚ Shard 2    โ”‚
โ”‚ users 0,3  โ”‚ users 1,4  โ”‚ users 2,5  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The shard key is one of the most important design decisions. A poor shard key can create hot shards, uneven data distribution, or difficult queries.

Good shard keys usually have:

Geo-Distributed / Active-Active

Geo-distributed systems place database nodes in multiple regions. Active-active systems allow writes in more than one region.

Best for How it scales Trade-offs
Global applications, low-latency regional access, disaster resilience Users read and write to nearby regions Higher complexity, higher cost, consistency trade-offs, conflict handling or consensus overhead

Example:

โ•ญโ”€โ”€ us-east โ”€โ”€โ•ฎ        โ•ญโ”€โ”€ eu-west โ”€โ”€โ•ฎ
โ”‚  RW node    โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚  RW node    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ        โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

There are two broad types of active-active systems:

  1. Strongly consistent active-active Uses consensus or global transaction coordination. This reduces conflict risk but increases write latency.
  2. Eventually consistent active-active Allows local writes and resolves conflicts later. This improves local latency but requires careful conflict-resolution rules.

Use active-active only when the business requirement justifies the operational complexity.

Serverless / Elastic Autoscaling

Serverless and elastic database platforms automatically adjust capacity based on demand. They can be useful for unpredictable or spiky workloads.

Best for How it scales Trade-offs
Spiky traffic, unpredictable workloads, low-ops teams Capacity grows and shrinks automatically Possible cold starts, pricing surprises, operational limits, less control

Example:

Load
  โ”‚
  โ”œโ”€โ”€ low traffic  โ”€โ”€โ–บ small capacity
  โ””โ”€โ”€ burst        โ”€โ”€โ–บ increased capacity

Serverless databases can reduce operational burden, but they are not automatically cheaper. They are often cost-effective for variable workloads, but steady high traffic may be cheaper on provisioned capacity.

Dedicated Cache Layer

A cache layer stores frequently accessed data in memory to reduce pressure on the primary database.

Best for How it scales Trade-offs
Hot reads, sessions, rate limits, expensive queries Offload repeated reads to an in-memory tier Extra system to operate; stale data and eviction strategy must be handled

Example:

Client โ†’ Redis Cluster โ†’ Primary DB
          fast path       source of truth

Caching is usually most effective when:

Avoid using caching to hide a poor data model too early. First make sure your queries, indexes, and schema are reasonable.

Availability and Reliability

Availability is about keeping the database accessible. Reliability is about keeping the data correct and recoverable.

A highly available database strategy usually combines:

Two important terms:

Example:

RTO = 15 minutes
RPO = 1 minute

This means the system should recover within 15 minutes and lose no more than 1 minute of data.

A useful principle:

Resilience is an architecture decision, not a lucky outcome. Choose a pattern that matches your SLA, risk tolerance, and budget.

Single Node + Local Backup

This is the simplest setup: one database node and periodic backups.

Best for How it protects Trade-offs
Development, prototypes, demos, non-critical internal tools Periodic backup of database files or dumps Downtime during failure; possible data loss since last backup

Example:

App โ”€โ”€โ–บ DB โ”€โ”€โ–บ local backup

This is not enough for important production systems because the database server remains a single point of failure.

Synchronous Failover

Synchronous replication writes data to a replica before confirming the transaction. This reduces data loss risk because the replica has the latest committed writes.

Best for How it protects Trade-offs
Production systems needing low data loss Primary writes synchronously to standby; standby can be promoted Higher write latency; if both nodes are in the same zone, zone failure can still take both down

Example:

Primary โ”€โ”€syncโ”€โ”€โ–บ Standby
   โ”‚
   โ””โ”€โ”€ write acknowledged after replica confirms

For stronger availability, synchronous failover is usually more useful when the standby is in a different availability zone, not the same rack or same failure domain.

Cross-Zone Asynchronous Replication

Asynchronous replication sends changes to another node after the primary commits. This improves availability and geographic fault tolerance, but the replica may lag behind.

Best for How it protects Trade-offs
Standard web and mobile backends Replica in another availability zone Possible data loss during failover; replica lag; split-brain risk if failover is mishandled

Example:

AZ-A Primary โ”€โ”€asyncโ”€โ”€โ–บ AZ-B Standby

This is a common and cost-effective production pattern. It improves resilience but does not guarantee zero data loss.

Multi-Region Active-Passive / Warm Standby

In this pattern, one region serves production traffic while another region stays ready as a standby. Data is continuously replicated to the standby region.

Best for How it protects Trade-offs
Disaster recovery, compliance, regional outage protection Logs or changes are shipped to another region; failover redirects traffic More expensive than single-region; failover must be tested; some data loss may occur depending on replication mode

Example:

Primary Region โ”€โ”€logs / replicationโ”€โ”€โ–บ Standby Region

Warm standby is often a practical compromise between cost and resilience. It is less complex than active-active but still protects against major regional outages.

Multi-Region Active-Active

Active-active systems allow multiple regions to accept traffic at the same time. This can improve global latency and availability, but it is significantly more complex.

Best for How it protects Trade-offs
Global 24/7 products, low-latency regional writes, high availability Multiple regions serve reads and writes Higher cost, more complex operations, consistency or conflict-resolution challenges

Example:

โ•ญโ”€โ”€ us-east โ”€โ”€โ•ฎ  โ—€โ”€โ”€sync/consensusโ”€โ”€โ–ถ   โ•ญโ”€โ”€ eu-west โ”€โ”€โ•ฎ
โ”‚  RW node    โ”‚                         โ”‚  RW node    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ                         โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Use active-active only when you truly need it. Many applications can meet their requirements with active-passive disaster recovery, read replicas, or regional caching.

Immutable Backups + Point-in-Time Restore

Backups protect against data loss, corruption, ransomware, accidental deletion, and bad deployments.

Point-in-time restore allows you to restore the database to a specific moment before an error occurred.

Best for How it protects Trade-offs
Operator error, ransomware, accidental deletes, bad migrations Immutable snapshots plus transaction logs or WAL Does not provide live high availability; restore takes time

Example:

DB snapshots โ”€โ”€โ–บ Object storage
DB logs      โ”€โ”€โ–บ Point-in-time restore

Backups should be tested regularly. An untested backup is only a hope, not a recovery plan.

Observability and Chaos Testing

Observability helps you detect problems before users notice them. Chaos testing verifies whether the system behaves correctly under failure.

Best for How it protects Trade-offs
Systems with SLOs, audits, production reliability goals Metrics, logs, traces, alerts, controlled fault injection Requires tooling, discipline, and operational maturity

Important database signals include:

Example:

Fault injection โ†’ Monitoring โ†’ Alert โ†’ Runbook โ†’ Recovery

Reliability is not only about database features. It also depends on the application, deployment process, network, backups, monitoring, and team response.

Decision-Making Trees

Use these trees as a starting point, not as strict rules. Real systems often combine multiple database types.

For example:

PostgreSQL for core transactions
Redis for caching
OpenSearch for search
S3 for files
ClickHouse or BigQuery for analytics

Start with the simplest system that meets your needs, then add specialized components when the workload proves they are necessary.

First Question: What Shape Is the Data?

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                       โ”‚   What does your application store?    โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                       โ”‚
                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ–ผ                      โ–ผ                         โ–ผ
      Fixed tables and          JSON, XML, documents,       Files, images,
      relationships?            flexible fields, keys?      videos, logs, blobs?
                โ”‚                      โ”‚                         โ”‚
                โ–ผ                      โ–ผ                         โ–ผ
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚  STRUCTURED  โ”‚       โ”‚ SEMI-STRUCTURED  โ”‚      โ”‚  UNSTRUCTURED   โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Use this as a guideline:

Do not overreact to small schema variation. A relational database can still handle some flexible fields through JSON columns. But if flexible nested objects are the core of your workload, a document model may be more natural.

Structured Data Decision Tree

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                       โ”‚ Structured Data โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ–ผ                                   โ–ผ
        OLTP / Transactions                 Analytics / OLAP
   row lookups, writes, updates         scans, aggregates, reports
              โ”‚                                   โ”‚
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ–ผ                 โ–ผ              โ–ผ                       โ–ผ
Single-node      Distributed SQL   Column store          Serverless analytics
Postgres/MySQL   or sharded SQL    warehouse             query engine

Recommended choices:

Need Good fit
Standard application transactions PostgreSQL, MySQL
Strong consistency with global scale Spanner, CockroachDB, YugabyteDB
Large analytical scans Snowflake, BigQuery, Redshift, ClickHouse
Querying data in object storage Athena, BigQuery external tables, Trino

Rule of thumb:

Start with PostgreSQL or MySQL for transactional apps unless you already know you need global distribution, extreme scale, or specialized analytics.

Semi-Structured Data Decision Tree

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                      โ”‚ Semi-Structured Data โ”‚
                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                  โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ–ผ                         โ–ผ                         โ–ผ
 Document-shaped?          Simple key-value?          Relationship-heavy?
        โ”‚                         โ”‚                         โ”‚
        โ–ผ                         โ–ผ                         โ–ผ
 Document DB                 KV / Cache                 Graph DB
 MongoDB, Couchbase,         Redis, DynamoDB,           Neo4j, Neptune,
 Firestore                   Memcached                  ArangoDB

        โ”‚                         โ”‚                         โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ–ผ                          โ–ผ
              Full-text search?          Time-ordered events?
                      โ”‚                          โ”‚
                      โ–ผ                          โ–ผ
              Search database             Time-series database
              Elasticsearch,              InfluxDB, TimescaleDB,
              OpenSearch                  Timestream

Recommended choices:

Need Good fit
Flexible JSON documents MongoDB, Couchbase, Firestore
Simple key-value access DynamoDB, Redis
Low-latency temporary data Redis, Memcached
Full-text search Elasticsearch, OpenSearch, Meilisearch
Relationship traversal Neo4j, Neptune
Metrics and time-window queries TimescaleDB, InfluxDB, Timestream

Rule of thumb:

Pick the system based on the dominant access pattern: document retrieval, key lookup, search, graph traversal, or time-window analysis.

Unstructured Data Decision Tree

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                       โ”‚  Unstructured Data  โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                  โ”‚
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ–ผ                    โ–ผ                    โ–ผ
      File semantics?       Large blobs?       Low-latency block access?
             โ”‚                    โ”‚                    โ”‚
             โ–ผ                    โ–ผ                    โ–ผ
   Distributed file system   Object storage        Block storage
   NFS, EFS, HDFS            S3, GCS, Azure Blob    EBS, Persistent Disk

Recommended choices:

Need Good fit
Images, videos, backups, large files Object storage such as S3, GCS, Azure Blob
Shared POSIX-like filesystem EFS, NFS, Filestore
Disk attached to a VM or database EBS, Persistent Disk
Global file delivery CDN in front of object storage

Rule of thumb:

Choose by access semantics: object, file, or block. Then decide based on latency, durability, cost, and access frequency.

How to Use These Trees

  1. Identify the data shape. Is it structured, semi-structured, or unstructured?
  2. Identify the access pattern. Are you doing transactions, key lookups, document reads, graph traversals, time-window queries, full-text search, or analytics?
  3. Start with the simplest reliable option. For many applications, this is a relational database plus good indexing.
  4. Prototype using real queries. Test the actual workload, not only theoretical examples.
  5. Measure before adding complexity. Add caching, sharding, replicas, or specialized systems when measurements prove you need them.
  6. Revisit the choice periodically. Traffic, data volume, team size, and cloud services change over time.

Practical Selection Checklist

Before choosing a database, answer these questions:

Data shape:

Query pattern:

Consistency:

Scale:

Availability:

Operations:

Cost: