Backup and Recovery Strategies

Last modified: January 26, 2018

This article is written in: 🇺🇸

Backup and Recovery Strategies

Backup and recovery strategies are essential components of any robust database management plan, ensuring that data remains durable, available, and that business operations can continue uninterrupted. One of the significant challenges in designing these strategies is performing backups without disrupting or slowing down production read and write operations. In this discussion, we'll delve into various backup types, explore methods to minimize production impact, examine recovery strategies, and highlight best practices and tools.

After reading the material, you should be able to answer the following questions:

What are the primary objectives of effective backup and recovery strategies, and why are they important for database management?
What are the different types of backups (full, incremental, differential, logical, and physical), and what are the advantages and trade-offs of each?
What methods can be employed to minimize the impact of backup operations on production systems, and how do they work?
What are the recovery strategies, such as Point-in-Time Recovery and Continuous Data Protection, and in what scenarios are they most effectively used?
What best practices and tools should be implemented to ensure reliable, efficient, and secure backup and recovery processes?

Objectives

The primary goals of effective backup and recovery strategies revolve around three main objectives:

Ensuring data durability is a fundamental priority, as it involves protecting data against potential loss caused by hardware malfunctions, software bugs, or human errors, and enabling full recovery to maintain data integrity.
Focusing on data availability is critical to reduce downtime, employing strategies that ensure systems remain operational or are quickly restored following disruptions to maintain consistent access to data.
Preserving performance during backup operations is essential, aiming to execute backups without affecting the speed or reliability of production read and write operations, ensuring a seamless experience for users and applications.

Challenges in Non-Disruptive Backups

Designing backup processes that don't interfere with production systems presents several challenges. Backup operations can consume significant resources, leading to competition with production workloads for I/O, CPU, and network bandwidth. Ensuring data consistency during backups, especially without locking tables or halting transactions, adds another layer of complexity. Additionally, as databases grow in size, scalability becomes a concern, requiring efficient methods to handle large volumes of data without affecting application performance.

Types of Backups

Understanding the different types of backups is fundamental to creating an effective backup strategy. Each type has its advantages and trade-offs, and often, a combination of these methods is employed to balance recovery needs and resource utilization.

Full Backup

A full backup involves creating a complete copy of the entire database at a specific point in time. This method is straightforward and simplifies the recovery process since all the data is contained in a single backup set. However, full backups can be time-consuming to create, especially for large databases, and require substantial storage space. If not managed properly, performing full backups can impact production systems due to the resources they consume.

Incremental Backup

Incremental backups save only the data that has changed since the last backup, whether it was a full or another incremental backup. This approach reduces the amount of data that needs to be backed up, resulting in faster backup times and reduced storage requirements. While incremental backups are efficient in terms of backup resources, the recovery process can be more complex. Restoring from incremental backups requires the last full backup and all subsequent incremental backups, which can increase recovery time.

Differential Backup

Differential backups capture all the data that has changed since the last full backup. Unlike incremental backups, differential backups do not consider previous differential backups, which simplifies the recovery process. Restoring from a differential backup requires only the last full backup and the latest differential backup. However, as time passes, differential backups can grow in size, potentially approaching the size of a full backup, which may impact backup windows and storage needs.

Logical vs. Physical Backups

Backups can also be categorized based on whether they are logical or physical.

I. Logical Backups:

These backups involve exporting database objects such as schemas and data using database utilities like mysqldump for MySQL or pg_dump for PostgreSQL. Logical backups are portable across different database versions and platforms and allow for selective restoration of specific objects. However, they can be slower to create and may impact performance due to the need to scan tables. Additionally, they may not capture certain database-specific features or configurations.

II. Physical Backups:

Physical backups involve copying the physical files that store the database data, such as data files and logs. This method tends to be faster for both backup and restoration and generally has less impact on performance if managed correctly. However, physical backups are usually platform-specific and require measures to ensure data consistency, such as pausing writes or using snapshots.

Backup Methods Minimizing Production Impact

To prevent backups from interfering with production operations, it's important to employ methods designed to minimize resource contention and maintain data consistency without halting read or write activities.

Online (Hot) Backups

Online, or hot backups, are performed while the database is running and accessible to users. This method leverages database features that allow for consistent backups without locking tables. The backup tool reads data files while tracking changes, often through mechanisms like write-ahead logs. It's essential to ensure that the backup tool supports online backups and to monitor resource utilization during the process to prevent performance degradation.

Example with PostgreSQL:

Using pg_basebackup, you can perform an online backup without stopping the database:

pg_basebackup -h localhost -D /backup/data -U replicator -Fp -Xs -P

Example Output:

Password:
32768/32768 kB (100%), 1/1 tablespace
Base backup completed.

The Password Prompt signifies that user authentication is necessary before proceeding with the backup operation, ensuring secure access to the database or system.
The Progress Indicator provides real-time updates on the backup process, displaying details such as the amount of data transferred (in kilobytes) and the specific tablespaces being backed up.
The Completion Message serves as a confirmation that the base backup operation has finished successfully, indicating that all required data has been captured.

Snapshot-Based Backups

Snapshot-based backups utilize filesystem or storage-level snapshots to capture the state of the database at a specific point in time. Filesystem snapshots, like those provided by LVM, ZFS, or Btrfs, and storage snapshots from SAN or NAS systems can create instant snapshots with minimal impact on the running system.

+------------------------+
|   Production Database  |
|       (Running)        |
+-----------+------------+
            |
       Initiate Snapshot
            |
            v
+------------------------+
|   Snapshot Created     |
|   (Point-in-Time Copy) |
+------------------------+

In this setup, the production database continues to operate without interruption while the snapshot is created. The snapshot can then be backed up or moved without affecting the production workload.

Example with LVM Snapshots:

# Create LVM snapshot
lvcreate --size 10G --snapshot --name db_snapshot /dev/vg0/db_volume

# Mount snapshot and perform backup
mount /dev/vg0/db_snapshot /mnt/db_snapshot
tar -czf db_backup.tar.gz /mnt/db_snapshot

# Remove snapshot after backup
umount /mnt/db_snapshot
lvremove /dev/vg0/db_snapshot

The lvcreate command is used to create a snapshot volume named db_snapshot, allocating 10G of space for the snapshot from the original logical volume /dev/vg0/db_volume.
The mount command is employed to attach the snapshot volume to the directory /mnt/db_snapshot, enabling access to its files for further operations.
The tar command facilitates the archiving and compression of the snapshot's data into a single file, db_backup.tar.gz, for efficient storage and transfer.
The umount and lvremove commands are executed sequentially to unmount the snapshot from the filesystem and delete the snapshot volume, ensuring no residual artifacts remain after the backup process.

Replication-Based Backups

Using a replica or standby server to perform backups can offload the backup workload from the primary server. Replication keeps the standby server synchronized with the primary, and backups are performed on the standby, avoiding any impact on the primary server's performance.

+-------------------+
          |   Primary Server  |
          |   (Production)    |
          +---------+---------+
                    |
               Replication
                    |
                    v
          +-------------------+
          |   Standby Server  |
          |     (Backup)      |
          +---------+---------+
                    |
              Backup Process
                    |
                    v
          +-------------------+
          |   Backup Storage  |
          +-------------------+

In this configuration, the standby server receives updates from the primary server and serves as the source for backups. This approach ensures that the backup process does not consume resources on the primary server, thereby preserving performance.

Considerations:

It's important to monitor the replication lag to ensure the standby server has the most recent data before performing backups.
The standby server should have sufficient resources to handle the backup operations without becoming a bottleneck.

Recovery Strategies

An effective recovery strategy is essential to restore operations quickly after data loss or corruption. Recovery strategies should align with business requirements, such as acceptable downtime and data loss thresholds.

Point-in-Time Recovery (PITR)

Point-in-Time Recovery allows the database to be restored to a specific moment by using backups and transaction logs. The process involves restoring the latest full backup, applying any incremental backups, and replaying transaction logs up to the desired point.

Requirements:

Regular full and incremental backups.
Continuous archiving of transaction logs (e.g., WAL files in PostgreSQL or binary logs in MySQL).

Example with PostgreSQL:

I. Configure Continuous Archiving:

In the postgresql.conf file:

archive_mode = on
archive_command = 'cp %p /archive_location/%f'

This setup ensures that transaction logs are copied to a designated archive location.

II. Perform Recovery:

Restore the base backup to the desired location.
Create a recovery.conf file specifying the target recovery time:

restore_command = 'cp /archive_location/%f %p'
recovery_target_time = '2023-09-14 12:34:56'

Start the PostgreSQL server, which will enter recovery mode and apply logs up to the specified time.

Continuous Data Protection (CDP)

Continuous Data Protection involves capturing and storing all changes in real-time or near-real-time, allowing for recovery to any point. This method provides the most granular recovery option but requires significant storage for logs and may impact performance due to the overhead of continuous logging.

Standby Databases

Maintaining one or more standby databases synchronized with the primary via replication offers immediate failover capabilities. In case of a primary server failure, a standby can be promoted to become the new primary, minimizing downtime.

Failover Process Illustration:

[Normal Operation]
+-------------------+       Replication       +-------------------+
|   Primary Server  | ----------------------> |   Standby Server  |
|     (Active)      |                         |     (Passive)     |
+-------------------+                         +-------------------+

[After Primary Failure]
+-------------------+                          +-------------------+
|   Primary Server  |      Promote Standby     |   Standby Server  |
|     (Failed)      | -----------------------> |    (Now Active)   |
+-------------------+                          +-------------------+

Clients are redirected to the new primary server, ensuring continuity of service.

Best Practices

Implementing best practices enhances the effectiveness of backup and recovery strategies:

Establish a backup schedule that aligns with your Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Automate backups to reduce the risk of human error.
Regularly test your restore procedures to ensure backups are valid and that data can be recovered successfully.
Store backups in multiple locations, including offsite or cloud storage, to protect against site-specific disasters.
Implement monitoring for backup processes and set up alerts for failures or performance issues to enable prompt response.
Maintain up-to-date documentation of backup and recovery processes, and train staff on how to execute recovery procedures effectively.
Encrypt backups to protect sensitive data and restrict access to backup storage to authorized personnel only.
Schedule backups during off-peak hours when possible, and consider throttling backup processes to limit their impact on system resources.
Adhere to legal and regulatory requirements for data retention and implement policies for backup retention and secure disposal of outdated backups.

Tools and Technologies for Efficient Backups

Efficient backups require tools and technologies that minimize disruption to production systems while ensuring data reliability and accessibility.

Database-Specific Tools

Percona XtraBackup is an open-source solution ideal for MySQL/MariaDB environments, offering non-blocking backups, incremental features, and data compression.
MySQL Enterprise Backup provides a comprehensive hot backup solution exclusive to MySQL Enterprise users.
PostgreSQL includes pg_basebackup, a built-in utility that facilitates streaming physical backups without halting the database.
Tools like Barman and pgBackRest provide advanced capabilities for PostgreSQL, such as incremental backups, compression, and precise point-in-time recovery.
Recovery Manager (RMAN) is integrated with Oracle databases and supports a wide range of backup options, including incremental and full backups with robust recovery features.

Filesystem and Storage Solutions

LVM snapshots create point-in-time backups of logical volumes, enabling backups without disrupting database operations.
ZFS snapshots are integrated into the ZFS filesystem and allow efficient point-in-time snapshots and data replication.
AWS Elastic Block Store (EBS) provides incremental snapshots that only save changes since the last backup, reducing storage needs.
Azure Managed Disks offer snapshot capabilities for virtual machines, ensuring reliable backups and easy restoration.
Google Cloud Persistent Disk Snapshots create point-in-time disk copies for streamlined backup and recovery within Google Cloud.

Backup and Recovery Software

Bacula is a robust open-source platform offering enterprise-grade backup for diverse operating systems and storage configurations.
Amanda simplifies backup infrastructure by providing a unified open-source solution for managing servers and clients.
Veeam Backup & Replication delivers broad-spectrum data protection for environments spanning virtual, physical, and cloud systems.
Commvault offers an all-encompassing data management solution with advanced backup and recovery features across numerous applications and platforms.

Continuous Data Protection Tools

Distributed Replicated Block Device (DRBD) ensures real-time block-level replication between servers, enhancing high availability.
MongoDB Ops Manager includes continuous backup and enables precise point-in-time recovery for MongoDB databases.
SQL Server Always On Availability Groups support high availability and real-time replication for Microsoft SQL Server deployments.