Deduplication Database and Store

Quick Links to Topics:

Credits:

Great thanks to Mike Byrne for his hard work with the screen captures.

The Deduplication Database (DDB) maintains all signature records for a deduplication engine. During data protection operations, signatures are generated on data blocks and sent to the DDB to determine if data blocks are duplicate or unique. During data aging operations, the DDB is used to decrement signature counters for blocks from aged jobs and subsequently prune signatures, and block records when the signature counter reaches zero. For these reasons, it is critical that the DDB is located on high performance solid state or PCI storage technology.

Database Disk Requirements

To meet DDB IOPS requirements, use high performance disks based on Commvault's sizing guidelines found in the Commvault Online Documentation. Enterprise class Solid State Disks or PCI storage technology, such as fusion I/O cards is highly recommended. Plan DDB disk requirements based on the current size of an environment and expected future growth.

Deduplication Database Technical Structure

The Deduplication Database (DDB) is made up of three tables:

Primary table – contains all deduplication signatures and a counter of the number of references to the signature.
Secondary table files – contains archive file records related to jobs. In Commvault V11, each secondary table file contains up to 16 archive file records. When all records within the file are purged from the file, it is deleted and recreated. This process mitigates the risk of large DDB bloat over time.
Zero reference table – contains all deduplication signatures that are no longer referenced by any jobs. This table is used during the physical pruning process.

Deduplication database structure:

Deduplication Database Volume

If you plan on using Commvault deduplication with the disk library attached to the MediaAgent, prepare the volume that will be used to host the Deduplication Database (DDB). The DDB should be isolated on a dedicated set of disks, preferably SSD drives. For more information on requirements for the DDB volume, please refer to Commvault Online Documentation.

Once the volume is created, it is required to be formatted using a 4k block size. The current Windows block size during format is 4KB, but it is preferable to manually select 4k (4096 bytes), as the default might change in the future.

Deduplication Database Protection

When a deduplication-enabled storage policy is created, a DDB backup subclient called 'DDBBackup' is automatically created on the MediaAgent hosting the DDB.

The DDB is automatically configured to perform a backup every twenty-four hour, and it is highly recommended to use this recommended schedule, as running frequent backups could impact other operations performance.

When a DDB backup runs, the database is placed in a quiesced state to ensure database consistency during the backup. For Windows^® MediaAgents, the Volume Shadow Copy Service (VSS) is enabled on the volume hosting the DDB. The 'Copy on Write' (COW) cache is configured to be at least 10% of the size of the volume hosting the DDB.

For Linux MediaAgents, Logical Volume Manager (LVM) is used to create software snapshots of the DDB (the LVM volume must have at least 15% of unallocated space for the snapshots).

Deduplication Database Backup Frequency

Each time the DDB is backed up, it creates a recovery point in the event that the database needs to be restored. The more frequent the backups run, the less time it takes to fully recover the DDB. On the other hand, running frequent DDB backups could potentially impact other operations, such as client backups and disk library data pruning. This is because the DDB backup engages a Volume Shadow Services (VSS) snapshot that may impact deduplication database performances. Therefore, if DDB backups are executed during client backups or data pruning, it may increase time to complete these operations.

There is a 'System Created for DDB subclients' schedule policy that is created when installing Commvault^® software. This schedule policy protects DDBs every twenty-four hours by default. When a new DDBBackup subclient is created by the system, it is automatically associated to this schedule policy. If you want to modify the frequency of the DDB backups, simply edit this schedule policy.

DDB backup subclients created before Service Pack 11 follow the previous default schedule (which was every eight hours). It is recommended to modify the schedule to reflect the new recommendations of one backup every twenty-four hours.

Configuring the DDB Subclient

In Commvault^® V11, there are several performance enhancements to provide faster DDB backups. The following subclient options can be tuned to provide faster DDB backup performance:

Data Readers – Increasing the number of 'Data Readers' and checking the 'Allow Multiple Readers in a Drive or Mount Point' provides multi-stream backups of the DDB volume. Ensure the disks are fast enough to support multiple streams.
Application Read Size – The default read size for a file system subclient is 64KB. Increasing this number to 256KB or higher will improve DDB backup performance. Test the settings to achieve best results.
Select VSS shadow copy storage associations – The location of the VSS COW cache can be located on a separate disk to provide faster backup performance as block changes will be redirected to the cache location on the other disk. When configuring this option, only volumes that do not contain DDBs are listed.

Deduplication Database Reconstruction

The Deduplication Database (DDB) is highly resilient and reconstruct operations can rebuild the database to match the latest job and chunk information maintained in the CommServe^® database.

In the unlikely event that the DDB becomes corrupt, the system automatically recovers the DDB from the most recent backup. Once the DDB backup has been restored, a reconstruct process occurs which will 'crawl' job data since the last DDB backup point. This brings the restored DDB to the most up-to-date state. Keep in mind that the more frequently DDB backups are conducted, the shorter the 'crawl' period lasts to completely restore the DDB. Note that during this entire recovery process, jobs that require the DDB must not be running.

How the DDB Reconstruct Works

During data protection jobs, as each chunk completes it is logged in the CommServe database. If the Deduplication Database (DDB) needs to be restored, the chunk information is used to re-read signatures and add them to the DDB. Upon initial restore of the DDB, the checkpoint at backup time is used to determine which chunks are more recent than the restored database. An auxiliary copy operation then processes the chunk data, extracts block signatures from the job metadata and adds the entries back into the DDB.

There are two methods available to reconstruct the deduplication database:

Partial Database Reconstruction – If the DDB is lost or corrupt, a backup copy of the database is restored, and the database is reconstructed using chunk metadata.
Full Database Reconstruction – If the DDB is lost and no backup copy is available, the entire database is reconstructed from chunk metadata.

Full deduplication database reconstruction process

Deduplication Store

Each global deduplication policy has a dedicated deduplication store. A deduplication store is a logical folder structure in the library used to write deduplicated data to disk. Each store is completely self-contained and multiple stores can exist within the same disk library. Data blocks from one store cannot be written to another store and data blocks in one store cannot be referenced from a different DDB for another store. This means that the more deduplication engines there are, the more duplicate data will exist in disk storage.

How the Deduplication Store Works

The size of a deduplication store is based on the number of deduplication block records maintained in the DDB and the available capacity on the disk library. A single deduplication store can grow based on the size of the library and performance characteristics of the Deduplication Database (DDB).

The deduplication store is made up of volume folders. Within the volume folders, there are several files:

SFILE_Container – contains all unique block data. An SFILE is written as part of a data protection job but upon chunk completion, other jobs can reference blocks within the SFILE.
SFILE_Container.idx – contains indexing information and reference points to blocks in the SFILE.
Chunk_Meta_Data – contains metadata information related to a specific job. Chunk metadata is always exclusive to the job that wrote the metadata.
Chunk_Meta_Data.idx – contains indexing information and reference points to job metadata.

Deduplication store structure

On-Demand Learning Library