Overview of Deduplication

Quick Links to Topics:

In any modern data center, duplicate data exists on storage-based media, networks, and virtual servers. Some examples include identical DLL files existing on different servers, or multiple users working on the same document—each user modifies different blocks in the file while other blocks remain unchanged. Traditionally this redundant data is stored on disk or tape, which requires a significant amount of space to protect. With Commvault^® deduplication storage techniques, a single copy of redundant data (and any subsequent references to the same data) is stored only once; reducing the amount of space needed to save data and protecting against data loss.

Deduplication high level concept

Benefits and Features

Commvault^® software has a unique set of deduplication features that are not available with most third-party deduplication solutions. By taking full advantage of Commvault deduplication, you can reduce storage and network resource requirements, shrink backup windows, efficiently copy data to off-site locations, and copy deduplicated data to disk or to a cloud environment.

Commvault deduplication offers the following benefits:

Efficient use of storage
Efficient use of network bandwidth
Significantly faster Synthetic Full operations
Significantly faster auxiliary copy operations
Resilient indexing and restorability

Efficient use of Storage

Commvault deduplication provides two storage policies that are used to efficiently move large amounts of data:

Deduplication Storage Policy – performs deduplication on all data blocks written to each storage policy.
Global Deduplication Storage Policy (optional) – writes blocks from multiple storage policies through a single deduplicated policy. Using a global policy results in multiple policy data blocks being stored once on disk storage.

Efficient use of Network Bandwidth

Client-Side Deduplication is used to deduplicate block data before it leaves the client. From that point forward, only changed blocks are sent over the network. This greatly reduces network bandwidth requirements after the first successful full backup is complete.

Faster Synthetic Full

Using the Deduplication Accelerate Streaming Hash (DASH) full backup reduces the time to perform synthetic full and traditional full backup operations. The DASH full runs as a read-optimized synthetic full operation, which does not require traditional full backups to be performed. Once the first full backup has completed, blocks that have changed are protected during incremental or differential backups. A DASH full runs in place of a traditional full or synthetic full, does not require movement of data, and updates the index files and Deduplication Database (DDB) when a full backup has completed.

Faster Auxiliary Copy Operations to Disk Storage

The DASH Copy operations are optimized auxiliary copy jobs that require only modified blocks to be sent to a second disk target. Because secondary copies do not require high bandwidth requirements, this is an ideal solution for sending off-site copies to secondary disaster recovery facilities.

Resilient Indexing and Restorability

Although the Deduplication Database (DDB) checks signature hashes for deduplication purposes, it is not required during restore operations. Instead the standard indexing methodology is used. This includes using the index directory and index files written at the conclusion of the job. This resiliency ensures that deduplicated data is restored even during unforeseen events, such as disaster recovery.

The Deduplication Process and Data Protection

The following process provides a high-level overview of the deduplication process during a data protection job.

Production data is read from the source location and written into a memory buffer. This memory buffer is filled based on the defined block size. Note that the block size is referred to as a data block with a default of 128KB.
A signature is then generated on the data block. The signature uniquely represents the bit makeup of the block.
The signature is compared in the DDB to determine if the data block already exists.
If it does exist, the data block in the memory buffer is discarded and pointers to the existing block in storage is referenced instead.
If it does not exist, the data block is written to protected storage.

Deduplication data movement process high level overview

Components and Terminology

When using the CommCell^® Console, there are several components that comprise the Commvault^® deduplication architecture:

The Global Deduplication Policy – defines the rules for the Deduplication Engine. These rules include:

Deduplication Store location and configuration settings
The Deduplication Database (DDB) location and configuration settings

A Data Management Storage Policy – is configured as a traditional storage policy, where the former also manages subclient associations and retention. Storage policy copies defined within the Data Management policy are associated with Global Deduplication storage policies. This association of the Data Management Storage Policy copy to a Global Deduplication Policy determines in which Deduplication Store the protected data resides.

Deduplication Database (DDB) – is the database that maintains records of all signatures for data blocks in the Deduplication Store.

Deduplication Store – contains the protected storage using Commvault deduplication. The store is a disk library which contains non-duplicate blocks, along with block indexing information, job metadata, and job indexes.

Client – is the production client where data is being protected. The client has a file system and/or an application agent installed. The agent contains the functionality to conduct deduplication operations, such as creating data blocks and generating signatures.

MediaAgent – coordinates signature lookups in the DDB and writes data to a protected storage. The signature lookups operation is performed using the DDB on the MediaAgent.

Deduplication Architecture high level overview:

Content Aware Deduplication

The concept of content aware deduplication is to identify what type of data is being protected and adjust how deduplication is implemented. Consider a deduplication appliance that receives data from a backup application. The appliance cannot detect files, databases, or metadata generated from the backup application. Commvault deduplication is integrated into agents so it understands what is being protected. Content aware deduplication provides significant space saving benefits and results in faster backup, restore, and synthetic full backup operations.

Object-Based Content Aware Deduplication

Since most file objects are not equally divisible by a set block size, such as 128KB, Commvault^® deduplication uses a content aware approach to generate signatures. If an object that is 272KB in size is deduplicated, it can be evenly divisible by 128KB with a remainder of 16KB. In this case two 128KB deduplication blocks are hashed and compared.

The remaining 16KB will be hashed in its entirety. In other words, Commvault^® deduplication will not add more data to the deduplication buffer. The result is if the object containing the three deduplication blocks never changes, all three blocks will always deduplicate against themselves.

The minimum fallback size to deduplicate the trailing block of an object is 4096 bytes (4 KB). Any trailing blocks smaller than 4096 bytes is protected but will not be deduplicated.

Database and Log Content Aware Deduplication

Database applications often provide built-in compression, which compresses blocks before Commvault generates signatures on the blocks. The application level compression can result in inconsistent blocks being deduplicated each time a backup runs, which results in poor deduplication ratios.

When using Commvault compression during backups instead of application compression, the application agent can be configured to detect the database backup and generates a signature on uncompressed data. After the signature has been generated, the block is then compressed, which leads to improved deduplication ratios. By default, Commvault^® software always compresses prior to signature generation. Note that an additional setting can be added to the database client to generate the signature prior to compression.

Log files are constantly changing with new information added and old information truncated. Since the state of the data is constantly changing, deduplication will provide no space saving benefits. During log backup jobs, the application agent detects the log backup and no signatures are generated. This saves CPU and memory resources on the production system and speeds up backups by eliminating signature lookups in the DDB.

Content aware deduplication concept:

Source and Target Side Deduplication

There are two types of deduplication that are performed:

Source or client-side deduplication
Target side deduplication

Source-Side Deduplication

Source-side deduplication, also referred to as 'client-side deduplication,' occurs when signatures are generated on deduplication blocks by the client and the signature is sent to a MediaAgent hosting the DDB. The MediaAgent looks up the signature within the DDB. If the signature is unique, a message is sent back to the client to transmit the block to the MediaAgent, which then writes it to the disk library. The signature is logged in the DDB to signify the deduplication block is now in storage.

If the signature already exists in the DDB then the block already exists in the disk library. The MediaAgent communicates back to the client agent to discard the block and only send metadata information.

Target-Side Deduplication

Target-side deduplication requires all data to be transmitted to the MediaAgent. Signatures are generated on the client or on the MediaAgent. The MediaAgent checks each signature in the DDB. If the signature does not exist, it is registered in the database and the deduplication block is written to the disk library.

If the signature does exist in the DDB, then the block already exists in the library. The deduplication block is discarded and only metadata associated with the block is written to disk.

Source-Side or Target-Side Deduplication?

Commvault^® software is used to configure deduplication to occur either on the client or on the MediaAgent, but which is best? This depends on several environmental variables including network bandwidth, client performance and MediaAgent performance.

Which method is the best?

Both Source-side and Target-side deduplication reduces storage requirements.
Source-side deduplication also reduces network traffic by only transmitting deduplication blocks that have changed since the last backup. Target-side deduplication does not.
Target-side deduplication is used to reduce CPU processing by generating signatures on the MediaAgent instead of the client. With Source-side deduplication, the signatures must be generated on the client.
For most network-based clients, Source-side deduplication is the preferred method since it reduces network and storage requirements.

In certain situations, such as underpowered clients or high transaction clients such as production database servers, Target-side deduplication may be preferable. Keep in mind that if Target-side deduplication is used and the MediaAgent is generating signatures, adequate CPU power is required on the MediaAgent. If the MediaAgent is not scaled properly, performance will suffer.

On-Demand Learning Library