Commvault
Data Movement of Deduplicated Data
- Jordan Cannata (Unlicensed)
- tdopko@commvault.com (Unlicensed)
- Carl Brault (Unlicensed)
Quick Links to Topics:
During data protection jobs, processes on the client compresses the data (if compression is enabled), fills the deduplication buffer (default 128KB), generates a signature on the data, and then optionally encrypts the block.
Deduplication technical processes during a data protection job:
- JobMgr on the CommServe® server initiates job.
- CLBackup process uses the Commvault Communications (CVD) service to initiate communication with CVD process on MediaAgent.
- CVD process on MediaAgent launches the SIDB2 process to access the Deduplication Database (DDB).
- SIDB2 process communicates with CommServe server to retrieve deduplication parameters.
- CLBackup process begins processing by buffering data based on deduplication block factor and generates signatures on each deduplication block.
- Signature is checked in DDB:
- If the signature exists, the primary record counter is increased. Secondary tables will update with detailed job information for the block. The block metadata is sent to the MediaAgent but the data block is discarded.
- If the signature does not exist, it is added to the primary table and detailed job information related to the block is added to the secondary table. Block data and metadata are sent to the MediaAgent.
Deduplicated data movement during a data protection job
DASH Full Jobs
A read optimized synthetic DASH Full uses the Commvault® deduplication feature to logically perform synthesized full backups without moving any data. This can be accomplished because Commvault deduplication tracks the location of all blocks on disk storage. After the initial base full is run and subsequent incremental jobs are run, all block data required for the synthetic full is already present in the deduplicated disk storage location. Since deduplication only stores a unique block once in storage, the DASH Full operation only makes references to the blocks in storage and not actually copies them. The DASH Full operation generates a new index file signifying that a full backup was run and updates the Deduplication Database (DDB) with block record data that is used for data aging purposes. DASH Full backups are the preferred method of running full backup jobs and can dramatically reduce backup windows.
Note that when enabling Commvault deduplication for a primary copy, the ‘Enable DASH Full’ option is selected by default.
DASH Full process flow
Auxiliary Copy Jobs and Deduplication
An auxiliary copy job is a non-indexed chunk level copy operation. Chunks that are part of jobs required to be copied during the auxiliary copy jobs are flagged. As each chunk is copied successfully to the destination MediaAgent, the flag is removed. This means if for any reason the auxiliary copy fails or is killed, when the job restarts, only flagged chunks require copying.
DASH Copy Jobs
A DASH Copy is an optimized auxiliary copy operation which only transmits unique blocks from the source library to the destination library. It can be thought of as an intelligent replication which is ideal for consolidating data from remote sites to a central data center and backups to DR sites.
DASH Copy has several advantages over traditional replication methods:
- DASH Copies are auxiliary copy operations so they can be scheduled to run at optimal time periods when network bandwidth is readily available. Traditional replication would replicate data blocks as it arrives at the source.
- Not all data on the source disk needs to be copied to the target disk. Using the subclient associations of the secondary copy, only the data required to be copied would be selected. Traditional replication would require all data on the source to be replicated to the destination.
- Different retention values can be set to each copy. Traditional replication would use the same retention settings for both the source and target.
- DASH Copy is more resilient in that if the source disk data becomes corrupt the target is still aware of all data blocks existing on the disk. This means after the source disk is repopulated with data blocks, duplicate blocks will not be sent to the target, only changed blocks. Traditional replication would require the entire replication process to start over if the source data became corrupt.
Disk and Network Optimized DASH Copy
Disk optimized, which is the default setting, should always be used when the source library is using Commvault® deduplication. Network optimized should only be used if the source library is not using Commvault deduplication.
Disk optimized DASH Copy will extract signatures from chunk metadata during the auxiliary copy process which reduces the load on the source disks and the MediaAgent since blocks do not need to be read back to the MediaAgent and signatures generated on the blocks.
Network optimized DASH Copy reads all blocks required for the auxiliary copy job back to the MediaAgent, which generates signatures on each block.
To schedule an auxiliary copy job as a DASH Copy, first go to the Secondary Copy Properties Deduplication tab and, from the Advanced subtab, select the ‘Enable DASH Copy’ check box and ensure that 'Disk Optimized' is also checked.
Data Movement and Job Checkpoints
During primary data protection and auxiliary copy jobs, the completion of each chunk represents a checkpoint in the job. This checkpoint will do the following:
- Commit the chunk metadata to the CommServe®
- Commit signature records to the Deduplication Database (DDB).
These two steps are essential to ensure data integrity. If for any reason, a job fails or is killed, committed chunks are reflected both in the CommServe database and DDB. Any chunks that did not complete are not registered in the CommServe database and the records are not committed to the DDB. This results in two important points:
- No additional block data that generates the same signature will reference a block in an incomplete chunk.
- Once the chunk and signatures are committed, any signatures that match ones from the committed chunk can immediately start deduplicating against the blocks within the chunk.
Another way to look at this is Commvault® software deduplicates on chunk boundaries. If multiple identical signatures appear in the same chunk, each signature will be registered in the DDB and the blocks will be written multiple times. Once the chunk is committed, duplicate signatures will only increase the record counter on the first occurrence of the signature. All the other duplicate signatures registered in the DDB will remain with until the job is aged and pruned from storage.
It is also important to note that the chunk data is written as part of the job. Once the chunk is committed, SFiles that make up the chunk are no longer bound to the job since other jobs can reference blocks within the SFile.
DASH Copy process for disk and network optimized auxiliary copy jobs
Source Side Disk Cache
During DASH Copy operations, a source side cache can be enabled on the source MediaAgent to hold all signatures locally for auxiliary copy jobs. When an auxiliary copy job runs, each signature is checked locally in the source cache to determine if the block exists on the destination MediaAgent. Using the source side disk cache is recommended to improve auxiliary copy performance over WAN links.
Optimize for high latency network is an optional setting which will first check the local MediaAgent disk cache. If the signature is not found in the local cache, the process assumes the block is unique and sends both the block and the signature to the destination MediaAgent.
Copyright © 2021 Commvault | All Rights Reserved.