Commvault
Configuring Deduplication - CommCell® Console
- Carl Brault (Unlicensed)
- tdopko@commvault.com (Unlicensed)
Quick Links to Topics:
Credits:
Great thanks to Mike Byrne for his hard work with the screen captures.
Storage Pool Wizard
The Storage Pool wizard is used to create a new global deduplication policy and to associate it to storage. Options during the wizard include the name, library, MediaAgent, number of partitions and partition location, and the network interfaces used for MediaAgent configurations. Once the Storage Pool is configured, it is displayed under the Storage Pool entity of the CommCell Browser. It can be expanded to provide a centralized view of the deduplication engine it created as well as the storage it is associated to. The components that makes a Storage Pool can also be viewed from their respective location. The Global Deduplication Policy created by the wizard is displayed under the Storage Policies entity and is named after the Storage Pool name. Advanced deduplication options can be configured from the Properties page of the Global Deduplication Policy.
To display the Storage Pool
1 - Expand Storage Resources | Storage | Storage Pool name.
2 - The deduplication engine is displayed and can be expanded to display its DDB partitions.
3 - The storage is displayed and can be expanded to display the mount paths.
To display the Global Deduplication Policy
1 - Expand Policies | Storage Policies.
2 - Global Deduplication Policies are represented by an icon made of deduplication block over an earth globe.
To create a deduplicated storage pool
1 - Right-click Storage Resources | Storage Pools and select Add Storage Pool | Disk.
2 -Provide a name for the Storage Pool.
3 - You can select an existing disk library from the list, or…
4 - …create a new local or network-based library.
5 - Check the box to enable deduplication.
6 - If the disk storage is shared amongst multiple MediaAgents, several deduplication database partitions can be used. Define the number of partitions to use.
7 - Click the first Choose Path link to define the location of the first database partition.
8 - Select the MediaAgent hosting the first DDB partition.
9 - Browse to the dedicated high-performance volume.
10 - Repeat the same procedure to define the location of the additional partitions.
11 - Check if encryption is required.
12 - If more than one DDB partition was defined, the library is automatically share amongst the MediaAgent hosting a partition. If the library also needs to be shared with additional MediaAgents not hosting any partition, create and select a client computer group containing these MediaAgents.
Streams and Deduplication
When implementing multiple MediaAgents and DDBs of different sizes, it is important to properly configure the streams. The number of streams that a MediaAgent or a DDB can receive is based on the hardware of the MediaAgent and the speed of the DDB volume.
There are two options to consider for stream parameters with deduplication:
- MediaAgent maximum number of parallel data transfer operations – This is the maximum number of streams that the MediaAgent can receive. It is configured in the General tab of the MediaAgent's Properties page.
- DDB maximum number of parallel data transfer operations – This is the maximum number of streams that any DDB of the CommCell® environment can receive. This is a governor option that is applied to all DDBs. It is configured in the Resource Manager of the Media Management applet.
Tip: Stream Parameters to Consider with Deduplication
If there are three MediaAgents; one is extra-large and two are medium, how can streams be adjusted? In this scenario, the DDB governor value should be set to the number of streams that the largest MediaAgent can receive, which in this case is 300 streams. This means that the extra-large MediaAgent maximum number of streams must also be set to 300 streams.
Lastly, the two medium MediaAgents can be set to 100, preventing them to receive more streams than their hardware can handle.
For more information on MediaAgent sizing and guidelines, refer to the Commvault Online Documentation.
MediaAgent stream options
1 - Right-click the MediaAgent and choose Properties.
2 - Defines the maximum number of parallel streams received by the MediaAgent.
To configure the maximum number of streams for DDB
1 - From the Storage menu | Click the Media Management tool.
2 - Define the maximum number of streams for deduplication databases.
Deduplication Configuration Options
After creating the global deduplication policy wizard, additional configuration can be set.
Deduplication Block Size
The deduplication block size determines the size of a data block that is hashed during data protection operations. The default block size is 128KB and is configurable from 32 – 512 KB. A smaller block size provides a marginally better deduplication ratio but reduces the maximum size of the deduplication store. A higher block size may reduce deduplication ratios but allows the deduplication store to hold more data. The current block size recommendation is 128KB, which provides the best balance for deduplication ratio, performance and scalability.
Deduplication block size is configured in the global deduplication policy properties. All storage policy copies associated with the global deduplication policy always use the block size defined in the global deduplication policy.
To modify the deduplication block size
1 - Expand Storage Policies | Right-click the Global Deduplication Policy | Properties.
2 - Select the required block size from the list.
Block size for large databases
For databases, the recommended block size is from 128KB to 512KB depending on the size of the data, number of full backups, and retention requirements. The logic for a larger block size is based on the fact that initial protection of the database does not yield high deduplication ratios due to the uniqueness of block data, so block size does not have a significant impact on deduplication ratios. However, protecting the database redundantly over time can yield very high deduplication ratios, depending on the application and compression usage.
Block size for large static repositories of binary data
For large data repositories managing binary data types such as static media repositories, larger deduplication block sizes can provide more scalability. For these data types the potential for duplicate data blocks is minimal. By using a higher block size, it allows more data to be stored with a smaller Deduplication Database (DDB) size. In this scenario, the advantage is not storage space savings but rather the advantage of using client-side deduplication and DASH full backups results in processing and backing up the block data only once. This solution works very well for static data types, but if processing such as video editing will be performed on the data, deduplication may not be the best method to protect the data.
In highly scalable environments managing large amounts of data using Commvault® solutions such as Software Defined Data Services (SDDS) scale out architecture, block sizes can be increased beyond the CommCell® console limit of 512KB. In this case, contact Commvault Professional Service for guidance in designing a scale out architecture.
Why a Higher Block Size is Better?
Why does Commvault recommend a 128KB block size? Competitors who usually sell appliance-based deduplication solutions with steep price tags use a lower block size, which results in a marginal gain in space savings considering most data in modern datacenters is quite large. Unfortunately, there are some severe disadvantages to this. First, records for those blocks must be maintained in a database. Smaller block size results in a much larger database which limits the size of disks that the appliance can support. Commvault software can scale significantly higher, up to 120TB per database.
Even more importantly is the aspect of fragmentation. The nature of deduplication and referencing blocks in different areas of a disk leads to data fragmentation. This can significantly decrease restore and auxiliary copy performance. The higher block size recommended by Commvault makes restores and copying data much faster.
The main aspect is price and scalability. With relatively inexpensive enterprise class disk arrays you can save significant costs over dedicated deduplication appliances. If you start running out of disks, just add more space, which can be added to an existing library, so deduplication will be preserved. Considering advanced deduplication features, such as DASH Full, DASH Copy, and SILO tape storage, the Commvault® Deduplication solution is a powerful tool.
Changing Deduplication Block Size
If the block size is changed on a deduplication store containing data, it results in a re-baseline of all data in the store since new signatures being generated do not match existing signatures. If the block size needs to be changed, ensure no jobs are currently running, seal the current Deduplication Database (DDB), and then change the block size.
Deduplication Database and Store Sealing Options
Under normal circumstances it is not recommended to set thresholds to seal the DDB and store, but in certain cases such as using SILO storage, sealing the store can be useful. Thresholds can be set in days, months and/or DDB size. If more than one threshold is used, the DDB is sealed as soon as either one of the thresholds is reached.
To configure DDB creation thresholds
1 - Expand the Global Deduplication Policy | Right-click the primary copy| Properties.
2 - Set this option to create a new DDB after a number of days.
3 - Set this option to create a new DDB after the disk library reaches a certain size in TB.
4 - Set this option to create a new DDB after a specific number of months.
Compression
By default, compression is enabled in a global deduplication policy. Data is compressed prior to signature generation, which is optimal for files and virtual machines. For database backups, signatures are generated first and then compression takes place providing better deduplication ratios for large databases. In some cases, where application data is being compressed prior to backup, such as in certain Oracle RMan configurations, it is best not to use Commvault compression.
Rule of thumb is to use one compression or the other, but never both. For most data, Commvault compression results in the best deduplication ratios.
Tip: Running auxiliary copy jobs to tape:
Compression is enabled by default for the tape data path. Most modern tape drives are advanced enough to attempt compression on the block and determine if it is beneficial. If the block compressed well, the block remains compressed. If not, it will un-compress the data on the fly.
To configure compression when using deduplication
1 - Expand the Storage Resources | Deduplication Engine | Right-click the primary copy | Properties.
2 - Check to enable compression when using deduplication.
DDB Priming
DDB priming is used only when a DDB and deduplication store has been sealed and data was protected over a slow WAN link. It works by locating required blocks in the sealed store and copying them into volume folders in the new store.
This option should not be used for LAN-based backups.
To configure DDB priming option
1 - Expand the Storage Resources| Right-click Deduplication Engine | Properties.
2 - Check to enable the DDB priming option.
Do Not Deduplicate Against Objects Older Than Option
This option should only be used when data verification jobs frequently identify bad blocks in the store. This is indicative of issues with the disks themselves, which may need to be replaced. Note that when this option is enabled, re-baseline of specific blocks occurs once they exceed the 'days' threshold.
To define a maximum number of days for deduplication
1 - Expand the Storage Resources | Deduplication Engine | Right-click the primary copy| Properties.
2 - Check the option and set a maximum number of days for deduplication.
Storage Policy Copy Options for Deduplication
Enable DASH Full Backups
Commvault recommends using DASH Full optimized synthetic full jobs for most agents.
To enable DASH Full backups
1 - Right-click the primary copy | Properties.
2 - In the Deduplication tab | Advanced subtab | Check the Enable DASH Full box.
Auxiliary Copy Jobs and Deduplication
An auxiliary copy job is a non-indexed chunk level copy operation. Chunks that are part of jobs required to be copied during the auxiliary copy jobs are flagged. As each chunk is copied successfully to the destination MediaAgent, the flag is removed. This means if for any reason the auxiliary copy fails or is killed, when the job restarts, only flagged chunks require copying.
DASH Copy Jobs
A DASH Copy is an optimized auxiliary copy operation which only transmits unique blocks from the source library to the destination library. It can be thought of as an intelligent replication which is ideal for consolidating data from remote sites to a central data center and backups to DR sites.
DASH Copy has several advantages over traditional replication methods:
DASH Copies are auxiliary copy operations, so they can be scheduled to run at optimal time periods when network bandwidth is readily available. Traditional replication would replicate data blocks as it arrives at the source.
Not all data on the source disk needs to be copied to the target disk. Using the subclient associations of the secondary copy, only the data required to be copied would be selected. Traditional replication would require all data on the source to be replicated to the destination.
Different retention values can be set to each copy. Traditional replication would use the same retention settings for both the source and target.
DASH Copy is more resilient in that if the source disk data becomes corrupt the target is still aware of all data blocks existing on the disk. This means after the source disk is repopulated with data blocks, duplicate blocks will not be sent to the target, only changed blocks. Traditional replication would require the entire replication process to start over if the source data became corrupt.
Disk and Network Optimized DASH Copy
Disk Optimized DASH Copy
Disk optimized, which is the default setting, should always be used when the source library is using Commvault deduplication.
Disk optimized DASH Copy extracts signatures from chunk metadata during the auxiliary copy process, which reduces the load on the source disks and the MediaAgent since blocks do not need to be read back to the MediaAgent and signatures generated on the blocks.
Network Optimized DASH Copy
Network optimized should only be used if the source library is not using Commvault deduplication. Network optimized DASH Copy reads all blocks required for the auxiliary copy job back to the MediaAgent, which generates signatures on each block.
DASH Copy operation using network and disk optimized
DASH Copy processes for disk and network optimized auxiliary copy jobs
Enable DASH Copy
Right-click the secondary deduplicated copy | Click Properties | Deduplication tab | Advanced subtab
To schedule an auxiliary copy job as a DASH Copy, first go to the Secondary Copy Properties Deduplication tab and, from the Advanced sub tab, select the 'Enable DASH Copy' checkbox and ensure that 'Disk Optimized' is also checked.
Running a DASH Copy:
Right-click the Storage Policy.
Select All Tasks and then Run Auxiliary Copy.
The auxiliary copy can be run immediately, scheduled, or set to run at automatic time intervals.
To set DASH Copy optimization
1 - Right-click the deduplicated secondary copy | Properties.
2 - Check to enable DASH Copy over traditional auxiliary copy which is a default when using Commvault® deduplication.
3 - Disk Read Optimized extracts signatures already stored in chunk metadata.
4 - Network Optimized generates signatures by reading and hashing blocks in chunk files.
Enable Source Side Disk Cache
Right-click the secondary deduplicated copy | Click Properties | Deduplication tab | Advanced subtab
During DASH Copy operations, a source side cache can be enabled on the source MediaAgent to hold all signatures locally for auxiliary copy jobs. When an auxiliary copy job runs, each signature is checked locally in the source cache to determine if the block exists on the destination MediaAgent. If a signature is not found in the local cache, the source-side verification will then allow a query of the target MediaAgent. Using the source side disk cache is recommended to improve auxiliary copy performance over WAN links.
When using this option, the Job Results directory of the source MediaAgent can be moved to a fast disk volume for optimal performances.
It can be enabled for individual clients or for all clients associated with a specific storage policy.
To enable source side disk cache
1 - Right-click the deduplicated secondary copy | Properties.
2 - Check this option to create a small cache used for initial lookups on the source MediaAgent, before querying the destination MediaAgent.
3 - Set a size limit for the source side cache.
Enable Source-Side Disk Cache from the Storage Policy Primary Copy
Source-side disk cache is useful when performing backups over WAN links. It can be enabled for individual clients or for all clients associated with a specific storage policy.
To enable source-side disk cache from the primary copy
1 - Right-click primary copy | Properties.
2 - In the Deduplication tab | Advanced subtab, check the ‘Enable source side disk cache’ option and set the cache size.
Copyright © 2021 Commvault | All Rights Reserved.