Commvault
Indexing Overview
- Carl Brault (Unlicensed)
- tdopko@commvault.com (Unlicensed)
Quick Links to Topics:
Commvault® software uses a distributed indexing structure that provides for enterprise-level scalability and automated index management. This works by using the CommServe database to only retain job-based metadata such as chunk information, which keeps the database relatively small. Detailed index information, such as details of protected objects is kept on the MediaAgent managing the job. When using Commvault deduplication, block and metadata indexing are maintained within volume folders in the disk library.
Commvault multi-tiered indexing structure
Job summary data maintained in the CommServe database keeps track of all data chunks being written to media. As each chunk completes, it is logged in the CommServe database. This information also tracks the media used to store the chunks.
Commvault® Version 11 introduces the new V2 indexing model, which has significant benefits over its predecessor. MediaAgents can host both V1 and V2 indexing in the index directory. The primary difference between these two indexing models, relative to the index directory sizing, are as follows:
- V1 indexes are pruned from the directory based on the days and the index cleanup percentage settings in the MediaAgent catalog tab.
- V2 indexes are persistent and not pruned from the index directory unless the backup set associated with the V2 index is deleted.
Indexed and Non-Indexed Jobs
Commvault® software defines data protection jobs as indexed or non-indexed job types. Indexes are used when data protection jobs require indexing information for granular level recovery. Non-indexed jobs are database jobs where recovery is only performed at the database level. Indexed-based operations require access to the index directory for creating or updating index files. Non-indexed based jobs do not require index directory access since the backup jobs use the CommServe database to update job summary information.
Indexed Based Jobs:
- File system backup and archive operations
- Exchange mailbox level backup and archive operations
- SharePoint document level backup and archive operations
Non-Indexed Based Jobs:
- Database jobs protected at the database level
- Some database agents including Oracle and Exchange block level backups use indexes
Traditional Indexing (V1)
Job summary data maintained in the CommServe database keeps track of all data chunks being written to media. As each chunk completes it is logged in the CommServe database. This information also maintains media identities where the job was written to, which can be used when recalling off-site media back for restores. This data is held in the database for as long as the job exists. This means that even if the data has exceeded defined retention rules, the summary information remains in the database until the job has been overwritten. An option to browse aged data is used to browse and recover data on media that has exceeded retention but has not been overwritten.
Detailed index information for jobs is maintained in the MediaAgent's index directory. This information contains:
- Each object
- Which chunk the data is in
- The chunk offset defining the exact location of the data within the chunk
The index files are stored in the index directory and after the data is protected to media, an archive index operation is conducted to write the index to media. This method automatically protects the index. The archived index is also used if the index directory is not available, when restoring the data at alternate locations, or if the indexes have been pruned from the index directory location.
One major distinction between Commvault® software and other backup products is that Commvault uses a distributed self-protecting index structure. The modular nature of the indexes allows the small index files to automatically be copied to media at the conclusion of data protection jobs.
Indexing Operations
The following steps provide a high-level overview of indexing operations during data protection and recovery operations.
Data Protection Operation and Indexing Processes
- A new data protection operation is initiated:
- A full backup generates a new index.
- An incremental or differential appends to an existing index.
- The index is located (incremental / differential) or a new index file is created (full) and the job begins.
- After each successful chunk is written to media:
- The chunk is logged in the CommServe SQL database.
- The index directory is updated.
- Once the protection phase of the job is completed:
- The index is finalized.
- The index file in the index directory is copied to media to automatically protect the index files.
Data Recovery Operation and Indexing Process
- A browse or find operation is initiated. Restore by job operations do not use the index directory.
- The index file is accessed/retrieved
- If the index is in the index directory it is accessed, and the operation continues.
- If the index is not in the index directory, it is automatically retrieved from media.
Backup and recovery process using V1 indexing
If the media is not in the library, the system prompts you to place the media in the library.
During a browse operation, if it is known that the media is not in the library, use the 'List Media' button to determine which media is required for the browse operation.
Self-Maintaining Indexing Structure
The index directory is self-maintaining based on two configurable parameters, 'Index Retention Time in Days' and 'Index Cleanup Percent'. Index files are kept in the index directory for a default of 15 days or until the directory disk reaches 90% disk capacity. A smaller index directory location may result in index files being pruned before the 15 day time period expires if the cleanup percentage is reached first. Index files are pruned from the index based on least recently accessed.
It is important to note that the 'Time in Days' and 'Index Cleanup Percent' settings use OR logic to determine how long indexes will be maintained in the index directory. If either one of these criteria are met, index files are pruned from the directory. When files are pruned from the index, they are deleted based on access time; deleting the least frequently accessed files first. This means that older index files that have been more recently accessed may be kept in the directory while newer index files that have not been accessed are deleted.
Indexing Service
The Indexing Service process on the MediaAgent is responsible for cleaning up the index directory location. This service runs every 24 hours. Any indexes older than 15 days are pruned from the index directory. If the directory location is above the 90% space threshold, additional index files are pruned.
V1 index cleanup process
V2 Indexing
Commvault® version 11 introduces the next generation indexing called indexing V2. It provides improved performance and resiliency while shrinking the size of index files in the index directory and in storage.
V2 indexing works by using a persistent index database maintained at the backup set level. During subclient data protection jobs, log files are generated with all protected objects and placed into the index database.
V2 indexing high level overview
Index Process for Data Protection Jobs
Indexing data is located on a persistent index database. One index database maintains records for all objects within a backup set, so all subclients within the same backup set writes to the same index database. The database is created and maintained on the MediaAgent once the initial protection job of a subclient within a backup set completes. Index databases are located in the index directory location on the MediaAgent.
During data protection jobs, log files are generated with records of protected objects. The maximum size of a log is 10,000 objects or a complete chunk. Once a log is filled or a new chunk is started, a new log file is created, and the closed log is written to the index database. By writing index logs to the database while the job is still running, the indexing operations of the job runs independently of the actual job; allowing a job to complete even if log operations are still committing information to the database.
At the conclusion of each job, the log files are written to storage along with the job. This is an important distinction from traditional indexing, which copies the entire index to storage. By copying just logs to storage, indexes require significantly less space in storage, which is a benefit when protecting large file servers. Since the index database is not copied to storage at the conclusion of each job, a special IndexBackup subclient is used to protect index databases.
V2 index process for data protection jobs
Index Checkpoint and Backup Process
During data protection jobs, logs are committed to the index database and are also kept in the index directory. If an index database is lost or becomes corrupt, a backup copy of the index database is restored from media and the log files in the index directory are replayed to the database. If the index directory location is lost, the database and logs are restored from media and the logs are replayed into the database. These recovery methods provide complete resiliency for index recovery.
The index databases are protected with system created subclients, which are displayed under the Index Servers computers group in the CommCell® browser. An index server instance is created for each storage policy. An index backup operation is scheduled to run every twenty-four hours. During the backup operation, index databases are checked to determine if they qualify for protection. The two primary criteria to determine if a database qualifies for protection are one million changes or 7 days since the last backup.
To access the index backup subclient properties
1 - Expand Client Computer Groups | The Storage Policy pseudo client | Big Data Apps | classicIndexInstance | Right-click the default subclient.
2 - The description field confirms that this is an index backup subclient.
To edit the index backup schedule
1 - Expand Policies | Schedule Policies | Right-click the System Created for IndexBackup subclients schedule policy | Edit.
2 - The description field confirms that this is the schedule policy used for Index backups.
3 - Highlight the schedule and click Edit.
4 - By default, the index backups are scheduled to run three times a day, but this can be modified as needed.
5 - Once modified, click OK to apply changes.
If the index database qualifies, three actions occur:
- A database checkpoint
- The database is compacted
- The database is backed up to the storage policy associated with the index server subclient
Database Checkpoint
Checkpoints are used to indicate a point-in-time in which a database was backed up. Once the database is protected to storage, any logs that are older than the checkpoint can be deleted from the index directory location.
Database Compaction
During data aging operations, deleted jobs are marked in the database as unrecoverable, but objects associated with the job remain in the database. The compaction operation deletes all aged objects and compacts the database.
Database Backup
Once the checkpoint and compaction occur, the database is backed up to the primary copy location of the storage policy. Three copies of the database are kept in storage and normal storage policy retention rules are ignored.
During the index backup process, the database is frozen and Browse or Find operations cannot be run against the database. Each database that qualifies for backup is protected sequentially minimizing the freeze time. Data protection jobs are not affected by the index backup.
V2 indexing checkpoint and backup process
Index Database Recovery Process
If an index database is lost or corrupt, or if the entire index directory location is lost, indexes are automatically recovered.
The index recovery process works as follows:
- The index database is restored from storage.
- If index logs are more recent than the index database checkpoint that is in the index directory location, they are automatically replayed into the index database.
- If index logs are not in the index directory location, the logs are restored from storage and replayed into the index database.
V2 index recovery process
Index Process Using Multiple MediaAgents
When multiple MediaAgents are configured to use a shared library, the MediaAgent used for the first protection job of a backup set is designated as the database hosting MediaAgent. During subsequent operations, if another MediaAgent is designated as the data mover, it does not copy the database to its local index directory. Instead, the data mover MediaAgent generates logs and ship them to the database hosting MediaAgent which are committed to the index database. If the hosting MediaAgent is not available, data protection operations continue uninterrupted. Once the hosting MediaAgent is online, the logs are shipped and committed to the index database.
V2 indexing process using multiple MediaAgents
Index Directory Configuration
Right-click the MediaAgent | Click Properties | Catalog tab
The index directory can manage both V1 and V2 indexes. It must be located on a dedicated high-speed disk, preferably solid-state drives. Index directory performance is critical when streaming a high number of jobs to a MediaAgent and when conducting DASH full operations. When MediaAgent software is first deployed to a server, the location of the index directory will be on the system drive. It is recommended to change the location to a dedicated drive prior to any jobs running.
To access the Index Directory configuration
1 - Right-click the MediaAgent hosting the Index Directory | Properties.
2 - Click to change the location of the Index Directory.
3 - Set space thresholds – days and percentage applies to V1 indexing clients only.
Changing the Index Directory Location
Right-click the MediaAgent | Click Properties | Catalog tab
The index directory location can be changed by changing the 'Index Directory' in the Catalog tab of the MediaAgent properties. By default, the Index Directory is located in the Commvault® software installation folder. It is recommended to move it to a dedicated set of SSD disks.
To define the Index Directory location
1 - Right-click the MediaAgent | Properties.
2 - Click to change the location of the Index Directory.
3 - Browse to the location and click OK.
Index Directory Sizing Guidelines
The index directory can contain both V1 and V2 indexes. V2 indexes are persistent within the directory where V1 indexes are pruned from the directory based on days and disk usage settings. Over time when using long term retention, the index directory grows. It's important to note that the V2 indexing footprint is considerably smaller than V1 indexes.
It is recommended to configure alerts to notify administrators if the index directory is running low on space.
V1 Index Sizing Guidelines
The index directory should be sized based on the need to browse back in time for data to be recovered. The farther back in time you need to browse, the larger the index directory should be. If the index directory is undersized, index files are pruned sooner to maintain a default 90% disk capacity. When you attempt to perform a Browse or Find operation and the index file is not in the directory, it automatically is restored from media. If the index file is in magnetic storage there is a short delay in recovering the index, but if it is on removable media, the time to recover the index can be much longer.
There are basic guidelines for sizing the index directory for V1 indexes:
- Job retention – Once a job ages and is pruned, all corresponding index files in the index directory are also deleted.
- Days Retention – Regardless of how long the job is being retained for, once the days retention time expires the indexes are deleted from the index directory.
- Index Cleanup Percent – Regardless of how long the job is being retained for, if disk usage reaches the 'Index Cleanup Percent' threshold, indexes are deleted from the index directory.
To properly size the index directory for V1 indexes, consider the following:
- The index file size is based on the number of objects being protected. Estimate 150 bytes per object. The more objects you are protecting the larger the index files will be.
- Each subclient contains its own index files within the directory.
- The index directory should be on a dedicated physical disk with no other data being written to the disk.
- To reduce the probability of pulling an index file back from media, use a large index directory location.
Since the indexes are automatically written to media, if the index directory does not contain the correct index, it is read from media and restored to the index directory when needed. This may result in a delay before browse results are displayed. This delay is more noticeable when index files are on tape media.
As a general best practice Commvault recommends sizing the index directory location to be approximately 4% of the estimated size of all data being protected by the MediaAgent. However, the index size is determined by the number of objects being protected and not the total size of the data.
Copyright © 2021 Commvault | All Rights Reserved.