Commvault

Troubleshooting Overview

Quick Links to Topics:

Troubleshooting Overview


A backup solution interacts with the entire IT infrastructure and is often more difficult to troubleshoot than other systems. If backup performance issues arise, daily tasks can prove challenging when the source of each problem is elusive.


Commvault® software integrates troubleshooting processes that let you effectively manage day-to-day operations. But to troubleshoot Commvault software, it is important to understand the CommCell® components that integrate with the applications and hardware to ensure a successful backup and recovery.


Troubleshooting Advice

When troubleshooting an incident, follow a sequence of steps each time an error occurs. This orderly approach will lead you closer to a solution when similar symptoms arise.

Gather Information


Collecting information is an important task when troubleshooting errors. Here are important details to guide you through the process:
Time and description of problem – Information is vital when gathering specific data about the error. Noting when the error first occurred is a key identifier in tracking down relevant information. Describe the problem both in general terms and in specifics as reported by the software. This includes the exact text of the error that alerted you to the problem.

  • Pertinent Job details – Note the 'type of job,' 'job id,' 'start time,' 'phase,' 'state,' and any other pertinent information.  Specific details along with the time and problem description helps to narrow the probable source of the problem.
  • Applicable Events – Find any and all events that may be applicable to the error.
  • Involved Resources/components – Identify all involved hardware and software components.
  • Relevant Log entries – Collect all log entries relevant to the job/task that had problems.
  • Other concurrent activity – Look at events/status from other jobs that were running at the same time.
  • Environment – Look for any external factor (e.g., storm, power failure, or other system issues) that may be a factor.

Narrow Probable Source

Consider the following questions to sharpen your focus on the cause of the problem:
Is this a known problem? – Check your documentation, product knowledge base, product release notes, patches, and documentation.


Is this a pattern/re-occurring problem? – Does this error occur repeatedly? Does it occur at the same time? Same place?  Is there a pattern?  What is common in the pattern?  Elusive, sporadic errors are the most difficult to isolate. 


Has it worked before? – Look at the job history. If the operation/procedure has worked successfully before, then something has changed to make it fail.  If it has never worked before, the problem is most likely in the installation or configuration of one of the components involved.


What's changed since it last worked? – Have you changed security (e.g., password/permissions); paths (IP address, hostname, data locations, MediaAgent, library, and media); job characteristics (content, filters, storage policy, encryption, compression, etc.)?  If the working environment has changed since the last successful job, it would be the probable source of the problem.  If not, then the problem may be an unknown change or component related.


Do other jobs work from/to the client? – If this problem has to do with data protection/restore, do similar jobs to/from client work? If not, the client is the probable source of the problem.  If other jobs do work, then the problem may be an agent or subclient, or it may be somewhere outside the client.


Do other jobs from/to that MediaAgent work? – If this problem involves a MediaAgent, do other similar job to/from that MediaAgent work?  If not, the MediaAgent is the probable source of the problem.  If other jobs do work, then the problem may be with a particular drive or media or it may be somewhere outside the MediaAgent.

Eliminate Simple Causes

  • Always check the simple causes first.  These usually take the least amount of time. 
  • Check that all devices have power and are in a ready operational state.
  • Check that all required services are started and running correctly.
  • Check cabling, connectors, controls, zoning, masking, and ports.  Can all devices see each other properly?
  • Check that both forward and reverse host names are resolved correctly.
  • Check that all drivers and firmware are up to date and compatible with each other.
  • Check OS and application access.
  • Check that sufficient spare media is present and that it's good spare media.


Apply One Solution at a Time

Changing multiple settings/conditions at the same time may fix the problem, but it will not enable you to determine the cause and take preventive action against future errors of that nature.


Always record the current state.  In some cases, an applied solution may cause additional errors.  Now it's a battle to get back to your previous state and your original problem.


Try the most probable solution first.  If this is not an easy solution, you may want to try other solutions that are easier or least disruptive first in order to eliminate them.


Always record the effect of a solution.  Even if it does not solve the problem, it may have either positive or negative effect on the application.  If new errors are generated, you can then associate them with the current state.


If applicable, back out each solution before trying the next.  If all the simple causes have been eliminated, and any software changes are identified and addressed - rarely will an error be the result of multiple concurrent software conditions. 

Document

Documentation records the history of errors, troubleshooting efforts, and solutions.  Some errors can only be resolved by recognizing patterns.  For example; backups run successfully for a client except on the first Friday of each month. This is due to another application running a particular process on that day, which causes port conflicts. The offending process may not be found without recognizing that there is a pattern of errors.


Maintaining reliable troubleshooting records provides a better chance of resolving future errors quicker.








Copyright © 2021 Commvault | All Rights Reserved.