Tech Note 002
Title: Troubleshooting guidelines (2) - Diagnosing hardware faults
Updated: April 2005
 
From time to time you may experience issues with backups, archives or restores. This usually results in a failed operation. Always make sure that you check the status and logs of completed processes. If an operation has failed, you should attempt to ascertain the cause of the failure and re-run the operation until it completes. In the case of backups and archives, your data may not be safe until the operation has successfully completed.
To begin troubleshooting the issue, refer to Tech Note 001: Troubleshooting guidelines (1) - General
 
Hardware failures
The majority of failures in backup, archive and restore operations are caused by hardware defects. Although the log that you view is created by FlashNet, the data transfer process is initiated and processed via several components. Hardware is attached to the FlashNet server using SCSI or fibre channel, and this utilises several individual components, such as HBAs (SCSI or fibre channel cards in the server), cables and terminators. These are connected to the backup device to create a link between the server and device, and the failure or misoperation of any of the components will result in failed data transfer.
Unfortunately, if you are experiencing problems with components of the SCSI or fibre bus, it is often not easy to determine which component is faulty. Often, the only reporting that FlashNet is able to produce is 'write error', or 'I/O error' (input/output error). Although this does not indicate specifically which component of the bus has failed, I/O errors are nearly always generated as the result of a bad component. If you are seeing I/O errors, check the system's log to see if a SCSI bus reset or other SCSI error is reported (on Sun Solaris the file to check is /var/adm/messages, on SGI /var/adm/SYSLOG).
Once you have determined that SCSI errors are occurring, you must try and determine which component is the guilty party. Unfortunately, this often involves an amount of trial and error (replacing a component then re-trying the job, if this still fails swap a different component on the bus then re-try etc). SCSI bus analyzers are very useful tools when analyzing SCSI bus errors.
In addition, the drives that write the data, and the media to which they write, may sometimes fail. Magnetic tapes in particular are susceptible to damage, and though in the vast majority of cases tape is a safe backup medium, media failures can occur.
Wherever possible, the FlashNet log will indicate the cause of an error. For example, a medium error will be reported as such; when these occur, you should re-try the job using a different piece of media (if the operation was writing to a group of volumes, you should remove the current volume from the group before re-trying the job so that different piece of media is used).
 
FlashNet and I031 messages
Each message in a FlashNet log begins with a four digit code, which denotes the type of message. There are three basic types of message: I(nformation), W(arnings) and F(atal). Information messages provide general information about the progress of the job. Warning messages highlight items that you should examine, but that are not sufficiently catastrophic to cause a fatal error; for example, files that were not able to be backed up are highlighted with warning messages. Fatal messages indicate a catastrophic failure of the process, which is halted at the point of the failure.
Most logs contain I031 messages. These messages are generated by the hardware device(s) that FlashNet is instructing to carry out the process. I031 messages come directly from the hardware; they are not FlashNet messages. I031 messages may indicate normal progression of the job (e.g. they inform of changes to the drive, such as media loading, or drive calibration), or they may indicate problems that the drive or media are experiencing (e.g. if there is a medium error or a basic hardware fault).
If a process has failed, check the FlashNet log. If the fatal error is preceded by an I031 message there is a very good chance that the message may throw some light on the cause of the failure. Remember, I031 messages are generated directly from the hardware; if the drive has a problem it will report it to FlashNet in an I031 message.

Any hardware issues should be reported to your FlashNet vendor, and where applicable to the hardware vendor. They may ask you to send them 'sense' information. This is the information that is passed between the backup device(s) and the controlling software (e.g. FlashNet), in non-human-readable format. The information is used by engineers to diagnose hardware issues.
Sense information is included in a FlashNet debug log, or can be displayed in the standard FlashNet process log by adding the entry
SHOW_SENSE YES
to the FlashNet environment file /.dtool_env. You should only add this entry when instructed to do so by a support engineer or helpdesk staff member.

 

 ==END==

>> top