Thursday 20 August 2009

Advice about data corruption events

Detection – The following items should be seen as warning signs of possible data corruption
• Unexpected exits from normally stable work jobs; any process or database abnormalities should be cause for concern.
• Failure of routine process that scan or search data volumes and file systems ( e.g.: Fsck, backups, catalog processes )
• Hardware I/O errors regardless of whether they are recoverable or not. Especially San, i/o paths and memory

Awareness – these activities are seen as high risk activities for any storage environments
• Frequent reconfiguration or replacement of storage objects.
• Microcode/firmware upgrades ( HBA, switch, array )
• San, Switch and Storage reconfigurations.

Prevention – These activities can help early detection of corruption problems
• Perform test restores of backups, check restored data integrity is sane.
• Scheduled execution of file system metadata and database integrity checks.
• Have spare reliable, tested storage available for immediate deployment, for use in a crisis.
• Maintain records of all i/o path and system device configurations, ( e.g. San, network diagrams and configuration detail)

Investigation – Review these points in the event of a data corruption investigation.
• Retain all log files. Do not let normal daily roll over wipe the evidence (e.g. syslog, alert log, engine log, san switch logs)
• Facilitate early engagement between vendors of all interconnect components.
• Retain or capture copies of corrupted storage objects, the content, size and location of corruption will be used to form theories during investigations.
• Be proactive around scanning the wider estate for similar issues.

The above is very general high level advice. If you wish to obtain specific implementation advice on a per-platform, per-process, per-product or target environment basis please contact your technical support provider.

No comments: