One of the disks in one of our old RHEL5 SunFire x4150 boxes wouldn’t spin up after a controlled reboot. Oracle Support showed a friendly face, and sendt a new disk after a few hours. Going down to the data center, and finding the right box, both disk lamps blinked green, one of them blinked a bit more than the other, but was that the broken one or the other? Pulling the wrong disk would bring a production system down.
How do I find which disk is the broken one?
The iLO web interfaces shows … nothing interesting about the broken disk at all. Now, these SunFire work horses are usually equipped with LSI SAS1068E Fusion-MPT entry-level raid cards, using the mptscsi driver, so we can poll status with mpt-status or lsiutil. mpt-status says that the broken disk is “phy 1 scsi_id 2″. lsiutil says the broken disk is “Bus 0 Target 2″. The Sun/Oracle docs showing the disk drawers enumerate the disks, but does not indicate SCSI IDs. What to do, what to do? The clock is ticking away, and at home, the dinner is ready.
Finally, the mother of all disk status tools, S.M.A.R.T. to the rescue: The mptscsi driver adds generic scsi devices to all physical devices, as well as the logical raid device. So we can use smartmontools to poll status of each physical device. On a typical system disk with a raid1 mirror, sg0 is the first physical disk, sg1 is the second, and sg2 is the logical lun provided by the mirror. What is so magical with smartmontools? It provides the actual serial number of the disk. And that is visible through the disk drawer front panel.
smartctl -a /dev/sg0 smartctl -a /dev/sg1
The broken disk should report (or fail to report) its status, and may be located by its serial number. Now change the disk and get home before the dinner gets cold.
Btw, smartctl reports other scsi ids on the physical disks than mpt-status and lsiutil did. Go figure.