smartd temperature error messages

I have 2 western digital 2500KS hard disks and noticed already some time ago that the smart implementation is not correct when it comes to temperatures. Afaik, it was something like reporting the temp in Â°F while they claim to report it in Â°C. This wouldn’t be a problem if they didn’t also define the thresholds in Â°C. On an operating system like windows, the user never is bothered by this shortcoming as windows simply has no clue about SMART. On linux however, the user is typically notified by mail or desktop popup when a hard drive exceeds its maximum temp. Which is quite annoying if it is a false alarm. 😉

A lot of 3rd party tools exist for windows, and that is how I originally encountered the problem. So it is really a firmware bug.

So I started looking for a way to either

add some offset to the limits
disable only temp monitoring on only those disks

Option 2 seemed the way to go.
First you need to know what is failing. Check your /var/log/messages:

smartd[3916]: Device: /dev/sda, Failed SMART usage Attribute: 190 Temperature_Celsius.
smartd[3916]: Device: /dev/sdb, Failed SMART usage Attribute: 190 Temperature_Celsius.

It seems attribute 190 is the problem. You can always verify this with “smartctl -a” which yields amongst others the vendor specific tags:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   227   201   021    Pre-fail  Always       -       3650
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       487
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1683
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       487
190 Temperature_Celsius     0x0022   034   021   045    Old_age   Always   FAILING_NOW 66
194 Temperature_Celsius     0x0022   084   071   000    Old_age   Always       -       66
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

Next, we need to adjust the smartd config.
There is a DEVICESCAN directive which is an easy wizard way to automatically configure all devices found. If you want to configure all your devices the same way anyway, you can simply append all options you want there.
If you, like me, want to give each device its own treat ;), you can add a line per device. As soon as you do that, you can’t use the devicescan wizard anymore. In my case I did something like this:

/dev/sda -a -d sat -i 190 -m root@localhost -M exec /usr/lib/smartmontools/smart-notify
/dev/sdb -a -d sat -i 190 -m root@localhost -M exec /usr/lib/smartmontools/smart-notify
/dev/sdc -a -d sat -m root@localhost -M exec /usr/lib/smartmontools/smart-notify
/dev/sdd -a -d sat -m root@localhost -M exec /usr/lib/smartmontools/smart-notify

sda and sdb need the special ignore for attr 190 (the key option is -i). I also specify for all hard drives that they are actually (S)ATA drives behind the SCSI interface (less warnings in the log ;)).

To make the changes in the config effective, we let smartd reread its config by sending a SIGHUP signal to the process:

killall -HUP smartd

done! 🙂

Leave a Reply Cancel reply