I have 2 western digital 2500KS hard disks and noticed already some time ago that the smart implementation is not correct when it comes to temperatures. Afaik, it was something like reporting the temp in Â°F while they claim to report it in Â°C. This wouldn’t be a problem if they didn’t also define the thresholds in Â°C. On an operating system like windows, the user never is bothered by this shortcoming as windows simply has no clue about SMART. On linux however, the user is typically notified by mail or desktop popup when a hard drive exceeds its maximum temp. Which is quite annoying if it is a false alarm.
A lot of 3rd party tools exist for windows, and that is how I originally encountered the problem. So it is really a firmware bug.
So I started looking for a way to either
- add some offset to the limits
- disable only temp monitoring on only those disks
Option 2 seemed the way to go.
First you need to know what is failing. Check your /var/log/messages:
smartd: Device: /dev/sda, Failed SMART usage Attribute: 190 Temperature_Celsius. smartd: Device: /dev/sdb, Failed SMART usage Attribute: 190 Temperature_Celsius.
It seems attribute 190 is the problem. You can always verify this with “smartctl -a” which yields amongst others the vendor specific tags:
Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 227 201 021 Pre-fail Always - 3650 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 487 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1683 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 487 190 Temperature_Celsius 0x0022 034 021 045 Old_age Always FAILING_NOW 66 194 Temperature_Celsius 0x0022 084 071 000 Old_age Always - 66 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
Next, we need to adjust the smartd config.
There is a DEVICESCAN directive which is an easy wizard way to automatically configure all devices found. If you want to configure all your devices the same way anyway, you can simply append all options you want there.
If you, like me, want to give each device its own treat ;), you can add a line per device. As soon as you do that, you can’t use the devicescan wizard anymore. In my case I did something like this:
/dev/sda -a -d sat -i 190 -m root@localhost -M exec /usr/lib/smartmontools/smart-notify /dev/sdb -a -d sat -i 190 -m root@localhost -M exec /usr/lib/smartmontools/smart-notify /dev/sdc -a -d sat -m root@localhost -M exec /usr/lib/smartmontools/smart-notify /dev/sdd -a -d sat -m root@localhost -M exec /usr/lib/smartmontools/smart-notify
sda and sdb need the special ignore for attr 190 (the key option is -i). I also specify for all hard drives that they are actually (S)ATA drives behind the SCSI interface (less warnings in the log ;)).
To make the changes in the config effective, we let smartd reread its config by sending a SIGHUP signal to the process:
killall -HUP smartd