HDD going belly up
Predrag Punosevac
punosevac72 at gmail.com
Tue Oct 17 06:14:45 PDT 2017
dfly# uname -a
DragonFly dfly.bagdala2.net 5.0-RELEASE DragonFly
v5.0.0.2.ga9d62-RELEASE #10: Tue Oct 17 07:25:14 EDT 2017
root at dfly.bagdala2.net:/usr/obj/usr/src/sys/X86_64_GENERIC x86_64
dfly# mount
ROOT on / (hammer, noatime, local)
devfs on /dev (devfs, nosymfollow, local)
/dev/serno/B620550018.s1a on /boot (ufs, local)
/pfs/@@-1:00001 on /var (null, local)
/pfs/@@-1:00002 on /tmp (null, local)
/pfs/@@-1:00003 on /home (null, local)
/pfs/@@-1:00004 on /usr/obj (null, local)
/pfs/@@-1:00005 on /var/crash (null, local)
/pfs/@@-1:00006 on /var/tmp (null, local)
procfs on /proc (procfs, local)
DATA on /data (hammer, noatime, local)
BACKUP on /backup (hammer, noatime, local)
/data/pfs/@@-1:00001 on /data/backups (null, local)
/data/pfs/@@-1:00002 on /data/nfs (null, NFS exported, local)
/dev/da3s1e at DATA on /test-hammer2 (hammer2, local)
dfly# smartctl -d sat -l selftest /dev/da1
smartctl 6.5 2016-05-07 r4318 [DragonFly 5.0-RELEASE x86_64] (local
build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,
www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 14053
1060611176
# 2 Short offline Completed: read failure 90% 14029
1060611176
# 3 Short offline Completed: read failure 90% 14005
1060611176
# 4 Extended offline Completed: read failure 90% 13982
1060611176
# 5 Short offline Completed: read failure 90% 13981
1060611176
# 6 Short offline Completed: read failure 90% 13957
1060611176
# 7 Short offline Completed: read failure 90% 13933
1060611176
# 8 Short offline Completed: read failure 90% 13909
1060611176
# 9 Short offline Completed: read failure 90% 13885
1060611176
#10 Short offline Completed: read failure 90% 13861
1060611176
#11 Short offline Completed: read failure 90% 13837
1060611176
#12 Extended offline Completed: read failure 90% 13814
1060611176
#13 Short offline Completed: read failure 90% 13813
1060611176
#14 Short offline Completed without error 00% 13789
-
#15 Short offline Completed without error 00% 13765
-
#16 Short offline Completed without error 00% 13741
-
#17 Short offline Completed without error 00% 13717
-
#18 Short offline Completed without error 00% 13693
-
#19 Short offline Completed without error 00% 13669
-
#20 Extended offline Completed without error 00% 13654
-
#21 Short offline Completed without error 00% 13645
-
as well as lots of
ahci0.2: TFES slot 28 ci_saved = 10000000
ahci0.2: read NCQ error page slot=28
ahci0.2: DONE log page target 0 err_slot=28
ahci0.2: disk_rw: error fiscmd=0x60 @off=0x0000007e6f48c000, 32768
(da1:ahci0:2:0:0): READ(10). CDB: 28 0 3f 37 a4 60 0 0 40 0
(da1:ahci0:2:0:0): CAM Status: SCSI Status Error
(da1:ahci0:2:0:0): SCSI Status: Check Condition
(da1:ahci0:2:0:0): MEDIUM ERROR asc:0,0
(da1:ahci0:2:0:0): No additional sense information
(da1:ahci0:2:0:0): Retrying Command (per Sense Data)
ahci0.2: TFES slot 7 ci_saved = 00000080
ahci0.2: read NCQ error page slot=7
ahci0.2: DONE log page target 0 err_slot=7
ahci0.2: disk_rw: error fiscmd=0x60 @off=0x0000007e6f48c000, 32768
(da1:ahci0:2:0:0): READ(10). CDB: 28 0 3f 37 a4 60 0 0 40 0
(da1:ahci0:2:0:0): CAM Status: SCSI Status Error
(da1:ahci0:2:0:0): SCSI Status: Check Condition
(da1:ahci0:2:0:0): MEDIUM ERROR asc:0,0
(da1:ahci0:2:0:0): No additional sense information
(da1:ahci0:2:0:0): Retrying Command (per Sense Data)
ahci0.2: TFES slot 8 ci_saved = 00000100
ahci0.2: read NCQ error page slot=8
ahci0.2: DONE log page target 0 err_slot=8
ahci0.2: disk_rw: error fiscmd=0x60 @off=0x0000007e6f48c000, 32768
(da1:ahci0:2:0:0): READ(10). CDB: 28 0 3f 37 a4 60 0 0 40 0
(da1:ahci0:2:0:0): CAM Status: SCSI Status Error
(da1:ahci0:2:0:0): SCSI Status: Check Condition
(da1:ahci0:2:0:0): MEDIUM ERROR asc:0,0
(da1:ahci0:2:0:0): No additional sense information
(da1:ahci0:2:0:0): Retries Exhausted
in my dmesg
What is the correct way to recover from the dying HDD? Should I stop
mirroring immediately and promote the slave into the master before
putting a new drive and making it slave? How can I tell if the data is
corrupted on the current master?
Cheers,
Predrag
More information about the Users
mailing list