HDD going belly up

Predrag Punosevac punosevac72 at gmail.com
Tue Oct 17 06:14:45 PDT 2017


dfly# uname -a
DragonFly dfly.bagdala2.net 5.0-RELEASE DragonFly
v5.0.0.2.ga9d62-RELEASE #10: Tue Oct 17 07:25:14 EDT 2017
root at dfly.bagdala2.net:/usr/obj/usr/src/sys/X86_64_GENERIC  x86_64

dfly# mount
ROOT on / (hammer, noatime, local)
devfs on /dev (devfs, nosymfollow, local)
/dev/serno/B620550018.s1a on /boot (ufs, local)
/pfs/@@-1:00001 on /var (null, local)
/pfs/@@-1:00002 on /tmp (null, local)
/pfs/@@-1:00003 on /home (null, local)
/pfs/@@-1:00004 on /usr/obj (null, local)
/pfs/@@-1:00005 on /var/crash (null, local)
/pfs/@@-1:00006 on /var/tmp (null, local)
procfs on /proc (procfs, local)
DATA on /data (hammer, noatime, local)
BACKUP on /backup (hammer, noatime, local)
/data/pfs/@@-1:00001 on /data/backups (null, local)
/data/pfs/@@-1:00002 on /data/nfs (null, NFS exported, local)
/dev/da3s1e at DATA on /test-hammer2 (hammer2, local)

dfly# smartctl -d sat -l selftest /dev/da1
smartctl 6.5 2016-05-07 r4318 [DragonFly 5.0-RELEASE x86_64] (local
build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,
www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     14053
   1060611176
# 2  Short offline       Completed: read failure       90%     14029
   1060611176
# 3  Short offline       Completed: read failure       90%     14005
   1060611176
# 4  Extended offline    Completed: read failure       90%     13982
   1060611176
# 5  Short offline       Completed: read failure       90%     13981
   1060611176
# 6  Short offline       Completed: read failure       90%     13957
   1060611176
# 7  Short offline       Completed: read failure       90%     13933
   1060611176
# 8  Short offline       Completed: read failure       90%     13909
   1060611176
# 9  Short offline       Completed: read failure       90%     13885
   1060611176
#10  Short offline       Completed: read failure       90%     13861
   1060611176
#11  Short offline       Completed: read failure       90%     13837
   1060611176
#12  Extended offline    Completed: read failure       90%     13814
   1060611176
#13  Short offline       Completed: read failure       90%     13813
   1060611176
#14  Short offline       Completed without error       00%     13789
   -
#15  Short offline       Completed without error       00%     13765
   -
#16  Short offline       Completed without error       00%     13741
   -
#17  Short offline       Completed without error       00%     13717
   -
#18  Short offline       Completed without error       00%     13693
   -
#19  Short offline       Completed without error       00%     13669
   -
#20  Extended offline    Completed without error       00%     13654
   -
#21  Short offline       Completed without error       00%     13645
   -

as well as lots of 

ahci0.2: TFES slot 28 ci_saved = 10000000
ahci0.2: read NCQ error page slot=28
ahci0.2: DONE log page target 0 err_slot=28
ahci0.2: disk_rw: error fiscmd=0x60 @off=0x0000007e6f48c000, 32768
(da1:ahci0:2:0:0): READ(10). CDB: 28 0 3f 37 a4 60 0 0 40 0 
(da1:ahci0:2:0:0): CAM Status: SCSI Status Error
(da1:ahci0:2:0:0): SCSI Status: Check Condition
(da1:ahci0:2:0:0): MEDIUM ERROR asc:0,0
(da1:ahci0:2:0:0): No additional sense information
(da1:ahci0:2:0:0): Retrying Command (per Sense Data)
ahci0.2: TFES slot 7 ci_saved = 00000080
ahci0.2: read NCQ error page slot=7
ahci0.2: DONE log page target 0 err_slot=7
ahci0.2: disk_rw: error fiscmd=0x60 @off=0x0000007e6f48c000, 32768
(da1:ahci0:2:0:0): READ(10). CDB: 28 0 3f 37 a4 60 0 0 40 0 
(da1:ahci0:2:0:0): CAM Status: SCSI Status Error
(da1:ahci0:2:0:0): SCSI Status: Check Condition
(da1:ahci0:2:0:0): MEDIUM ERROR asc:0,0
(da1:ahci0:2:0:0): No additional sense information
(da1:ahci0:2:0:0): Retrying Command (per Sense Data)
ahci0.2: TFES slot 8 ci_saved = 00000100
ahci0.2: read NCQ error page slot=8
ahci0.2: DONE log page target 0 err_slot=8
ahci0.2: disk_rw: error fiscmd=0x60 @off=0x0000007e6f48c000, 32768
(da1:ahci0:2:0:0): READ(10). CDB: 28 0 3f 37 a4 60 0 0 40 0 
(da1:ahci0:2:0:0): CAM Status: SCSI Status Error
(da1:ahci0:2:0:0): SCSI Status: Check Condition
(da1:ahci0:2:0:0): MEDIUM ERROR asc:0,0
(da1:ahci0:2:0:0): No additional sense information
(da1:ahci0:2:0:0): Retries Exhausted


in my dmesg


What is the correct way to recover from the dying HDD? Should I stop
mirroring immediately and promote the slave into the master before
putting a new drive and making it slave? How can I tell if the data is
corrupted on the current master?

Cheers,
Predrag



More information about the Users mailing list