Wenn das Backup streikt… Networker troubleshooting 2

Ursachenforschung

Die Preisfrage lautet jetzt: ” Was ist mit der Tape Library los? Liegt das Problem auf Seite des Hosts oder Jukebox?”

Dazu erst einmal das Networker-Log in ein brauchbares Format konvertieren:

root # nsr_render_log /nsr/logs/daemon.raw > /tmp/daemon.log

Nun mit dem Editor der Wahl o.ä. zum Ende des Protokolls und dann zum Beginn des Fehlers zurückblättern:

Ende des Logs:

7224 08/15/08 19:05:12  2 0 0 1 17996 0 senbackup1 nsrd [Jukebox `senlib3', operation # 3547]. Jukebox `senlib3′ is not ready or it is busy.
39078 08/15/08 19:05:13  0 0 2 3 18486 0 senbackup1 nsrlcpd#1 Jukebox error: Jukebox:senlib3 access:/dev/scsi/changer/c3t500104F0007E9B64d0 failed:MOVE MEDIUM key:2 status:CHECK CONDITION No Additional Sense, Logical Unit Not Ready, Manual Intervention Required
0 08/15/08 19:05:14  2 0 0 1 18471 0 senbackup1 nsrmmgd lcpd 1 at host senbackup1 reported error ‘Jukebox `senlib3′ is not ready or it is busy.’ for the command `4′.
7224 08/15/08 19:05:14  2 0 0 1 17996 0 senbackup1 nsrd [Jukebox `senlib3', operation # 3548]. Jukebox `senlib3′ is not ready or it is busy.
4690 08/15/08 19:05:14  2 0 0 1 18471 0 senbackup1 nsrmmgd Jukebox `senlib3′ is not ready or it is busy.
7224 08/15/08 19:05:14  2 0 0 1 17996 0 senbackup1 nsrd [Jukebox `senlib3', operation # 3548]. Jukebox `senlib3′ is not ready or it is busy.
12361 08/15/08 19:05:15  2 0 0 1 17996 0 senbackup1 nsrd [Jukebox `senlib3', operation # 3547]. Finished with status: retryable

Beginn des Fehlers:

42506 08/14/08 22:33:05  2 0 0 1 17996 0 senbackup1 nsrd write completion notice: Writing to volume 000061 complete
0 08/14/08 22:34:10  2 0 0 1 17996 0 senbackup1 nsrd Operation 277 started : Unload jukebox device `/dev/rmt/2cbn’..
38752 08/14/08 22:34:12  2 0 0 1 17996 0 senbackup1 nsrd /dev/rmt/2cbn Eject operation in progress
39078 08/14/08 22:36:55  0 0 2 3 18486 0 senbackup1 nsrlcpd#1 Jukebox error: Jukebox:senlib3 access:/dev/scsi/changer/c3t500104F0007E9B64d0 failed:MOVE MEDIUM key:4 status:CHECK CONDITION No Additional Sense, Diagnostic Failure on Component 0×01
0 08/14/08 22:36:56  2 0 0 1 18471 0 senbackup1 nsrmmgd lcpd 1 at host senbackup1 reported error ‘Jukebox:senlib3 access:/dev/scsi/changer/c3t500104F0007E9B6
4d0 failed:MOVE MEDIUM key:4 status:CHECK CONDITION No Additional Sense, Diagnostic Failure on Component 0×01′ for the command `5′.
7224 08/14/08 22:36:56  2 0 0 1 17996 0 senbackup1 nsrd [Jukebox `senlib3', operation # 277]. Jukebox:senlib3 access:/dev/scsi/changer/c3t500104F0007E9B64d0
failed:MOVE MEDIUM key:4 status:CHECK CONDITION No Additional Sense, Diagnostic Failure on Component 0×01
4690 08/14/08 22:36:56  2 0 0 1 18471 0 senbackup1 nsrmmgd Jukebox:senlib3 access:/dev/scsi/changer/c3t500104F0007E9B64d0 failed:MOVE MEDIUM key:4 status:CHECK CONDITION No Additional Sense, Diagnostic Failure on Component 0×01
7224 08/14/08 22:36:56  2 0 0 1 17996 0 senbackup1 nsrd [Jukebox `senlib3', operation # 277]. Jukebox `senlib3′ failed to unload the requested volume from de
vice `/dev/rmt/2cbn’.
7224 08/14/08 22:36:56  2 0 0 1 17996 0 senbackup1 nsrd [Jukebox `senlib3', operation # 277]. Jukebox:senlib3 access:/dev/scsi/changer/c3t500104F0007E9B64d0
failed:MOVE MEDIUM key:4 status:CHECK CONDITION No Additional Sense, Diagnostic Failure on Component 0×01
12361 08/14/08 22:37:00  2 0 0 1 17996 0 senbackup1 nsrd [Jukebox `senlib3', operation # 277]. Finished with status: retryable
42506 08/14/08 22:37:28  2 0 0 1 17996 0 senbackup1 nsrd media warning: Jukebox `senlib3′ Hardware status of jukebox ‘senlib3′ changed from ‘ready’ to ‘needs
attention’

Aha. Scheinbar ist das Entladen eines Tapes fehlgeschlagen. Seitdem reagiert die Jukebox nicht mehr.

Es handelt sich hier um eine Sun/STK SL500 Tape-Library, die per Fibrechannel-Fabric angebunden ist. Jetzt also erst mal prüfen, ob es ggf. Probleme im SAN gibt.

Zunächst einmal gilt es Infos zur Jukebox-Konfiguration in Erfahrung zu bringen. Aus den obigen Fehlermeldungen wissen wir bereits, dass das Jukebox-Changer-Device /dev/scsi/changer/c3t500104F0007E9B64d0 heißt und somit am Controller 3 angeschlossen ist.

root@senbackup1 # ls -l /dev/scsi/changer
total 2
lrwxrwxrwx   1 root     root          89 Sep 11  2007 c3t500104F0007E9B64d0 -> ../../../devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0,1/fp@0,0/sgen@w500104f0007e9b64,0:changer

root # ls -l /dev/rmt/*cbn
lrwxrwxrwx   1 root     root          80 Sep 13  2007 /dev/rmt/0cbn -> ../../devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0,1/fp@0,0/st@w500104f0007e9b71,0:cbn
lrwxrwxrwx   1 root     root          80 Sep 13  2007 /dev/rmt/1cbn -> ../../devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0,1/fp@0,0/st@w500104f0007e9b6e,0:cbn
lrwxrwxrwx   1 root     root          80 Sep 13  2007 /dev/rmt/2cbn -> ../../devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0,1/fp@0,0/st@w500104f0007e9b74,0:cbn
lrwxrwxrwx   1 root     root          80 Sep 13  2007 /dev/rmt/3cbn -> ../../devices/pci@7c0/pci@0/pci@9/SUNW,qlc@0,1/fp@0,0/st@w500104f0007e9b77,0:cbn
lrwxrwxrwx   1 root     root          80 Sep 13  2007 /dev/rmt/4cbn -> ../../devices/pci@7c0/pci@0/pci@9/SUNW,qlc@0,1/fp@0,0/st@w500104f0007e9b7d,0:cbn
lrwxrwxrwx   1 root     root          80 Apr 30 13:34 /dev/rmt/5cbn -> ../../devices/pci@7c0/pci@0/pci@9/SUNW,qlc@0,1/fp@0,0/st@w500104f0007e9b7a,0:cbn

Aus dem Hardwarepfad ist ersichtlich, daß je 3 Laufwerke an 2 seperaten FC-HBAs angeschlossen sind.

root # ls -l /dev/cfg/ | grep devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0,1/fp@0,0
lrwxrwxrwx   1 root     root          56 Aug 24  2007 c3 -> ../../devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0,1/fp@0,0:fc
root # ls -l /dev/cfg/ | grep devices/pci@7c0/pci@0/pci@9/SUNW,qlc@0,1/fp@0,0
lrwxrwxrwx   1 root     root          56 Aug 24  2007 c5 -> ../../devices/pci@7c0/pci@0/pci@9/SUNW,qlc@0,1/fp@0,0:fc

Tape 0,1,2 und die Robotik hängen also am Controller 3, Tape 3,4,5 am Controller 5.

root # fcinfo hba-port -l

HBA Port WWN: 2101001b323f0a1d
OS Device Name: /dev/cfg/c3
Manufacturer: QLogic Corp.
Model: 375-3356-01
Firmware Version: 4.2.2
FCode/BIOS Version:  BIOS: 1.24; fcode: 1.26; EFI: 1.8;
Type: N-port
State: online
Supported Speeds: 1Gb 2Gb 4Gb
Current Speed: 4Gb
Node WWN: 2001001b323f0a1d

HBA Port WWN: 210100e08bb2d018
OS Device Name: /dev/cfg/c5
Manufacturer: QLogic Corp.
Model: 375-3356-01
Firmware Version: 4.2.2
FCode/BIOS Version:  BIOS: 1.24; fcode: 1.26; EFI: 1.8;
Type: N-port
State: online
Supported Speeds: 1Gb 2Gb 4Gb
Current Speed: 4Gb
Node WWN: 200100e08bb2d018

Nun sind also die WWPNs der Controller bekannt. Damit lässt sich abfragen, welche LUNs der Server auf diesen Ports sieht:

root # fcinfo remote-port -p 2101001b323f0a1d -s
Remote Port WWN: 500104f0007e9b64
Active FC4 Types: SCSI
SCSI Target: yes
Node WWN: 500104f0007e9b63
LUN: 0
Vendor: STK
Product: SL500
OS Device Name: /devices/pci@7c0/pci@0/pci@8/SUNW,qlc@0,1/fp@0,0/sgen@w500104f0007e9b64,0

Remote Port WWN: 500104f0007e9b6e
Active FC4 Types: SCSI
SCSI Target: yes
Node WWN: 500104f0007e9b6d
LUN: 0
Vendor: HP
Product: Ultrium 3-SCSI
OS Device Name: /dev/rmt/1n
Remote Port WWN: 500104f0007e9b71
Active FC4 Types: SCSI
SCSI Target: yes
Node WWN: 500104f0007e9b70
LUN: 0
Vendor: HP
Product: Ultrium 3-SCSI
OS Device Name: /dev/rmt/0n
Remote Port WWN: 500104f0007e9b74
Active FC4 Types: SCSI
SCSI Target: yes
Node WWN: 500104f0007e9b73
LUN: 0
Vendor: HP
Product: Ultrium 3-SCSI
OS Device Name: /dev/rmt/2n

Am Controller 3 sind 3 HP LTO-3 Tape Drives und der Changer der SL500 (über sgen-Treiber) sichtbar.

root # fcinfo remote-port -p 210100e08bb2d018 -s

Remote Port WWN: 500104f0007e9b77
Active FC4 Types: SCSI
SCSI Target: yes
Node WWN: 500104f0007e9b76
LUN: 0
Vendor: HP
Product: Ultrium 3-SCSI
OS Device Name: /dev/rmt/3n
Remote Port WWN: 500104f0007e9b7a
Active FC4 Types: SCSI
SCSI Target: yes
Node WWN: 500104f0007e9b79
LUN: 0
Vendor: HP
Product: Ultrium 3-SCSI
OS Device Name: /dev/rmt/5n
Remote Port WWN: 500104f0007e9b7d
Active FC4 Types: SCSI
SCSI Target: yes
Node WWN: 500104f0007e9b7c
LUN: 0
Vendor: HP
Product: Ultrium 3-SCSI
OS Device Name: /dev/rmt/4n

Am Controller 5 sind 3 HP LTO-3 Tape Drives sichtbar.

Fazit: Im SAN scheint es zumindest auf den ersten Blick kein Problem zu geben.

Der Networker bietet auch noch Befehle zum Thema an:

root # inquire -l

-l flag found: searching all LUNs, which may take over 10 minutes per adapter
for some fibre channel adapters.  Please be patient.

scsidev@3.3.0:HP      Ultrium 3-SCSI  M21S|Tape, /dev/rmt/1cbn
S/N: HU107120CY
ATNN=HP      Ultrium 3-SCSI  HU107120CY
WWNN=50060B000036F6EA
WWPN=500104F0007E9B6E
WW2N=500104F0007E9B6D
PORT=00000001
scsidev@3.4.0:HP      Ultrium 3-SCSI  M21S|Tape, /dev/rmt/0cbn
S/N: HU107120BG
ATNN=HP      Ultrium 3-SCSI  HU107120BG
WWNN=50060B000036F686
WWPN=500104F0007E9B71
WW2N=500104F0007E9B70
PORT=00000001
scsidev@3.7.0:HP      Ultrium 3-SCSI  L58S|Tape, /dev/rmt/2cbn
S/N: HU10704J3U
ATNN=HP      Ultrium 3-SCSI  HU10704J3U
WWNN=500110A00058B456
WWPN=500104F0007E9B74
WW2N=500104F0007E9B73
PORT=00000001

scsidev@5.1.0:HP      Ultrium 3-SCSI  L58S|Tape, /dev/rmt/3cbn
S/N: HU107070AD
ATNN=HP      Ultrium 3-SCSI  HU107070AD
WWNN=500110A00058D182
WWPN=500104F0007E9B77
WW2N=500104F0007E9B76
PORT=00000001
scsidev@5.3.0:HP      Ultrium 3-SCSI  L58S|Tape, /dev/rmt/5cbn
S/N: HU10704J4A
ATNN=HP      Ultrium 3-SCSI  HU10704J4A
WWNN=500110A00058B446
WWPN=500104F0007E9B7A
WW2N=500104F0007E9B79
PORT=00000001
scsidev@5.6.0:HP      Ultrium 3-SCSI  L58S|Tape, /dev/rmt/4cbn
S/N: HU10704J4R
ATNN=HP      Ultrium 3-SCSI  HU10704J4R
WWNN=500110A00058B42A
WWPN=500104F0007E9B7D
WW2N=500104F0007E9B7C
PORT=00000001

Verdächtig: inquire findet die Robotik nicht…

root # jbverify
14866:jbverify:
Jbverify is running on host senbackup1, SunOS 5.10
….
14917:jbverify:
Finished processing stand-alone devices.
14918:jbverify:
Processing jukebox devices…
14920:jbverify:
Processing jukebox senlib3:
39078:jbverify: SYSTEM error: Device busy
39077:jbverify: error, Cannot initialize jukebox! Driver may not be loaded!
14926:jbverify:
Failed while processing jukebox senlib3 on senbackup1
14929:jbverify:
Finished processing jukebox devices.
**********************************************************************

Summary report of jbverify
======= ====== == ========

Hostname   Device Handle    Blocksize  Jukebox  Drv No. Status
——–   ————-    ———  ——-  ——- ——
senbackup1 /AFTD/Exchange/_AF_readonly 131072 N/A N/A   Pass
senbackup1 /AFTD/Arriba/_AF_readonly 131072 N/A N/A     Pass
senbackup1 /AFTD/Windows/_AF_readonly 131072 N/A N/A     Pass
senbackup1 /AFTD/UNIX/_AF_readonly 131072 N/A   N/A     Pass
senbackup1 /AFTD/SQL/_AF_readonly 131072 N/A    N/A     Pass
senbackup1 /AFTD/Exchange   131072     N/A      N/A     Pass
senbackup1 /AFTD/default/_AF_readonly 131072 N/A N/A     Pass
senbackup1 /AFTD/Arriba     131072     N/A      N/A     Pass
senbackup1 /AFTD/UNIX       131072     N/A      N/A     Pass
senbackup1 /AFTD/SQL        131072     N/A      N/A     Pass
senbackup1 /AFTD/default    131072     N/A      N/A     Pass
senbackup1 /AFTD/Windows    131072     N/A      N/A     Pass
senbackup1 Unknown          N/A        senlib3  N/A     Fail

**********************************************************************

14938:jbverify:
Exiting jbverify with failures.

root # jbverify -j -v -v
14866:jbverify:
Jbverify is running on host senbackup1, SunOS 5.10
14890:jbverify:
turning verbose mode on.
14900:jbverify:
No. of retries set by default to 10.
14902:jbverify:
Using senbackup1 as NetWorker server.
14918:jbverify:
Processing jukebox devices…
14920:jbverify:
Processing jukebox senlib3:
39078:jbverify: SYSTEM error: Device busy
39077:jbverify: error, Cannot initialize jukebox! Driver may not be loaded!
14926:jbverify:
Failed while processing jukebox senlib3 on senbackup1
14929:jbverify:
Finished processing jukebox devices.
**********************************************************************

Summary report of jbverify
======= ====== == ========

Hostname   Device Handle    Blocksize  Jukebox  Drv No. Status
——–   ————-    ———  ——-  ——- ——
senbackup1 Unknown          N/A        senlib3  N/A     Fail

**********************************************************************

14938:jbverify:
Exiting jbverify with failures.

Damit wird es nun offensichtlich, dass wohl ein Problem in der Tape-Library selbst besteht. Deshalb nun der Griff zur Library Console:

Oh je, das sieht verdächtig nach Hardware-Problem aus…

Dieser Beitrag wurde unter Backup, EMC, SAN, Solaris, Sun abgelegt und mit , , , , , verschlagwortet. Setze ein Lesezeichen auf den Permalink.

Hinterlasse eine Antwort

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind markiert *

*

Du kannst folgende HTML-Tags benutzen: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>