Hey,
We have 15 Dell XS23-TY3 nodes hosted across 5 Dell C6100 Quad Nodes. They are all running with a vCenter Server Essentials Kit that provides these with fairly basic functionality (no HA)
These devices are in two different datacenters, but with all with identical setups that are fairly standard:
- BMC 1.3 System BIOS: 1.69 through 1.71 (hangs happen among all hosts with varying BIOS versions. I update the BIOS after a failure and hosts don't stop hanging, etc.)
- iSCSI software HBAs per host
- Connection to iSCSI datastore that supports ESXi
- 2 x Intel 5540 2.53GHz Xeons
- 32-48GB RAM
- Redundant 1GB iSCSI connections and redundant vmnics for network traffic (management network also redundant)
- Internal boot storage (some SSDs and some HDDs. Flash drives were disconnecting too often for our setup)
- ESXi 6.0 connected to vCenter Server 6.0
- IPMI 2.0 with remote access via DRAC (saving grace here to reboot the hosts)
The hosts randomly hang and require a hard reset via the DRAC. After rebooting everything resumes for a few weeks to a few months or more...but eventually some lead back to hanging again. Anyone experienced this issue? Dell or non-Dell?
Side note/question:
Have IPMI devices caused issues in the past? I noticed twice that I had to reset/pull the CMOS battery to get the ipmi_srv screen to load and allow ESXi to boot normally.
add log output:
2015-07-29T11:20:28.440Z cpu13:32816)StorageApdHandler: 1204: APD start for 0x4305f958f000 [2c891b0c-e7814de8] 2015-07-29T11:20:28.440Z cpu0:32980)StorageApdHandler: 421: APD start event for 0x4305f958f000 [2c891b0c-e7814de8] 2015-07-29T11:20:28.440Z cpu0:32980)StorageApdHandlerEv: 110: Device or filesystem with identifier [2c891b0c-e7814de8] has entered the All Paths Down state. 2015-07-29T11:20:44.912Z cpu0:33186)WARNING: LinNet: netdev_watchdog:3678: NETDEV WATCHDOG: vmnic2: transmit timed out 2015-07-29T11:20:44.912Z cpu0:33186)WARNING: at vmkdrivers/src_92/vmklinux_92/vmware/linux_net.c:3707/netdev_watchdog() (inside vmklinux) 2015-07-29T11:20:44.912Z cpu0:33186)Backtrace for current CPU #0, worldID=33186, rbp=0x43037edb8e70 2015-07-29T11:20:44.912Z cpu0:33186)0x4390cd11be10:[0x418037296b4e]vmk_LogBacktraceMessage@vmkernel#nover+0x22 stack: 0x0, 0x41803791e7 2015-07-29T11:20:44.912Z cpu0:33186)0x4390cd11be30:[0x41803791e7b7]watchdog_work_cb@com.vmware.driverAPI#9.2+0x27f stack: 0x43037eda0ae 2015-07-29T11:20:44.912Z cpu0:33186)0x4390cd11bea0:[0x418037944a5f]vmklnx_workqueue_callout@com.vmware.driverAPI#9.2+0xd7 stack: 0x4303 2015-07-29T11:20:44.912Z cpu0:33186)0x4390cd11bf30:[0x41803724f872]helpFunc@vmkernel#nover+0x4e6 stack: 0x0, 0x43037eda0ae0, 0x27, 0x0, 2015-07-29T11:20:44.912Z cpu0:33186)0x4390cd11bfd0:[0x41803741231e]CpuSched_StartWorld@vmkernel#nover+0xa2 stack: 0x0, 0x0, 0x0, 0x0, 0 2015-07-29T11:20:44.912Z cpu0:33186)<3>e1000e 0000:03:00.0: vmnic2: Reset adapter unexpectedly