Question

Trouble with EdgeCore Switch

  • 21 November 2018
  • 3 replies
  • 219 views

Hi everyone,

My name is Axel, I have an infrastructure with 2 EdgeCore switches in a server bay with also 2 PFSense for the Firewall and some physical servers (esxi, backup, JBOD...)
We encountered some troubles recently, since one of the switch fell (the Edgecore2), knocking out our production infrastructure, without any reason.
Only a hard reboot, by unplugging the power cords was able to fix the problem.

By the way we encountered the same troubles last June, so it is a recurring problem.

the logs show these 2 lines that are repeated on both switch:

code:
2018-11-16T10:16:11.447573+01:00 sw-cl-2 sudo:     snmp : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/cumulus/bin/cl-resource-query -k
2018-11-16T10:16:13.880065+01:00 sw-cl-2 jdoo[4033]: 'localhost' sysloadavg(15min) of 130.0 matches resource limit [sysloadavg(15min)>110.0]

So we try this command lines :
On EdgeCore1
code:
Tue Nov 20 11:04:58 CET 2018
root@sw-cl-1 # cl-resource-query
Host entries: 143, 1% of maximum value 12288
IPv4 neighbors: 17
IPv6 neighbors: 63
IPv4/IPv6 entries: 13, 0% of maximum value 7892
Long IPv6 entries: 4, 0% of maximum value 2048
IPv4 Routes: 9
IPv6 Routes: 6
Total Routes: 15, 0% of maximum value 32768
ECMP nexthops: 0, 0% of maximum value 4044
MAC entries: 192, 0% of maximum value 24576
Ingress ACL entries: 317, 17% of maximum value 1792
Ingress ACL counters: 574, 2% of maximum value 28672
Ingress ACL meters: 21, 0% of maximum value 4096
Ingress ACL slices: 2, 28% of maximum value 7
Egress ACL entries: 25, 9% of maximum value 256
Egress ACL counters: 50, 4% of maximum value 1024
Egress ACL meters: 25, 9% of maximum value 256
Egress ACL slices: 1, 100% of maximum value 1


On EdgeCore2
code:
root@sw-cl-2 # date
Tue Nov 20 11:04:56 CET 2018
root@sw-cl-2 # cl-resource-query
Host entries: 140, 1% of maximum value 12288
IPv4 neighbors: 16
IPv6 neighbors: 62
IPv4/IPv6 entries: 13, 0% of maximum value 7892
Long IPv6 entries: 3, 0% of maximum value 2048
IPv4 Routes: 9
IPv6 Routes: 5
Total Routes: 14, 0% of maximum value 32768
ECMP nexthops: 0, 0% of maximum value 4044
MAC entries: 193, 0% of maximum value 24576
Ingress ACL entries: 317, 17% of maximum value 1792
Ingress ACL counters: 574, 2% of maximum value 28672
Ingress ACL meters: 21, 0% of maximum value 4096
Ingress ACL slices: 2, 28% of maximum value 7
Egress ACL entries: 25, 9% of maximum value 256
Egress ACL counters: 50, 4% of maximum value 1024
Egress ACL meters: 25, 9% of maximum value 256
Egress ACL slices: 1, 100% of maximum value 1


Last lines were red.

If need somebody to explain to me the issue with the switches.
I can send you additional logs which occurred during event if you need, or our EdgeCore config files.


Thank you all.
Best Regards

3 replies

Here are the additional logs during the incident, if someone have tips or hypothesis :

code:
2018-11-16T11:22:36.336891+01:00 sw-cl-2 switchd[2242]: sync.c:3759 Neighbor Summary : 0 Added, 2 Deleted, 0 Updated, 0 Skipped in 7859 usecs
2018-11-16T11:23:43.036391+01:00 sw-cl-2 switchd[2242]: sync.c:3759 Neighbor Summary : 2 Added, 0 Deleted, 0 Updated, 0 Skipped in 8451 usecs
2018-11-16T11:24:25.881228+01:00 sw-cl-2 switchd[2242]: sync.c:3759 Neighbor Summary : 0 Added, 1 Deleted, 0 Updated, 0 Skipped in 7534 usecs
2018-11-16T11:24:44.563661+01:00 sw-cl-2 switchd[2242]: sync.c:3759 Neighbor Summary : 1 Added, 0 Deleted, 0 Updated, 0 Skipped in 6685 usecs
2018-11-16T11:25:39.176449+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN soc_dma_done_chain: chan 1 dv_dcnt != dv_vcnt (log message replaces assert)
2018-11-16T11:25:39.176503+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN DMA STAT: 0
2018-11-16T11:25:39.176521+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: dv@0x3cba208 unit 0 dcbtype-23 op=RX vcnt=16 dcnt=1 cnt=16
2018-11-16T11:25:39.176930+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: chan=1 chain=(nil) flags=0x0-->notify-pkt
2018-11-16T11:25:39.176961+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: user1 0x3cba1e0. user2 (nil). user3 (nil). user4 (nil)
2018-11-16T11:25:39.176979+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: dcb[0] @0xb2c17600:
2018-11-16T11:25:39.176994+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #01165165c00 00014000 00000000 00400000 00720600 80004e34 fb100072 01140000 08000000 00720200 00080000 00000004 00000000 000c0082 00000000 8003004e
2018-11-16T11:25:39.177017+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: HIGIG2 Frame: len=16 (header=16 payload=0)
2018-11-16T11:25:39.177034+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: 0xfb100072
2018-11-16T11:25:39.177067+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: 0x01140000
2018-11-16T11:25:39.177084+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: 0x08000000
2018-11-16T11:25:39.177150+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: 0x00720200
2018-11-16T11:25:39.177198+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: 802.3 Ether-II VLAN-Tagged Payload (0 bytes)
2018-11-16T11:25:39.178084+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011type 23 chain !sg !reload !hg !stat !pause !purge
2018-11-16T11:25:39.178119+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011addr 0xb28c5c00 reqcount 16384 xfercount 78
2018-11-16T11:25:39.178139+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011done !error start end
2018-11-16T11:25:39.178156+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011reason bit 22: VlanFilterMatch
2018-11-16T11:25:39.178175+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011!chg_tos !regen_crc !chg_ecn !vfi_valid !dvp_nhi_sel
2018-11-16T11:25:39.178193+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011!service_tag switch_pkt !hg_type !src_hg
2018-11-16T11:25:39.178211+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011!l3routed !l3only !replicated !do_not_change_ttl !bpdu
2018-11-16T11:25:39.178231+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011!hg2_ext_hdr
2018-11-16T11:25:39.178247+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011reason_type=0 reason=00000000_00400000 ts_type=3 timestamp=00000000_00000000Dma descr: #011srcport=1536 cpu_cos=2 hgi=0 lb_pkt_type=52 repl_nhi=04000
2018-11-16T11:25:39.178268+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011outer_vid=114 outer_cfi=0 outer_pri=0 otag_action=0 vntag_action=0
2018-11-16T11:25:39.178286+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011inner_vid=0 inner_cfi=0 inner_pri=0 itag_action=0 itag_status=2
2018-11-16T11:25:39.179451+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011dscp=4 ecn=0 !special_pkt !all_switch_drop
2018-11-16T11:25:39.179481+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: #011decap_tunnel_type=0 vfi=0 match_rule=0 mtp_ind=0
2018-11-16T11:25:39.179502+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN Dma descr: data[0000]: {01005e000100} {5e0000120000} 5e00 016c
2018-11-16T11:25:39.179519+01:00 sw-cl-2 switchd[2242]: hal_bcm_console.c:149 WARN




Thank you

Regards
Userlevel 5
What versions of software are you running here? This output looks like it is is from the 2.x train. I see a bug for this message " soc_dma_done_chain" in the 2.0 timeframe. It has since been fixed as part of defect CM-2049.
Indeed, here is the output of cat /etc/lsb-release:

DISTRIB_ID="Cumulus Linux"
DISTRIB_RELEASE=2.5.7
DISTRIB_DESCRIPTION=2.5.7-753304d-201603071654-build


Thank you Eric for your time

Reply