Page 1 of 1

Size Overflow in _decode_session6 - Kernel panic

PostPosted: Sat Nov 15, 2014 7:36 pm
by deagol
Hello,

an PAX enabled kernel is crashing my system every night when the DSL provider resets the ipv4 connection.
I've an ipv6 tunnel to hurrican electric terminating inside a virtual machine running on gentoo hardened. When the ipv4 connection is reset the system with the ipv6 tunnel panics. When I start a ping to any ipv6 address and reset the pppoe deamon on the ipv4 internet firewall I can reproduce an instant panic every time.

I found the same problem analyzed here: http://forums.gentoo.org/viewtopic-t-1003804.html with instructions to address that to the Pax team, but no evidence that that has happened till now.

Digging a bit around I was able to nail the panic to the size overflow protection in _decode_session6 for skb_network_header_len and even avoiding it by disabling the size overflow protection for skb_network_header_len.

(This is probably related to thread https://forums.grsecurity.net/viewtopic.php?f=1&t=4033 and I used the instructions there for debugging.)

So what happens when the IPv4 connection is reset and I've a ping6 running over the ipv6 tunnel?
There is no PAX or other message at all, the system panics:

Code: Select all
Kernel panic - not syncing: Aiee, killing interrupt handler!
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.2-hardened-r1 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000000 0000000000000000 ffff88011fc437e8 ffffffff91489ba4
 ffffffff915ff982 ffff88011fc43868 ffffffff914860df 0000000000000008
 ffff88011fc43878 ffff88011fc43810 0000000000000000 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff91489ba4>] dump_stack+0x45/0x5c
 [<ffffffff914860df>] panic+0xc8/0x211
 [<ffffffff9103ce98>] do_exit+0xad/0x911
 [<ffffffff91005047>] ? show_stack_log_lvl+0x104/0x119
 [<ffffffff9103e594>] do_group_exit+0x4b/0x10d
 [<ffffffff91100776>] report_size_overflow+0x41/0x41
 [<ffffffff9146c43d>] _decode_session6+0x198/0x30e
 [<ffffffff914310cf>] __xfrm_decode_session+0x41/0x58
 [<ffffffff9145d71e>] icmpv6_route_lookup+0xcc/0x152
 [<ffffffff91452255>] ? fib6_rule_lookup+0x37/0x45
 [<ffffffff9145e0c1>] icmp6_send+0x5b1/0x7a7
 [<ffffffff9148e7d1>] ? _raw_read_unlock_bh+0x28/0x31
 [<ffffffff9144d73d>] ? ip6_pol_route_lookup+0x19d/0x1b5
 [<ffffffff91452255>] ? fib6_rule_lookup+0x37/0x45
 [<ffffffff9147693a>] icmpv6_send+0x40/0x4d
 [<ffffffff91474409>] ipip6_err+0x20c/0x27c
 [<ffffffff9142d1d7>] tunnel64_err+0x36/0x4f
 [<ffffffff914150bf>] icmp_socket_deliver+0xc1/0xce
 [<ffffffff91415329>] icmp_unreach+0x1cf/0x1ee
 [<ffffffff91415f5f>] icmp_rcv+0x1c4/0x374
 [<ffffffff913e6e79>] ip_local_deliver_finish+0x11b/0x1f2
 [<ffffffff913e70bf>] ip_local_deliver+0x7c/0x86
 [<ffffffff913e6d16>] ip_rcv_finish+0x28f/0x2d7
 [<ffffffff913e73b8>] ip_rcv+0x2ef/0x361
 [<ffffffff913af501>] __netif_receive_skb_core+0x628/0x677
 [<ffffffff913af56c>] __netif_receive_skb+0x1c/0x71
 [<ffffffff913af5fe>] netif_receive_skb_internal+0x3d/0x78
 [<ffffffff913af64b>] netif_receive_skb+0x12/0x1a
 [<ffffffff91365642>] virtnet_receive+0x646/0x6a3
 [<ffffffff913656c7>] virtnet_poll+0x28/0x98
 [<ffffffff913afca1>] net_rx_action+0x120/0x231
 [<ffffffff9103f0f0>] __do_softirq+0x10e/0x213
 [<ffffffff9103f39b>] irq_exit+0x40/0x8a
 [<ffffffff91004581>] do_IRQ+0xc1/0xe0
 [<ffffffff91490386>] common_interrupt+0x86/0x86
 <EOI>  [<ffffffff9100ad8c>] ? sched_clock+0x9/0x13
 [<ffffffff9100ba5f>] ? hard_enable_TSC+0x21/0x21
 [<ffffffff9102ef00>] ? native_safe_halt+0x6/0xe
 [<ffffffff9100ba68>] default_idle+0x9/0x13
 [<ffffffff9100c1ea>] arch_cpu_idle+0x17/0x1f
 [<ffffffff9106b684>] cpu_startup_entry+0x107/0x201
 [<ffffffff910284ed>] ? lapic_resume+0x2ae/0x2ae
 [<ffffffff910267c5>] start_secondary+0x22f/0x23a
Kernel Offset: 0x10000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Aiee, killing interrupt handler!


I'm running gentoo hardened and have reproduced the issue with "linux-3.15.10-hardened-r1" and "linux-3.17.2-hardened-r1_fixed". That later is using 4420_grsecurity-3.0-3.17.2-201411062034.patch.

With the debug code from ephox and disabling the size overflow protection for skb_network_header_len (to prevent the kernel panic so I can get the debug printk messages) I get this in dmesg instead of the kernel panic:

Code: Select all
[  541.338918] PAX _decode_session6: transport_header: 76, network_header: 4e
[  541.385364] PAX _decode_session6: transport_header: 62, network_header: 7e
[  542.387031] PAX _decode_session6: transport_header: 62, network_header: 7e
[  543.389107] PAX _decode_session6: transport_header: 62, network_header: 7e
[  544.390073] PAX _decode_session6: transport_header: 62, network_header: 7e
[  545.391771] PAX _decode_session6: transport_header: 62, network_header: 7e
[  546.393425] PAX _decode_session6: transport_header: 62, network_header: 7e
[  547.394839] PAX _decode_session6: transport_header: 62, network_header: 7e

(To prevent the pernel panic I removed "skb_network_header_len" from "size_overflow_hash.data")

Looks like _decode_session6 is called for each ping or more likely for each response:
Running tcpdump I see ipv4 packets as reply for the ipv6 pings. The first one is an icmp type 3 code 13 (Communication administratively filtered) with a length of 94 bytes. For all following ipv6 pings send out by the tunnel interface there in an icmp 3 code 3 (Port unreachable) with a length of 166 bytes. (counted including the ethernet headers).
Looking at the "translated" ipv6 icmp messages the first ipv4 icmp reply is indeed lost on translation, there is no corresponding IPv6 message. But the first ipv6 ping which is getting a port unreachable response is getting two type ipv6 1 code 3 (Address unreachable) replys, with about 45ms delay in the cature I have. (The sequenzenumber is making that quite clear)

I assume now, that there is a bug in the linux kernel, somehow mangeling the communication administratively filtered icmp4 packet and triggering the pax check by doing that.

So any idea how to debug that further or get it fixed? I'm way out of my depth here and surprised I got so far at all...

Re: Size Overflow in _decode_session6 - Kernel panic

PostPosted: Sat Nov 15, 2014 8:43 pm
by PaX Team
thanks a lot for the debugging, this does look like a real issue in the upstream kernel so the best course of action would be to report it to lkml/netdev and have some ipv6/tunneling expert tell what's going on. it's entirely possible that the mismatch between transport_header and network_header is normal and the kernel is supposed to recover from it gracefully but in that case we need confirmation before we exclude skb_network_header_len from the overflow check. also for reference, this is also tracked as gentoo bug https://bugs.gentoo.org/show_bug.cgi?id=529352 .

Re: Size Overflow in _decode_session6 - Kernel panic

PostPosted: Sun Nov 16, 2014 8:47 am
by deagol
I think I got much closer to the real issue.

Looks like the culpit lies within the function "ipip6_err_gen_icmpv6_unreach" in sit.c
Witthout deep linux kernel or even C knowledge I'm not sure if it's the correct fix, but it's working for me:

Code: Select all
--- linux-3.17.2-hardened-r1_orig/net/ipv6/sit.c        2014-11-16 11:27:12.100000000 +0100
+++ linux-3.17.2-hardened-r1/net/ipv6/sit.c     2014-11-16 13:32:05.180000000 +0100
@@ -500,6 +500,7 @@
        skb_dst_drop(skb2);
        skb_pull(skb2, ihl);
        skb_reset_network_header(skb2);
+       skb_reset_transport_header(skb2);
 
        rt = rt6_lookup(dev_net(skb->dev), &ipv6_hdr(skb2)->saddr, NULL, 0, 0);



** edit: some interesting additional remarks to the conclusion above**

The problem are any ipv4 icmp desination unreachable packets for the tunnel endpoint. I'm able to trigger the same error without resetting the pppoe conection, a reject rule (tested with icmp port unreachable) for the tunnel destination is sufficient.
As mentioned I do not really understand the code in sit.c, but it looks like the skb from the ipv4 packet is copied to skb2 and later send out as the icmpv6 packet. With some printk's I could see that at first skb had sane values for network_header and transport_header (transport_header > network_header) but prior to calling icmpv6_send we had the same strange values as seen in _decode_session6. So I looked for a function to also reset the transport header and added it. (For me the transport_header length == network_header length with the debug code, we have no underflow in skb_network_header_len and I also get basically the same icmpv6 packet as without the patch.

Re: Size Overflow in _decode_session6 - Kernel panic

PostPosted: Wed Apr 01, 2015 6:35 am
by PaX Team
update for future searches: while the originally reported problem was fixed at the time it seems that it was either incomplete or there's another upstream bug lurking in the same code. you can track the progress on that one at https://bugs.gentoo.org/show_bug.cgi?id=545192 .