Consul triggering Kernel crash

Discuss usability issues, general maintenance, and general support issues for a grsecurity-enabled system.

Consul triggering Kernel crash

Postby coredumb » Thu Feb 19, 2015 5:30 am

Code: Select all
Feb 19 07:11:04 kernel: list_del corruption. prev->next should be ffff88007801e580, but was fefefefefefefefe
Feb 19 07:11:04 kernel: ------------[ cut here ]------------
Feb 19 07:11:04 kernel: kernel BUG at lib/list_debug.c:87!
Feb 19 07:11:04 kernel: invalid opcode: 0000 [#1] SMP
Feb 19 07:11:04 kernel: Modules linked in: netconsole configfs ipv6 ppdev microcode pcspkr e1000 parport_pc parport i2c_piix4 i2c_core sg shpchp ext4 jbd2 mbcache sd_mod crc_t10dif crct10dif_common sr_mod cdrom floppy mptsas mptscsih mptbase scsi_transport_sas pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
Feb 19 07:11:04 kernel: CPU: 1 PID: 30919 Comm: consul Not tainted 3.14.32-100.el6.x86_64 #1
Feb 19 07:11:04 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
Feb 19 07:11:04 kernel: task: ffff88007803c360 ti: ffff88007803cad8 task.ti: ffff88007803cad8
Feb 19 07:11:04 kernel: RIP: 0010:[<ffffffff812f44ac>]  [<ffffffff812f44ac>] __list_del_entry_debug+0x7c/0xa0
Feb 19 07:11:04 kernel: RSP: 0018:ffffc9000e5a3d80  EFLAGS: 00010083
Feb 19 07:11:04 kernel: RAX: 0000000000000054 RBX: ffff88007801e580 RCX: 0000000000000000
Feb 19 07:11:04 kernel: RDX: ffff88007fd0cf00 RSI: ffff88007fd0b508 RDI: 0000000000000046
Feb 19 07:11:04 kernel: RBP: ffffc9000e5a3d88 R08: 0000000000000092 R09: 0000000002000000
Feb 19 07:11:04 kernel: R10: 00000000000004c7 R11: ffffea0000db9f40 R12: ffff88007801e568
Feb 19 07:11:04 kernel: R13: 0000000000000282 R14: ffff88007961a280 R15: 0000000000000003
Feb 19 07:11:04 kernel: FS:  000002d2b1f2b700(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
Feb 19 07:11:04 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 19 07:11:04 kernel: CR2: ffffffffff600400 CR3: 0000000001647000 CR4: 00000000000007f0
Feb 19 07:11:04 kernel: Stack:
Feb 19 07:11:04 kernel: ffff88007801e580 ffffc9000e5a3da0 ffffffff812f4580 ffff88007801e580
Feb 19 07:11:04 kernel: ffffc9000e5a3db8 ffffffff812f45b0 ffff88007814be80 ffffc9000e5a3de0
Feb 19 07:11:04 kernel: ffffffff810a7b09 ffff88007801e550 ffff880036ebfb40 ffff880036ebfb80
Feb 19 07:11:04 kernel: Call Trace:
Feb 19 07:11:04 kernel: [<ffffffff812f4580>] __list_del_entry+0x10/0x30
Feb 19 07:11:04 kernel: [<ffffffff812f45b0>] list_del+0x10/0x30
Feb 19 07:11:04 kernel: [<ffffffff810a7b09>] remove_wait_queue+0x29/0x50
Feb 19 07:11:04 kernel: [<ffffffff811fefba>] ep_unregister_pollwait.isra.8+0x3a/0x60
Feb 19 07:11:04 kernel: [<ffffffff811ff002>] ep_remove+0x22/0xd0
Feb 19 07:11:04 kernel: [<ffffffff8120046c>] SyS_epoll_ctl+0x41c/0xa30
Feb 19 07:11:04 kernel: [<ffffffff816389ff>] system_call_fastpath+0x16/0x1b
Feb 19 07:11:04 kernel: [<ffffffff81638a36>] ? sysret_check+0x2d/0x6c
Feb 19 07:11:04 kernel: [<ffffffff8162f4cc>] ? retint_swapgs+0x13/0x16
Feb 19 07:11:04 kernel: Code: e8 9f b6 32 00 0f 0b 48 89 de 48 c7 c7 88 6b db 81 31 c0 e8 8c b6 32 00 0f 0b 48 89 de 48 c7 c7 48 6b db 81 31 c0 e8 79 b6 32 00 <0f> 0b 48 89 de 48 c7 c7 10 6b db 81 31 c0 e8 66 b6 32 00 0f 0b
Feb 19 07:11:04 kernel: RIP  [<ffffffff812f44ac>] __list_del_entry_debug+0x7c/0xa0
Feb 19 07:11:04 kernel: RSP <ffffc9000e5a3d80>
Feb 19 07:11:04 kernel: ---[ end trace 6a8fa99325ee51c4 ]---
Feb 19 07:11:04 kernel: grsec: banning user with uid 515 until system restart for suspicious kernel crash         


In this case uid 515 is the user running the consul daemon on those servers. Reproducible every 24h or so...
coredumb
 
Posts: 14
Joined: Mon Aug 25, 2014 10:11 am

Re: Consul triggering Kernel crash

Postby minipli » Thu Feb 19, 2015 6:07 am

This seems to be a known upstream bug (https://lkml.org/lkml/2014/5/15/532, https://lkml.org/lkml/2013/10/14/424) but without a fix.

I think the issue here is, the epi_cache (fs/eventpoll.c) is tying to use RCU operations on a SLAB not marked with SLAB_DESTROY_BY_RCU. This tells MEMORY_SANITIZE it's free to sanitize the objects on kmem_cache_free() time. Even though that operation is RCU'ed for the epi_cache (uses call_rcu()), MEMORY_SANITIZE should not sanitize the object in the epi_rcu_free() hook. But for MEMORY_SANITIZE to know this, the SLAB must be marked accordingly.

Can you try the following patch, please?: http://r00tworld.net/~minipli/grsec/pax ... l_rcu.diff
minipli
 
Posts: 21
Joined: Mon Jan 03, 2011 6:39 pm

Re: Consul triggering Kernel crash

Postby coredumb » Mon Feb 23, 2015 2:26 am

Still crashing :(

Code: Select all
list_del corruption. prev->next should be ffff880036e41710, but was fefefefefefefefe
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:87!
invalid opcode: 0000 [#1] SMP
Modules linked in: netconsole configfs ipv6 ppdev microcode pcspkr e1000 parport_pc parport sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif crct10dif_common sr_mod cdrom floppy mptsas mptscsih mptbase scsi_transport_sas pata_acpi ata_generic ata_piix dm_mi
rror dm_region_hash dm_log dm_mod                                                                                                                                                                                                                                                CPU: 0 PID: 1637 Comm: consul Not tainted 3.14.33-101.el6.x86_64 #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
task: ffff8800797289a0 ti: ffff880079728d08 task.ti: ffff880079728d08
RIP: 0010:[<ffffffff8130dff9>]  [<ffffffff8130dff9>] __list_del_entry_debug+0x89/0xb0
RSP: 0018:ffffc9000bbabd58  EFLAGS: 00010097
RAX: 0000000000000054 RBX: ffff880036e41710 RCX: 0000000000000000
RDX: ffff88007fc0d140 RSI: ffff88007fc0b648 RDI: 0000000000000046
RBP: ffffc9000bbabd60 R08: 0000000000000096 R09: 0000000000000000
R10: 000002cb0ce1570c R11: ffffea0001e22c40 R12: ffff88003734b080
R13: 0000000000000282 R14: ffffffffffffffea R15: 0000000000000003
FS:  000002cb0bf54700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffff600400 CR3: 000000000166f000 CR4: 00000000000007f0
Stack:
 ffff880036e41710 ffffc9000bbabd78 ffffffff8130e0d0 ffff880036e41710
 ffffc9000bbabd90 ffffffff8130e100 ffff880036e416f8 ffffc9000bbabdb8
 ffffffff810ad018 ffff880036e416e0 ffff880036e40a80 ffff880036e40ac0
Call Trace:
 [<ffffffff8130e0d0>] __list_del_entry+0x10/0x30
 [<ffffffff8130e100>] list_del+0x10/0x30
 [<ffffffff810ad018>] remove_wait_queue+0x28/0x40
 [<ffffffff8121190a>] ep_unregister_pollwait.isra.8+0x3a/0x60
 [<ffffffff81211952>] ep_remove+0x22/0xd0
 [<ffffffff81213390>] SyS_epoll_ctl+0x680/0xb00
 [<ffffffff8166297f>] system_call_fastpath+0x16/0x1b
 [<ffffffff816629b6>] ? sysret_check+0x2d/0x6c
Code: 66 90 48 89 de 48 c7 c7 38 b3 de 81 31 c0 e8 b7 02 34 00 0f 0b 0f 1f 44 00 00 48 89 de 48 c7 c7 70 b3 de 81 31 c0 e8 9f 02 34 00 <0f> 0b 0f 1f 44 00 00 48 89 de 48 c7 c7 b0 b3 de 81 31 c0 e8 87
RIP  [<ffffffff8130dff9>] __list_del_entry_debug+0x89/0xb0
 RSP <ffffc9000bbabd58>
---[ end trace 678558703460c6b0 ]---
grsec: banning user with uid 515 until system restart for suspicious kernel crash
coredumb
 
Posts: 14
Joined: Mon Aug 25, 2014 10:11 am

Re: Consul triggering Kernel crash

Postby minipli » Mon Feb 23, 2015 5:18 am

Seems to be a real use-after-free. I'll try to create a reproducer locally and will report back when I have something.
minipli
 
Posts: 21
Joined: Mon Jan 03, 2011 6:39 pm

Re: Consul triggering Kernel crash

Postby minipli » Tue Mar 03, 2015 5:29 am

Could you please revert the above patch and apply the following one instead?: http://r00tworld.net/~minipli/grsec/af_unix-no_dead_poll.diff
It withstood a 12 hour stress test with my local reproducer that otherwise triggered the bug within seconds.
If it fixes the bug for you too, it should probably go upstream as it's a vanilla bug. If it does not, well, then there're other bugs still waiting to be found ;)
minipli
 
Posts: 21
Joined: Mon Jan 03, 2011 6:39 pm

Re: Consul triggering Kernel crash

Postby coredumb » Tue Mar 03, 2015 9:52 am

Testing right now i'll let you know in 3 to 4 days.
coredumb
 
Posts: 14
Joined: Mon Aug 25, 2014 10:11 am

Re: Consul triggering Kernel crash

Postby coredumb » Fri Mar 06, 2015 1:38 am

Code: Select all
list_del corruption. prev->next should be ffff880037e44df0, but was fefefefefefefefe
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:87!
invalid opcode: 0000 [#1] SMP
Modules linked in: netconsole configfs ipv6 ppdev microcode pcspkr e1000 parport_pc parport sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif crct10dif_common sr_mod cdrom floppy mptsas mptscsih mptbase scsi_transport_sas pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
CPU: 1 PID: 1658 Comm: consul Not tainted 3.14.34-100.el6.x86_64 #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
task: ffff88007c219320 ti: ffff88007c219720 task.ti: ffff88007c219720
RIP: 0010:[<ffffffff8130db19>]  [<ffffffff8130db19>] __list_del_entry_debug+0x89/0xb0
RSP: 0018:ffffc9000bc53d48  EFLAGS: 00010093
RAX: 0000000000000054 RBX: ffff880037e44df0 RCX: 0000000000000000
RDX: ffff88007fd0d140 RSI: ffff88007fd0b648 RDI: 0000000000000046
RBP: ffffc9000bc53d50 R08: 0000000000000092 R09: 0000000000000000
R10: 000002b015372e0c R11: ffffea0001e04400 R12: ffff8800373ccd80
R13: 0000000000000282 R14: ffffffffffffffea R15: 0000000000000003
FS:  000002b015373700(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffff600400 CR3: 0000000001671000 CR4: 00000000000007f0
Stack:
 ffff880037e44df0 ffffc9000bc53d68 ffffffff8130dbf0 ffff880037e44df0
 ffffc9000bc53d80 ffffffff8130dc20 ffff880037e44dd8 ffffc9000bc53da8
 ffffffff810ad168 ffff880037e44dc0 ffff880037e43780 ffff880037e437c0
Call Trace:
 [<ffffffff8130dbf0>] __list_del_entry+0x10/0x30
 [<ffffffff8130dc20>] list_del+0x10/0x30
 [<ffffffff810ad168>] remove_wait_queue+0x28/0x40
 [<ffffffff8121150a>] ep_unregister_pollwait.isra.8+0x3a/0x60
 [<ffffffff81211552>] ep_remove+0x22/0xd0
 [<ffffffff81212f6a>] SyS_epoll_ctl+0x65a/0xac0
 [<ffffffff81662aff>] system_call_fastpath+0x16/0x1b
 [<ffffffff81662b36>] ? sysret_check+0x2d/0x6c
Code: 66 90 48 89 de 48 c7 c7 28 b4 de 81 31 c0 e8 77 08 34 00 0f 0b 0f 1f 44 00 00 48 89 de 48 c7 c7 60 b4 de 81 31 c0 e8 5f 08 34 00 <0f> 0b 0f 1f 44 00 00 48 89 de 48 c7 c7 a0 b4 de 81 31 c0 e8 47
RIP  [<ffffffff8130db19>] __list_del_entry_debug+0x89/0xb0
 RSP <ffffc9000bc53d48>
---[ end trace ed184eb9311a5d7c ]---
grsec: banning user with uid 515 until system restart for suspicious kernel crash

Still the same :(
coredumb
 
Posts: 14
Joined: Mon Aug 25, 2014 10:11 am

Re: Consul triggering Kernel crash

Postby minipli » Fri Mar 06, 2015 6:12 am

...so it's option 2: other bugs still waiting to be found :(

Can you tell me what kind of file descriptors consul is putting into epoll? lsof output may be helpful here.
minipli
 
Posts: 21
Joined: Mon Jan 03, 2011 6:39 pm


Return to grsecurity support