KVM guests and KERNEXEC

Discuss usability issues, general maintenance, and general support issues for a grsecurity-enabled system.

KVM guests and KERNEXEC

Postby morpho » Sat Aug 06, 2011 8:10 am

Hi.

I built a kernel with the latest stable patch (2.2.2-2.6.32.43) and KERNEXEC enabled to be used as a KVM guest.
The kernel builds fine, but when I start it up under KVM the kernel hits a BUG_ON in native_pax_open_kernel while trying to initialize the (virtual) CPU.

The exact location is arch/x86/include/asm/pgtable.h:92:
Code: Select all
BUG_ON(unlikely(cr0 & X86_CR0_WP));

The host system is an EPT-capable x86_64 machine running kernel 2.6.38, so it should allow the guest to change the WP bit in CR0.
Other PaX options (eg. UDEREF) work fine as long as KERNEXEC is disabled.
morpho
 
Posts: 6
Joined: Sat Aug 06, 2011 7:15 am

Re: KVM guests and KERNEXEC

Postby PaX Team » Mon Aug 08, 2011 4:21 am

morpho wrote:The exact location is arch/x86/include/asm/pgtable.h:92:
Code: Select all
BUG_ON(unlikely(cr0 & X86_CR0_WP));

The host system is an EPT-capable x86_64 machine running kernel 2.6.38, so it should allow the guest to change the WP bit in CR0.
Other PaX options (eg. UDEREF) work fine as long as KERNEXEC is disabled.
i'm not sure kvm's cr0.wp emulation allows the trickery that PaX needs. can you try a newer host kernel perhaps to see if anything's changed?
PaX Team
 
Posts: 2310
Joined: Mon Mar 18, 2002 4:35 pm

Re: KVM guests and KERNEXEC

Postby morpho » Wed Aug 17, 2011 7:43 am

PaX Team wrote:i'm not sure kvm's cr0.wp emulation allows the trickery that PaX needs.

kvm_intel code is pretty much the same in all 2.6.38+ releases as far as cr0.wp is concerned: it is forced to 1 unless the host is EPT capable, in which case the guest is allowed to change it at its leisure regardless of other CR0 bits (IIRC this was implemented to allow QNX to run as a guest OS, since it fiddles with cr0.wp during boot pretty much in the same way as KERNEXEC does).

PaX Team wrote:can you try a newer host kernel perhaps to see if anything's changed?

I can't change the host kernel on that machine, unfortunately, since other VMs are running production services.
I'll try to add a bit of debug output in native_pax_open_kernel() to see if cr0.wp is updated at all.
morpho
 
Posts: 6
Joined: Sat Aug 06, 2011 7:15 am

Re: KVM guests and KERNEXEC

Postby PaX Team » Wed Aug 17, 2011 11:15 am

morpho wrote:kvm_intel code is pretty much the same in all 2.6.38+ releases as far as cr0.wp is concerned: it is forced to 1 unless the host is EPT capable, in which case the guest is allowed to change it at its leisure regardless of other CR0 bits (IIRC this was implemented to allow QNX to run as a guest OS, since it fiddles with cr0.wp during boot pretty much in the same way as KERNEXEC does).
in that case it may be a genuine bug in the PaX side where somehow cr0.wp is no longer in sync with what the code expects (the BUG_ONs are meant to catch exactly these kinds of oversights). can you post the full BUG_ON report? maybe the backtrace will help me figure out where it went wrong.
PaX Team
 
Posts: 2310
Joined: Mon Mar 18, 2002 4:35 pm

Re: KVM guests and KERNEXEC

Postby morpho » Wed Aug 17, 2011 5:49 pm

Here is the trace:
Code: Select all
[    0.000000] Initializing CPU#0
[    0.000000] ------------[ cut here ]------------
[    0.000000] kernel BUG at /home/morpheus/backports/grsec/linux-2.6.32.43-grsec/arch/x86/include/asm/pgtable.h:92!
[    0.000000] invalid opcode: 0000 [#1] SMP
[    0.000000] last sysfs file:
[    0.000000] CPU 0
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.32.43-grsec #1 Bochs
[    0.000000] RIP: 0010:[<ffffffff8101e6ee>]  [<ffffffff8101e6ee>] native_pax_open_kernel+0x21/0x32
[    0.000000] RSP: 0018:ffffffff81601e40  EFLAGS: 00010006
[    0.000000] RAX: 0000000080040033 RBX: 0000000080040033 RCX: 0000000000000000
[    0.000000] RDX: ffffffff81606890 RSI: 0000000000000000 RDI: 0000000080050033
[    0.000000] RBP: ffffffff81601e48 R08: 0000000000000002 R09: 0000000000000000
[    0.000000] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff813b8040
[    0.000000] R13: 0000000000000008 R14: 0000000000000000 R15: ffff880001a04560
[    0.000000] FS:  0000000000000000(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000
[    0.000000] CS:  0038 DS: 0018 ES: 0018 CR0: 0000000080050033
[    0.000000] CR2: 0000000000000000 CR3: 00000000013b7000 CR4: 00000000000000b0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.000000] Process swapper (pid: 0, threadinfo ffffffff81604528, task ffffffff81604100)
[    0.000000] Stack:
[    0.000000]  0000000000000000 ffffffff81601e68 ffffffff8101e41b ffffffff813b8040
[    0.000000] <0> ffffffff81604880 ffffffff81601ee8 ffffffff813a0e5f ffffffff81601f08
[    0.000000] <0> ffffffff811d8d97 ffffffff81601ea0 ffff880001a00000 0000000081601ed8
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff8101e41b>] native_set_pgd+0x14/0x24
[    0.000000]  [<ffffffff813a0e5f>] cpu_init+0x205/0x305
[    0.000000]  [<ffffffff811d8d97>] ? sort+0x10b/0x1a4
[    0.000000]  [<ffffffff8101ffff>] ? kern_addr_valid+0x82/0xec
[    0.000000]  [<ffffffff81815192>] trap_init+0x249/0x258
[    0.000000]  [<ffffffff81812b22>] start_kernel+0x1ff/0x3cc
[    0.000000]  [<ffffffff8181225d>] x86_64_start_reservations+0xac/0xb0
[    0.000000]  [<ffffffff81812359>] x86_64_start_kernel+0xf8/0x107
[    0.000000] Code: 20 b1 4d 81 48 89 d8 5b c9 c3 55 48 89 e5 53 ff 14 25 18 b1 4d 81 48 89 c7 48 89 c3 48 81 f7 00 00 01 00 f7 c7 00 00 01 00 74 04 <0f> 0b eb fe ff 14 25 20 b1 4d 81 48 89 d8 5b c9 c3 55 48 89 e5
[    0.000000] RIP  [<ffffffff8101e6ee>] native_pax_open_kernel+0x21/0x32
[    0.000000]  RSP <ffffffff81601e40>
[    0.000000] ---[ end trace 9865ac5b5ee90b51 ]---


After a painful search through vmlinux with the help of gdb (gcc inlined the whole bloody thing), I found that the bug is caused by enter_lazy_tlb() and set_pgd trying to clear cr0.wp twice:
Code: Select all
static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{

#if defined(CONFIG_X86_64) && defined(CONFIG_PAX_MEMORY_UDEREF)
        unsigned int i;
        pgd_t *pgd;
                       
        pax_open_kernel();
        pgd = get_cpu_pgd(smp_processor_id());
        for (i = USER_PGD_PTRS; i < 2 * USER_PGD_PTRS; ++i)
                if (paravirt_enabled())
                        set_pgd(pgd+i, native_make_pgd(0));
                else
                        pgd[i] = native_make_pgd(0);
        pax_close_kernel();
#endif

The problem here is that KVM paravirt-ops does not replace set_pgd(), so paravirt will call native_set_pgd(), which in turn will try to call pax_open_kernel() again, causing cr0.wp to flip to 1 and ultimately hitting the aforementioned BUG_ON(unlikely(cr0 & X86_CR0_WP)) assertion.

HTH :)
morpho
 
Posts: 6
Joined: Sat Aug 06, 2011 7:15 am

Re: KVM guests and KERNEXEC

Postby PaX Team » Wed Aug 17, 2011 7:03 pm

morpho wrote:After a painful search through vmlinux with the help of gdb (gcc inlined the whole bloody thing), I found that the bug is caused by enter_lazy_tlb() and set_pgd trying to clear cr0.wp twice:
Code: Select all
static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{

#if defined(CONFIG_X86_64) && defined(CONFIG_PAX_MEMORY_UDEREF)
        unsigned int i;
        pgd_t *pgd;
                       
        pax_open_kernel();
        pgd = get_cpu_pgd(smp_processor_id());
        for (i = USER_PGD_PTRS; i < 2 * USER_PGD_PTRS; ++i)
                if (paravirt_enabled())
                        set_pgd(pgd+i, native_make_pgd(0));
                else
                        pgd[i] = native_make_pgd(0);
        pax_close_kernel();
#endif

The problem here is that KVM paravirt-ops does not replace set_pgd(), so paravirt will call native_set_pgd(), which in turn will try to call pax_open_kernel() again, causing cr0.wp to flip to 1 and ultimately hitting the aforementioned BUG_ON(unlikely(cr0 & X86_CR0_WP)) assertion.
this is weird, trap_init runs after setup_arch which sets up the kvm paravirt ops as well. so it seems that the host kernel either doesn't support paravirtualization or KVM_FEATURE_MMU_OP. can you patch the above code to read like this:
Code: Select all
if (paravirt_enabled() && pv_mmu_ops.set_pgd != native_set_pgd)
and see if it helps? you may need to add includes to get the declarations.
PaX Team
 
Posts: 2310
Joined: Mon Mar 18, 2002 4:35 pm

Re: KVM guests and KERNEXEC

Postby morpho » Thu Aug 18, 2011 8:40 am

PaX Team wrote:this is weird, trap_init runs after setup_arch which sets up the kvm paravirt ops as well. so it seems that the host kernel either doesn't support paravirtualization or KVM_FEATURE_MMU_OP.

It seems that paravirtualized MMU support in KVM on the host side was removed in March 2010 with commit a68a6a7282373bedba8a2ed751b6384edb983a64, so under KVM all MMU-related operations are executed through their native_* variants.

I'll try to add the check for native_set_pgd as soon as I can, but my hunch is that it'll hit some other paravirt_ops-related problem sooner or later.
morpho
 
Posts: 6
Joined: Sat Aug 06, 2011 7:15 am

Re: KVM guests and KERNEXEC

Postby morpho » Thu Aug 18, 2011 6:23 pm

Tried to patch as you suggested, it doesn't work - and probably never will since native_set_pgd() is declared as static inline in arch/x86/include/asm/pgtable_64.h.

As a workaround I changed the code to look like this (including <linux/kvm_para.h> at the top of mmu_context.h):
Code: Select all
static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{

#if defined(CONFIG_X86_64) && defined(CONFIG_PAX_MEMORY_UDEREF)
        unsigned int i;
        pgd_t *pgd;
                       
        pax_open_kernel();
        pgd = get_cpu_pgd(smp_processor_id());
        for (i = USER_PGD_PTRS; i < 2 * USER_PGD_PTRS; ++i)
#ifdef CONFIG_PARAVIRT
                if (paravirt_enabled() && (!kvm_para_available() || kvm_para_has_feature(KVM_FEATURE_MMU_OP)))
                        set_pgd(pgd+i, native_make_pgd(0));
                else
#endif /* CONFIG_PARAVIRT */
                        pgd[i] = native_make_pgd(0);
        pax_close_kernel();
#endif


This allows the kernel to go as far as trying to relinquish control to userspace, then it hits yet another bug while trying to call native_set_pgd():
Code: Select all
[    0.415025] Freeing unused kernel memory: 476k freed
[    0.499640] invalid opcode: 0000 [#1] SMP
[    0.500214] last sysfs file:
[    0.500583] CPU 0
[    0.500882] Pid: 1, comm: init Not tainted 2.6.32.43-grsec #1 Bochs
[    0.501669] RIP: 0010:[<ffffffff8101e6ee>]  [<ffffffff8101e6ee>] native_pax_open_kernel+0x21/0x32
[    0.502784] RSP: 0018:ffff88001f845de0  EFLAGS: 00010006
[    0.503473] RAX: 000000008004003b RBX: 000000008004003b RCX: ffff88001f848428
[    0.503561] RDX: 0000000000000000 RSI: 000000001f0f6067 RDI: 000000008005003b
[    0.503561] RBP: ffff88001f845de8 R08: ffff88001f846000 R09: ffff88001f0cd0c0
[    0.503561] R10: 000000004401353a R11: 0000000055607b1c R12: ffffffff813b8000
[    0.503561] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    0.503561] FS:  0000000000000000(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000
[    0.503561] CS:  0038 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.503561] CR2: 000007ff47072f99 CR3: 00000000013b8000 CR4: 00000000000006f0
[    0.503561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.503561] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.503561] Process init (pid: 1, threadinfo ffff88001f848428, task ffff88001f848000)
[    0.503561] Stack:
[    0.503561]  000000001f0f6067 ffff88001f845e08 ffffffff8101e419 ffffffff813b8000
[    0.503561] <0> 0000000000000000 0000000000000000 ffffffff81002849 0000000055607b1c
[    0.503561] <0> 000000004401353a ffff88001f0cd0c0 ffff88001f846000 ffffea00006e8b40
[    0.503561] Call Trace:
[    0.503561]  [<ffffffff8101e419>] native_set_pgd+0x12/0x24
[    0.503561]  [<ffffffff81002849>] pax_exit_kernel_user+0x59/0x100
[    0.503561]  [<ffffffff810034a0>] ? retint_swapgs+0xb/0x13
[    0.503561]  [<ffffffff810002ee>] ? run_init_process+0x1e/0x20
[    0.503561]  [<ffffffff810003ad>] ? init_post+0xbd/0xf0
[    0.503561]  [<ffffffff818126ea>] ? 0xffffffff818126ea
[    0.503561]  [<ffffffff81003dda>] ? child_rip+0xa/0x20
[    0.503561]  [<ffffffff810034b3>] ? restore_args+0x0/0x30
[    0.503561]  [<ffffffff8181253b>] ? 0xffffffff8181253b
[    0.503561]  [<ffffffff81003dd0>] ? child_rip+0x0/0x20
[    0.503561] Code: ff ff 66 90 48 89 d8 5b c9 c3 55 48 89 e5 53 e8 8a fb ff ff 66 90 48 89 c7 48 89 c3 48 81 f7 00 00 01 00 f7 c7 00 00 01 00 74 04 <0f> 0b eb fe e8 73 fb ff ff 66 90 48 89 d8 5b c9 c3 55 48 89 e5
[    0.503561] RIP  [<ffffffff8101e6ee>] native_pax_open_kernel+0x21/0x32
[    0.503561]  RSP <ffff88001f845de0>
[    0.503561] ---[ end trace 0566b61f1cab0730 ]---
[    0.528307] usb 1-1: new full speed USB device using uhci_hcd and address 2
[    0.529304] Kernel panic - not syncing: Attempted to kill init!
[    0.530022] Pid: 1, comm: init Tainted: G      D    2.6.32.43-grsec #1
[    0.530847] Call Trace:
[    0.531171]  [<ffffffff813a688b>] panic+0x75/0x129
[    0.531779]  [<ffffffff81043968>] do_exit+0x81/0x712
[    0.532458]  [<ffffffff813a697b>] ? printk+0x3c/0x41
[    0.533050]  [<ffffffff81044074>] do_group_exit+0x7b/0xa5
[    0.533761]  [<ffffffff81006f6a>] oops_end+0x90/0x95
[    0.534379]  [<ffffffff81007140>] die+0x55/0x5e
[    0.534938]  [<ffffffff810048a6>] do_trap+0x116/0x133
[    0.535569]  [<ffffffff81004c5e>] do_invalid_op+0x97/0xa0
[    0.536252]  [<ffffffff8101e6ee>] ? native_pax_open_kernel+0x21/0x32
[    0.537033]  [<ffffffff81002688>] ? pax_enter_kernel+0x38/0x50
[    0.537777]  [<ffffffff81003a81>] invalid_op+0x31/0x40
[    0.538404]  [<ffffffff8101e6ee>] ? native_pax_open_kernel+0x21/0x32
[    0.539190]  [<ffffffff8101e419>] native_set_pgd+0x12/0x24
[    0.539864]  [<ffffffff81002849>] pax_exit_kernel_user+0x59/0x100
[    0.540633]  [<ffffffff810034a0>] ? retint_swapgs+0xb/0x13
[    0.541308]  [<ffffffff810002ee>] ? run_init_process+0x1e/0x20
[    0.542032]  [<ffffffff810003ad>] ? init_post+0xbd/0xf0
[    0.542688]  [<ffffffff818126ea>] ? 0xffffffff818126ea
[    0.543328]  [<ffffffff81003dda>] ? child_rip+0xa/0x20
[    0.543979]  [<ffffffff810034b3>] ? restore_args+0x0/0x30
[    0.544680]  [<ffffffff8181253b>] ? 0xffffffff8181253b
[    0.545309]  [<ffffffff81003dd0>] ? child_rip+0x0/0x20

I'll try to hunt down this one tomorrow, right now I'm way too tired to start up gdb and check what happened.
morpho
 
Posts: 6
Joined: Sat Aug 06, 2011 7:15 am

Re: KVM guests and KERNEXEC

Postby PaX Team » Thu Aug 18, 2011 7:13 pm

morpho wrote:Tried to patch as you suggested, it doesn't work - and probably never will since native_set_pgd() is declared as static inline in arch/x86/include/asm/pgtable_64.h.
doh, i should have at least grepped it ;).
This allows the kernel to go as far as trying to relinquish control to userspace, then it hits yet another bug while trying to call native_set_pgd():
Code: Select all
[    0.503561]  [<ffffffff8101e419>] native_set_pgd+0x12/0x24
[    0.503561]  [<ffffffff81002849>] pax_exit_kernel_user+0x59/0x100

I'll try to hunt down this one tomorrow, right now I'm way too tired to start up gdb and check what happened.
this is again a nesting issue, the outer cr0.wp access is in pax_exit_kernel_user and then we end up in the native_set_pgd function again due to the pvops. i'm beginning to think that i'll need to fork native_set_pgd into two versions...
PaX Team
 
Posts: 2310
Joined: Mon Mar 18, 2002 4:35 pm

Re: KVM guests and KERNEXEC

Postby morpho » Fri Aug 19, 2011 10:38 am

I grepped the current stable patch and there are only three places where set_pgd() is called with cr0.wp cleared:
Code: Select all
arch/x86/include/asm/mmu_context.h:enter_lazy_tlb()
arch/x86/kernel/entry_64.S:pax_enter_kernel_user()
arch/x86/kernel/entry_64.S:pax_exit_kernel_user()

This might be fixed by defining a new pv_mmu_ops entry (rather hackish, I know) which does a straight assignment to the pgd pointer without flipping cr0.wp when running under KVM and falls back to standard set_pgd() for lguest, Xen and VMI (its value won't matter on bare metal since pv_info.paravirt_enabled would be 0 and set_pgd() would never be called in those contexts).
Now for the fun part: arch/x86/kernel/kvmclock.c sets pv_info.paravirt_enabled to 1 even if CONFIG_KVM_GUEST=N, so patching the KVM paravirt guest driver (arch/x86/kernel/kvm.c) would not be enough.
morpho
 
Posts: 6
Joined: Sat Aug 06, 2011 7:15 am


Return to grsecurity support