Hi,
I am using grsecurity on Gentoo Linux with their already patched kenels hardened-sources. Stable releases. I encountered interesting problems since kernel 3.14 up to 4.1.4. My server is a HP ProLiant SE316M1-R2 (aka DL360), which is a two NUMA node system. Each Xeon CPU L5520 has access to 24GB RAM on all three memory channels. So both CPUs have a total of 48GB RAM.
I tried numad userspace program months ago. I was using this server mainly for monitoring purposes. So I also installed ntop-ng, which is very memory consuming. It happened from time to time, when memory got less and less (caches!) that suddenly proccesses seemed to hang. Being on the command line and typing some command could lead to full freeze, like the connection was cut. Seconds or even minutes later the process worked again. This was so damn problematic that I disabled the numad and uninstalled even ntop-ng.
Currently I use the server with Zabbix monitoring and also 7 virtual KVM guests. 3 weeks ago, I only had 24GB RAM. And it happened that Zabbix told me that I am running out of memory. I did not believe it, as there were nearly 6GB cached data. So I ignored it. Days later, I got a call in the morning that people could not deliver mail. I investigated the problem and saw that my virtual machine with my mail system was not running anymore. So I thought: Well, maybe Zabbix was right and I should buy more memory. So I did and now it has 48GB RAM.
Yesterday I wrote a script that shall do live backups with libvirt. For this, it creates an external snapshot and copies the base disk image to a different location. After that it merges changes from the snapshot back into the base image and removes the snapshot. While merging my mail server, the guest died. And again I found out that my server had all 48GB RAM in use. I guess the copy command did use all memory for caching.
And I used a kernel with numa_balancing turned on automatically at boot.
Could it be that some grsecurity stuff has problem, if a process is moved from one NUMA node to another?
Which options could cause such a problem? Is this grsecurity related or hardware or other linux problem? Are my thoughts right or is it totally wrong direction?
Thanks in advance
Christian