| Documentation for /proc/sys/kernel/* kernel version 2.2.10 |
| (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> |
| (c) 2009, Shen Feng<shen@cn.fujitsu.com> |
| |
| For general info and legal blurb, please look in README. |
| |
| ============================================================== |
| |
| This file contains documentation for the sysctl files in |
| /proc/sys/kernel/ and is valid for Linux kernel version 2.2. |
| |
| The files in this directory can be used to tune and monitor |
| miscellaneous and general things in the operation of the Linux |
| kernel. Since some of the files _can_ be used to screw up your |
| system, it is advisable to read both documentation and source |
| before actually making adjustments. |
| |
| Currently, these files might (depending on your configuration) |
| show up in /proc/sys/kernel: |
| |
| - acct |
| - acpi_video_flags |
| - auto_msgmni |
| - bootloader_type [ X86 only ] |
| - bootloader_version [ X86 only ] |
| - callhome [ S390 only ] |
| - cap_last_cap |
| - core_pattern |
| - core_pipe_limit |
| - core_uses_pid |
| - ctrl-alt-del |
| - dmesg_restrict |
| - domainname |
| - hostname |
| - hotplug |
| - hardlockup_all_cpu_backtrace |
| - hardlockup_panic |
| - hung_task_panic |
| - hung_task_check_count |
| - hung_task_timeout_secs |
| - hung_task_check_interval_secs |
| - hung_task_warnings |
| - hyperv_record_panic_msg |
| - kexec_load_disabled |
| - kptr_restrict |
| - l2cr [ PPC only ] |
| - modprobe ==> Documentation/debugging-modules.txt |
| - modules_disabled |
| - msg_next_id [ sysv ipc ] |
| - msgmax |
| - msgmnb |
| - msgmni |
| - nmi_watchdog |
| - osrelease |
| - ostype |
| - overflowgid |
| - overflowuid |
| - panic |
| - panic_on_oops |
| - panic_on_stackoverflow |
| - panic_on_unrecovered_nmi |
| - panic_on_warn |
| - panic_on_rcu_stall |
| - perf_cpu_time_max_percent |
| - perf_event_paranoid |
| - perf_event_max_stack |
| - perf_event_mlock_kb |
| - perf_event_max_contexts_per_stack |
| - pid_max |
| - powersave-nap [ PPC only ] |
| - printk |
| - printk_delay |
| - printk_ratelimit |
| - printk_ratelimit_burst |
| - pty ==> Documentation/filesystems/devpts.txt |
| - randomize_va_space |
| - real-root-dev ==> Documentation/admin-guide/initrd.rst |
| - reboot-cmd [ SPARC only ] |
| - rtsig-max |
| - rtsig-nr |
| - seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst |
| - sem |
| - sem_next_id [ sysv ipc ] |
| - sg-big-buff [ generic SCSI device (sg) ] |
| - shm_next_id [ sysv ipc ] |
| - shm_rmid_forced |
| - shmall |
| - shmmax [ sysv ipc ] |
| - shmmni |
| - softlockup_all_cpu_backtrace |
| - soft_watchdog |
| - stack_erasing |
| - stop-a [ SPARC only ] |
| - sysrq ==> Documentation/admin-guide/sysrq.rst |
| - sysctl_writes_strict |
| - tainted |
| - threads-max |
| - unknown_nmi_panic |
| - watchdog |
| - watchdog_thresh |
| - version |
| |
| ============================================================== |
| |
| acct: |
| |
| highwater lowwater frequency |
| |
| If BSD-style process accounting is enabled these values control |
| its behaviour. If free space on filesystem where the log lives |
| goes below <lowwater>% accounting suspends. If free space gets |
| above <highwater>% accounting resumes. <Frequency> determines |
| how often do we check the amount of free space (value is in |
| seconds). Default: |
| 4 2 30 |
| That is, suspend accounting if there left <= 2% free; resume it |
| if we got >=4%; consider information about amount of free space |
| valid for 30 seconds. |
| |
| ============================================================== |
| |
| acpi_video_flags: |
| |
| flags |
| |
| See Doc*/kernel/power/video.txt, it allows mode of video boot to be |
| set during run time. |
| |
| ============================================================== |
| |
| auto_msgmni: |
| |
| This variable has no effect and may be removed in future kernel |
| releases. Reading it always returns 0. |
| Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni |
| upon memory add/remove or upon ipc namespace creation/removal. |
| Echoing "1" into this file enabled msgmni automatic recomputing. |
| Echoing "0" turned it off. auto_msgmni default value was 1. |
| |
| |
| ============================================================== |
| |
| bootloader_type: |
| |
| x86 bootloader identification |
| |
| This gives the bootloader type number as indicated by the bootloader, |
| shifted left by 4, and OR'd with the low four bits of the bootloader |
| version. The reason for this encoding is that this used to match the |
| type_of_loader field in the kernel header; the encoding is kept for |
| backwards compatibility. That is, if the full bootloader type number |
| is 0x15 and the full version number is 0x234, this file will contain |
| the value 340 = 0x154. |
| |
| See the type_of_loader and ext_loader_type fields in |
| Documentation/x86/boot.txt for additional information. |
| |
| ============================================================== |
| |
| bootloader_version: |
| |
| x86 bootloader version |
| |
| The complete bootloader version number. In the example above, this |
| file will contain the value 564 = 0x234. |
| |
| See the type_of_loader and ext_loader_ver fields in |
| Documentation/x86/boot.txt for additional information. |
| |
| ============================================================== |
| |
| callhome: |
| |
| Controls the kernel's callhome behavior in case of a kernel panic. |
| |
| The s390 hardware allows an operating system to send a notification |
| to a service organization (callhome) in case of an operating system panic. |
| |
| When the value in this file is 0 (which is the default behavior) |
| nothing happens in case of a kernel panic. If this value is set to "1" |
| the complete kernel oops message is send to the IBM customer service |
| organization in case the mainframe the Linux operating system is running |
| on has a service contract with IBM. |
| |
| ============================================================== |
| |
| cap_last_cap |
| |
| Highest valid capability of the running kernel. Exports |
| CAP_LAST_CAP from the kernel. |
| |
| ============================================================== |
| |
| core_pattern: |
| |
| core_pattern is used to specify a core dumpfile pattern name. |
| . max length 128 characters; default value is "core" |
| . core_pattern is used as a pattern template for the output filename; |
| certain string patterns (beginning with '%') are substituted with |
| their actual values. |
| . backward compatibility with core_uses_pid: |
| If core_pattern does not include "%p" (default does not) |
| and core_uses_pid is set, then .PID will be appended to |
| the filename. |
| . corename format specifiers: |
| %<NUL> '%' is dropped |
| %% output one '%' |
| %p pid |
| %P global pid (init PID namespace) |
| %i tid |
| %I global tid (init PID namespace) |
| %u uid (in initial user namespace) |
| %g gid (in initial user namespace) |
| %d dump mode, matches PR_SET_DUMPABLE and |
| /proc/sys/fs/suid_dumpable |
| %s signal number |
| %t UNIX time of dump |
| %h hostname |
| %e executable filename (may be shortened) |
| %E executable path |
| %<OTHER> both are dropped |
| . If the first character of the pattern is a '|', the kernel will treat |
| the rest of the pattern as a command to run. The core dump will be |
| written to the standard input of that program instead of to a file. |
| |
| ============================================================== |
| |
| core_pipe_limit: |
| |
| This sysctl is only applicable when core_pattern is configured to pipe |
| core files to a user space helper (when the first character of |
| core_pattern is a '|', see above). When collecting cores via a pipe |
| to an application, it is occasionally useful for the collecting |
| application to gather data about the crashing process from its |
| /proc/pid directory. In order to do this safely, the kernel must wait |
| for the collecting process to exit, so as not to remove the crashing |
| processes proc files prematurely. This in turn creates the |
| possibility that a misbehaving userspace collecting process can block |
| the reaping of a crashed process simply by never exiting. This sysctl |
| defends against that. It defines how many concurrent crashing |
| processes may be piped to user space applications in parallel. If |
| this value is exceeded, then those crashing processes above that value |
| are noted via the kernel log and their cores are skipped. 0 is a |
| special value, indicating that unlimited processes may be captured in |
| parallel, but that no waiting will take place (i.e. the collecting |
| process is not guaranteed access to /proc/<crashing pid>/). This |
| value defaults to 0. |
| |
| ============================================================== |
| |
| core_uses_pid: |
| |
| The default coredump filename is "core". By setting |
| core_uses_pid to 1, the coredump filename becomes core.PID. |
| If core_pattern does not include "%p" (default does not) |
| and core_uses_pid is set, then .PID will be appended to |
| the filename. |
| |
| ============================================================== |
| |
| ctrl-alt-del: |
| |
| When the value in this file is 0, ctrl-alt-del is trapped and |
| sent to the init(1) program to handle a graceful restart. |
| When, however, the value is > 0, Linux's reaction to a Vulcan |
| Nerve Pinch (tm) will be an immediate reboot, without even |
| syncing its dirty buffers. |
| |
| Note: when a program (like dosemu) has the keyboard in 'raw' |
| mode, the ctrl-alt-del is intercepted by the program before it |
| ever reaches the kernel tty layer, and it's up to the program |
| to decide what to do with it. |
| |
| ============================================================== |
| |
| dmesg_restrict: |
| |
| This toggle indicates whether unprivileged users are prevented |
| from using dmesg(8) to view messages from the kernel's log buffer. |
| When dmesg_restrict is set to (0) there are no restrictions. When |
| dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use |
| dmesg(8). |
| |
| The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the |
| default value of dmesg_restrict. |
| |
| ============================================================== |
| |
| domainname & hostname: |
| |
| These files can be used to set the NIS/YP domainname and the |
| hostname of your box in exactly the same way as the commands |
| domainname and hostname, i.e.: |
| # echo "darkstar" > /proc/sys/kernel/hostname |
| # echo "mydomain" > /proc/sys/kernel/domainname |
| has the same effect as |
| # hostname "darkstar" |
| # domainname "mydomain" |
| |
| Note, however, that the classic darkstar.frop.org has the |
| hostname "darkstar" and DNS (Internet Domain Name Server) |
| domainname "frop.org", not to be confused with the NIS (Network |
| Information Service) or YP (Yellow Pages) domainname. These two |
| domain names are in general different. For a detailed discussion |
| see the hostname(1) man page. |
| |
| ============================================================== |
| hardlockup_all_cpu_backtrace: |
| |
| This value controls the hard lockup detector behavior when a hard |
| lockup condition is detected as to whether or not to gather further |
| debug information. If enabled, arch-specific all-CPU stack dumping |
| will be initiated. |
| |
| 0: do nothing. This is the default behavior. |
| |
| 1: on detection capture more debug information. |
| ============================================================== |
| |
| hardlockup_panic: |
| |
| This parameter can be used to control whether the kernel panics |
| when a hard lockup is detected. |
| |
| 0 - don't panic on hard lockup |
| 1 - panic on hard lockup |
| |
| See Documentation/lockup-watchdogs.txt for more information. This can |
| also be set using the nmi_watchdog kernel parameter. |
| |
| ============================================================== |
| |
| hotplug: |
| |
| Path for the hotplug policy agent. |
| Default value is "/sbin/hotplug". |
| |
| ============================================================== |
| |
| hung_task_panic: |
| |
| Controls the kernel's behavior when a hung task is detected. |
| This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. |
| |
| 0: continue operation. This is the default behavior. |
| |
| 1: panic immediately. |
| |
| ============================================================== |
| |
| hung_task_check_count: |
| |
| The upper bound on the number of tasks that are checked. |
| This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. |
| |
| ============================================================== |
| |
| hung_task_timeout_secs: |
| |
| When a task in D state did not get scheduled |
| for more than this value report a warning. |
| This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. |
| |
| 0: means infinite timeout - no checking done. |
| Possible values to set are in range {0..LONG_MAX/HZ}. |
| |
| ============================================================== |
| |
| hung_task_check_interval_secs: |
| |
| Hung task check interval. If hung task checking is enabled |
| (see hung_task_timeout_secs), the check is done every |
| hung_task_check_interval_secs seconds. |
| This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. |
| |
| 0 (default): means use hung_task_timeout_secs as checking interval. |
| Possible values to set are in range {0..LONG_MAX/HZ}. |
| |
| ============================================================== |
| |
| hung_task_warnings: |
| |
| The maximum number of warnings to report. During a check interval |
| if a hung task is detected, this value is decreased by 1. |
| When this value reaches 0, no more warnings will be reported. |
| This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. |
| |
| -1: report an infinite number of warnings. |
| |
| ============================================================== |
| |
| hyperv_record_panic_msg: |
| |
| Controls whether the panic kmsg data should be reported to Hyper-V. |
| |
| 0: do not report panic kmsg data. |
| |
| 1: report the panic kmsg data. This is the default behavior. |
| |
| ============================================================== |
| |
| kexec_load_disabled: |
| |
| A toggle indicating if the kexec_load syscall has been disabled. This |
| value defaults to 0 (false: kexec_load enabled), but can be set to 1 |
| (true: kexec_load disabled). Once true, kexec can no longer be used, and |
| the toggle cannot be set back to false. This allows a kexec image to be |
| loaded before disabling the syscall, allowing a system to set up (and |
| later use) an image without it being altered. Generally used together |
| with the "modules_disabled" sysctl. |
| |
| ============================================================== |
| |
| kptr_restrict: |
| |
| This toggle indicates whether restrictions are placed on |
| exposing kernel addresses via /proc and other interfaces. |
| |
| When kptr_restrict is set to 0 (the default) the address is hashed before |
| printing. (This is the equivalent to %p.) |
| |
| When kptr_restrict is set to (1), kernel pointers printed using the %pK |
| format specifier will be replaced with 0's unless the user has CAP_SYSLOG |
| and effective user and group ids are equal to the real ids. This is |
| because %pK checks are done at read() time rather than open() time, so |
| if permissions are elevated between the open() and the read() (e.g via |
| a setuid binary) then %pK will not leak kernel pointers to unprivileged |
| users. Note, this is a temporary solution only. The correct long-term |
| solution is to do the permission checks at open() time. Consider removing |
| world read permissions from files that use %pK, and using dmesg_restrict |
| to protect against uses of %pK in dmesg(8) if leaking kernel pointer |
| values to unprivileged users is a concern. |
| |
| When kptr_restrict is set to (2), kernel pointers printed using |
| %pK will be replaced with 0's regardless of privileges. |
| |
| ============================================================== |
| |
| l2cr: (PPC only) |
| |
| This flag controls the L2 cache of G3 processor boards. If |
| 0, the cache is disabled. Enabled if nonzero. |
| |
| ============================================================== |
| |
| modules_disabled: |
| |
| A toggle value indicating if modules are allowed to be loaded |
| in an otherwise modular kernel. This toggle defaults to off |
| (0), but can be set true (1). Once true, modules can be |
| neither loaded nor unloaded, and the toggle cannot be set back |
| to false. Generally used with the "kexec_load_disabled" toggle. |
| |
| ============================================================== |
| |
| msg_next_id, sem_next_id, and shm_next_id: |
| |
| These three toggles allows to specify desired id for next allocated IPC |
| object: message, semaphore or shared memory respectively. |
| |
| By default they are equal to -1, which means generic allocation logic. |
| Possible values to set are in range {0..INT_MAX}. |
| |
| Notes: |
| 1) kernel doesn't guarantee, that new object will have desired id. So, |
| it's up to userspace, how to handle an object with "wrong" id. |
| 2) Toggle with non-default value will be set back to -1 by kernel after |
| successful IPC object allocation. If an IPC object allocation syscall |
| fails, it is undefined if the value remains unmodified or is reset to -1. |
| |
| ============================================================== |
| |
| nmi_watchdog: |
| |
| This parameter can be used to control the NMI watchdog |
| (i.e. the hard lockup detector) on x86 systems. |
| |
| 0 - disable the hard lockup detector |
| 1 - enable the hard lockup detector |
| |
| The hard lockup detector monitors each CPU for its ability to respond to |
| timer interrupts. The mechanism utilizes CPU performance counter registers |
| that are programmed to generate Non-Maskable Interrupts (NMIs) periodically |
| while a CPU is busy. Hence, the alternative name 'NMI watchdog'. |
| |
| The NMI watchdog is disabled by default if the kernel is running as a guest |
| in a KVM virtual machine. This default can be overridden by adding |
| |
| nmi_watchdog=1 |
| |
| to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst). |
| |
| ============================================================== |
| |
| numa_balancing |
| |
| Enables/disables automatic page fault based NUMA memory |
| balancing. Memory is moved automatically to nodes |
| that access it often. |
| |
| Enables/disables automatic NUMA memory balancing. On NUMA machines, there |
| is a performance penalty if remote memory is accessed by a CPU. When this |
| feature is enabled the kernel samples what task thread is accessing memory |
| by periodically unmapping pages and later trapping a page fault. At the |
| time of the page fault, it is determined if the data being accessed should |
| be migrated to a local memory node. |
| |
| The unmapping of pages and trapping faults incur additional overhead that |
| ideally is offset by improved memory locality but there is no universal |
| guarantee. If the target workload is already bound to NUMA nodes then this |
| feature should be disabled. Otherwise, if the system overhead from the |
| feature is too high then the rate the kernel samples for NUMA hinting |
| faults may be controlled by the numa_balancing_scan_period_min_ms, |
| numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, |
| numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. |
| |
| ============================================================== |
| |
| numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, |
| numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb |
| |
| Automatic NUMA balancing scans tasks address space and unmaps pages to |
| detect if pages are properly placed or if the data should be migrated to a |
| memory node local to where the task is running. Every "scan delay" the task |
| scans the next "scan size" number of pages in its address space. When the |
| end of the address space is reached the scanner restarts from the beginning. |
| |
| In combination, the "scan delay" and "scan size" determine the scan rate. |
| When "scan delay" decreases, the scan rate increases. The scan delay and |
| hence the scan rate of every task is adaptive and depends on historical |
| behaviour. If pages are properly placed then the scan delay increases, |
| otherwise the scan delay decreases. The "scan size" is not adaptive but |
| the higher the "scan size", the higher the scan rate. |
| |
| Higher scan rates incur higher system overhead as page faults must be |
| trapped and potentially data must be migrated. However, the higher the scan |
| rate, the more quickly a tasks memory is migrated to a local node if the |
| workload pattern changes and minimises performance impact due to remote |
| memory accesses. These sysctls control the thresholds for scan delays and |
| the number of pages scanned. |
| |
| numa_balancing_scan_period_min_ms is the minimum time in milliseconds to |
| scan a tasks virtual memory. It effectively controls the maximum scanning |
| rate for each task. |
| |
| numa_balancing_scan_delay_ms is the starting "scan delay" used for a task |
| when it initially forks. |
| |
| numa_balancing_scan_period_max_ms is the maximum time in milliseconds to |
| scan a tasks virtual memory. It effectively controls the minimum scanning |
| rate for each task. |
| |
| numa_balancing_scan_size_mb is how many megabytes worth of pages are |
| scanned for a given scan. |
| |
| ============================================================== |
| |
| osrelease, ostype & version: |
| |
| # cat osrelease |
| 2.1.88 |
| # cat ostype |
| Linux |
| # cat version |
| #5 Wed Feb 25 21:49:24 MET 1998 |
| |
| The files osrelease and ostype should be clear enough. Version |
| needs a little more clarification however. The '#5' means that |
| this is the fifth kernel built from this source base and the |
| date behind it indicates the time the kernel was built. |
| The only way to tune these values is to rebuild the kernel :-) |
| |
| ============================================================== |
| |
| overflowgid & overflowuid: |
| |
| if your architecture did not always support 32-bit UIDs (i.e. arm, |
| i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to |
| applications that use the old 16-bit UID/GID system calls, if the |
| actual UID or GID would exceed 65535. |
| |
| These sysctls allow you to change the value of the fixed UID and GID. |
| The default is 65534. |
| |
| ============================================================== |
| |
| panic: |
| |
| The value in this file represents the number of seconds the kernel |
| waits before rebooting on a panic. When you use the software watchdog, |
| the recommended setting is 60. |
| |
| ============================================================== |
| |
| panic_on_io_nmi: |
| |
| Controls the kernel's behavior when a CPU receives an NMI caused by |
| an IO error. |
| |
| 0: try to continue operation (default) |
| |
| 1: panic immediately. The IO error triggered an NMI. This indicates a |
| serious system condition which could result in IO data corruption. |
| Rather than continuing, panicking might be a better choice. Some |
| servers issue this sort of NMI when the dump button is pushed, |
| and you can use this option to take a crash dump. |
| |
| ============================================================== |
| |
| panic_on_oops: |
| |
| Controls the kernel's behaviour when an oops or BUG is encountered. |
| |
| 0: try to continue operation |
| |
| 1: panic immediately. If the `panic' sysctl is also non-zero then the |
| machine will be rebooted. |
| |
| ============================================================== |
| |
| panic_on_stackoverflow: |
| |
| Controls the kernel's behavior when detecting the overflows of |
| kernel, IRQ and exception stacks except a user stack. |
| This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. |
| |
| 0: try to continue operation. |
| |
| 1: panic immediately. |
| |
| ============================================================== |
| |
| panic_on_unrecovered_nmi: |
| |
| The default Linux behaviour on an NMI of either memory or unknown is |
| to continue operation. For many environments such as scientific |
| computing it is preferable that the box is taken out and the error |
| dealt with than an uncorrected parity/ECC error get propagated. |
| |
| A small number of systems do generate NMI's for bizarre random reasons |
| such as power management so the default is off. That sysctl works like |
| the existing panic controls already in that directory. |
| |
| ============================================================== |
| |
| panic_on_warn: |
| |
| Calls panic() in the WARN() path when set to 1. This is useful to avoid |
| a kernel rebuild when attempting to kdump at the location of a WARN(). |
| |
| 0: only WARN(), default behaviour. |
| |
| 1: call panic() after printing out WARN() location. |
| |
| ============================================================== |
| |
| panic_on_rcu_stall: |
| |
| When set to 1, calls panic() after RCU stall detection messages. This |
| is useful to define the root cause of RCU stalls using a vmcore. |
| |
| 0: do not panic() when RCU stall takes place, default behavior. |
| |
| 1: panic() after printing RCU stall messages. |
| |
| ============================================================== |
| |
| perf_cpu_time_max_percent: |
| |
| Hints to the kernel how much CPU time it should be allowed to |
| use to handle perf sampling events. If the perf subsystem |
| is informed that its samples are exceeding this limit, it |
| will drop its sampling frequency to attempt to reduce its CPU |
| usage. |
| |
| Some perf sampling happens in NMIs. If these samples |
| unexpectedly take too long to execute, the NMIs can become |
| stacked up next to each other so much that nothing else is |
| allowed to execute. |
| |
| 0: disable the mechanism. Do not monitor or correct perf's |
| sampling rate no matter how CPU time it takes. |
| |
| 1-100: attempt to throttle perf's sample rate to this |
| percentage of CPU. Note: the kernel calculates an |
| "expected" length of each sample event. 100 here means |
| 100% of that expected length. Even if this is set to |
| 100, you may still see sample throttling if this |
| length is exceeded. Set to 0 if you truly do not care |
| how much CPU is consumed. |
| |
| ============================================================== |
| |
| perf_event_paranoid: |
| |
| Controls use of the performance events system by unprivileged |
| users (without CAP_SYS_ADMIN). The default value is 2. |
| |
| -1: Allow use of (almost) all events by all users |
| Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK |
| >=0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN |
| Disallow raw tracepoint access by users without CAP_SYS_ADMIN |
| >=1: Disallow CPU event access by users without CAP_SYS_ADMIN |
| >=2: Disallow kernel profiling by users without CAP_SYS_ADMIN |
| |
| ============================================================== |
| |
| perf_event_max_stack: |
| |
| Controls maximum number of stack frames to copy for (attr.sample_type & |
| PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using |
| 'perf record -g' or 'perf trace --call-graph fp'. |
| |
| This can only be done when no events are in use that have callchains |
| enabled, otherwise writing to this file will return -EBUSY. |
| |
| The default value is 127. |
| |
| ============================================================== |
| |
| perf_event_mlock_kb: |
| |
| Control size of per-cpu ring buffer not counted agains mlock limit. |
| |
| The default value is 512 + 1 page |
| |
| ============================================================== |
| |
| perf_event_max_contexts_per_stack: |
| |
| Controls maximum number of stack frame context entries for |
| (attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for |
| instance, when using 'perf record -g' or 'perf trace --call-graph fp'. |
| |
| This can only be done when no events are in use that have callchains |
| enabled, otherwise writing to this file will return -EBUSY. |
| |
| The default value is 8. |
| |
| ============================================================== |
| |
| pid_max: |
| |
| PID allocation wrap value. When the kernel's next PID value |
| reaches this value, it wraps back to a minimum PID value. |
| PIDs of value pid_max or larger are not allocated. |
| |
| ============================================================== |
| |
| ns_last_pid: |
| |
| The last pid allocated in the current (the one task using this sysctl |
| lives in) pid namespace. When selecting a pid for a next task on fork |
| kernel tries to allocate a number starting from this one. |
| |
| ============================================================== |
| |
| powersave-nap: (PPC only) |
| |
| If set, Linux-PPC will use the 'nap' mode of powersaving, |
| otherwise the 'doze' mode will be used. |
| |
| ============================================================== |
| |
| printk: |
| |
| The four values in printk denote: console_loglevel, |
| default_message_loglevel, minimum_console_loglevel and |
| default_console_loglevel respectively. |
| |
| These values influence printk() behavior when printing or |
| logging error messages. See 'man 2 syslog' for more info on |
| the different loglevels. |
| |
| - console_loglevel: messages with a higher priority than |
| this will be printed to the console |
| - default_message_loglevel: messages without an explicit priority |
| will be printed with this priority |
| - minimum_console_loglevel: minimum (highest) value to which |
| console_loglevel can be set |
| - default_console_loglevel: default value for console_loglevel |
| |
| ============================================================== |
| |
| printk_delay: |
| |
| Delay each printk message in printk_delay milliseconds |
| |
| Value from 0 - 10000 is allowed. |
| |
| ============================================================== |
| |
| printk_ratelimit: |
| |
| Some warning messages are rate limited. printk_ratelimit specifies |
| the minimum length of time between these messages (in jiffies), by |
| default we allow one every 5 seconds. |
| |
| A value of 0 will disable rate limiting. |
| |
| ============================================================== |
| |
| printk_ratelimit_burst: |
| |
| While long term we enforce one message per printk_ratelimit |
| seconds, we do allow a burst of messages to pass through. |
| printk_ratelimit_burst specifies the number of messages we can |
| send before ratelimiting kicks in. |
| |
| ============================================================== |
| |
| printk_devkmsg: |
| |
| Control the logging to /dev/kmsg from userspace: |
| |
| ratelimit: default, ratelimited |
| on: unlimited logging to /dev/kmsg from userspace |
| off: logging to /dev/kmsg disabled |
| |
| The kernel command line parameter printk.devkmsg= overrides this and is |
| a one-time setting until next reboot: once set, it cannot be changed by |
| this sysctl interface anymore. |
| |
| ============================================================== |
| |
| randomize_va_space: |
| |
| This option can be used to select the type of process address |
| space randomization that is used in the system, for architectures |
| that support this feature. |
| |
| 0 - Turn the process address space randomization off. This is the |
| default for architectures that do not support this feature anyways, |
| and kernels that are booted with the "norandmaps" parameter. |
| |
| 1 - Make the addresses of mmap base, stack and VDSO page randomized. |
| This, among other things, implies that shared libraries will be |
| loaded to random addresses. Also for PIE-linked binaries, the |
| location of code start is randomized. This is the default if the |
| CONFIG_COMPAT_BRK option is enabled. |
| |
| 2 - Additionally enable heap randomization. This is the default if |
| CONFIG_COMPAT_BRK is disabled. |
| |
| There are a few legacy applications out there (such as some ancient |
| versions of libc.so.5 from 1996) that assume that brk area starts |
| just after the end of the code+bss. These applications break when |
| start of the brk area is randomized. There are however no known |
| non-legacy applications that would be broken this way, so for most |
| systems it is safe to choose full randomization. |
| |
| Systems with ancient and/or broken binaries should be configured |
| with CONFIG_COMPAT_BRK enabled, which excludes the heap from process |
| address space randomization. |
| |
| ============================================================== |
| |
| reboot-cmd: (Sparc only) |
| |
| ??? This seems to be a way to give an argument to the Sparc |
| ROM/Flash boot loader. Maybe to tell it what to do after |
| rebooting. ??? |
| |
| ============================================================== |
| |
| rtsig-max & rtsig-nr: |
| |
| The file rtsig-max can be used to tune the maximum number |
| of POSIX realtime (queued) signals that can be outstanding |
| in the system. |
| |
| rtsig-nr shows the number of RT signals currently queued. |
| |
| ============================================================== |
| |
| sched_schedstats: |
| |
| Enables/disables scheduler statistics. Enabling this feature |
| incurs a small amount of overhead in the scheduler but is |
| useful for debugging and performance tuning. |
| |
| ============================================================== |
| |
| sg-big-buff: |
| |
| This file shows the size of the generic SCSI (sg) buffer. |
| You can't tune it just yet, but you could change it on |
| compile time by editing include/scsi/sg.h and changing |
| the value of SG_BIG_BUFF. |
| |
| There shouldn't be any reason to change this value. If |
| you can come up with one, you probably know what you |
| are doing anyway :) |
| |
| ============================================================== |
| |
| shmall: |
| |
| This parameter sets the total amount of shared memory pages that |
| can be used system wide. Hence, SHMALL should always be at least |
| ceil(shmmax/PAGE_SIZE). |
| |
| If you are not sure what the default PAGE_SIZE is on your Linux |
| system, you can run the following command: |
| |
| # getconf PAGE_SIZE |
| |
| ============================================================== |
| |
| shmmax: |
| |
| This value can be used to query and set the run time limit |
| on the maximum shared memory segment size that can be created. |
| Shared memory segments up to 1Gb are now supported in the |
| kernel. This value defaults to SHMMAX. |
| |
| ============================================================== |
| |
| shm_rmid_forced: |
| |
| Linux lets you set resource limits, including how much memory one |
| process can consume, via setrlimit(2). Unfortunately, shared memory |
| segments are allowed to exist without association with any process, and |
| thus might not be counted against any resource limits. If enabled, |
| shared memory segments are automatically destroyed when their attach |
| count becomes zero after a detach or a process termination. It will |
| also destroy segments that were created, but never attached to, on exit |
| from the process. The only use left for IPC_RMID is to immediately |
| destroy an unattached segment. Of course, this breaks the way things are |
| defined, so some applications might stop working. Note that this |
| feature will do you no good unless you also configure your resource |
| limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't |
| need this. |
| |
| Note that if you change this from 0 to 1, already created segments |
| without users and with a dead originative process will be destroyed. |
| |
| ============================================================== |
| |
| sysctl_writes_strict: |
| |
| Control how file position affects the behavior of updating sysctl values |
| via the /proc/sys interface: |
| |
| -1 - Legacy per-write sysctl value handling, with no printk warnings. |
| Each write syscall must fully contain the sysctl value to be |
| written, and multiple writes on the same sysctl file descriptor |
| will rewrite the sysctl value, regardless of file position. |
| 0 - Same behavior as above, but warn about processes that perform writes |
| to a sysctl file descriptor when the file position is not 0. |
| 1 - (default) Respect file position when writing sysctl strings. Multiple |
| writes will append to the sysctl value buffer. Anything past the max |
| length of the sysctl value buffer will be ignored. Writes to numeric |
| sysctl entries must always be at file position 0 and the value must |
| be fully contained in the buffer sent in the write syscall. |
| |
| ============================================================== |
| |
| softlockup_all_cpu_backtrace: |
| |
| This value controls the soft lockup detector thread's behavior |
| when a soft lockup condition is detected as to whether or not |
| to gather further debug information. If enabled, each cpu will |
| be issued an NMI and instructed to capture stack trace. |
| |
| This feature is only applicable for architectures which support |
| NMI. |
| |
| 0: do nothing. This is the default behavior. |
| |
| 1: on detection capture more debug information. |
| |
| ============================================================== |
| |
| soft_watchdog |
| |
| This parameter can be used to control the soft lockup detector. |
| |
| 0 - disable the soft lockup detector |
| 1 - enable the soft lockup detector |
| |
| The soft lockup detector monitors CPUs for threads that are hogging the CPUs |
| without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads |
| from running. The mechanism depends on the CPUs ability to respond to timer |
| interrupts which are needed for the 'watchdog/N' threads to be woken up by |
| the watchdog timer function, otherwise the NMI watchdog - if enabled - can |
| detect a hard lockup condition. |
| |
| ============================================================== |
| |
| stack_erasing |
| |
| This parameter can be used to control kernel stack erasing at the end |
| of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. |
| |
| That erasing reduces the information which kernel stack leak bugs |
| can reveal and blocks some uninitialized stack variable attacks. |
| The tradeoff is the performance impact: on a single CPU system kernel |
| compilation sees a 1% slowdown, other systems and workloads may vary. |
| |
| 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. |
| |
| 1: kernel stack erasing is enabled (default), it is performed before |
| returning to the userspace at the end of syscalls. |
| |
| ============================================================== |
| |
| tainted: |
| |
| Non-zero if the kernel has been tainted. Numeric values, which can be |
| ORed together. The letters are seen in "Tainted" line of Oops reports. |
| |
| 1 (P): A module with a non-GPL license has been loaded, this |
| includes modules with no license. |
| Set by modutils >= 2.4.9 and module-init-tools. |
| 2 (F): A module was force loaded by insmod -f. |
| Set by modutils >= 2.4.9 and module-init-tools. |
| 4 (S): Unsafe SMP processors: SMP with CPUs not designed for SMP. |
| 8 (R): A module was forcibly unloaded from the system by rmmod -f. |
| 16 (M): A hardware machine check error occurred on the system. |
| 32 (B): A bad page was discovered on the system. |
| 64 (U): The user has asked that the system be marked "tainted". This |
| could be because they are running software that directly modifies |
| the hardware, or for other reasons. |
| 128 (D): The system has died. |
| 256 (A): The ACPI DSDT has been overridden with one supplied by the user |
| instead of using the one provided by the hardware. |
| 512 (W): A kernel warning has occurred. |
| 1024 (C): A module from drivers/staging was loaded. |
| 2048 (I): The system is working around a severe firmware bug. |
| 4096 (O): An out-of-tree module has been loaded. |
| 8192 (E): An unsigned module has been loaded in a kernel supporting module |
| signature. |
| 16384 (L): A soft lockup has previously occurred on the system. |
| 32768 (K): The kernel has been live patched. |
| 65536 (X): Auxiliary taint, defined and used by for distros. |
| 131072 (T): The kernel was built with the struct randomization plugin. |
| |
| ============================================================== |
| |
| threads-max |
| |
| This value controls the maximum number of threads that can be created |
| using fork(). |
| |
| During initialization the kernel sets this value such that even if the |
| maximum number of threads is created, the thread structures occupy only |
| a part (1/8th) of the available RAM pages. |
| |
| The minimum value that can be written to threads-max is 20. |
| The maximum value that can be written to threads-max is given by the |
| constant FUTEX_TID_MASK (0x3fffffff). |
| If a value outside of this range is written to threads-max an error |
| EINVAL occurs. |
| |
| The value written is checked against the available RAM pages. If the |
| thread structures would occupy too much (more than 1/8th) of the |
| available RAM pages threads-max is reduced accordingly. |
| |
| ============================================================== |
| |
| unknown_nmi_panic: |
| |
| The value in this file affects behavior of handling NMI. When the |
| value is non-zero, unknown NMI is trapped and then panic occurs. At |
| that time, kernel debugging information is displayed on console. |
| |
| NMI switch that most IA32 servers have fires unknown NMI up, for |
| example. If a system hangs up, try pressing the NMI switch. |
| |
| ============================================================== |
| |
| watchdog: |
| |
| This parameter can be used to disable or enable the soft lockup detector |
| _and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. |
| |
| 0 - disable both lockup detectors |
| 1 - enable both lockup detectors |
| |
| The soft lockup detector and the NMI watchdog can also be disabled or |
| enabled individually, using the soft_watchdog and nmi_watchdog parameters. |
| If the watchdog parameter is read, for example by executing |
| |
| cat /proc/sys/kernel/watchdog |
| |
| the output of this command (0 or 1) shows the logical OR of soft_watchdog |
| and nmi_watchdog. |
| |
| ============================================================== |
| |
| watchdog_cpumask: |
| |
| This value can be used to control on which cpus the watchdog may run. |
| The default cpumask is all possible cores, but if NO_HZ_FULL is |
| enabled in the kernel config, and cores are specified with the |
| nohz_full= boot argument, those cores are excluded by default. |
| Offline cores can be included in this mask, and if the core is later |
| brought online, the watchdog will be started based on the mask value. |
| |
| Typically this value would only be touched in the nohz_full case |
| to re-enable cores that by default were not running the watchdog, |
| if a kernel lockup was suspected on those cores. |
| |
| The argument value is the standard cpulist format for cpumasks, |
| so for example to enable the watchdog on cores 0, 2, 3, and 4 you |
| might say: |
| |
| echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask |
| |
| ============================================================== |
| |
| watchdog_thresh: |
| |
| This value can be used to control the frequency of hrtimer and NMI |
| events and the soft and hard lockup thresholds. The default threshold |
| is 10 seconds. |
| |
| The softlockup threshold is (2 * watchdog_thresh). Setting this |
| tunable to zero will disable lockup detection altogether. |
| |
| ============================================================== |