aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* kernel/audit_tree.c: tree will leak memory when failure occurs in ↵Chen Gang2013-05-111-1/+1
| | | | | | | | | | | | | | | | | | | audit_trim_trees() commit 12b2f117f3bf738c1a00a6f64393f1953a740bd4 upstream. audit_trim_trees() calls get_tree(). If a failure occurs we must call put_tree(). [akpm@linux-foundation.org: run put_tree() before mutex_lock() for small scalability improvement] Signed-off-by: Chen Gang <gang.chen@asianux.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix ftrace_dump()Steven Rostedt (Red Hat)2013-05-112-40/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7fe70b579c9e3daba71635e31b6189394e7b79d3 upstream. ftrace_dump() had a lot of issues. What ftrace_dump() does, is when ftrace_dump_on_oops is set (via a kernel parameter or sysctl), it will dump out the ftrace buffers to the console when either a oops, panic, or a sysrq-z occurs. This was written a long time ago when ftrace was fragile to recursion. But it wasn't written well even for that. There's a possible deadlock that can occur if a ftrace_dump() is happening and an NMI triggers another dump. This is because it grabs a lock before checking if the dump ran. It also totally disables ftrace, and tracing for no good reasons. As the ring_buffer now checks if it is read via a oops or NMI, where there's a chance that the buffer gets corrupted, it will disable itself. No need to have ftrace_dump() do the same. ftrace_dump() is now cleaned up where it uses an atomic counter to make sure only one dump happens at a time. A simple atomic_inc_return() is enough that is needed for both other CPUs and NMIs. No need for a spinlock, as if one CPU is running the dump, no other CPU needs to do it too. The tracing_on variable is turned off and not turned on. The original code did this, but it wasn't pretty. By just disabling this variable we get the result of not seeing traces that happen between crashes. For sysrq-z, it doesn't get turned on, but the user can always write a '1' to the tracing_on file. If they are using sysrq-z, then they should know about tracing_on. The new code is much easier to read and less error prone. No more deadlock possibility when an NMI triggers here. Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* rcutrace: single_open() leaksAl Viro2013-05-071-4/+4
| | | | | | | | commit 7ee2b9e56495c56dcaffa2bab19b39451d9fdc8a upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* clockevents: Set dummy handler on CPU_DEAD shutdownThomas Gleixner2013-05-072-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6f7a05d7018de222e40ca003721037a530979974 upstream. Vitaliy reported that a per cpu HPET timer interrupt crashes the system during hibernation. What happens is that the per cpu HPET timer gets shut down when the nonboot cpus are stopped. When the nonboot cpus are onlined again the HPET code sets up the MSI interrupt which fires before the clock event device is registered. The event handler is still set to hrtimer_interrupt, which then crashes the machine due to highres mode not being active. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700333 There is no real good way to avoid that in the HPET code. The HPET code alrady has a mechanism to detect spurious interrupts when event handler == NULL for a similar reason. We can handle that in the clockevent/tick layer and replace the previous functional handler with a dummy handler like we do in tick_setup_new_device(). The original clockevents code did this in clockevents_exchange_device(), but that got removed by commit 7c1e76897 (clockevents: prevent clockevent event_handler ending up handler_noop) which forgot to fix it up in tick_shutdown(). Same issue with the broadcast device. Reported-by: Vitaliy Fillipov <vitalif@yourcmc.ru> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: 700333@bugs.debian.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cgroup: fix broken file xattrsLi Zefan2013-05-071-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 712317ad97f41e738e1a19aa0a6392a78a84094e upstream. We should store file xattrs in struct cfent instead of struct cftype, because cftype is a type while cfent is object instance of cftype. For example each cgroup has a tasks file, and each tasks file is associated with a uniq cfent, but all those files share the same struct cftype. Alexey Kodanev reported a crash, which can be reproduced: # mount -t cgroup -o xattr /sys/fs/cgroup # mkdir /sys/fs/cgroup/test # setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks # rmdir /sys/fs/cgroup/test # umount /sys/fs/cgroup oops! In this case, simple_xattrs_free() will free the same struct simple_xattrs twice. tj: Dropped unused local variable @cft from cgroup_diput(). Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cgroup: fix an off-by-one bug which may trigger BUG_ON()Li Zefan2013-05-071-1/+1
| | | | | | | | | | | | | | | | | commit 3ac1707a13a3da9cfc8f242a15b2fae6df2c5f88 upstream. The 3rd parameter of flex_array_prealloc() is the number of elements, not the index of the last element. The effect of the bug is, when opening cgroup.procs, a flex array will be allocated and all elements of the array is allocated with GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to allocate memory for it, it'll trigger a BUG_ON(). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hrtimer: Add expiry time overflow check in hrtimer_interruptPrarit Bhargava2013-05-071-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8f294b5a139ee4b75e890ad5b443c93d1e558a8b upstream. The settimeofday01 test in the LTP testsuite effectively does gettimeofday(current time); settimeofday(Jan 1, 1970 + 100 seconds); settimeofday(current time); This test causes a stack trace to be displayed on the console during the setting of timeofday to Jan 1, 1970 + 100 seconds: [ 131.066751] ------------[ cut here ]------------ [ 131.096448] WARNING: at kernel/time/clockevents.c:209 clockevents_program_event+0x135/0x140() [ 131.104935] Hardware name: Dinar [ 131.108150] Modules linked in: sg nfsv3 nfs_acl nfsv4 auth_rpcgss nfs dns_resolver fscache lockd sunrpc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables kvm_amd kvm sp5100_tco bnx2 i2c_piix4 crc32c_intel k10temp fam15h_power ghash_clmulni_intel amd64_edac_mod pcspkr serio_raw edac_mce_amd edac_core microcode xfs libcrc32c sr_mod sd_mod cdrom ata_generic crc_t10dif pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ahci pata_atiixp libahci libata usb_storage i2c_core dm_mirror dm_region_hash dm_log dm_mod [ 131.176784] Pid: 0, comm: swapper/28 Not tainted 3.8.0+ #6 [ 131.182248] Call Trace: [ 131.184684] <IRQ> [<ffffffff810612af>] warn_slowpath_common+0x7f/0xc0 [ 131.191312] [<ffffffff8106130a>] warn_slowpath_null+0x1a/0x20 [ 131.197131] [<ffffffff810b9fd5>] clockevents_program_event+0x135/0x140 [ 131.203721] [<ffffffff810bb584>] tick_program_event+0x24/0x30 [ 131.209534] [<ffffffff81089ab1>] hrtimer_interrupt+0x131/0x230 [ 131.215437] [<ffffffff814b9600>] ? cpufreq_p4_target+0x130/0x130 [ 131.221509] [<ffffffff81619119>] smp_apic_timer_interrupt+0x69/0x99 [ 131.227839] [<ffffffff8161805d>] apic_timer_interrupt+0x6d/0x80 [ 131.233816] <EOI> [<ffffffff81099745>] ? sched_clock_cpu+0xc5/0x120 [ 131.240267] [<ffffffff814b9ff0>] ? cpuidle_wrap_enter+0x50/0xa0 [ 131.246252] [<ffffffff814b9fe9>] ? cpuidle_wrap_enter+0x49/0xa0 [ 131.252238] [<ffffffff814ba050>] cpuidle_enter_tk+0x10/0x20 [ 131.257877] [<ffffffff814b9c89>] cpuidle_idle_call+0xa9/0x260 [ 131.263692] [<ffffffff8101c42f>] cpu_idle+0xaf/0x120 [ 131.268727] [<ffffffff815f8971>] start_secondary+0x255/0x257 [ 131.274449] ---[ end trace 1151a50552231615 ]--- When we change the system time to a low value like this, the value of timekeeper->offs_real will be a negative value. It seems that the WARN occurs because an hrtimer has been started in the time between the releasing of the timekeeper lock and the IPI call (via a call to on_each_cpu) in clock_was_set() in the do_settimeofday() code. The end result is that a REALTIME_CLOCK timer has been added with softexpires = expires = KTIME_MAX. The hrtimer_interrupt() fires/is called and the loop at kernel/hrtimer.c:1289 is executed. In this loop the code subtracts the clock base's offset (which was set to timekeeper->offs_real in do_settimeofday()) from the current hrtimer_cpu_base->expiry value (which was KTIME_MAX): KTIME_MAX - (a negative value) = overflow A simple check for an overflow can resolve this problem. Using KTIME_MAX instead of the overflow value will result in the hrtimer function being run, and the reprogramming of the timer after that. Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Prarit Bhargava <prarit@redhat.com> [jstultz: Tweaked commit subject] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hrtimer: Fix ktime_add_ns() overflow on 32bit architecturesDavid Engraf2013-05-071-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 51fd36f3fad8447c487137ae26b9d0b3ce77bb25 upstream. One can trigger an overflow when using ktime_add_ns() on a 32bit architecture not supporting CONFIG_KTIME_SCALAR. When passing a very high value for u64 nsec, e.g. 7881299347898368000 the do_div() function converts this value to seconds (7881299347) which is still to high to pass to the ktime_set() function as long. The result in is a negative value. The problem on my system occurs in the tick-sched.c, tick_nohz_stop_sched_tick() when time_delta is set to timekeeping_max_deferment(). The check for time_delta < KTIME_MAX is valid, thus ktime_add_ns() is called with a too large value resulting in a negative expire value. This leads to an endless loop in the ticker code: time_delta: 7881299347898368000 expires = ktime_add_ns(last_update, time_delta) expires: negative value This fix caps the value to KTIME_MAX. This error doesn't occurs on 64bit or architectures supporting CONFIG_KTIME_SCALAR (e.g. ARM, x86-32). Signed-off-by: David Engraf <david.engraf@sysgo.com> [jstultz: Minor tweaks to commit message & header] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Reset ftrace_graph_filter_enabled if count is zeroNamhyung Kim2013-05-071-1/+2
| | | | | | | | | | | | | | | | | commit 9f50afccfdc15d95d7331acddcb0f7703df089ae upstream. The ftrace_graph_count can be decreased with a "!" pattern, so that the enabled flag should be updated too. Link: http://lkml.kernel.org/r/1365663698-2413-1-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Check return value of tracing_init_dentry()Namhyung Kim2013-05-073-0/+6
| | | | | | | | | | | | | | | commit ed6f1c996bfe4b6e520cf7a74b51cd6988d84420 upstream. Check return value and bail out if it's NULL. Link: http://lkml.kernel.org/r/1365553093-10180-2-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix off-by-one on allocating stat->pagesNamhyung Kim2013-05-071-1/+1
| | | | | | | | | | | | | | | commit 39e30cd1537937d3c00ef87e865324e981434e5b upstream. The first page was allocated separately, so no need to start from 0. Link: http://lkml.kernel.org/r/1364820385-32027-2-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Remove most or all of stack tracer stack size from stack_max_sizeSteven Rostedt (Red Hat)2013-05-071-21/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4df297129f622bdc18935c856f42b9ddd18f9f28 upstream. Currently, the depth reported in the stack tracer stack_trace file does not match the stack_max_size file. This is because the stack_max_size includes the overhead of stack tracer itself while the depth does not. The first time a max is triggered, a calculation is not performed that figures out the overhead of the stack tracer and subtracts it from the stack_max_size variable. The overhead is stored and is subtracted from the reported stack size for comparing for a new max. Now the stack_max_size corresponds to the reported depth: # cat stack_max_size 4640 # cat stack_trace Depth Size Location (48 entries) ----- ---- -------- 0) 4640 32 _raw_spin_lock+0x18/0x24 1) 4608 112 ____cache_alloc+0xb7/0x22d 2) 4496 80 kmem_cache_alloc+0x63/0x12f 3) 4416 16 mempool_alloc_slab+0x15/0x17 [...] While testing against and older gcc on x86 that uses mcount instead of fentry, I found that pasing in ip + MCOUNT_INSN_SIZE let the stack trace show one more function deep which was missing before. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix stack tracer with fentry useSteven Rostedt (Red Hat)2013-05-071-4/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d4ecbfc49b4b1d4b597fb5ba9e4fa25d62f105c5 upstream. When gcc 4.6 on x86 is used, the function tracer will use the new option -mfentry which does a call to "fentry" at every function instead of "mcount". The significance of this is that fentry is called as the first operation of the function instead of the mcount usage of being called after the stack. This causes the stack tracer to show some bogus results for the size of the last function traced, as well as showing "ftrace_call" instead of the function. This is due to the stack frame not being set up by the function that is about to be traced. # cat stack_trace Depth Size Location (48 entries) ----- ---- -------- 0) 4824 216 ftrace_call+0x5/0x2f 1) 4608 112 ____cache_alloc+0xb7/0x22d 2) 4496 80 kmem_cache_alloc+0x63/0x12f The 216 size for ftrace_call includes both the ftrace_call stack (which includes the saving of registers it does), as well as the stack size of the parent. To fix this, if CC_USING_FENTRY is defined, then the stack_tracer will reserve the first item in stack_dump_trace[] array when calling save_stack_trace(), and it will fill it in with the parent ip. Then the code will look for the parent pointer on the stack and give the real size of the parent's stack pointer: # cat stack_trace Depth Size Location (14 entries) ----- ---- -------- 0) 2640 48 update_group_power+0x26/0x187 1) 2592 224 update_sd_lb_stats+0x2a5/0x4ac 2) 2368 160 find_busiest_group+0x31/0x1f1 3) 2208 256 load_balance+0xd9/0x662 I'm Cc'ing stable, although it's not urgent, as it only shows bogus size for item #0, the rest of the trace is legit. It should still be corrected in previous stable releases. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Use stack of calling function for stack tracerSteven Rostedt (Red Hat)2013-05-071-5/+7
| | | | | | | | | | | | | | commit 87889501d0adfae10e3b0f0e6f2d7536eed9ae84 upstream. Use the stack of stack_trace_call() instead of check_stack() as the test pointer for max stack size. It makes it a bit cleaner and a little more accurate. Adding stable, as a later fix depends on this patch. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix selftest function recursion accountingSteven Rostedt2013-05-011-13/+3
| | | | | | | | | | | | | | | | commit 05cbbf643b8eea1be21082c53cdb856d1dc6d765 upstream. The test that checks function recursion does things differently if the arch does not support all ftrace features. But that really doesn't make a difference with how the test runs, and either way the count variable should be 2 at the end. Currently the test wrongly fails for archs that don't support all the ftrace features. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Add file_ns_capable() helper function for open-time capability checkingLinus Torvalds2013-04-261-0/+24
| | | | | | | | | | | | | | | commit 935d8aabd4331f47a89c3e1daa5779d23cf244ee upstream. Nothing is using it yet, but this will allow us to delay the open-time checks to use time, without breaking the normal UNIX permission semantics where permissions are determined by the opener (and the file descriptor can then be passed to a different process, or the process can drop capabilities). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Shea Levy <shea@shealevy.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: Changing any namespace id mappings should require privilegesAndy Lutomirski2013-04-251-3/+3
| | | | | | | | | | | | | | | | commit 41c21e351e79004dbb4efa4bc14a53a7e0af38c5 upstream. Changing uid/gid/projid mappings doesn't change your id within the namespace; it reconfigures the namespace. Unprivileged programs should *not* be able to write these files. (We're also checking the privileges on the wrong task.) Given the write-once nature of these files and the other security checks, this is likely impossible to usefully exploit. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: Check uid_map's opener's fsuid, not the current fsuidAndy Lutomirski2013-04-251-2/+2
| | | | | | | | commit e3211c120a85b792978bcb4be7b2886df18d27f0 upstream. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: Don't let unprivileged users trick privileged users into setting the ↵Eric W. Biederman2013-04-251-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | id_map commit 6708075f104c3c9b04b23336bb0366ca30c3931b upstream. When we require privilege for setting /proc/<pid>/uid_map or /proc/<pid>/gid_map no longer allow an unprivileged user to open the file and pass it to a privileged program to write to the file. Instead when privilege is required require both the opener and the writer to have the necessary capabilities. I have tested this code and verified that setting /proc/<pid>/uid_map fails when an unprivileged user opens the file and a privielged user attempts to set the mapping, that unprivileged users can still map their own id, and that a privileged users can still setup an arbitrary mapping. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf: Treat attr.config as u64 in perf_swevent_init()Tommi Rantala2013-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | commit 8176cced706b5e5d15887584150764894e94e02f upstream. Trinity discovered that we fail to check all 64 bits of attr.config passed by user space, resulting to out-of-bounds access of the perf_swevent_enabled array in sw_perf_event_destroy(). Introduced in commit b0a873ebb ("perf: Register PMU implementations"). Signed-off-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/debug: Fix sd->*_idx limit range avoiding overflowlibin2013-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | commit fd9b86d37a600488dbd80fe60cca46b822bff1cd upstream. Commit 201c373e8e ("sched/debug: Limit sd->*_idx range on sysctl") was an incomplete bug fix. This patch fixes sd->*_idx limit range to [0 ~ CPU_LOAD_IDX_MAX-1] avoiding array overflow caused by setting sd->*_idx to CPU_LOAD_IDX_MAX on sysctl. Signed-off-by: Libin <huawei.libin@huawei.com> Cc: <jiang.liu@huawei.com> Cc: <guohanjun@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/51626610.2040607@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()sTejun Heo2013-04-251-2/+4
| | | | | | | | | | | | | | | | | | | | | | commit 383efcd00053ec40023010ce5034bd702e7ab373 upstream. try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130318192234.GD3042@htj.dyndns.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kernel/signal.c: stop info leak via the tkill and the tgkill syscallsEmese Revfy2013-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b9e146d8eb3b9ecae5086d373b50fa0c1f3e7f0f upstream. This fixes a kernel memory contents leak via the tkill and tgkill syscalls for compat processes. This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field when handling signals delivered from tkill. The place of the infoleak: int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from) { ... put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr); ... } Signed-off-by: Emese Revfy <re.emese@gmail.com> Reviewed-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hrtimer: Don't reinitialize a cpu_base lock on CPU_UPMichael Bohan2013-04-251-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 84cc8fd2fe65866e49d70b38b3fdf7219dd92fe0 upstream. The current code makes the assumption that a cpu_base lock won't be held if the CPU corresponding to that cpu_base is offline, which isn't always true. If a hrtimer is not queued, then it will not be migrated by migrate_hrtimers() when a CPU is offlined. Therefore, the hrtimer's cpu_base may still point to a CPU which has subsequently gone offline if the timer wasn't enqueued at the time the CPU went down. Normally this wouldn't be a problem, but a cpu_base's lock is blindly reinitialized each time a CPU is brought up. If a CPU is brought online during the period that another thread is performing a hrtimer operation on a stale hrtimer, then the lock will be reinitialized under its feet, and a SPIN_BUG() like the following will be observed: <0>[ 28.082085] BUG: spinlock already unlocked on CPU#0, swapper/0/0 <0>[ 28.087078] lock: 0xc4780b40, value 0x0 .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 <4>[ 42.451150] [<c0014398>] (unwind_backtrace+0x0/0x120) from [<c0269220>] (do_raw_spin_unlock+0x44/0xdc) <4>[ 42.460430] [<c0269220>] (do_raw_spin_unlock+0x44/0xdc) from [<c071b5bc>] (_raw_spin_unlock+0x8/0x30) <4>[ 42.469632] [<c071b5bc>] (_raw_spin_unlock+0x8/0x30) from [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8) <4>[ 42.479521] [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8) from [<c00aa014>] (hrtimer_start+0x20/0x28) <4>[ 42.489247] [<c00aa014>] (hrtimer_start+0x20/0x28) from [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320) <4>[ 42.498709] [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320) from [<c00e6440>] (rcu_idle_enter+0xa0/0xb8) <4>[ 42.508259] [<c00e6440>] (rcu_idle_enter+0xa0/0xb8) from [<c000f268>] (cpu_idle+0x24/0xf0) <4>[ 42.516503] [<c000f268>] (cpu_idle+0x24/0xf0) from [<c06ed3c0>] (rest_init+0x88/0xa0) <4>[ 42.524319] [<c06ed3c0>] (rest_init+0x88/0xa0) from [<c0c00978>] (start_kernel+0x3d0/0x434) As an example, this particular crash occurred when hrtimer_start() was executed on CPU #0. The code locked the hrtimer's current cpu_base corresponding to CPU #1. CPU #0 then tried to switch the hrtimer's cpu_base to an optimal CPU which was online. In this case, it selected the cpu_base corresponding to CPU #3. Before it could proceed, CPU #1 came online and reinitialized the spinlock corresponding to its cpu_base. Thus now CPU #0 held a lock which was reinitialized. When CPU #0 finally ended up unlocking the old cpu_base corresponding to CPU #1 so that it could switch to CPU #3, we hit this SPIN_BUG() above while in switch_hrtimer_base(). CPU #0 CPU #1 ---- ---- ... <offline> hrtimer_start() lock_hrtimer_base(base #1) ... init_hrtimers_cpu() switch_hrtimer_base() ... ... raw_spin_lock_init(&cpu_base->lock) raw_spin_unlock(&cpu_base->lock) ... <spin_bug> Solve this by statically initializing the lock. Signed-off-by: Michael Bohan <mbohan@codeaurora.org> Link: http://lkml.kernel.org/r/1363745965-23475-1-git-send-email-mbohan@codeaurora.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kthread: Prevent unpark race which puts threads on the wrong cpuThomas Gleixner2013-04-251-24/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f2530dc71cf0822f90bb63ea4600caaef33a66bb upstream. The smpboot threads rely on the park/unpark mechanism which binds per cpu threads on a particular core. Though the functionality is racy: CPU0 CPU1 CPU2 unpark(T) wake_up_process(T) clear(SHOULD_PARK) T runs leave parkme() due to !SHOULD_PARK bind_to(CPU2) BUG_ON(wrong CPU) We cannot let the tasks move themself to the target CPU as one of those tasks is actually the migration thread itself, which requires that it starts running on the target cpu right away. The solution to this problem is to prevent wakeups in park mode which are not from unpark(). That way we can guarantee that the association of the task to the target cpu is working correctly. Add a new task state (TASK_PARKED) which prevents other wakeups and use this state explicitly for the unpark wakeup. Peter noticed: Also, since the task state is visible to userspace and all the parked tasks are still in the PID space, its a good hint in ps and friends that these tasks aren't really there for the moment. The migration thread has another related issue. CPU0 CPU1 Bring up CPU2 create_thread(T) park(T) wait_for_completion() parkme() complete() sched_set_stop_task() schedule(TASK_PARKED) The sched_set_stop_task() call is issued while the task is on the runqueue of CPU1 and that confuses the hell out of the stop_task class on that cpu. So we need the same synchronizaion before sched_set_stop_task(). Reported-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Dave Hansen <dave@sr71.net> Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Acked-by: Peter Ziljstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: dhillf@gmail.com Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched_clock: Prevent 64bit inatomicity on 32bit systemsThomas Gleixner2013-04-161-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 upstream. The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. #1 can be excluded because sched_clock would continue to increase for one jiffy and then go stale. #2 can be excluded because it would not make the clock jump forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Move ftrace_filter_lseek out of CONFIG_DYNAMIC_FTRACE sectionSteven Rostedt (Red Hat)2013-04-161-14/+14
| | | | | | | | | | | | | commit 7f49ef69db6bbf756c0abca7e9b65b32e999eec8 upstream. As ftrace_filter_lseek is now used with ftrace_pid_fops, it needs to be moved out of the #ifdef CONFIG_DYNAMIC_FTRACE section as the ftrace_pid_fops is defined when DYNAMIC_FTRACE is not. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix possible NULL pointer dereferencesNamhyung Kim2013-04-162-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6a76f8c0ab19f215af2a3442870eeb5f0e81998d upstream. Currently set_ftrace_pid and set_graph_function files use seq_lseek for their fops. However seq_open() is called only for FMODE_READ in the fops->open() so that if an user tries to seek one of those file when she open it for writing, it sees NULL seq_file and then panic. It can be easily reproduced with following command: $ cd /sys/kernel/debug/tracing $ echo 1234 | sudo tee -a set_ftrace_pid In this example, GNU coreutils' tee opens the file with fopen(, "a") and then the fopen() internally calls lseek(). Link: http://lkml.kernel.org/r/1365663302-2170-1-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: stable@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()Huacai Chen2013-04-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | commit 6f389a8f1dd22a24f3d9afc2812b30d639e94625 upstream. As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core subsystems PM) say, syscore_ops operations should be carried with one CPU on-line and interrupts disabled. However, after commit f96972f2d (kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()), syscore_shutdown() is called before disable_nonboot_cpus(), so break the rules. We have a MIPS machine with a 8259A PIC, and there is an external timer (HPET) linked at 8259A. Since 8259A has been shutdown too early (by syscore_shutdown()), disable_nonboot_cpus() runs without timer interrupt, so it hangs and reboot fails. This patch call syscore_shutdown() a little later (after disable_nonboot_cpus()) to avoid reboot failure, this is the same way as poweroff does. For consistency, add disable_nonboot_cpus() to kernel_halt(). Signed-off-by: Huacai Chen <chenhc@lemote.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix double free when function profile init failedNamhyung Kim2013-04-161-1/+0
| | | | | | | | | | | | | | | | commit 83e03b3fe4daffdebbb42151d5410d730ae50bd1 upstream. On the failure path, stat->start and stat->pages will refer same page. So it'll attempt to free the same page again and get kernel panic. Link: http://lkml.kernel.org/r/1364820385-32027-1-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Consistently restore trace function on sysctl enablingJan Kiszka2013-04-121-6/+2
| | | | | | | | | | | | | | | commit 5000c418840b309251c5887f0b56503aae30f84c upstream. If we reenable ftrace via syctl, we currently set ftrace_trace_function based on the previous simplistic algorithm. This is inconsistent with what update_ftrace_function does. So better call that helper instead. Link: http://lkml.kernel.org/r/5151D26F.1070702@siemens.com Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Prevent buffer overwrite disabled for latency tracersSteven Rostedt (Red Hat)2013-04-054-16/+65
| | | | | | | | | | | | | | | | | | commit 613f04a0f51e6e68ac6fe571ab79da3c0a5eb4da upstream. The latency tracers require the buffers to be in overwrite mode, otherwise they get screwed up. Force the buffers to stay in overwrite mode when latency tracers are enabled. Added a flag_changed() method to the tracer structure to allow the tracers to see what flags are being changed, and also be able to prevent the change from happing. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Lingzhu Xiang <lxiang@redhat.com> Reviewed-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kernel/signal.c: use __ARCH_HAS_SA_RESTORER instead of SA_RESTORERAndrew Morton2013-04-051-1/+1
| | | | | | | | | | | | | | | | | | | | | commit 522cff142d7d2f9230839c9e1f21a4d8bcc22a4a upstream. __ARCH_HAS_SA_RESTORER is the preferred conditional for use in 3.9 and later kernels, per Kees. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Emese Revfy <re.emese@gmail.com> Cc: Emese Revfy <re.emese@gmail.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: Restrict when proc and sysfs can be mountedEric W. Biederman2013-04-052-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | commit 87a8ebd637dafc255070f503909a053cf0d98d3f upstream. Only allow unprivileged mounts of proc and sysfs if they are already mounted when the user namespace is created. proc and sysfs are interesting because they have content that is per namespace, and so fresh mounts are needed when new namespaces are created while at the same time proc and sysfs have content that is shared between every instance. Respect the policy of who may see the shared content of proc and sysfs by only allowing new mounts if there was an existing mount at the time the user namespace was created. In practice there are only two interesting cases: proc and sysfs are mounted at their usual places, proc and sysfs are not mounted at all (some form of mount namespace jail). Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: Don't allow creation if the user is chrootedEric W. Biederman2013-04-051-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3151527ee007b73a0ebd296010f1c0454a919c7d upstream. Guarantee that the policy of which files may be access that is established by setting the root directory will not be violated by user namespaces by verifying that the root directory points to the root of the mount namespace at the time of user namespace creation. Changing the root is a privileged operation, and as a matter of policy it serves to limit unprivileged processes to files below the current root directory. For reasons of simplicity and comprehensibility the privilege to change the root directory is gated solely on the CAP_SYS_CHROOT capability in the user namespace. Therefore when creating a user namespace we must ensure that the policy of which files may be access can not be violated by changing the root directory. Anyone who runs a processes in a chroot and would like to use user namespace can setup the same view of filesystems with a mount namespace instead. With this result that this is not a practical limitation for using user namespaces. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* pid: Handle the exit of a multi-threaded init.Eric W. Biederman2013-04-051-1/+2
| | | | | | | | | | | | | | | | | | | commit 751c644b95bb48aaa8825f0c66abbcc184d92051 upstream. When a multi-threaded init exits and the initial thread is not the last thread to exit the initial thread hangs around as a zombie until the last thread exits. In that case zap_pid_ns_processes needs to wait until there are only 2 hashed pids in the pid namespace not one. v2. Replace thread_pid_vnr(me) == 1 with the test thread_group_leader(me) as suggested by Oleg. Reported-by: Caj Larsson <caj@omnicloud.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* clockevents: Don't allow dummy broadcast timersMark Rutland2013-03-281-1/+2
| | | | | | | | | | | | | | | | | | | | | commit a7dc19b8652c862d5b7c4d2339bd3c428bd29c4a upstream. Currently tick_check_broadcast_device doesn't reject clock_event_devices with CLOCK_EVT_FEAT_DUMMY, and may select them in preference to real hardware if they have a higher rating value. In this situation, the dummy timer is responsible for broadcasting to itself, and the core clockevents code may attempt to call non-existent callbacks for programming the dummy, eventually leading to a panic. This patch makes tick_check_broadcast_device always reject dummy timers, preventing this problem. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Jon Medhurst (Tixy) <tixy@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Keep overwrite in sync between regular and snapshot buffersSteven Rostedt (Red Hat)2013-03-281-1/+5
| | | | | | | | | | | | | | | | commit 80902822658aab18330569587cdb69ac1dfdcea8 upstream. Changing the overwrite mode for the ring buffer via the trace option only sets the normal buffer. But the snapshot buffer could swap with it, and then the snapshot would be in non overwrite mode and the normal buffer would be in overwrite mode, even though the option flag states otherwise. Keep the two buffers overwrite modes in sync. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Protect tracer flags with trace_types_lockSteven Rostedt (Red Hat)2013-03-281-4/+8
| | | | | | | | | | | | | | | | commit 69d34da2984c95b33ea21518227e1f9470f11d95 upstream. Seems that the tracer flags have never been protected from synchronous writes. Luckily, admins don't usually modify the tracing flags via two different tasks. But if scripts were to be used to modify them, then they could get corrupted. Move the trace_types_lock that protects against tracers changing to also protect the flags being set. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix free of probe entry by calling call_rcu_sched()Steven Rostedt (Red Hat)2013-03-281-2/+2
| | | | | | | | | | | | | | | | | | | | commit 740466bc89ad8bd5afcc8de220f715f62b21e365 upstream. Because function tracing is very invasive, and can even trace calls to rcu_read_lock(), RCU access in function tracing is done with preempt_disable_notrace(). This requires a synchronize_sched() for updates and not a synchronize_rcu(). Function probes (traceon, traceoff, etc) must be freed after a synchronize_sched() after its entry has been removed from the hash. But call_rcu() is used. Fix this by using call_rcu_sched(). Also fix the usage to use hlist_del_rcu() instead of hlist_del(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix race in snapshot swappingSteven Rostedt (Red Hat)2013-03-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | commit 2721e72dd10f71a3ba90f59781becf02638aa0d9 upstream. Although the swap is wrapped with a spin_lock, the assignment of the temp buffer used to swap is not within that lock. It needs to be moved into that lock, otherwise two swaps happening on two different CPUs, can end up using the wrong temp buffer to assign in the swap. Luckily, all current callers of the swap function appear to have their own locks. But in case something is added that allows two different callers to call the swap, then there's a chance that this race can trigger and corrupt the buffers. New code is coming soon that will allow for this race to trigger. I've Cc'd stable, so this bug will not show up if someone backports one of the changes that can trigger this bug. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* signal: always clear sa_restorer on execveKees Cook2013-03-201-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2ca39528c01a933f6689cd6505ce65bd6d68a530 upstream. When the new signal handlers are set up, the location of sa_restorer is not cleared, leaking a parent process's address space location to children. This allows for a potential bypass of the parent's ASLR by examining the sa_restorer value returned when calling sigaction(). Based on what should be considered "secret" about addresses, it only matters across the exec not the fork (since the VMAs haven't changed until the exec). But since exec sets SIG_DFL and keeps sa_restorer, this is where it should be fixed. Given the few uses of sa_restorer, a "set" function was not written since this would be the only use. Instead, we use __ARCH_HAS_SA_RESTORER, as already done in other places. Example of the leak before applying this patch: $ cat /proc/$$/maps ... 7fb9f3083000-7fb9f3238000 r-xp 00000000 fd:01 404469 .../libc-2.15.so ... $ ./leak ... 7f278bc74000-7f278be29000 r-xp 00000000 fd:01 404469 .../libc-2.15.so ... 1 0 (nil) 0x7fb9f30b94a0 2 4000000 (nil) 0x7f278bcaa4a0 3 4000000 (nil) 0x7f278bcaa4a0 4 0 (nil) 0x7fb9f30b94a0 ... [akpm@linux-foundation.org: use SA_RESTORER for backportability] Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Emese Revfy <re.emese@gmail.com> Cc: Emese Revfy <re.emese@gmail.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: Don't allow CLONE_NEWUSER | CLONE_FSEric W. Biederman2013-03-142-1/+8
| | | | | | | | | | | | | | | | | | | | | | | commit e66eded8309ebf679d3d3c1f5820d1f2ca332c71 upstream. Don't allowing sharing the root directory with processes in a different user namespace. There doesn't seem to be any point, and to allow it would require the overhead of putting a user namespace reference in fs_struct (for permission checks) and incrementing that reference count on practically every call to fork. So just perform the inexpensive test of forbidding sharing fs_struct acrosss processes in different user namespaces. We already disallow other forms of threading when unsharing a user namespace so this should be no real burden in practice. This updates setns, clone, and unshare to disallow multiple user namespaces sharing an fs_struct. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Update the kconfig for DYNAMIC_FTRACESteven Rostedt2013-03-141-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | commit db05021d49a994ee40a9735d9c3cb0060c9babb8 upstream. The prompt to enable DYNAMIC_FTRACE (the ability to nop and enable function tracing at run time) had a confusing statement: "enable/disable ftrace tracepoints dynamically" This was written before tracepoints were added to the kernel, but now that tracepoints have been added, this is very confusing and has confused people enough to give wrong information during presentations. Not only that, I looked at the help text, and it still references that dreaded daemon that use to wake up once a second to update the nop locations and brick NICs, that hasn't been around for over five years. Time to bring the text up to the current decade. Reported-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nohz: Make tick_nohz_irq_exit() irq safeFrederic Weisbecker2013-03-141-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e5ab012c3271990e8457055c25cafddc1ae8aa6b upstream. As it stands, irq_exit() may or may not be called with irqs disabled, depending on __ARCH_IRQ_EXIT_IRQS_DISABLED that the arch can define. It makes tick_nohz_irq_exit() unsafe. For example two interrupts can race in tick_nohz_stop_sched_tick(): the inner most one computes the expiring time on top of the timer list, then it's interrupted right before reprogramming the clock. The new interrupt enqueues a new timer list timer, it reprogram the clock to take it into account and it exits. The CPUs resumes the inner most interrupt and performs the clock reprogramming without considering the new timer list timer. This regression has been introduced by: 280f06774afedf849f0b34248ed6aff57d0f6908 ("nohz: Separate out irq exit and idle loop dyntick logic") Let's fix it right now with the appropriate protections. A saner long term solution will be to remove __ARCH_IRQ_EXIT_IRQS_DISABLED and mandate that irq_exit() is called with interrupts disabled. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Link: http://lkml.kernel.org/r/1361373336-11337-1-git-send-email-fweisbec@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cgroup: fix exit() vs rmdir() raceLi Zefan2013-03-031-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | commit 71b5707e119653039e6e95213f00479668c79b75 upstream. In cgroup_exit() put_css_set_taskexit() is called without any lock, which might lead to accessing a freed cgroup: thread1 thread2 --------------------------------------------- exit() cgroup_exit() put_css_set_taskexit() atomic_dec(cgrp->count); rmdir(); /* not safe !! */ check_for_release(cgrp); rcu_read_lock() can be used to make sure the cgroup is alive. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cpuset: fix cpuset_print_task_mems_allowed() vs rename() raceLi Zefan2013-03-031-2/+10
| | | | | | | | | | | | | | | | | | commit 63f43f55c9bbc14f76b582644019b8a07dc8219a upstream. rename() will change dentry->d_name. The result of this race can be worse than seeing partially rewritten name, but we might access a stale pointer because rename() will re-allocate memory to hold a longer name. It's safe in the protection of dentry->d_lock. v2: check NULL dentry before acquiring dentry lock. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* workqueue: consider work function when searching for busy work itemsTejun Heo2013-03-031-7/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a2c1c57be8d9fd5b716113c8991d3d702eeacf77 upstream. To avoid executing the same work item concurrenlty, workqueue hashes currently busy workers according to their current work items and looks up the the table when it wants to execute a new work item. If there already is a worker which is executing the new work item, the new item is queued to the found worker so that it gets executed only after the current execution finishes. Unfortunately, a work item may be freed while being executed and thus recycled for different purposes. If it gets recycled for a different work item and queued while the previous execution is still in progress, workqueue may make the new work item wait for the old one although the two aren't really related in any way. In extreme cases, this false dependency may lead to deadlock although it's extremely unlikely given that there aren't too many self-freeing work item users and they usually don't wait for other work items. To alleviate the problem, record the current work function in each busy worker and match it together with the work item address in find_worker_executing_work(). While this isn't complete, it ensures that unrelated work items don't interact with each other and in the very unlikely case where a twisted wq user triggers it, it's always onto itself making the culprit easy to spot. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andrey Isakov <andy51@gmx.ru> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51701 Cc: stable@vger.kernel.org
* sysctl: fix null checking in bin_dn_node_address()Xi Wang2013-03-031-1/+2
| | | | | | | | | | | | | | commit df1778be1a33edffa51d094eeda87c858ded6560 upstream. The null check of `strchr() + 1' is broken, which is always non-null, leading to OOB read. Instead, check the result of strchr(). Signed-off-by: Xi Wang <xi.wang@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Call ftrace cleanup module notifier after all other notifiersSteven Rostedt (Red Hat)2013-03-031-14/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8c189ea64eea01ca20d102ddb74d6936dd16c579 upstream. Commit: c1bf08ac "ftrace: Be first to run code modification on modules" changed ftrace module notifier's priority to INT_MAX in order to process the ftrace nops before anything else could touch them (namely kprobes). This was the correct thing to do. Unfortunately, the ftrace module notifier also contains the ftrace clean up code. As opposed to the set up code, this code should be run *after* all the module notifiers have run in case a module is doing correct clean-up and unregisters its ftrace hooks. Basically, ftrace needs to do clean up on module removal, as it needs to know about code being removed so that it doesn't try to modify that code. But after it removes the module from its records, if a ftrace user tries to remove a probe, that removal will fail due as the record of that code segment no longer exists. Nothing really bad happens if the probe removal is called after ftrace did the clean up, but the ftrace removal function will return an error. Correct code (such as kprobes) will produce a WARN_ON() if it fails to remove the probe. As people get annoyed by frivolous warnings, it's best to do the ftrace clean up after everything else. By splitting the ftrace_module_notifier into two notifiers, one that does the module load setup that is run at high priority, and the other that is called for module clean up that is run at low priority, the problem is solved. Reported-by: Frank Ch. Eigler <fche@redhat.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>