lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2010-07-20	sched: _cpu_down(): Don't play with current->cpus_allowed	Oleg Nesterov
	_cpu_down() changes the current task's affinity and then recovers it at the end. The problems are well known: we can't restore old_allowed if it was bound to the now-dead-cpu, and we can race with the userspace which can change cpu-affinity during unplug. _cpu_down() should not play with current->cpus_allowed at all. Instead, take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable() removes the dying cpu from cpu_online_mask. [ upstream commit: 6a1bdc1b577ebcb65f6603c57f8347309bc4ab13 ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091023.GA9148@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20	sched: Kill the broken and deadlockable ↵	Oleg Nesterov
	cpuset_lock/cpuset_cpus_allowed_locked code This patch just states the fact the cpusets/cpuhotplug interaction is broken and removes the deadlockable code which only pretends to work. - cpuset_lock() doesn't really work. It is needed for cpuset_cpus_allowed_locked() but we can't take this lock in try_to_wake_up()->select_fallback_rq() path. - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take cpuset_lock() and hangs forever because CPU is already dead and thus T can't be scheduled. - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock() which is not irq-safe, but try_to_wake_up() can be called from irq. Kill them, and change select_fallback_rq() to use cpu_possible_mask, like we currently do without CONFIG_CPUSETS. Also, with or without this patch, with or without CONFIG_CPUSETS, the callers of select_fallback_rq() can race with each other or with set_cpus_allowed() pathes. The subsequent patches try to to fix these problems. [ upstream: 897f0b3c3ff40b443c84e271bef19bd6ae885195 ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091003.GA9123@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-13	Merge git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.y	Thomas Gleixner
	Conflicts: Makefile Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-13	vfs: Revert the scalability patches	Thomas Gleixner
	We still have sporadic and hard to debug problems. Revert it for now and revisit with Nick's new version. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-05	KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)	Shane Wang
	Per document, for feature control MSR: Bit 1 enables VMXON in SMX operation. If the bit is clear, execution of VMXON in SMX operation causes a general-protection exception. Bit 2 enables VMXON outside SMX operation. If the bit is clear, execution of VMXON outside SMX operation causes a general-protection exception. This patch is to enable this kind of check with SMX for VMXON in KVM. Signed-off-by: Shane Wang <shane.wang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit cafd66595d92591e4bd25c3904e004fc6f897e2d)
2010-07-05	vfs: add NOFOLLOW flag to umount(2)	Miklos Szeredi
	commit db1f05bb85d7966b9176e293f3ceead1cb8b5d79 upstream. Add a new UMOUNT_NOFOLLOW flag to umount(2). This is needed to prevent symlink attacks in unprivileged unmounts (fuse, samba, ncpfs). Additionally, return -EINVAL if an unknown flag is used (and specify an explicitly unused flag: UMOUNT_UNUSED). This makes it possible for the caller to determine if a flag is supported or not. CC: Eugene Teo <eugene@redhat.com> CC: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05	sctp: Fix skb_over_panic resulting from multiple invalid parameter errors ↵	Neil Horman
	(CVE-2010-1173) (v4) commit 5fa782c2f5ef6c2e4f04d3e228412c9b4a4c8809 upstream. Ok, version 4 Change Notes: 1) Minor cleanups, from Vlads notes Summary: Hey- Recently, it was reported to me that the kernel could oops in the following way: <5> kernel BUG at net/core/skbuff.c:91! <5> invalid operand: 0000 [#1] <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U) vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5 ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod <5> CPU: 0 <5> EIP: 0060:[<c02bff27>] Not tainted VLI <5> EFLAGS: 00010216 (2.6.9-89.0.25.EL) <5> EIP is at skb_over_panic+0x1f/0x2d <5> eax: 0000002c ebx: c033f461 ecx: c0357d96 edx: c040fd44 <5> esi: c033f461 edi: df653280 ebp: 00000000 esp: c040fd40 <5> ds: 007b es: 007b ss: 0068 <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0) <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180 e0c2947d <5> 00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004 df653490 <5> 00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e 00000004 <5> Call Trace: <5> [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp] <5> [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp] <5> [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp] <5> [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp] <5> [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp] <5> [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp] <5> [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp] <5> [<c01555a4>] cache_grow+0x140/0x233 <5> [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp] <5> [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp] <5> [<e0c34600>] sctp_rcv+0x454/0x509 [sctp] <5> [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter] <5> [<c02d005e>] nf_iterate+0x40/0x81 <5> [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151 <5> [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151 <5> [<c02d0362>] nf_hook_slow+0x83/0xb5 <5> [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9 <5> [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151 <5> [<c02e103e>] ip_rcv+0x334/0x3b4 <5> [<c02c66fd>] netif_receive_skb+0x320/0x35b <5> [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd] <5> [<c02c67a4>] process_backlog+0x6c/0xd9 <5> [<c02c690f>] net_rx_action+0xfe/0x1f8 <5> [<c012a7b1>] __do_softirq+0x35/0x79 <5> [<c0107efb>] handle_IRQ_event+0x0/0x4f <5> [<c01094de>] do_softirq+0x46/0x4d Its an skb_over_panic BUG halt that results from processing an init chunk in which too many of its variable length parameters are in some way malformed. The problem is in sctp_process_unk_param: if (NULL == errp) errp = sctp_make_op_error_space(asoc, chunk, ntohs(chunk->chunk_hdr->length)); if (errp) { sctp_init_cause(errp, SCTP_ERROR_UNKNOWN_PARAM, WORD_ROUND(ntohs(param.p->length))); sctp_addto_chunk(errp, WORD_ROUND(ntohs(param.p->length)), param.v); When we allocate an error chunk, we assume that the worst case scenario requires that we have chunk_hdr->length data allocated, which would be correct nominally, given that we call sctp_addto_chunk for the violating parameter. Unfortunately, we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error chunk, so the worst case situation in which all parameters are in violation requires chunk_hdr->length+(sizeof(sctp_errhdr_t)param_count) bytes of data. The result of this error is that a deliberately malformed packet sent to a listening host can cause a remote DOS, described in CVE-2010-1173: http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173 I've tested the below fix and confirmed that it fixes the issue. We move to a strategy whereby we allocate a fixed size error chunk and ignore errors we don't have space to report. Tested by me successfully Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05	tracing: Fix null pointer deref with SEND_SIG_FORCED	Oleg Nesterov
	commit b9b76dfaac6fa2c289ee8a005be637afd2da7e2f upstream. BUG: unable to handle kernel NULL pointer dereference at 0000000000000006 IP: [<ffffffff8107bd37>] ftrace_raw_event_signal_generate+0x87/0x140 TP_STORE_SIGINFO() forgets about SEND_SIG_FORCED, fix. We should probably export is_si_special() and change TP_STORE_SIGINFO() to use it in the longer term. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> LKML-Reference: <20100603213409.GA8307@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05	wrong type for 'magic' argument in simple_fill_super()	Roberto Sassu
	commit 7d683a09990ff095a91b6e724ecee0ff8733274a upstream. It's used to superblock ->s_magic, which is unsigned long. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Reviewed-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05	ahci: add pci quirk for JMB362	Tejun Heo
	commit 4daedcfe8c6851aa01cc1997220f2577f4039c13 upstream. JMB362 is a new variant of jmicron controller which is similar to JMB360 but has two SATA ports instead of one. As there is no PATA port, single function AHCI mode can be used as in JMB360. Add pci quirk for JMB362. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Aries Lee <arieslee@jmicron.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05	tmpfs: insert tmpfs cache pages to inactive list at first	KOSAKI Motohiro
	commit e9d6c157385e4efa61cb8293e425c9d8beba70d3 upstream. Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer. This is regression of caused by commit 9ff473b9a7 ("vmscan: evict streaming IO first"). Wow, It is 2 years old patch! Currently, tmpfs file cache is inserted active list at first. This means that the insertion doesn't only increase numbers of pages in anon LRU, but it also reduces anon scanning ratio. Therefore, vmscan will get totally confused. It scans almost only file LRU even though the system has plenty unused tmpfs pages. Historically, lru_cache_add_active_anon() was used for two reasons. 1) Intend to priotize shmem page rather than regular file cache. 2) Intend to avoid reclaim priority inversion of used once pages. But we've lost both motivation because (1) Now we have separate anon and file LRU list. then, to insert active list doesn't help such priotize. (2) In past, one pte access bit will cause page activation. then to insert inactive list with pte access bit mean higher priority than to insert active list. Its priority inversion may lead to uninteded lru chun. but it was already solved by commit 645747462 (vmscan: detect mapped file pages used only once). (Thanks Hannes, you are great!) Thus, now we can use lru_cache_add_anon() instead. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-06-08	dcache: Prevent d_genocide() from decrementing d_count more than once	Steven Rostedt
	When d_genocide is called, it traverses the dentries decrementing the d_count. After the loops, a check is made to see if a rename or other change has been made, if so, it traverse the tree again. Unfortunately, this will decrement the same dentries that it decremented the first time. To avoid this multiple decrement, I added a DCACHE_GENOCIDE flag that will be set to a dentry when the d_genocide has decremented its counter. It will only decrement the counter if this flag is not set. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <npiggin@suse.de> Cc: John Kacur <jkacur@redhat.com> Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org> Cc: Clark Williams <williams@redhat.com> Acked-by: john stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-31	Merge stable/linux-2.6.33.y into rt/2.6.33	Thomas Gleixner
	Conflicts: Makefile Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-26	revert "procfs: provide stack information for threads" and its fixup commits	Robin Holt
	commit 34441427aab4bdb3069a4ffcda69a99357abcb2e upstream. Originally, commit d899bf7b ("procfs: provide stack information for threads") attempted to introduce a new feature for showing where the threadstack was located and how many pages are being utilized by the stack. Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was applied to fix the NO_MMU case. Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") was applied to fix a bug in ia32 executables being loaded. Commit 9ebd4eba7 ("procfs: fix /proc/<pid>/stat stack pointer for kernel threads") was applied to fix a bug which had kernel threads printing a userland stack address. Commit 1306d603f ('proc: partially revert "procfs: provide stack information for threads"') was then applied to revert the stack pages being used to solve a significant performance regression. This patch nearly undoes the effect of all these patches. The reason for reverting these is it provides an unusable value in field 28. For x86_64, a fork will result in the task->stack_start value being updated to the current user top of stack and not the stack start address. This unpredictability of the stack_start value makes it worthless. That includes the intended use of showing how much stack space a thread has. Other architectures will get different values. As an example, ia64 gets 0. The do_fork() and copy_process() functions appear to treat the stack_start and stack_size parameters as architecture specific. I only partially reverted c44972f1 ("procfs: disable per-task stack usage on NOMMU") . If I had completely reverted it, I would have had to change mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is configured. Since I could not test the builds without significant effort, I decided to not change mm/Makefile. I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") . I left the KSTK_ESP() change in place as that seemed worthwhile. Signed-off-by: Robin Holt <holt@sgi.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-26	dma-mapping: fix dma_sync_single_range_*	FUJITA Tomonori
	commit f33d7e2d2d113a63772bbc993cdec3b5327f0ef1 upstream. dma_sync_single_range_for_cpu() and dma_sync_single_range_for_device() use a wrong address with a partial synchronization. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-17	net: Make [dis/en]able_irq_*_lockdep() RT safe	Thomas Gleixner
	The lockdep irqoff protection which is used to prevent lockdep false positives leads to "scheduling while atomic" and "might sleep" bug floods. Make the irq disabling depend on !RT. Reported-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-13	Merge branch '2.6.33.4' into rt/2.6.33	Thomas Gleixner
	Conflicts: Makefile Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-12	sctp: Fix oops when sending queued ASCONF chunks	Vlad Yasevich
	[ Upstream commit c0786693404cffd80ca3cb6e75ee7b35186b2825 ] When we finish processing ASCONF_ACK chunk, we try to send the next queued ASCONF. This action runs the sctp state machine recursively and it's not prepared to do so. kernel BUG at kernel/timer.c:790! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/module/ipv6/initstate Modules linked in: sha256_generic sctp libcrc32c ipv6 dm_multipath uinput 8139too i2c_piix4 8139cp mii i2c_core pcspkr virtio_net joydev floppy virtio_blk virtio_pci [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.34-rc4 #15 /Bochs EIP: 0060:[<c044a2ef>] EFLAGS: 00010286 CPU: 0 EIP is at add_timer+0xd/0x1b EAX: cecbab14 EBX: 000000f0 ECX: c0957b1c EDX: 03595cf4 ESI: cecba800 EDI: cf276f00 EBP: c0957aa0 ESP: c0957aa0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process swapper (pid: 0, ti=c0956000 task=c0988ba0 task.ti=c0956000) Stack: c0957ae0 d1851214 c0ab62e4 c0ab5f26 0500ffff 00000004 00000005 00000004 <0> 00000000 d18694fd 00000004 1666b892 cecba800 cecba800 c0957b14 00000004 <0> c0957b94 d1851b11 ceda8b00 cecba800 cf276f00 00000001 c0957b14 000000d0 Call Trace: [<d1851214>] ? sctp_side_effects+0x607/0xdfc [sctp] [<d1851b11>] ? sctp_do_sm+0x108/0x159 [sctp] [<d1863386>] ? sctp_pname+0x0/0x1d [sctp] [<d1861a56>] ? sctp_primitive_ASCONF+0x36/0x3b [sctp] [<d185657c>] ? sctp_process_asconf_ack+0x2a4/0x2d3 [sctp] [<d184e35c>] ? sctp_sf_do_asconf_ack+0x1dd/0x2b4 [sctp] [<d1851ac1>] ? sctp_do_sm+0xb8/0x159 [sctp] [<d1863334>] ? sctp_cname+0x0/0x52 [sctp] [<d1854377>] ? sctp_assoc_bh_rcv+0xac/0xe1 [sctp] [<d1858f0f>] ? sctp_inq_push+0x2d/0x30 [sctp] [<d186329d>] ? sctp_rcv+0x797/0x82e [sctp] Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Yuansong Qiao <ysqiao@research.ait.ie> Signed-off-by: Shuaijun Zhang <szhang@research.ait.ie> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	sctp: avoid irq lock inversion while call sk->sk_data_ready()	Wei Yongjun
	[ Upstream commit 561b1733a465cf9677356b40c27653dd45f1ac56 ] sk->sk_data_ready() of sctp socket can be called from both BH and non-BH contexts, but the default sk->sk_data_ready(), sock_def_readable(), can not be used in this case. Therefore, we have to make a new function sctp_data_ready() to grab sk->sk_data_ready() with BH disabling. ========================================================= [ INFO: possible irq lock inversion dependency detected ] 2.6.33-rc6 #129 --------------------------------------------------------- sctp_darn/1517 just changed the state of lock: (clock-AF_INET){++.?..}, at: [<c06aab60>] sock_def_readable+0x20/0x80 but this lock took another, SOFTIRQ-unsafe lock in the past: (slock-AF_INET){+.-...} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 1 lock held by sctp_darn/1517: #0: (sk_lock-AF_INET){+.+.+.}, at: [<cdfe363d>] sctp_sendmsg+0x23d/0xc00 [sctp] Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	libata: Fix accesses at LBA28 boundary (old bug, but nasty) (v2)	Mark Lord
	commit 45c4d015a92f72ec47acd0c7557abdc0c8a6499d upstream. Most drives from Seagate, Hitachi, and possibly other brands, do not allow LBA28 access to sector number 0x0fffffff (2^28 - 1). So instead use LBA48 for such accesses. This bug could bite a lot of systems, especially when the user has taken care to align partitions to 4KB boundaries. On misaligned systems, it is less likely to be encountered, since a 4KB read would end at 0x10000000 rather than at 0x0fffffff. Signed-off-by: Mark Lord <mlord@pobox.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	hugetlb: fix infinite loop in get_futex_key() when backed by huge pages	Mel Gorman
	commit 23be7468e8802a2ac1de6ee3eecb3ec7f14dc703 upstream. If a futex key happens to be located within a huge page mapped MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a page->mapping that will never exist. See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details about the problem. This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to continue but because of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by futex. No other part of the VM should be dereferencing the page->mapping of a hugetlbfs page as its page cache is not on the LRU. This patch fixes the problem with the test case described in the bugzilla. [akpm@linux-foundation.org: mel cant spel] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <darren@dvhart.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-04-29	fs: Fix namespace related hangs	John Stultz
	Nick converted the dentry->d_mounted counter to a flag, however with namespaces, dentries can be mounted multiple times (and more importantly unmounted multiple times). If a namespace was created and then released, the unmount_tree would remove the DCACHE_MOUNTED flag and that would make d_mountpoint fail, causing the mounts to be lost. This patch coverts it back to a counter, and adds some extra WARN_ONs to make sure things are accounted properly. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org> Cc: Nick Piggin <npiggin@suse.de> LKML-Reference: <1272522942.1967.12.camel@work-vm> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-28	hugetlb: fix infinite loop in get_futex_key() when backed by huge pages	Mel Gorman
	If a futex key happens to be located within a huge page mapped MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a page->mapping that will never exist. See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details about the problem. This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to continue but because of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by futex. No other part of the VM should be dereferencing the page->mapping of a hugetlbfs page as its page cache is not on the LRU. This patch fixes the problem with the test case described in the bugzilla. [ upstream commit: 23be7468e8802a2ac1de6ee3eecb3ec7f14dc703 ] [akpm@linux-foundation.org: mel cant spel] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <darren@dvhart.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fixup some compilation warnings and errors	John Stultz
	Amit Arora noticed some compile issues with coda, and an fs.h include issue, so so this patch fixes those along with btrfs warnings. Thanks to Amit for the testing! Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Revert d_count back to an atomic_t	John Stultz
	This patch reverts the portion of Nick's vfs scalability patch that converts the dentry d_count from an atomic_t to an int protected by the d_lock. This greatly improves vfs scalability with the -rt kernel, as the extra lock contention on the d_lock hurts very badly when CONFIG_PREEMPT_RT is enabled and the spinlocks become rtmutexes. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fix vfsmount_read_lock to work with -rt	John Stultz
	Because vfsmount_read_lock aquires the vfsmount spinlock for the current cpu, it causes problems wiht -rt, as you might migrate between cpus between a lock and unlock. This patch fixes the issue by having the caller pick a cpu, then consistently use that cpu between the lock and unlock. We may migrate inbetween lock and unlock, but that's ok because we're not doing anything cpu specific, other then avoiding contention on the read side across the cpus. Its not pretty, but it works and statistically shouldn't hurt performance. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fixups from 09102009.patch.gz	Nick Piggin
	This patch is just the delta from Nick's 06102009 and his 09102009 megapatches Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-sb-inodes-percpu	Nick Piggin
	Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-nr_inodes-percpu	Eric Dumazet
	fs: inode per-cpu nr_inodes counter Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode-nr_inodes	Nick Piggin
	XXX: this should be folded back into the individual locking patches Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_rcu	Nick Piggin
	RCU free the struct inode. This will allow: - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want to take i_lock no longer need to take sb_inode_list_lock to walk the list in the first place. This will simplify and optimize locking. - eventually, completely write-free RCU path walking. The inode must be consulted for permissions when walking, so a write-free reference (ie. RCU is helpful). - can potentially simplify things a bit in VM land. May not need to take the page lock to get back to the page->mapping. - can remove some nested trylock loops in dcache code todo: convert all filesystems Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-10	Nick Piggin
	Impelemnt lazy inode lru similarly to dcache. This should reduce inode list lock acquisition (todo: measure). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-8	Nick Piggin
	Make inode_hash_lock private by adding a function __remove_inode_hash that can be used by filesystems defining their own drop_inode functions. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-7	Nick Piggin
	Remove the global inode_lock Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-5	Nick Piggin
	Protect inodes_stat statistics with atomic ops rather than inode_lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-6	Nick Piggin
	Add a new lock, wb_inode_list_lock, to protect i_list and various lists which the inode can be put onto. XXX: haven't audited ocfs2 Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-4	Nick Piggin
	Protect inode->i_count with i_lock, rather than having it atomic. Next step should also be to move things together (eg. the refcount increment into d_instantiate, which will remove a lock/unlock cycle on i_lock). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-2	Nick Piggin
	Add a new lock, inode_hash_lock, to protect the inode hash table lists. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale	Nick Piggin
	Protect sb->s_inodes with a new lock, sb_inode_list_lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	dcache-percpu-nr_dentry	Nick Piggin
	The nr_dentry stat is a globally touched cacheline and atomic operation twice over the lifetime of a dentry. It is used for the benfit of userspace only. We could make a per-cpu counter or something for it, but it is only accessed via proc, so we could use slab stats. XXX: must implement slab routines to return stats for a single cache, and implement the proc handler. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	dcache-split-inode_lock	Nick Piggin
	dcache_inode_lock can be replaced with per-inode locking. Use existing inode->i_lock for this. This is slightly non-trivial because we sometimes need to find the inode from the dentry, which requires d_inode to be stabilised (either with refcount or d_lock). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	dcache-chain-hashlock	Nick Piggin
	We can turn the dcache hash locking from a global dcache_hash_lock into per-bucket locking. XXX: should probably use a bit lock in the first bit of the hash pointers to avoid any space bloating (non-atomic unlock means no extra atomics either) Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-dcache_lock-remove	Nick Piggin
	dcache_lock no longer protects anything (I hope). remove it. This breaks a lot of the tree where I haven't thought about the problem, but it simplifies the dcache.c code quite a bit (and it's also probably a good thing to break unconverted code). So I include this here before making further changes to the locking. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-dcache-scale-i_dentry	Nick Piggin
	Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list from concurrent modification. d_alias is also protected by d_lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-dcache-scale-d_subdirs	Nick Piggin
	Protect d_subdirs and d_child with d_lock, except in filesystems that aren't using dcache_lock for these anyway (eg. using i_mutex). XXX: probably don't need parent lock in inotify (because child lock should stabilize parent). Also, possibly some filesystems don't need so much locking (eg. of child dentry when modifying d_child, so long as parent is locked)... but be on the safe side. Hmm, maybe we should just say d_child list is protected by d_parent->d_lock. d_parent could remain protected with d_lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-dcache-scale-d_count	Nick Piggin
	Make d_count non-atomic and protect it with d_lock. This allows us to ensure a 0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when we start protecting many other dentry members with d_lock. XXX: This patch does not boot on its own Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-dcache-scale-nr_dentry	Nick Piggin
	Make dentry_stat_t.nr_dentry an atomic_t type, and move it from under dcache_lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-dcache-scale-d_hash	Nick Piggin
	Add a new lock, dcache_hash_lock, to protect the dcache hash table from concurrent modification. d_hash is also protected by d_lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-mntget-scale	Nick Piggin
	Improve scalability of mntget/mntput by using per-cpu counters protected by the reader side of the brlock vfsmount_lock. mnt_mounted keeps track of whether the vfsmount is actually attached to the tree so we can shortcut expensive checks in mntput. XXX: count_mnt_count needs write lock. Document this and/or revisit locking (eg. look at writers count) Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-vfsmount_lock-scale	Nick Piggin
	Use a brlock for the vfsmount lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>