<feed xmlns='http://www.w3.org/2005/Atom'>
<title>lwn.git/kernel, branch v2.6.35.9</title>
<subtitle>Linux kernel documentation tree maintained by Jonathan Corbet</subtitle>
<id>http://mirrors.hust.edu.cn/git/lwn.git/atom?h=v2.6.35.9</id>
<link rel='self' href='http://mirrors.hust.edu.cn/git/lwn.git/atom?h=v2.6.35.9'/>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/'/>
<updated>2010-11-22T18:59:45+00:00</updated>
<entry>
<title>futex: Fix errors in nested key ref-counting</title>
<updated>2010-11-22T18:59:45+00:00</updated>
<author>
<name>Darren Hart</name>
<email>dvhart@linux.intel.com</email>
</author>
<published>2010-10-17T15:35:04+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=52c974f34e857dc648f013cc24f6ad4febc26040'/>
<id>urn:sha1:52c974f34e857dc648f013cc24f6ad4febc26040</id>
<content type='text'>
commit 7ada876a8703f23befbb20a7465a702ee39b1704 upstream.

futex_wait() is leaking key references due to futex_wait_setup()
acquiring an additional reference via the queue_lock() routine. The
nested key ref-counting has been masking bugs and complicating code
analysis. queue_lock() is only called with a previously ref-counted
key, so remove the additional ref-counting from the queue_(un)lock()
functions.

Also futex_wait_requeue_pi() drops one key reference too many in
unqueue_me_pi(). Remove the key reference handling from
unqueue_me_pi(). This was paired with a queue_lock() in
futex_lock_pi(), so the count remains unchanged.

Document remaining nested key ref-counting sites.

Signed-off-by: Darren Hart &lt;dvhart@linux.intel.com&gt;
Reported-and-tested-by: Matthieu Fertré&lt;matthieu.fertre@kerlabs.com&gt;
Reported-by: Louis Rilling&lt;louis.rilling@kerlabs.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: John Kacur &lt;jkacur@redhat.com&gt;
Cc: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
LKML-Reference: &lt;4CBB17A8.70401@linux.intel.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>sched: Fix string comparison in /proc/sched_features</title>
<updated>2010-11-22T18:59:44+00:00</updated>
<author>
<name>Mathieu Desnoyers</name>
<email>mathieu.desnoyers@efficios.com</email>
</author>
<published>2010-09-13T21:47:00+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=ee6158ba96148f198a4f1b78a24770633a568b95'/>
<id>urn:sha1:ee6158ba96148f198a4f1b78a24770633a568b95</id>
<content type='text'>
commit 7740191cd909b75d75685fb08a5d1f54b8a9d28b upstream.

Fix incorrect handling of the following case:

 INTERACTIVE
 INTERACTIVE_SOMETHING_ELSE

The comparison only checks up to each element's length.

Changelog since v1:
 - Embellish using some Rostedtisms.
  [ mingo:                 ^^ == smaller and cleaner ]

Signed-off-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Reviewed-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Tony Lindgren &lt;tony@atomide.com&gt;
LKML-Reference: &lt;20100913214700.GB16118@Krystal&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>sched: Drop all load weight manipulation for RT tasks</title>
<updated>2010-11-22T18:59:43+00:00</updated>
<author>
<name>Linus Walleij</name>
<email>linus.walleij@stericsson.com</email>
</author>
<published>2010-10-11T14:36:51+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=19b0be98067611d07f68faf396041da6e95eb60e'/>
<id>urn:sha1:19b0be98067611d07f68faf396041da6e95eb60e</id>
<content type='text'>
commit 17bdcf949d03306b308c5fb694849cd35f119807 upstream.

Load weights are for the CFS, they do not belong in the RT task. This makes all
RT scheduling classes leave the CFS weights alone.

This fixes a real bug as well: I noticed the following phonomena: a process
elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is
indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight
inserted by set_load_weight() remains at 0, giving the task insignificat
priority.

With this fix, the weight is reset to what the task had before being elevated
to SCHED_RR/SCHED_FIFO.

Cc: Lennart Poettering &lt;lennart@poettering.net&gt;
Signed-off-by: Linus Walleij &lt;linus.walleij@stericsson.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
LKML-Reference: &lt;1286807811-10568-1-git-send-email-linus.walleij@stericsson.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>perf_events: Fix bogus context time tracking</title>
<updated>2010-11-22T18:59:42+00:00</updated>
<author>
<name>Stephane Eranian</name>
<email>eranian@google.com</email>
</author>
<published>2010-10-15T13:26:01+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=c119599ec7d46c1b31d09a7968566b76e54da8e8'/>
<id>urn:sha1:c119599ec7d46c1b31d09a7968566b76e54da8e8</id>
<content type='text'>
commit c530ccd9a1864a44a7ff35826681229ce9f2357a upstream.

You can only call update_context_time() when the context
is active, i.e., the thread it is attached to is still running.

However, perf_event_read() can be called even when the context
is inactive, e.g., user read() the counters. The call to
update_context_time() must be conditioned on the status of
the context, otherwise, bogus time_enabled, time_running may
be returned. Here is an example on AMD64. The task program
is an example from libpfm4. The -p prints deltas every 1s.

$ task -p -e cpu_clk_unhalted sleep 5
    2,266,610 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
5,242,358,071 cpu_clk_unhalted (99.95% scaling, ena=5,000,359,984, run=2,319,270)

Whereas if you don't read deltas, e.g., no call to perf_event_read() until
the process terminates:

$ task -e cpu_clk_unhalted sleep 5
    2,497,783 cpu_clk_unhalted (0.00% scaling, ena=2,376,899, run=2,376,899)

Notice that time_enable, time_running are bogus in the first example
causing bogus scaling.

This patch fixes the problem, by conditionally calling update_context_time()
in perf_event_read().

Signed-off-by: Stephane Eranian &lt;eranian@google.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
LKML-Reference: &lt;4cb856dc.51edd80a.5ae0.38fb@mx.google.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>hrtimer: Preserve timer state in remove_hrtimer()</title>
<updated>2010-10-29T04:51:24+00:00</updated>
<author>
<name>Salman Qazi</name>
<email>sqazi@google.com</email>
</author>
<published>2010-10-12T14:25:19+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=eafe9c354fde7597834d45a04f79e9057d24ae93'/>
<id>urn:sha1:eafe9c354fde7597834d45a04f79e9057d24ae93</id>
<content type='text'>
commit f13d4f979c518119bba5439dd2364d76d31dcd3f upstream.

The race is described as follows:

CPU X                                 CPU Y
remove_hrtimer
// state &amp; QUEUED == 0
timer-&gt;state = CALLBACK
unlock timer base
timer-&gt;f(n) //very long
                                  hrtimer_start
                                    lock timer base
                                    remove_hrtimer // no effect
                                    hrtimer_enqueue
                                    timer-&gt;state = CALLBACK |
                                                   QUEUED
                                    unlock timer base
                                  hrtimer_start
                                    lock timer base
                                    remove_hrtimer
                                        mode = INACTIVE
                                        // CALLBACK bit lost!
                                    switch_hrtimer_base
                                            CALLBACK bit not set:
                                                    timer-&gt;base
                                                    changes to a
                                                    different CPU.
lock this CPU's timer base

The bug was introduced with commit ca109491f (hrtimer: removing all ur
callback modes) in 2.6.29

[ tglx: Feed new state via local variable and add a comment. ]

Signed-off-by: Salman Qazi &lt;sqazi@google.com&gt;
Cc: akpm@linux-foundation.org
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
LKML-Reference: &lt;20101012142351.8485.21823.stgit@dungbeetle.mtv.corp.google.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>ring-buffer: Fix typo of time extends per page</title>
<updated>2010-10-29T04:51:23+00:00</updated>
<author>
<name>Steven Rostedt</name>
<email>srostedt@redhat.com</email>
</author>
<published>2010-10-12T16:06:43+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=02ee32edab40e81b2af0149a8fb86986fc5ac9e7'/>
<id>urn:sha1:02ee32edab40e81b2af0149a8fb86986fc5ac9e7</id>
<content type='text'>
commit d01343244abdedd18303d0323b518ed9cdcb1988 upstream.

Time stamps for the ring buffer are created by the difference between
two events. Each page of the ring buffer holds a full 64 bit timestamp.
Each event has a 27 bit delta stamp from the last event. The unit of time
is nanoseconds, so 27 bits can hold ~134 milliseconds. If two events
happen more than 134 milliseconds apart, a time extend is inserted
to add more bits for the delta. The time extend has 59 bits, which
is good for ~18 years.

Currently the time extend is committed separately from the event.
If an event is discarded before it is committed, due to filtering,
the time extend still exists. If all events are being filtered, then
after ~134 milliseconds a new time extend will be added to the buffer.

This can only happen till the end of the page. Since each page holds
a full timestamp, there is no reason to add a time extend to the
beginning of a page. Time extends can only fill a page that has actual
data at the beginning, so there is no fear that time extends will fill
more than a page without any data.

When reading an event, a loop is made to skip over time extends
since they are only used to maintain the time stamp and are never
given to the caller. As a paranoid check to prevent the loop running
forever, with the knowledge that time extends may only fill a page,
a check is made that tests the iteration of the loop, and if the
iteration is more than the number of time extends that can fit in a page
a warning is printed and the ring buffer is disabled (all of ftrace
is also disabled with it).

There is another event type that is called a TIMESTAMP which can
hold 64 bits of data in the theoretical case that two events happen
18 years apart. This code has not been implemented, but the name
of this event exists, as well as the structure for it. The
size of a TIMESTAMP is 16 bytes, where as a time extend is only
8 bytes. The macro used to calculate how many time extends can fit on
a page used the TIMESTAMP size instead of the time extend size
cutting the amount in half.

The following test case can easily trigger the warning since we only
need to have half the page filled with time extends to trigger the
warning:

 # cd /sys/kernel/debug/tracing/
 # echo function &gt; current_tracer
 # echo 'common_pid &lt; 0' &gt; events/ftrace/function/filter
 # echo &gt; trace
 # echo 1 &gt; trace_marker
 # sleep 120
 # cat trace

Enabling the function tracer and then setting the filter to only trace
functions where the process id is negative (no events), then clearing
the trace buffer to ensure that we have nothing in the buffer,
then write to trace_marker to add an event to the beginning of a page,
sleep for 2 minutes (only 35 seconds is probably needed, but this
guarantees the bug), and then finally reading the trace which will
trigger the bug.

This patch fixes the typo and prevents the false positive of that warning.

Reported-by: Hans J. Koch &lt;hjk@linutronix.de&gt;
Tested-by: Hans J. Koch &lt;hjk@linutronix.de&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>Fix unprotected access to task credentials in waitid()</title>
<updated>2010-09-27T00:18:41+00:00</updated>
<author>
<name>Daniel J Blueman</name>
<email>daniel.blueman@gmail.com</email>
</author>
<published>2010-08-17T22:56:55+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=2178c26dbdca2ac63c0e4183aa9e570e2ffe593e'/>
<id>urn:sha1:2178c26dbdca2ac63c0e4183aa9e570e2ffe593e</id>
<content type='text'>
commit f362b73244fb16ea4ae127ced1467dd8adaa7733 upstream.

Using a program like the following:

	#include &lt;stdlib.h&gt;
	#include &lt;unistd.h&gt;
	#include &lt;sys/types.h&gt;
	#include &lt;sys/wait.h&gt;

	int main() {
		id_t id;
		siginfo_t infop;
		pid_t res;

		id = fork();
		if (id == 0) { sleep(1); exit(0); }
		kill(id, SIGSTOP);
		alarm(1);
		waitid(P_PID, id, &amp;infop, WCONTINUED);
		return 0;
	}

to call waitid() on a stopped process results in access to the child task's
credentials without the RCU read lock being held - which may be replaced in the
meantime - eliciting the following warning:

	===================================================
	[ INFO: suspicious rcu_dereference_check() usage. ]
	---------------------------------------------------
	kernel/exit.c:1460 invoked rcu_dereference_check() without protection!

	other info that might help us debug this:

	rcu_scheduler_active = 1, debug_locks = 1
	2 locks held by waitid02/22252:
	 #0:  (tasklist_lock){.?.?..}, at: [&lt;ffffffff81061ce5&gt;] do_wait+0xc5/0x310
	 #1:  (&amp;(&amp;sighand-&gt;siglock)-&gt;rlock){-.-...}, at: [&lt;ffffffff810611da&gt;]
	wait_consider_task+0x19a/0xbe0

	stack backtrace:
	Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3
	Call Trace:
	 [&lt;ffffffff81095da4&gt;] lockdep_rcu_dereference+0xa4/0xc0
	 [&lt;ffffffff81061b31&gt;] wait_consider_task+0xaf1/0xbe0
	 [&lt;ffffffff81061d15&gt;] do_wait+0xf5/0x310
	 [&lt;ffffffff810620b6&gt;] sys_waitid+0x86/0x1f0
	 [&lt;ffffffff8105fce0&gt;] ? child_wait_callback+0x0/0x70
	 [&lt;ffffffff81003282&gt;] system_call_fastpath+0x16/0x1b

This is fixed by holding the RCU read lock in wait_task_continued() to ensure
that the task's current credentials aren't destroyed between us reading the
cred pointer and us reading the UID from those credentials.

Furthermore, protect wait_task_stopped() in the same way.

We don't need to keep holding the RCU read lock once we've read the UID from
the credentials as holding the RCU read lock doesn't stop the target task from
changing its creds under us - so the credentials may be outdated immediately
after we've read the pointer, lock or no lock.

Signed-off-by: Daniel J Blueman &lt;daniel.blueman@gmail.com&gt;
Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
Acked-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Acked-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>PM / Hibernate: Avoid hitting OOM during preallocation of memory</title>
<updated>2010-09-27T00:18:38+00:00</updated>
<author>
<name>Rafael J. Wysocki</name>
<email>rjw@sisk.pl</email>
</author>
<published>2010-09-11T18:58:27+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=ff278af4dcda812b4a3f8ef0d0260232ed1352ac'/>
<id>urn:sha1:ff278af4dcda812b4a3f8ef0d0260232ed1352ac</id>
<content type='text'>
commit 6715045ddc7472a22be5e49d4047d2d89b391f45 upstream.

There is a problem in hibernate_preallocate_memory() that it calls
preallocate_image_memory() with an argument that may be greater than
the total number of available non-highmem memory pages.  If that's
the case, the OOM condition is guaranteed to trigger, which in turn
can cause significant slowdown to occur during hibernation.

To avoid that, make preallocate_image_memory() adjust its argument
before calling preallocate_image_pages(), so that the total number of
saveable non-highem pages left is not less than the minimum size of
a hibernation image.  Change hibernate_preallocate_memory() to try to
allocate from highmem if the number of pages allocated by
preallocate_image_memory() is too low.

Modify free_unnecessary_pages() to take all possible memory
allocation patterns into account.

Reported-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Rafael J. Wysocki &lt;rjw@sisk.pl&gt;
Tested-by: M. Vefa Bicakci &lt;bicave@superonline.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>sched: Fix user time incorrectly accounted as system time on 32-bit</title>
<updated>2010-09-27T00:18:23+00:00</updated>
<author>
<name>Stanislaw Gruszka</name>
<email>sgruszka@redhat.com</email>
</author>
<published>2010-09-14T14:35:14+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=22bad8bbb944a2973d3fa3581198fd67c2f40eba'/>
<id>urn:sha1:22bad8bbb944a2973d3fa3581198fd67c2f40eba</id>
<content type='text'>
commit e75e863dd5c7d96b91ebbd241da5328fc38a78cc upstream.

We have 32-bit variable overflow possibility when multiply in
task_times() and thread_group_times() functions. When the
overflow happens then the scaled utime value becomes erroneously
small and the scaled stime becomes i erroneously big.

Reported here:

 https://bugzilla.redhat.com/show_bug.cgi?id=633037
 https://bugzilla.kernel.org/show_bug.cgi?id=16559

Reported-by: Michael Chapman &lt;redhat-bugzilla@very.puzzling.org&gt;
Reported-by: Ciriaco Garcia de Celis &lt;sysman@etherpilot.com&gt;
Signed-off-by: Stanislaw Gruszka &lt;sgruszka@redhat.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Hidetoshi Seto &lt;seto.hidetoshi@jp.fujitsu.com&gt;
LKML-Reference: &lt;20100914143513.GB8415@redhat.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>pid: make setpgid() system call use RCU read-side critical section</title>
<updated>2010-09-27T00:18:22+00:00</updated>
<author>
<name>Paul E. McKenney</name>
<email>paulmck@linux.vnet.ibm.com</email>
</author>
<published>2010-09-01T00:00:18+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=7495973bc945e7f4962f676ad631492b2ba4c061'/>
<id>urn:sha1:7495973bc945e7f4962f676ad631492b2ba4c061</id>
<content type='text'>
commit 950eaaca681c44aab87a46225c9e44f902c080aa upstream.

[   23.584719]
[   23.584720] ===================================================
[   23.585059] [ INFO: suspicious rcu_dereference_check() usage. ]
[   23.585176] ---------------------------------------------------
[   23.585176] kernel/pid.c:419 invoked rcu_dereference_check() without protection!
[   23.585176]
[   23.585176] other info that might help us debug this:
[   23.585176]
[   23.585176]
[   23.585176] rcu_scheduler_active = 1, debug_locks = 1
[   23.585176] 1 lock held by rc.sysinit/728:
[   23.585176]  #0:  (tasklist_lock){.+.+..}, at: [&lt;ffffffff8104771f&gt;] sys_setpgid+0x5f/0x193
[   23.585176]
[   23.585176] stack backtrace:
[   23.585176] Pid: 728, comm: rc.sysinit Not tainted 2.6.36-rc2 #2
[   23.585176] Call Trace:
[   23.585176]  [&lt;ffffffff8105b436&gt;] lockdep_rcu_dereference+0x99/0xa2
[   23.585176]  [&lt;ffffffff8104c324&gt;] find_task_by_pid_ns+0x50/0x6a
[   23.585176]  [&lt;ffffffff8104c35b&gt;] find_task_by_vpid+0x1d/0x1f
[   23.585176]  [&lt;ffffffff81047727&gt;] sys_setpgid+0x67/0x193
[   23.585176]  [&lt;ffffffff810029eb&gt;] system_call_fastpath+0x16/0x1b
[   24.959669] type=1400 audit(1282938522.956:4): avc:  denied  { module_request } for  pid=766 comm="hwclock" kmod="char-major-10-135" scontext=system_u:system_r:hwclock_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclas

It turns out that the setpgid() system call fails to enter an RCU
read-side critical section before doing a PID-to-task_struct translation.
This commit therefore does rcu_read_lock() before the translation, and
also does rcu_read_unlock() after the last use of the returned pointer.

Reported-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Acked-by: David Howells &lt;dhowells@redhat.com&gt;
Cc: Jiri Slaby &lt;jslaby@suse.cz&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
</feed>
