<feed xmlns='http://www.w3.org/2005/Atom'>
<title>lwn.git/fs/direct-io.c, branch 765</title>
<subtitle>Linux kernel documentation tree maintained by Jonathan Corbet</subtitle>
<id>http://mirrors.hust.edu.cn/git/lwn.git/atom?h=765</id>
<link rel='self' href='http://mirrors.hust.edu.cn/git/lwn.git/atom?h=765'/>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/'/>
<updated>2008-02-05T17:44:13+00:00</updated>
<entry>
<title>Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user</title>
<updated>2008-02-05T17:44:13+00:00</updated>
<author>
<name>Christoph Lameter</name>
<email>clameter@sgi.com</email>
</author>
<published>2008-02-05T06:28:29+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=eebd2aa355692afaf9906f62118620f1a1c19dbb'/>
<id>urn:sha1:eebd2aa355692afaf9906f62118620f1a1c19dbb</id>
<content type='text'>
Simplify page cache zeroing of segments of pages through 3 functions

zero_user_segments(page, start1, end1, start2, end2)

        Zeros two segments of the page. It takes the position where to
        start and end the zeroing which avoids length calculations and
	makes code clearer.

zero_user_segment(page, start, end)

        Same for a single segment.

zero_user(page, start, length)

        Length variant for the case where we know the length.

We remove the zero_user_page macro. Issues:

1. Its a macro. Inline functions are preferable.

2. The KM_USER0 macro is only defined for HIGHMEM.

   Having to treat this special case everywhere makes the
   code needlessly complex. The parameter for zeroing is always
   KM_USER0 except in one single case that we open code.

Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.

Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.

Also extract the flushing of the caches to be outside of the kmap.

[akpm@linux-foundation.org: fix nfs and ntfs build]
[akpm@linux-foundation.org: fix ntfs build some more]
Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Cc: Steven French &lt;sfrench@us.ibm.com&gt;
Cc: Michael Halcrow &lt;mhalcrow@us.ibm.com&gt;
Cc: &lt;linux-ext4@vger.kernel.org&gt;
Cc: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
Cc: Trond Myklebust &lt;trond.myklebust@fys.uio.no&gt;
Cc: "J. Bruce Fields" &lt;bfields@fieldses.org&gt;
Cc: Anton Altaparmakov &lt;aia21@cantab.net&gt;
Cc: Mark Fasheh &lt;mark.fasheh@oracle.com&gt;
Cc: David Chinner &lt;dgc@sgi.com&gt;
Cc: Michael Halcrow &lt;mhalcrow@us.ibm.com&gt;
Cc: Steven French &lt;sfrench@us.ibm.com&gt;
Cc: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
Cc: Trond Myklebust &lt;trond.myklebust@fys.uio.no&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>remove ZERO_PAGE</title>
<updated>2007-10-16T16:42:53+00:00</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@suse.de</email>
</author>
<published>2007-10-16T08:24:40+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=557ed1fa2620dc119adb86b34c614e152a629a80'/>
<id>urn:sha1:557ed1fa2620dc119adb86b34c614e152a629a80</id>
<content type='text'>
The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
much use to them except on benchmarks.  All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus "snif" Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Drop 'size' argument from bio_endio and bi_end_io</title>
<updated>2007-10-10T07:25:57+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2007-09-27T10:47:43+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=6712ecf8f648118c3363c142196418f89a510b90'/>
<id>urn:sha1:6712ecf8f648118c3363c142196418f89a510b90</id>
<content type='text'>
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant.  Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size.  So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Jens Axboe &lt;jens.axboe@oracle.com&gt;
</content>
</entry>
<entry>
<title>dio: zero struct dio with kzalloc instead of manually</title>
<updated>2007-08-21T05:50:25+00:00</updated>
<author>
<name>Zach Brown</name>
<email>zach.brown@oracle.com</email>
</author>
<published>2007-08-21T00:12:01+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=848c4dd5153c7a0de55470ce99a8e13a63b4703f'/>
<id>urn:sha1:848c4dd5153c7a0de55470ce99a8e13a63b4703f</id>
<content type='text'>
This patch uses kzalloc to zero all of struct dio rather than manually
trying to track which fields we rely on being zero.  It passed aio+dio
stress testing and some bug regression testing on ext3.

This patch was introduced by Linus in the conversation that lead up to
Badari's minimal fix to manually zero .map_bh.b_state in commit:

  6a648fa72161d1f6468dabd96c5d3c0db04f598a

It makes the code a bit smaller.  Maybe a couple fewer cachelines to
load, if we're lucky:

   text    data     bss     dec     hex filename
3285925  568506 1304616 5159047  4eb887 vmlinux
3285797  568506 1304616 5158919  4eb807 vmlinux.patched

I was unable to measure a stable difference in the number of cpu cycles
spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
~340MB/s.

So the resulting intent of the patch isn't a performance gain but to
avoid exposing ourselves to the risk of finding another field like
.map_bh.b_state where we rely on zeroing but don't enforce it in the
code.

Signed-off-by: Zach Brown &lt;zach.brown@oracle.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>direct-io: fix error-path crashes</title>
<updated>2007-08-11T22:47:40+00:00</updated>
<author>
<name>Badari Pulavarty</name>
<email>pbadari@us.ibm.com</email>
</author>
<published>2007-08-10T20:00:44+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=6a648fa72161d1f6468dabd96c5d3c0db04f598a'/>
<id>urn:sha1:6a648fa72161d1f6468dabd96c5d3c0db04f598a</id>
<content type='text'>
Need to initialize map_bh.b_state to zero.  Otherwise, in case of a faulty
user-buffer its possible to go into dio_zero_block() and submit a page by
mistake - since it checks for buffer_new().

http://marc.info/?l=linux-kernel&amp;m=118551339032528&amp;w=2

akpm: Linus had a (better) patch to just do a kzalloc() in there, but it got
lost.  Probably this version is better for -stable anwyay.

Signed-off-by: Badari Pulavarty &lt;pbadari@us.ibm.com&gt;
Acked-by: Joe Jin &lt;joe.jin@oracle.com&gt;
Acked-by: Zach Brown &lt;zach.brown@oracle.com&gt;
Cc: gurudas pai &lt;gurudas.pai@oracle.com&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>dio: remove bogus refcounting BUG_ON</title>
<updated>2007-07-04T01:23:23+00:00</updated>
<author>
<name>Zach Brown</name>
<email>zach.brown@oracle.com</email>
</author>
<published>2007-07-03T22:28:55+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=fcb82f8835c1d71b4fe5de1d9894f45370f80dab'/>
<id>urn:sha1:fcb82f8835c1d71b4fe5de1d9894f45370f80dab</id>
<content type='text'>
Badari Pulavarty reported a case of this BUG_ON is triggering during
testing.  It's completely bogus and should be removed.

It's trying to notice if we left references to the dio hanging around in
the sync case.  They should have been dropped as IO completed while this
path was in dio_await_completion().  This condition will also be
checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few
lines lower.  So to start this BUG_ON() is redundant.

More fatally, it's dereferencing dio-&gt; after having dropped its
reference.  It's only safe to dereference the dio after releasing the
lock if the final reference was just dropped.  Another CPU might free
the dio in bio completion and reuse the memory after this path drops the
dio lock but before the BUG_ON() is evaluated.

This patch passed aio+dio regression unit tests and aio-stress on ext3.

Signed-off-by: Zach Brown &lt;zach.brown@oracle.com&gt;
Cc: Badari Pulavarty &lt;pbadari@us.ibm.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial</title>
<updated>2007-05-09T19:54:17+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@woody.linux-foundation.org</email>
</author>
<published>2007-05-09T19:54:17+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=9a9136e270af14da506f66bcafcc506b86a86498'/>
<id>urn:sha1:9a9136e270af14da506f66bcafcc506b86a86498</id>
<content type='text'>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
  sound: convert "sound" subdirectory to UTF-8
  MAINTAINERS: Add cxacru website/mailing list
  include files: convert "include" subdirectory to UTF-8
  general: convert "kernel" subdirectory to UTF-8
  documentation: convert the Documentation directory to UTF-8
  Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
  remove broken URLs from net drivers' output
  Magic number prefix consistency change to Documentation/magic-number.txt
  trivial: s/i_sem /i_mutex/
  fix file specification in comments
  drivers/base/platform.c: fix small typo in doc
  misc doc and kconfig typos
  Remove obsolete fat_cvf help text
  Fix occurrences of "the the "
  Fix minor typoes in kernel/module.c
  Kconfig: Remove reference to external mqueue library
  Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
  Correct comments in genrtc.c to refer to correct /proc file.
  Fix more "deprecated" spellos.
  Fix "deprecated" typoes.
  ...

Fix trivial comment conflict in kernel/relay.c.
</content>
</entry>
<entry>
<title>fs: convert core functions to zero_user_page</title>
<updated>2007-05-09T19:30:55+00:00</updated>
<author>
<name>Nate Diller</name>
<email>nate.diller@gmail.com</email>
</author>
<published>2007-05-09T09:35:07+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=01f2705daf5a36208e69d7cf95db9c330f843af6'/>
<id>urn:sha1:01f2705daf5a36208e69d7cf95db9c330f843af6</id>
<content type='text'>
It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset().  There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.

This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones.  Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call.  The diffstat below shows the entire
patchset.

[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Nate Diller &lt;nate.diller@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Fix misspellings collected by members of KJ list.</title>
<updated>2007-05-09T05:14:03+00:00</updated>
<author>
<name>Robert P. J. Day</name>
<email>rpjday@mindspring.com</email>
</author>
<published>2007-05-09T05:14:03+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=beb7dd86a101263bf63a78c7c6d4da3849b35bd6'/>
<id>urn:sha1:beb7dd86a101263bf63a78c7c6d4da3849b35bd6</id>
<content type='text'>
Fix the misspellings of "propogate", "writting" and (oh, the shame
:-) "kenrel" in the source tree.

Signed-off-by: Robert P. J. Day &lt;rpjday@mindspring.com&gt;
Signed-off-by: Adrian Bunk &lt;bunk@stusta.de&gt;
</content>
</entry>
<entry>
<title>[PATCH] dio: lock refcount operations</title>
<updated>2006-12-10T17:57:21+00:00</updated>
<author>
<name>Zach Brown</name>
<email>zach.brown@oracle.com</email>
</author>
<published>2006-12-10T10:21:07+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=5eb6c7a2ab413dea1ee6c08dd58263a1c2c2efa3'/>
<id>urn:sha1:5eb6c7a2ab413dea1ee6c08dd58263a1c2c2efa3</id>
<content type='text'>
The wait_for_more_bios() function name was poorly chosen.  While looking to
clean it up it I noticed that the dio struct refcounting between the bio
completion and dio submission paths was racey.

The bio submission path was simply freeing the dio struct if
atomic_dec_and_test() indicated that it dropped the final reference.

The aio bio completion path was dereferencing its dio struct pointer *after
dropping its reference* based on the remaining number of references.

These two paths could race and result in the aio bio completion path
dereferencing a freed dio, though this was not observed in the wild.

This moves the refcount under the bio lock so that bio completion can drop
its reference and decide to wake all in one atomic step.

Once testing and waking is locked dio_await_one() can test its sleeping
condition and mark itself uninterruptible under the lock.  It gets simpler
and wait_for_more_bios() disappears.

The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
looks alarming.  This lock acquiry existed in that path before the recent
dio completion patch set.  We shouldn't expect significant performance
regression from returning to the behaviour that existed before the
completion clean up work.

This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.

Signed-off-by: Zach Brown &lt;zach.brown@oracle.com&gt;
Cc: Badari Pulavarty &lt;pbadari@us.ibm.com&gt;
Cc: Suparna Bhattacharya &lt;suparna@in.ibm.com&gt;
Cc: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Cc: &lt;xfs-masters@oss.sgi.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@osdl.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@osdl.org&gt;
</content>
</entry>
</feed>
