summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/bcachefs/SubmittingPatches.rst43
-rw-r--r--Documentation/filesystems/bcachefs/casefolding.rst90
-rw-r--r--Documentation/filesystems/bcachefs/index.rst20
-rw-r--r--Documentation/filesystems/dax.rst1
-rw-r--r--Documentation/filesystems/f2fs.rst3
-rw-r--r--Documentation/filesystems/journalling.rst4
-rw-r--r--Documentation/filesystems/nfs/reexport.rst10
-rw-r--r--Documentation/filesystems/proc.rst53
8 files changed, 193 insertions, 31 deletions
diff --git a/Documentation/filesystems/bcachefs/SubmittingPatches.rst b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
index 4b79ca58faf2..18c79d548391 100644
--- a/Documentation/filesystems/bcachefs/SubmittingPatches.rst
+++ b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
@@ -1,8 +1,13 @@
-Submitting patches to bcachefs:
-===============================
+Submitting patches to bcachefs
+==============================
+
+Here are suggestions for submitting patches to bcachefs subsystem.
+
+Submission checklist
+--------------------
Patches must be tested before being submitted, either with the xfstests suite
-[0], or the full bcachefs test suite in ktest [1], depending on what's being
+[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
touched. Note that ktest wraps xfstests and will be an easier method to running
it for most users; it includes single-command wrappers for all the mainstream
in-kernel local filesystems.
@@ -26,21 +31,21 @@ considered out of date), but try not to deviate too much without reason.
Focus on writing code that reads well and is organized well; code should be
aesthetically pleasing.
-CI:
-===
+CI
+--
Instead of running your tests locally, when running the full test suite it's
preferable to let a server farm do it in parallel, and then have the results
in a nice test dashboard (which can tell you which failures are new, and
presents results in a git log view, avoiding the need for most bisecting).
-That exists [2], and community members may request an account. If you work for
+That exists [2]_, and community members may request an account. If you work for
a big tech company, you'll need to help out with server costs to get access -
but the CI is not restricted to running bcachefs tests: it runs any ktest test
(which generally makes it easy to wrap other tests that can run in qemu).
-Other things to think about:
-============================
+Other things to think about
+---------------------------
- How will we debug this code? Is there sufficient introspection to diagnose
when something starts acting wonky on a user machine?
@@ -79,20 +84,22 @@ Other things to think about:
tested? (Automated tests exists but aren't in the CI, due to the hassle of
disk image management; coordinate to have them run.)
-Mailing list, IRC:
-==================
+Mailing list, IRC
+-----------------
-Patches should hit the list [3], but much discussion and code review happens on
-IRC as well [4]; many people appreciate the more conversational approach and
-quicker feedback.
+Patches should hit the list [3]_, but much discussion and code review happens
+on IRC as well [4]_; many people appreciate the more conversational approach
+and quicker feedback.
Additionally, we have a lively user community doing excellent QA work, which
exists primarily on IRC. Please make use of that resource; user feedback is
important for any nontrivial feature, and documenting it in commit messages
would be a good idea.
-[0]: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
-[1]: https://evilpiepirate.org/git/ktest.git/
-[2]: https://evilpiepirate.org/~testdashboard/ci/
-[3]: linux-bcachefs@vger.kernel.org
-[4]: irc.oftc.net#bcache, #bcachefs-dev
+.. rubric:: References
+
+.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
+.. [1] https://evilpiepirate.org/git/ktest.git/
+.. [2] https://evilpiepirate.org/~testdashboard/ci/
+.. [3] linux-bcachefs@vger.kernel.org
+.. [4] irc.oftc.net#bcache, #bcachefs-dev
diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
new file mode 100644
index 000000000000..ba5de97d155f
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/casefolding.rst
@@ -0,0 +1,90 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Casefolding
+===========
+
+bcachefs has support for case-insensitive file and directory
+lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
+casefolding attributes.
+
+The main usecase for casefolding is compatibility with software written
+against other filesystems that rely on casefolded lookups
+(eg. NTFS and Wine/Proton).
+Taking advantage of file-system level casefolding can lead to great
+loading time gains in many applications and games.
+
+Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
+Once a directory has been flagged for casefolding, a feature bit
+is enabled on the superblock which marks the filesystem as using
+casefolding.
+When the feature bit for casefolding is enabled, it is no longer possible
+to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
+
+On the lookup/query side: casefolding is implemented by allocating a new
+string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
+casefold the query string.
+
+On the dirent side: casefolding is implemented by ensuring the `bkey`'s
+hash is made from the casefolded string and storing the cached casefolded
+name with the regular name in the dirent.
+
+The structure looks like this:
+
+* Regular: [dirent data][regular name][nul][nul]...
+* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
+
+(Do note, the number of NULs here is merely for illustration; their count can
+vary per-key, and they may not even be present if the key is aligned to
+`sizeof(u64)`.)
+
+This is efficient as it means that for all file lookups that require casefolding,
+it has identical performance to a regular lookup:
+a hash comparison and a `memcmp` of the name.
+
+Rationale
+---------
+
+Several designs were considered for this system:
+One was to introduce a dirent_v2, however that would be painful especially as
+the hash system only has support for a single key type. This would also need
+`BCH_NAME_MAX` to change between versions, and a new feature bit.
+
+Another option was to store without the two lengths, and just take the length of
+the regular name and casefolded name contiguously / 2 as the length. This would
+assume that the regular length == casefolded length, but that could potentially
+not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
+the lowercase unicode glyph.
+It would be possible to disregard the casefold cache for those cases, but it was
+decided to simply encode the two string lengths in the key to avoid random
+performance issues if this edgecase was ever hit.
+
+The option settled on was to use a free-bit in d_type to mark a dirent as having
+a casefold cache, and then treat the first 4 bytes the name block as lengths.
+You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
+
+The feature bit was used to allow casefolding support to be enabled for the majority
+of users, but some allow users who have no need for the feature to still use bcachefs as
+`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
+which may be decider between using bcachefs for eg. embedded platforms.
+
+Other filesystems like ext4 and f2fs have a super-block level option for casefolding
+encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
+any encodings than a single UTF-8 version. When future encodings are desirable,
+they will be added trivially using the opts mechanism.
+
+dentry/dcache considerations
+----------------------------
+
+Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
+negative dentry's.
+
+This is because currently doing so presents a problem in the following scenario:
+
+ - Lookup file "blAH" in a casefolded directory
+ - Creation of file "BLAH" in a casefolded directory
+ - Lookup file "blAH" in a casefolded directory
+
+This would fail if negative dentry's were cached.
+
+This is slightly suboptimal, but could be fixed in future with some vfs work.
+
diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst
index 7db4d7ceab58..3864d0ae89c1 100644
--- a/Documentation/filesystems/bcachefs/index.rst
+++ b/Documentation/filesystems/bcachefs/index.rst
@@ -4,10 +4,28 @@
bcachefs Documentation
======================
+Subsystem-specific development process notes
+--------------------------------------------
+
+Development notes specific to bcachefs. These are intended to supplement
+:doc:`general kernel development handbook </process/index>`.
+
.. toctree::
- :maxdepth: 2
+ :maxdepth: 1
:numbered:
CodingStyle
SubmittingPatches
+
+Filesystem implementation
+-------------------------
+
+Documentation for filesystem features and their implementation details.
+At this moment, only a few of these are described here.
+
+.. toctree::
+ :maxdepth: 1
+ :numbered:
+
+ casefolding
errorcodes
diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst
index 719e90f1988e..08dd5e254cc5 100644
--- a/Documentation/filesystems/dax.rst
+++ b/Documentation/filesystems/dax.rst
@@ -207,7 +207,6 @@ implement direct_access.
These block devices may be used for inspiration:
- brd: RAM backed block device driver
-- dcssblk: s390 dcss block device driver
- pmem: NVDIMM persistent memory driver
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index fb7d2ee022bc..e15c4275862a 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -206,6 +206,7 @@ fault_type=%d Support configuring fault injection type, should be
FAULT_BLKADDR_VALIDITY 0x000040000
FAULT_BLKADDR_CONSISTENCE 0x000080000
FAULT_NO_SEGMENT 0x000100000
+ FAULT_INCONSISTENT_FOOTER 0x000200000
=========================== ===========
mode=%s Control block allocation mode which supports "adaptive"
and "lfs". In "lfs" mode, there should be no random
@@ -365,6 +366,8 @@ errors=%s Specify f2fs behavior on critical errors. This supports modes:
pending node write drop keep N/A
pending meta write keep keep N/A
====================== =============== =============== ========
+nat_bits Enable nat_bits feature to enhance full/empty nat blocks access,
+ by default it's disabled.
======================== ============================================================
Debugfs Entries
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 0254f7d57429..863e93e623f7 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -111,9 +111,7 @@ a callback function when the transaction is finally committed to disk,
so that you can do some of your own management. You ask the journalling
layer for calling the callback by simply setting
``journal->j_commit_callback`` function pointer and that function is
-called after each transaction commit. You can also use
-``transaction->t_private_list`` for attaching entries to a transaction
-that need processing when the transaction commits.
+called after each transaction commit.
JBD2 also provides a way to block all transaction updates via
jbd2_journal_lock_updates() /
diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
index ff9ae4a46530..044be965d75e 100644
--- a/Documentation/filesystems/nfs/reexport.rst
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -26,9 +26,13 @@ Reboot recovery
---------------
The NFS protocol's normal reboot recovery mechanisms don't work for the
-case when the reexport server reboots. Clients will lose any locks
-they held before the reboot, and further IO will result in errors.
-Closing and reopening files should clear the errors.
+case when the reexport server reboots because the source server has not
+rebooted, and so it is not in grace. Since the source server is not in
+grace, it cannot offer any guarantees that the file won't have been
+changed between the locks getting lost and any attempt to recover them.
+The same applies to delegations and any associated locks. Clients are
+not allowed to get file locks or delegations from a reexport server, any
+attempts will fail with operation not supported.
Filehandle limits
-----------------
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 09f0aed5a08b..2a17865dfe39 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -128,6 +128,16 @@ process running on the system, which is named after the process ID (PID).
The link 'self' points to the process reading the file system. Each process
subdirectory has the entries listed in Table 1-1.
+A process can read its own information from /proc/PID/* with no extra
+permissions. When reading /proc/PID/* information for other processes, reading
+process is required to have either CAP_SYS_PTRACE capability with
+PTRACE_MODE_READ access permissions, or, alternatively, CAP_PERFMON
+capability. This applies to all read-only information like `maps`, `environ`,
+`pagemap`, etc. The only exception is `mem` file due to its read-write nature,
+which requires CAP_SYS_PTRACE capabilities with more elevated
+PTRACE_MODE_ATTACH permissions; CAP_PERFMON capability does not grant access
+to /proc/PID/mem for other processes.
+
Note that an open file descriptor to /proc/<pid> or to any of its
contained files or subdirectories does not prevent <pid> being reused
for some other process in the event that <pid> exits. Operations on
@@ -502,9 +512,25 @@ process, its PSS will be 1500. "Pss_Dirty" is the portion of PSS which
consists of dirty pages. ("Pss_Clean" is not included, but it can be
calculated by subtracting "Pss_Dirty" from "Pss".)
-Note that even a page which is part of a MAP_SHARED mapping, but has only
-a single pte mapped, i.e. is currently used by only one process, is accounted
-as private and not as shared.
+Traditionally, a page is accounted as "private" if it is mapped exactly once,
+and a page is accounted as "shared" when mapped multiple times, even when
+mapped in the same process multiple times. Note that this accounting is
+independent of MAP_SHARED.
+
+In some kernel configurations, the semantics of pages part of a larger
+allocation (e.g., THP) can differ: a page is accounted as "private" if all
+pages part of the corresponding large allocation are *certainly* mapped in the
+same process, even if the page is mapped multiple times in that process. A
+page is accounted as "shared" if any page page of the larger allocation
+is *maybe* mapped in a different process. In some cases, a large allocation
+might be treated as "maybe mapped by multiple processes" even though this
+is no longer the case.
+
+Some kernel configurations do not track the precise number of times a page part
+of a larger allocation is mapped. In this case, when calculating the PSS, the
+average number of mappings per page in this larger allocation might be used
+as an approximation for the number of mappings of a page. The PSS calculation
+will be imprecise in this case.
"Referenced" indicates the amount of memory currently marked as referenced or
accessed.
@@ -686,6 +712,11 @@ Where:
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
size, in KB, that is backing the mapping up.
+Note that some kernel configurations do not track the precise number of times
+a page part of a larger allocation (e.g., THP) is mapped. In these
+configurations, "mapmax" might corresponds to the average number of mappings
+per page in such a larger allocation instead.
+
1.2 Kernel data
---------------
@@ -1060,6 +1091,8 @@ Example output. You may not have all of these fields.
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
+ Unaccepted: 0 kB
+ Balloon: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
@@ -1132,9 +1165,15 @@ Dirty
Writeback
Memory which is actively being written back to the disk
AnonPages
- Non-file backed pages mapped into userspace page tables
+ Non-file backed pages mapped into userspace page tables. Note that
+ some kernel configurations might consider all pages part of a
+ larger allocation (e.g., THP) as "mapped", as soon as a single
+ page is mapped.
Mapped
- files which have been mmapped, such as libraries
+ files which have been mmapped, such as libraries. Note that some
+ kernel configurations might consider all pages part of a larger
+ allocation (e.g., THP) as "mapped", as soon as a single page is
+ mapped.
Shmem
Total memory used by shared memory (shmem) and tmpfs
KReclaimable
@@ -1228,6 +1267,10 @@ CmaTotal
Memory reserved for the Contiguous Memory Allocator (CMA)
CmaFree
Free remaining memory in the CMA reserves
+Unaccepted
+ Memory that has not been accepted by the guest
+Balloon
+ Memory returned to Host by VM Balloon Drivers
HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
See Documentation/admin-guide/mm/hugetlbpage.rst.
DirectMap4k, DirectMap2M, DirectMap1G