summaryrefslogtreecommitdiff
path: root/Documentation/filesystems/ext4/ondisk/inodes.rst
blob: 655ce898f3f5c4a08c60581ee45764ae1d7abdb9 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
.. SPDX-License-Identifier: GPL-2.0

Index Nodes
-----------

In a regular UNIX filesystem, the inode stores all the metadata
pertaining to the file (time stamps, block maps, extended attributes,
etc), not the directory entry. To find the information associated with a
file, one must traverse the directory files to find the directory entry
associated with a file, then load the inode to find the metadata for
that file. ext4 appears to cheat (for performance reasons) a little bit
by storing a copy of the file type (normally stored in the inode) in the
directory entry. (Compare all this to FAT, which stores all the file
information directly in the directory entry, but does not support hard
links and is in general more seek-happy than ext4 due to its simpler
block allocator and extensive use of linked lists.)

The inode table is a linear array of ``struct ext4_inode``. The table is
sized to have enough blocks to store at least
``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the
block group containing an inode can be calculated as
``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the
group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There
is no inode 0.

The inode checksum is calculated against the FS UUID, the inode number,
and the inode structure itself.

The inode table entry is laid out in ``struct ext4_inode``.

.. list-table::
   :widths: 1 1 1 77
   :header-rows: 1

   * - Offset
     - Size
     - Name
     - Description
   * - 0x0
     - \_\_le16
     - i\_mode
     - File mode. See the table i_mode_ below.
   * - 0x2
     - \_\_le16
     - i\_uid
     - Lower 16-bits of Owner UID.
   * - 0x4
     - \_\_le32
     - i\_size\_lo
     - Lower 32-bits of size in bytes.
   * - 0x8
     - \_\_le32
     - i\_atime
     - Last access time, in seconds since the epoch. However, if the EA\_INODE
       inode flag is set, this inode stores an extended attribute value and
       this field contains the checksum of the value.
   * - 0xC
     - \_\_le32
     - i\_ctime
     - Last inode change time, in seconds since the epoch. However, if the
       EA\_INODE inode flag is set, this inode stores an extended attribute
       value and this field contains the lower 32 bits of the attribute value's
       reference count.
   * - 0x10
     - \_\_le32
     - i\_mtime
     - Last data modification time, in seconds since the epoch. However, if the
       EA\_INODE inode flag is set, this inode stores an extended attribute
       value and this field contains the number of the inode that owns the
       extended attribute.
   * - 0x14
     - \_\_le32
     - i\_dtime
     - Deletion Time, in seconds since the epoch.
   * - 0x18
     - \_\_le16
     - i\_gid
     - Lower 16-bits of GID.
   * - 0x1A
     - \_\_le16
     - i\_links\_count
     - Hard link count. Normally, ext4 does not permit an inode to have more
       than 65,000 hard links. This applies to files as well as directories,
       which means that there cannot be more than 64,998 subdirectories in a
       directory (each subdirectory's '..' entry counts as a hard link, as does
       the '.' entry in the directory itself). With the DIR\_NLINK feature
       enabled, ext4 supports more than 64,998 subdirectories by setting this
       field to 1 to indicate that the number of hard links is not known.
   * - 0x1C
     - \_\_le32
     - i\_blocks\_lo
     - Lower 32-bits of “block” count. If the huge\_file feature flag is not
       set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
       on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in
       ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
       << 32)`` 512-byte blocks on disk. If huge\_file is set and
       EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file
       consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
       disk.
   * - 0x20
     - \_\_le32
     - i\_flags
     - Inode flags. See the table i_flags_ below.
   * - 0x24
     - 4 bytes
     - i\_osd1
     - See the table i_osd1_ for more details.
   * - 0x28
     - 60 bytes
     - i\_block[EXT4\_N\_BLOCKS=15]
     - Block map or extent tree. See the section “The Contents of inode.i\_block”.
   * - 0x64
     - \_\_le32
     - i\_generation
     - File version (for NFS).
   * - 0x68
     - \_\_le32
     - i\_file\_acl\_lo
     - Lower 32-bits of extended attribute block. ACLs are of course one of
       many possible extended attributes; I think the name of this field is a
       result of the first use of extended attributes being for ACLs.
   * - 0x6C
     - \_\_le32
     - i\_size\_high / i\_dir\_acl
     - Upper 32-bits of file/directory size. In ext2/3 this field was named
       i\_dir\_acl, though it was usually set to zero and never used.
   * - 0x70
     - \_\_le32
     - i\_obso\_faddr
     - (Obsolete) fragment address.
   * - 0x74
     - 12 bytes
     - i\_osd2
     - See the table i_osd2_ for more details.
   * - 0x80
     - \_\_le16
     - i\_extra\_isize
     - Size of this inode - 128. Alternately, the size of the extended inode
       fields beyond the original ext2 inode, including this field.
   * - 0x82
     - \_\_le16
     - i\_checksum\_hi
     - Upper 16-bits of the inode checksum.
   * - 0x84
     - \_\_le32
     - i\_ctime\_extra
     - Extra change time bits. This provides sub-second precision. See Inode
       Timestamps section.
   * - 0x88
     - \_\_le32
     - i\_mtime\_extra
     - Extra modification time bits. This provides sub-second precision.
   * - 0x8C
     - \_\_le32
     - i\_atime\_extra
     - Extra access time bits. This provides sub-second precision.
   * - 0x90
     - \_\_le32
     - i\_crtime
     - File creation time, in seconds since the epoch.
   * - 0x94
     - \_\_le32
     - i\_crtime\_extra
     - Extra file creation time bits. This provides sub-second precision.
   * - 0x98
     - \_\_le32
     - i\_version\_hi
     - Upper 32-bits for version number.
   * - 0x9C
     - \_\_le32
     - i\_projid
     - Project ID.

.. _i_mode:

The ``i_mode`` value is a combination of the following flags:

.. list-table::
   :widths: 1 79
   :header-rows: 1

   * - Value
     - Description
   * - 0x1
     - S\_IXOTH (Others may execute)
   * - 0x2
     - S\_IWOTH (Others may write)
   * - 0x4
     - S\_IROTH (Others may read)
   * - 0x8
     - S\_IXGRP (Group members may execute)
   * - 0x10
     - S\_IWGRP (Group members may write)
   * - 0x20
     - S\_IRGRP (Group members may read)
   * - 0x40
     - S\_IXUSR (Owner may execute)
   * - 0x80
     - S\_IWUSR (Owner may write)
   * - 0x100
     - S\_IRUSR (Owner may read)
   * - 0x200
     - S\_ISVTX (Sticky bit)
   * - 0x400
     - S\_ISGID (Set GID)
   * - 0x800
     - S\_ISUID (Set UID)
   * -
     - These are mutually-exclusive file types:
   * - 0x1000
     - S\_IFIFO (FIFO)
   * - 0x2000
     - S\_IFCHR (Character device)
   * - 0x4000
     - S\_IFDIR (Directory)
   * - 0x6000
     - S\_IFBLK (Block device)
   * - 0x8000
     - S\_IFREG (Regular file)
   * - 0xA000
     - S\_IFLNK (Symbolic link)
   * - 0xC000
     - S\_IFSOCK (Socket)

.. _i_flags:

The ``i_flags`` field is a combination of these values:

.. list-table::
   :widths: 1 79
   :header-rows: 1

   * - Value
     - Description
   * - 0x1
     - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented)
   * - 0x2
     - This file should be preserved, should undeletion be desired
       (EXT4\_UNRM\_FL). (not implemented)
   * - 0x4
     - File is compressed (EXT4\_COMPR\_FL). (not really implemented)
   * - 0x8
     - All writes to the file must be synchronous (EXT4\_SYNC\_FL).
   * - 0x10
     - File is immutable (EXT4\_IMMUTABLE\_FL).
   * - 0x20
     - File can only be appended (EXT4\_APPEND\_FL).
   * - 0x40
     - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL).
   * - 0x80
     - Do not update access time (EXT4\_NOATIME\_FL).
   * - 0x100
     - Dirty compressed file (EXT4\_DIRTY\_FL). (not used)
   * - 0x200
     - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used)
   * - 0x400
     - Do not compress file (EXT4\_NOCOMPR\_FL). (not used)
   * - 0x800
     - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was
       EXT4\_ECOMPR\_FL (compression error), which was never used.
   * - 0x1000
     - Directory has hashed indexes (EXT4\_INDEX\_FL).
   * - 0x2000
     - AFS magic directory (EXT4\_IMAGIC\_FL).
   * - 0x4000
     - File data must always be written through the journal
       (EXT4\_JOURNAL\_DATA\_FL).
   * - 0x8000
     - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4)
   * - 0x10000
     - All directory entry data should be written synchronously (see
       ``dirsync``) (EXT4\_DIRSYNC\_FL).
   * - 0x20000
     - Top of directory hierarchy (EXT4\_TOPDIR\_FL).
   * - 0x40000
     - This is a huge file (EXT4\_HUGE\_FILE\_FL).
   * - 0x80000
     - Inode uses extents (EXT4\_EXTENTS\_FL).
   * - 0x200000
     - Inode stores a large extended attribute value in its data blocks
       (EXT4\_EA\_INODE\_FL).
   * - 0x400000
     - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL).
       (deprecated)
   * - 0x01000000
     - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
   * - 0x04000000
     - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in
       mainline)
   * - 0x08000000
     - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
       mainline)
   * - 0x10000000
     - Inode has inline data (EXT4\_INLINE\_DATA\_FL).
   * - 0x20000000
     - Create children with the same project ID (EXT4\_PROJINHERIT\_FL).
   * - 0x80000000
     - Reserved for ext4 library (EXT4\_RESERVED\_FL).
   * -
     - Aggregate flags:
   * - 0x4BDFFF
     - User-visible flags.
   * - 0x4B80FF
     - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and
       EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's
       EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of
       these flags in a special manner and they are masked out of the set of
       flags that are saved directly to i\_flags.

.. _i_osd1:

The ``osd1`` field has multiple meanings depending on the creator:

Linux:

.. list-table::
   :widths: 1 1 1 77
   :header-rows: 1

   * - Offset
     - Size
     - Name
     - Description
   * - 0x0
     - \_\_le32
     - l\_i\_version
     - Inode version. However, if the EA\_INODE inode flag is set, this inode
       stores an extended attribute value and this field contains the upper 32
       bits of the attribute value's reference count.

Hurd:

.. list-table::
   :widths: 1 1 1 77
   :header-rows: 1

   * - Offset
     - Size
     - Name
     - Description
   * - 0x0
     - \_\_le32
     - h\_i\_translator
     - ??

Masix:

.. list-table::
   :widths: 1 1 1 77
   :header-rows: 1

   * - Offset
     - Size
     - Name
     - Description
   * - 0x0
     - \_\_le32
     - m\_i\_reserved
     - ??

.. _i_osd2:

The ``osd2`` field has multiple meanings depending on the filesystem creator:

Linux:

.. list-table::
   :widths: 1 1 1 77
   :header-rows: 1

   * - Offset
     - Size
     - Name
     - Description
   * - 0x0
     - \_\_le16
     - l\_i\_blocks\_high
     - Upper 16-bits of the block count. Please see the note attached to
       i\_blocks\_lo.
   * - 0x2
     - \_\_le16
     - l\_i\_file\_acl\_high
     - Upper 16-bits of the extended attribute block (historically, the file
       ACL location). See the Extended Attributes section below.
   * - 0x4
     - \_\_le16
     - l\_i\_uid\_high
     - Upper 16-bits of the Owner UID.
   * - 0x6
     - \_\_le16
     - l\_i\_gid\_high
     - Upper 16-bits of the GID.
   * - 0x8
     - \_\_le16
     - l\_i\_checksum\_lo
     - Lower 16-bits of the inode checksum.
   * - 0xA
     - \_\_le16
     - l\_i\_reserved
     - Unused.

Hurd:

.. list-table::
   :widths: 1 1 1 77
   :header-rows: 1

   * - Offset
     - Size
     - Name
     - Description
   * - 0x0
     - \_\_le16
     - h\_i\_reserved1
     - ??
   * - 0x2
     - \_\_u16
     - h\_i\_mode\_high
     - Upper 16-bits of the file mode.
   * - 0x4
     - \_\_le16
     - h\_i\_uid\_high
     - Upper 16-bits of the Owner UID.
   * - 0x6
     - \_\_le16
     - h\_i\_gid\_high
     - Upper 16-bits of the GID.
   * - 0x8
     - \_\_u32
     - h\_i\_author
     - Author code?

Masix:

.. list-table::
   :widths: 1 1 1 77
   :header-rows: 1

   * - Offset
     - Size
     - Name
     - Description
   * - 0x0
     - \_\_le16
     - h\_i\_reserved1
     - ??
   * - 0x2
     - \_\_u16
     - m\_i\_file\_acl\_high
     - Upper 16-bits of the extended attribute block (historically, the file
       ACL location).
   * - 0x4
     - \_\_u32
     - m\_i\_reserved2[2]
     - ??

Inode Size
~~~~~~~~~~

In ext2 and ext3, the inode structure size was fixed at 128 bytes
(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
128 bytes. Starting with ext4, it is possible to allocate a larger
on-disk inode at format time for all inodes in the filesystem to provide
space beyond the end of the original ext2 inode. The on-disk inode
record size is recorded in the superblock as ``s_inode_size``. The
number of bytes actually used by struct ext4\_inode beyond the original
128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
inode, which allows struct ext4\_inode to grow for a new kernel without
having to upgrade all of the on-disk inodes. Access to fields beyond
EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within
``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
of October 2013) the inode structure is 156 bytes
(``i_extra_isize = 28``). The extra space between the end of the inode
structure and the end of the inode record can be used to store extended
attributes. Each inode record can be as large as the filesystem block
size, though this is not terribly efficient.

Finding an Inode
~~~~~~~~~~~~~~~~

Each block group contains ``sb->s_inodes_per_group`` inodes. Because
inode 0 is defined not to exist, this formula can be used to find the
block group that an inode lives in:
``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode
can be found within the block group's inode table at
``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte
address within the inode table, use
``offset = index * sb->s_inode_size``.

Inode Timestamps
~~~~~~~~~~~~~~~~

Four timestamps are recorded in the lower 128 bytes of the inode
structure -- inode change time (ctime), access time (atime), data
modification time (mtime), and deletion time (dtime). The four fields
are 32-bit signed integers that represent seconds since the Unix epoch
(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
January 2038. For inodes that are not linked from any directory but are
still open (orphan inodes), the dtime field is overloaded for use with
the orphan list. The superblock field ``s_last_orphan`` points to the
first inode in the orphan list; dtime is then the number of the next
orphaned inode, or zero if there are no more orphans.

If the inode structure size ``sb->s_inode_size`` is larger than 128
bytes and the ``i_inode_extra`` field is large enough to encompass the
respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
inode fields are widened to 64 bits. Within this “extra” 32-bit field,
the lower two bits are used to extend the 32-bit seconds field to be 34
bit wide; the upper 30 bits are used to provide nanosecond timestamp
accuracy. Therefore, timestamps should not overflow until May 2446.
dtime was not widened. There is also a fifth timestamp to record inode
creation time (crtime); this field is 64-bits wide and decoded in the
same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
through the regular stat() interface, though debugfs will report them.

We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)).
In other words:

.. list-table::
   :widths: 20 20 20 20 20
   :header-rows: 1

   * - Extra epoch bits
     - MSB of 32-bit time
     - Adjustment for signed 32-bit to 64-bit tv\_sec
     - Decoded 64-bit tv\_sec
     - valid time range
   * - 0 0
     - 1
     - 0
     - ``-0x80000000 - -0x00000001``
     - 1901-12-13 to 1969-12-31
   * - 0 0
     - 0
     - 0
     - ``0x000000000 - 0x07fffffff``
     - 1970-01-01 to 2038-01-19
   * - 0 1
     - 1
     - 0x100000000
     - ``0x080000000 - 0x0ffffffff``
     - 2038-01-19 to 2106-02-07
   * - 0 1
     - 0
     - 0x100000000
     - ``0x100000000 - 0x17fffffff``
     - 2106-02-07 to 2174-02-25
   * - 1 0
     - 1
     - 0x200000000
     - ``0x180000000 - 0x1ffffffff``
     - 2174-02-25 to 2242-03-16
   * - 1 0
     - 0
     - 0x200000000
     - ``0x200000000 - 0x27fffffff``
     - 2242-03-16 to 2310-04-04
   * - 1 1
     - 1
     - 0x300000000
     - ``0x280000000 - 0x2ffffffff``
     - 2310-04-04 to 2378-04-22
   * - 1 1
     - 0
     - 0x300000000
     - ``0x300000000 - 0x37fffffff``
     - 2378-04-22 to 2446-05-10

This is a somewhat odd encoding since there are effectively seven times
as many positive values as negative values. There have also been
long-standing bugs decoding and encoding dates beyond 2038, which don't
seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
incorrectly use the extra epoch bits 1,1 for dates between 1901 and
1970. At some point the kernel will be fixed and e2fsck will fix this
situation, assuming that it is run before 2310.