LEAN file system specification 1.0.3
Version 1.0.3, last update: 2022 December 28
Copyright © 2004-2010 by Salvatore ISAJA (original Author through version 0.6.1)
Copyright © 2021-2025 by Forever Young Software (Author starting with version 0.7.0)
Permission to make and distribute verbatim copies of this specification, translations of this specification into another language, and derivative works that comment on or otherwise explain it or assist in its implementation, for any purpose and without fee or royalty is hereby granted, provided that the present copyright notice, licensing terms and disclaimer are preserved on all copies.
In addition, the Author covenants not to assert any claims on implementations of this specification, unless such implementations contain derivative work of implementations by the Author, for which specific licensing terms may apply.
THIS SPECIFICATION IS PROVIDED "AS IS," AND THE AUTHOR MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR TITLE; THAT THE CONTENTS OF THIS SPECIFICATION ARE SUITABLE FOR ANY PURPOSE; NOR THAT THE IMPLEMENTATION OF SUCH CONTENTS WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. TO THE FULLEST EXTENT PERMITTED BY APPLICABLE LAW, THE AUTHOR WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THIS SPECIFICATION OR THE PERFORMANCE OR IMPLEMENTATION OF THE CONTENTS THEREOF.
No other rights are granted by implication, estoppel or otherwise.
This document describes version 1.0.3 of the LEAN file system (where the word LEAN is a recursive backronym for Lean yet Effective Allocation and Naming, based on the ironical contrary of FAT, as an English word): a free, simple, portable, personal, fully featured file system for embedded tasks. Minor changes made to this document (e.g. wording) that do not affect the file system format are tracked by the third number in the document version number.
The LEAN file system is in the release development stage: this document supersedes any previous version of the LEAN file system specification with no care for backward compatibility.
Since this is a new file system, and some aspects are still to be defined, suggestions or corrections are welcome, for either the file system or this document. Please contact the original author at: salvois at users.sourceforge.net
or for later releases: fys at fysnet.net
. The authors wish to thank those who have submitted comments and criticism in order to improve the LEAN file system.
Table of contents
- Differences from the previous versions and future plans
- Definitions
- Partition identifier
- Structure identification and checksum
- Bands
- Character encoding
- The Superblock
- The bitmap
- The data area and file allocation
- The inode structure
- Directories
- Symbolic links
- Extended attributes
- Extended Capabilities
- Undelete Capabilities
- Transactional Updates
Differences from the previous versions and future plans
- version 1.0
- Official Release: Official release version 1.0.
- version 1.0-rc1
- Transactional Updates: Added a section about Transactional Updates.
- Bitmap Checksum: Added a Bitmap Checksum to the Superblock. This addition means the complete Volume can now be Checksum'ed.
- Encoding: Removed the Character Encoding option. All text and strings are UTF-8 as originally specified.
- version 1.0-rc0
- First non-alpha/non-beta release: First release with a Major version greater than zero.
- Version field: Added clarification to the version field in the Superblock.
- Next Free: Added a hint to the next free block field the Superblock.
- Capabilities field: Added a capabilities field to the Superblock.
- Character Encoding: Added a character encoding field to the Superblock.
- Inode size: Changed the size of an Inode.
inodeSize = 200
- Extents per inode: Changed the
extentsPerInode
constant to 8 for standard extents (remains 6 for extended extents). - Extended attributes: Changed the
header
dword to include an 'Implementation Specific' field. This restricts the size of an attribute value from 16 Meg to 1 Meg (minus 1), though Linux restricts them to 64k, so I think we are still good.
- version 0.8
- Hidden Attribute: Removed
iaHidden
from theenum InodeAttributes
and placed it in the directory entry'stype
field. This allows multiple hard links to the same inode to be hidden independently of each other. - Undelete: Added the
ftDeleted
file type and comments about undelete capabilities.
- Hidden Attribute: Removed
- version 0.7
- Block size: 512-byte sectors are no longer assumed. A block size can be specified.
- Extended lengths: The length of some items have been extended to accommodate for larger block sizes.
- Band size: The minimum size of a band has been extended to accommodate for larger block sizes.
- Journal inode: An entry in the Superblock was added to accommodate this arbitrary block size and a Journal.
- Misc: A field in the Indirect structure was extended to a 16-bit value to accommodate for larger block sizes.
- Superblock location: Tightened the constraints on the Superblock location when large block sizes are used.
- Bad Block inode: Added section to clarify the Bad Block inode.
- version 0.6
- License: the GNU Free Documentation License has been replaced with a custom license that makes more sense for a specification.
- Preallocation: added support for drivers that may want to preallocate more disk space than needed for a file to reduce fragmentation and improve locality.
- Creation time stamp: creation and last status modification time stamps have been splitted.
- Extended attribute names: introduced constrains on names of extended attributes for compatibility with existing extended attribute APIs.
- Inline extended attributes: there are only two options for inline extended attributes now: do not use them, or reserve the whole space after the inode structure in the first sector, instead of the old
inodeExtra
parameter. - Naming convention: removed identifiers in
ALL_CAPITALS
to avoid clash with the conventional namespace of C and C++ macros.
- version 0.5
- Directory entry restrictions: directory entries are no longer required to fit within a single sector. This increases the maximum file name length, and makes storage of directory entries and extended attributes consistent.
- version 0.4
- File allocation: new definitions for some fields of the inode structure to simplify traversal of indirect sectors.
- Extended attributes: extended attributes can now be stored in an optional extra space after the inode structure, or in a separate fork for each file.
- Access control: POSIX permissions, although optional, have been included. Access Control Lists may be added in a future version using an extended attribute.
- version 0.3
- Sector size: sector size is now fixed to 512 bytes. This simplifies the layout of several structures and implementation of drivers. A driver should convert actual disk sectors to conventional sectors in case they differ.
- 64-bit addressing: sector addressing has been extended from 32-bit to 64-bit. Extent sizes are still 32-bit to reduce bookkeeping space. The layout of all data structures has been modified to accommodate the new sizes.
- version 0.2
- Multiband layout: a volume is split in bands, each with its portion of the bitmap.
- Extent-based allocation: file allocation is expressed in terms of extents instead of the old clusters.
- File name restrictions: removed constrains on characters not allowed in file names.
Definitions
The following terms and conventions will be used throughout this specification:
- must, must not, should, should not, may: these words are to be interpreted as described in RFC 2119;
- reserved for future use: a data field that has not yet been defined. When creating a new entry that field must be written as zeros. When reading an existing entry, a driver must not make assumption about the content of that field. When writing that field to an existing entry, a driver must preserve the value in that field;
- byte: a group of eight adjacent bits, an octet;
- driver: a file system driver, or any other part of the operating system, or any application implementing this specification to access a LEAN file system;
- disk: a block device hosting a LEAN file system. This wording is used for simplicity, although that device does not need to be an actual disk memory (e.g. disk partition, memory card, RAM disk). Drivers must limit disk sizes to 263-1 blocks, that is the maximum number that can be represented with a signed 64-bit integer;
- volume: a self-contained LEAN file system on a disk;
- sector: an allocation unit, an arbitrary number of bytes, singly accessible, read or written, by a device controlling the media of the volume. This is the smallest number of bytes accessible per physical access of the media, by default usually being a 512-byte block. This must be a power of two equal to or greater than 256. It is the goal of this specification to eventually target a sector size of 4096 bytes with a matching block size;
- block: by default, a conventional sector of 512 bytes, though not limited to this size. A block must be an array of contiguous sectors and the disk must be an array of contiguous blocks, numbered starting from zero. The block size may or may not equal the media sector size;
- Superblock: a structure containing fundamental information about a volume;
- bitmap: a structure tracking free blocks of a volume;
- band: a portion of the disk made up of several contiguous blocks. Each band stores a part of the bitmap, in order to take advantage of locality in space;
- data area: part of a volume storing files, directories and other file system objects;
- directory: a special file storing a list of directory entries;
- directory entry: a structure containing a file name and location;
- inode structure: a structure containing fundamental information about a file. The inode structure of a file is stored in the first block of the file;
- inode number: the block number of the first block of a file, that is the block containing the inode structure of the file. The inode number must uniquely identify a file on a volume;
- extent: a contiguous addressable portion of a file, specified by a starting block and a size;
- fork: a special file, associated with another non-fork file, storing a list of extended attributes;
- extended attribute: a structure containing an arbitrary name-value pair of metadata;
- The C++ syntax is used for code snippets and structure formats, with stdint.h integral types. The
char
type is assumed to be 8-bit. Hex numbers are prefixed with0x
; - Unless otherwise stated, all numbers must be stored in little endian format (least significant byte first).
Partition identifier
On the i386 platform, each file system is identified by an 8-bit number in an entry of an MBR partition table. The value 0xEA
(after LEAN) has been proposed as it seems unassigned.
For GUID partition tables the following GUID is provided to identify a LEAN data partition (placed in the GUID Type field):
bb5a91b0-977e-11db-b606-0800200c9a66
Block size
This file system uses an arbitrary sized allocation block, not necessarily fixed to the size of the disk's sector size, though with larger sector-sized disks becoming available, larger block sizes equal to the sector size will have a large advantage.
The Block Size
is specified in the Superblock and must have a size of a power of two equal to or greater than 256. It is recommended, but not mandated, that a block size match the sector size of the disk. A default size of 512 bytes is recommended for most disks.
One such advantage of larger block sizes is that since an Indirect Block
uses all remaining space within the block for extents, a larger block size will use less Indirect Blocks
on heavily fragmented files. With a 4096-byte sector and block size, many Indirect extents can be read with one read operation.
If the block size does not match the sector size, the implementation must use sector buffering, and its implementation is not defined by this specification.
For simplicity and for less prone to corruption, it is highly recommended that a block size does not exceed the sector size of the disk.
Structure identification and checksum
Several structures of LEAN include fields to make the file system more robust.
Sensitive structures, such as the Superblock
, Inode
, or Indirect Block
, each store a checksum
field in their first 32 bits. The checksum is computed on any data following the checksum
field itself, up to the end of the sensitive structure and/or block. Other areas might also contain checksums. The area to check must be cast to an array of 32-bit unsigned integers, then each element of this array must be summed, rotating the sum one bit to the right at every iteration. The size of the area to check must be a multiple of 4 bytes. The checksum must be recomputed at least every time a sensitive structure or data area is written back to disk.
The following functions shows how to calculate the checksum. The data
parameter is a pointer to the sensitive structure (including the checksum field in its first 4 bytes), or the data area to be checked. The size
parameter is the size in bytes of the structure pointed to by data
, including the 4 bytes for the checksum field if it is a sensitive structure. It must be a multiple of 4 bytes. The function skips the first 4 bytes when computing the checksum if the isSensitive
parameter is non-zero.
When calculating the checksum on the bitmap, either read in each band's bitmap, consolidating to one contiguous buffer and then call computeChecksum
, or keep a running total, initially setting res
to zero, read in each band's bitmap and call computeChecksumPartial
, sending res
each remaining time.
/* Compute the Checksum of an area. * data -> Sensitive stucture or data area to be checked. * size = size in bytes of area. * isSensitive = non-zero to skip first 32-bit entry (is a Sensitive structure) */ uint32_t computeChecksum(const void *data, const size_t size, const bool isSensitive) { uint32_t res = 0; if (isSensitive) { assert(size >= sizeof(uint32_t)); res = computeChecksumPartial(res, static_cast<const uint8_t *>(data) + sizeof(uint32_t), size - sizeof(uint32_t)); } else { res = computeChecksumPartial(res, data, size); } return res; } /* Compute the Checksum of a partial area. * res = running checksum of last checked area. * data -> Sensitive stucture or data area to be checked. * size = size in bytes of area. */ uint32_t computeChecksumPartial(uint32_t res, const void *data, size_t size) { const uint32_t *d = static_cast<const uint32_t *>(data); assert((size & (sizeof(uint32_t) - 1)) == 0); size /= sizeof(uint32_t); for (size_t i = 0; i != size; ++i) res = (res << 31) + (res >> 1) + d[i]; return res; }
A sensitive structure also contains a magic
field storing a 32-bit constant signature identifying the structure. This can be used as a first test to validate a sensitive structure.
A field specifying the address of the block storing the structure itself is sometimes provided. This is another robustness measure against repair utilities, in case they are scanning a disk for sensitive data structure, and an image file containing a LEAN file system is stored on that disk.
Figure 1: Layout of a LEAN volume.
Bands
To improve locality in space and to help dynamically change the size of a volume, the disk is logically subdivided into bands, made up of several contiguous blocks. Bands are equally sized, perhaps except the last one, and must be made up of a power of two worth of blocks. A volume contains at least one band. Bands are numbered starting from 0, where block 0 is the first block of band 0.
Each band contains a portion of the bitmap representing the allocation state of the blocks of that band. That is, each bitmap portion represents the state of closely located blocks. The bitmap portion for each band must start at the first block of the band. Band 0 is an exception in that it contains blocks reserved for the boot loader and the Superblock, thus a specific field is used to tell where the bitmap of band 0 starts.
The size of a band closely correlates with the size of a block. For more information, see the Bitmap
section.
Character encoding
All character strings are stored in UTF-8 format. This includes the volumeLabel
field in the superblock, all filenames in any directory entry, the name
field in any extended attributes, and all symbolic links, both name and content.
The Superblock
The Superblock is a structure containing fundamental information about a LEAN volume. Its name comes from the ext2 file system, and it is used to differentiate it from the boot sector, that on i386 is sector 0, containing the code of the boot loader.
A driver must use the Superblock to know how to access every other information on the volume. Thus, the Superblock must be stored in a well known location. The Superblock must be stored in one of any block in range from 1 to 32, inclusive. This range should allow to scan for a Superblock taking advantage of track buffering from the disk driver, while being flexible enough to overcome the limits of a fixed location. With a 512-byte block size, this leaves up to 16 KiB for the boot loader. A driver must use the magic
, checksum
and primarySuper
fields of the Superblock to identify a valid Superblock.
Since the size of a block is unknown until you find and verify the Superblock, the Superblock must be aligned on a multiple of a 512-byte offset from the start of the volume, as well as being block aligned. However, this creates a problem when block sizes are unreasonably large. Therefore, no matter the block size, along with the constraints specified here, the Superblock must start in the range of at least 512 bytes from the start of the volume, and not to exceed a starting point of 131072 bytes from the start of the volume, this being block 32 on a 4k block size. This allows the implementer to read at most 131072 + max_block_size_supported
bytes from the start of the volume, then starting at offset 512, check every 512-byte block until the end of the buffer. Remember that the checksum
is calculated on a (1 << logBlockSize)
sized buffer.
A copy of the Superblock must be present in a different location for backup purposes. A driver must ensure that the content of the backup copy is always exactly the same as the Superblock. The location of the Superblock backup is stored in the Superblock to allow a driver to update the backup copy. The backup copy of the Superblock should be placed in the last block of the first band, in order to help drivers locate it in the event the primary copy gets lost or corrupted (being band sizes constrained to powers of two). The magic
, checksum
and backupSuper
fields must be used in this case.
Blocks before the Superblock are reserved for a boot loader, they must not be used by a LEAN driver, and their content is undefined. The Superblock may be shifted forward, up to the limit specified above, if more blocks are needed to store a boot loader. Block 0 is always reserved for the optional boot code.
Once the Superblock position is chosen, within the range specified above, though it is not mandated, the first bitmap and root inode fall directly after the Superblock to make it easier for a repair utility to find and repair a corrupt system.
In the FAT file system the boot sector embeds, besides the boot loader, volume status information. Thus, if a boot sector containing apparently valid FAT information is used on a LEAN volume, FAT drivers may erroneously detect the volume as FAT. Implementers and/or users may remove FAT related information from the boot sector (for example, zeroing every byte but the actual boot code), or try FAT detection at the end of the autodetection process. The LEAN design avoids storage of volume information in the boot sector for several reasons: platform independence, to use a compact, free format, where the fields are properly aligned and ordered, for robustness against a damaged sector 0, and to make room for boot loader code.
The format of the Superblock is the following:
uint32_t checksum | The checksum value for the Superblock. All 32-bit dwords in this block are included in the calculation. |
uint32_t magic | This must be equal to 0x4E41454C (the 'LEAN' characters in ASCII), and it must be used to identify a valid LEAN file system. It should be used to find the Superblock backup in case the primary copy gets lost or damaged. |
uint16_t fsVersion | This field identifies the version of the LEAN file system, and it is provided for future development. The high byte identifies the major version number and the low byte the minor version number (for example 0x0110 would mean version 1.16). At present, in the first non-beta development stage, it must be set to 0x0100 (that is version 1.0) and drivers must not try to access an unknown file system version, backward compatibility making no sense.The Major number will only change on significant and breaking changes to the specification. The Minor number will change on changes to Extended Capabilities. It is the intent of this version system to allow a driver to be Major version compliant only, the driver checking the capabilities field and refusing to mount a volume if an unknown Extended Capability is found. For example, a driver written for version 1.0 should be able to mount a volume with a version of 1.16 as long as this newer volume doesn't have an offending Extended Capability for version 1.0. See the Extended Capabilities section for more information.Note that there is no third version number, as in a '4' of version 1.8.4, for example. The third version number is only for a specification clarification and should have no affect on the driver. i.e.: A properly written driver for specification version 1.8.0 should be able to properly mount a volume created with specification version 1.8.4, for example. The third version number of the specification should not affect the ability of the driver to correctly mount a volume. |
uint8_t preallocCount | The count minus one of contiguous blocks that the driver should try to preallocate when a new block is to be added to a file. A value of zero means that the driver should append a single block. Values greater than zero are recommended, especially for slowly-growing files such as directories, to reduce fragmentation and improve locality. A value of 7 (8 additional blocks) is recommended for a block size of 512 bytes. A value of 3 (4 additional blocks) is recommended for a block size of 4096 bytes. See also the inode's iaPrealloc attribute. |
uint8_t logBlocksPerBand | The base-2 logarithm of the number of blocks per band. The latter is easily computed as 1 << logBlocksPerBand . For 512-byte blocks, it must be at least equal to 12 (4096 bits per band, that is 512 bytes of a block), so that a bitmap portion is always made up of an integral amount of blocks. This field must be equal to 13 for 1024-byte blocks, 14 for 2048-byte blocks, 15 for 4096-byte blocks, and so on. This means, when using 512-byte blocks, the smallest band size is 4096 sectors, that is 2 MiB. |
uint32_t state | This field identifies the state of the file system. Bit 0 (the LSB) is the clean unmount bit: it must be set to '0' while the volume is mounted, and it must set to '1' while the volume is unmounted. Thus, if a driver finds this bit set to '0' when mounting a volume, it means the volume was not unmounted properly. Bit 1 is the error bit: it must be normally '0'. Drivers must set it to '1' in case of major errors, like when file system corruption is detected. File system checkup programs may use this bit to know if some major problem happened with the volume. Bits 2 to 31 are reserved for future use. |
uint8_t uuid[16] | A 128-bit universally unique identifier (UUID) that should uniquely identify the volume. A driver should use it to check for media change in a removable media drive. A driver may use it for automatic volume identification. Drivers must create a sequence of 16 bytes as unique as possible, for example using a good random number generator or one of the canonical algorithms for UUID generation. |
char volumeLabel[64] | A descriptive string, encoded in UTF-8, and null-terminated, to help the user identify the volume. It may be empty. |
uint64_t blockCount | The total number of blocks that form the volume. |
uint64_t freeBlockCount | The number of free (or available, or unallocated) blocks in the volume. A driver must keep it consistent to the rest of the file system to measure the free volume space to store files, directories and bookkeeping. A value of zero must mean disk full. Drivers may choose to not update this field every time there is a change if performance is a constraint or the disk has limited write capabilities (e.g. flash memories). However, this field must be updated when dismounting the volume. |
uint64_t nextFree | The block number to start looking for free blocks. At mount time, a driver should use this to hint at the next block that might be free. This field should be updated before dismount. This is used to help traverse through all blocks in the volume when looking for free blocks. A driver should start looking for free blocks at this position in the volume, incrementing to and rolling over at the end of the volume. This value might not point to an actual free block, just a position to start looking. |
uint64_t primarySuper | The address of the block containing the primary copy of the Superblock. A driver must use this number to know where to write changes made to the Superblock back to disk. This field is provided for robustness, for example against repair utilities operating on a disk containing a LEAN file system image file. |
uint64_t backupSuper | The address of the block containing the backup copy of the Superblock. A driver must use this number to know where to write the backup copy when changes are made to the Superblock. In the event the primary copy of the Superblock gets lost or damaged, recovery utilities should search for the backup copy of the Superblock looking for valid magic and checksum fields, and may also use the backupSuper field itself for further validation. |
uint64_t bitmapStart | The address of the block where the bitmap of the first band (that is, band 0) starts. This is, usually but not mandated, the block right after the Superblock. For any other band, the bitmap must start in the first block of the band. The size of the bitmap in each band is related to the logBlocksPerBand field. |
uint32_t bitmapChecksum | This is the 32-bit checksum of the complete bitmap of the volume. Each band's bitmap including any bits after blockCount - 1 in the last bitmap block is to be checked. This checksum should be done as a premount operation and updated before unmounting the volume. |
uint32_t reserved0 | Padding, reserved for future use. See the definition for Reserved |
uint64_t rootInode | The inode number (that is, the address of the first block) of the root directory. This is, usually but not mandated, the block right after the bitmap for band 0. |
uint64_t badInode | The inode number (that is, the address of the first block) of a pseudo-file to track bad blocks. This pseudo-file should be generated and managed by a file system checkup program, and should be made up of any blocks that the program detected as bad. This ensures that a driver can not allocate them. If the volume has no bad blocks, badInode must be zero. See a later section for the format of this pseudo-file. |
uint64_t journalInode | The inode number of the "file" representing the Journal data. This field may be zero indicating no Journal. An implementation must verify the Journal Data before using it. It is outside this specification on how this Journal is to be formatted/used. See a later section for more information. |
uint32_t capabilities | A bitmap of the Extended Capabilities in use on this volume. A set bit means that capability is in use. See the Extended Capabilities section for a list of capabilities and their corresponding bit position. |
uint8_t logBlockSize | The base-2 logarithm of the size of a block in bytes when this volume was formatted. A value of 9 is 512, 10 = 1024, 11 = 2048, 12 = 4096, and so on. It is calculated as 1 << logBlockSize . This value must be at least 8. |
uint8_t reserved1[3] | Padding, reserved for future use. See the definition for Reserved |
uint8_t reserved2[328]1 | Reserved for future use. These bytes pad the size of the Superblock structure to the size of the block used and must be included in checksum calculation. A value of 328 is used for 512-byte blocks. See the definition for Reserved .1 328 for 512-byte blocks, 840 for 1024-byte blocks, 1864 for 2048-byte blocks, 3912 for 4096-byte blocks, etc. |
The bitmap
The complete bitmap is a non-contiguous data structure, made up of blocks that, as a whole, form an array of bits, each of them indicating the allocated/unallocated state of each block of a volume. The bitmap is not contiguous because its blocks are subdivided into bands, instead of being stored in a single chunk in a fixed location.
The total number of blocks in a volume is specified by the blockCount
field of the Superblock. Thus, the bitmap as a whole must contain blockCount
bits, one for each block. The number of blocks required to store the bitmap (that is, the total bitmap size) is given by ceil(blockCount / ((1 << logBlockSize) * 8))
, where ceil
means round up. Any bits after the first blockCount
are unspecified. Each band must store a portion of the bitmap, according to the band size in blocks, that is the size of each bitmap portion must be blocksPerBand / ((1 << logBlockSize) * 8)
blocks, where blocksPerBand
can be easily computed from the Superblock. The size of each bitmap portion will result in an integral number of blocks due to the constrains specified for logblocksPerBand
.
A bit of the bitmap must be set to '1' if its corresponding block is used/allocated, whereas it must be set to '0' if its corresponding block is free/unallocated. Block 0 is identified by bit 0 (the LSB) of the first byte of the bitmap, block 1 by bit 1 of the first byte, block 7 by bit 7 (the MSB) of the first byte, block 8 by bit 0 of the second byte, and so on.
The blocks occupied by the Superblock, the Superblock backup, the bitmap itself and any reserved blocks containing boot loader code are not free for data storage, thus must be marked as allocated in the bitmap.
The following code snippet shows how to locate a bit for the block specified by block
.
/* The following input values and precalculated values are assumed: * uin64_t block = the block to locate the bitmap entry for * Superblock sb = a structure containing the Superblock * uint64_t blocksPerBand = 1 << sb.logBlocksPerBand */ uint64_t band = block >> sb.logBlocksPerBand; uint64_t blockInBand = block & (blocksPerBand - 1); uint64_t bitmapBlock = (blockInBand >> (3 + sb.logBlockSize)) + (band << sb.logBlocksPerBand); if (band == 0) bitmapBlock += sb.bitmapStart; unsigned bitOfBlock = blockInBand & ((1 << sb.logBlockSize) * 8 - 1);
Band/Bitmap size
You must take the size of a block into consideration when calculating the size of a band which corresponds with the size of a bitmap section. A bitmap section, possibly not counting the last section, must occupy a whole block. For example, with a 512-byte block size, a band must be at least 4096 blocks in size.
4096 blocks / 8 bits per byte = 512 bytes
However, with a 4096-byte block size, a band must be at least 32768 blocks.
32768 blocks / 8 bits per byte = 4096 bytes
A bitmap section, possibly not counting the last, must contain an integral amount of bits to correlate with the block-size it occupies.
Figure 2: File Allocation in LEAN (Assuming
512-byte blocks with 38 extents).
The data area and file allocation
The data area is made up of any block not used by the bitmap, the Superblock, the Superblock backup and blocks reserved for boot code. It is in general not contiguous due to the band-based layout of the bitmap. It stores actual file system contents: files, directories and their related bookkeeping (that is the list of their allocated blocks).
Besides being unused, each block in the data area can be either a data block or an indirect block. The former contains bookkeeping and actual data, the latter contains only bookkeeping. Data blocks are grouped together to form extents when they are contiguous. An extent is specified by a starting address and a size in blocks.
Each file is stored in one or more extents, thus file allocation can be described as an array of extent specifications. The first block of a file stores the inode structure of the file. Actual file data begins after the inode structure or after the optional extra space after the inode structure reserved for inline extended attributes. This introduces an offset when converting between file data position and data blocks. The inode structure and indirect blocks are used to know what extents belong to a file. This is done in two ways: direct addressing and indirect addressing.
Extent definition
An Extent is defined as a Starting Block and a count of contiguous blocks used. By default, a standard extent is formatted as shown below. Please note that, being an extent specification is a 12-byte quantity (64-bit address + 32-bit size), the two members, extent starting blocks and sizes, have been split in two distinct arrays to keep proper alignment. A standard extent with a count equal to 1, occupies 12 bytes. (extentSize = 12
)
uint64_t extentStarts[count] | The array of starting block addresses of the extent specifications. |
uint32_t extentSizes[count] | The array of extent sizes of the extent specifications. Each element represents the size in blocks of the corresponding extent whose first block is specified in extentStarts . |
Direct addressing
To speed up file access on small files, a limited number of extent specifications is stored in the inode structure itself. A driver must use this information to directly access any of the first extents of the file in constant time. The constant extentsPerInode
, currently defined to 8, specifies how many extent specifications are saved in the inode structure.
Indirect addressing
If a file needs more than extentsPerInode
extents, indirect addressing is used for subsequent extents. An indirect block is a block storing an array of extentsPerIndirect
extent specifications instead of file data. Indirect blocks are linked together to form a doubly-linked list in order to support arbitrary file lengths.
The number of extents in an Indirect block, extentsPerIndirect
, correlates with the size of a block, so that all of the remaining space in a block is used. For example, when a 512-byte block is used, there is enough room for 38 standard extents before the end of the block. For 1024-byte blocks, there is enough room for 80 standard extents. This number jumps to 336 for 4096-byte block sizes. Note that there may be a few bytes after the last extent and these remaining bytes are considered reserved.
The constant extentsPerIndirect
is calculated as (((1 << sb.logBlockSize) - 56) / extentSize)
truncated to an integer.
The following is the layout of an indirect block.
uint32_t checksum | The checksum value for the indirect block. All bytes in the block do count for checksum calculation. |
uint32_t magic | This must be equal to 0x58444E49 (the 'INDX' characters in ASCII). |
uint64_t blockCount | The total number of blocks addressable using this indirect block. A driver should use this information to quickly traverse the list of indirect blocks to find the desired data block. It must be equal to the sum of the used extentSizes (see below). |
uint64_t inode | The inode number of the file this indirect block belongs to. This is provided for robustness. |
uint64_t thisblock | The address of the block storing this indirect block. This is provided for robustness. |
uint64_t prevIndirect | The address of the previous indirect block. If there is no previous indirect block, this field must be zero. A driver may use this value to traverse the list of indirect blocks in reverse order. |
uint64_t nextIndirect | The address of the next indirect block. If there is no next indirect block, this field must be zero. This is the primary means for a driver to traverse the list of indirect blocks to seek into a file. |
uint16_t extentCount | The number of valid extent specifications stored in this indirect block, that is the count of the first elements in the extentStarts and extentSizes arrays that refer to actual file extents. If the indirect block is not the last indirect block of a file, all extent specifications must be in use, that is extentCount must be equal to extentsPerIndirect . |
uint8_t reserved0[2] | Padding, reserved for future use. See the definition for Reserved . |
uint32_t reserved1 | Padding, reserved for future use. See the definition for Reserved . |
Extents[extentsPerIndirect] | The Extents. |
uint8_t reserved2[n] | Zero to extentSize - 1 bytes. A 512-byte block might have zero bytes here. A 4096-byte block might have 8 bytes here. The value of these bytes are included in the checksum calculation. See the definition for Reserved . |
Rationale
The extent-based approach allows efficient and simple access to the file system. It is efficient because, once the driver knows an extent specification, it can access any block of that extent in constant time. It is also simple because, for each extent, only an offset must be added to its starting block to find any other block.
This works best if the file is not badly fragmented, that is unless the file is made up of several extents. Being the first extentsPerInode
extent specifications saved in the inode structure, it is expected that most small files would need no indirect blocks, thus any block can be accessed in constant time with no additional reads. Large files can benefit from this too, if they are mostly contiguous, although this is not as likely.
Indirect blocks store a much greater amount of extent specifications than the inode structure, thus it is expected that with a few indirect blocks a file is completely specified. For example, when using a 512-byte block size, using just 5 indirect blocks, it is possible to specify a file made up of 198 fragments. They may be buffered for faster operation. The number of indirect blocks, if any are present, is determined only by the count of fragments, not by the file size (at least not directly).
This approach has been chosen in order to advantage random access on small files (e.g. configuration files), but sequential access for bigger files (e.g. multimedia files), as well as keeping the file system implementation as simple as possible.
Allocation strategies
In order to use the file system as efficiently as possible, drivers may implement allocation strategies when creating or extending files. This specification does not define or mandate such allocation strategies, but a few suggestions are presented.
When extending a file, a driver should try to grow its last extent. The last extent of a file can be easily found using the extentCount
and/or lastIndirect
fields of the inode structure. The following algorithm for adding one block to a file may be used:
if the block right after the end of the last extent is free { grow the last extent increasing its size by 1; } else { find a free block as close as possible to the last extent; if a new indirect block is needed { write the indirect block in that free block; find a free block as close as possible to the indirect block just written; } create a new extent starting at that free block and with size 1; }
Drivers may use a modified version of the algorithm above, by trying to allocate and add more than one block at a time. This is expected to reduce fragmentation and improve locality, especially on slowly-growing files such as directories.
To improve locality in space between directories and their files, and to help reduce fragmentation, a driver may use the following algorithm when creating a new file. It assumes that most files will have a single hard link, thus a single parent directory:
locate the band where the parent directory of the new file ends; if the new file is not a directory { find a free block as close as possible to the end of the parent directory; } else { generate a band number different from the one of the parent directory; find a free block in that band; } create the new file in that block and store there its inode structure;
To help implementing these algorithms, a driver may provide a parametric procedure for searching for free blocks, specifying a suggested block where to start to search from and a suggested block count.
The inode structure
All fundamental metadata of a file are stored in a structure called inode structure. Such file metadata include file size, attributes and time stamps, as well as the means to access directly- and indirectly-addressable extents of a file.
The inode structure must be stored at offset zero of the first data block of a file. The address of that first data block must be used as inode number, that is a serial number to uniquely identify the file in a volume. It is not supposed to uniquely identify the file on the whole system: the operating system or a driver may use the inode number together with some way to identify the volume (for example, its UUID or an opaque identifier) to uniquely identify the file system-wise.
Putting the inode structure in a file data block allows to avoid a separate inode table, allowing an arbitrary number of files per volume and simplifying inode management, at the price of some disk space (especially for very small files that would otherwise fit in a single block or zero-length files) and possible performance degradation when doing a detailed directory listing including file metadata.
The inode structure is 200 bytes (inodeSize
) long, and its format is defined in the following table. Drivers may opt to keep the whole structure in memory for open files, or use a larger inode cache, or, if memory is a scarce resource, keep only a minimal amount of metadata and read the inode structure from disk whenever needed. Actual file data starts right after the inode structure or, if the iaInlineExtAttr
attribute is set, at the beginning of the subsequent data block (perhaps in the same extent).
It is recommended that if the file's content is to exceed the size of the remaining space after the inode, that the iaInlineExtAttr
attribute is to be set, and the file start in the subsequent data block. Example: iaInlineExtAttr = (fileSize > ((logBlockSize << 1) - inodeSize));
Field | Description |
---|---|
uint32_t checksum | The checksum value for the inode structure. Any optional inline extended attributes or file data following the inode structure are not part of the inode structure itself, thus must not be included in checksum calculation. |
uint32_t magic | This must be equal to 0x45444F4E (the 'NODE' characters in ASCII). |
uint8_t extentCount | The number of valid extent specifications stored in the inode structure, that is the count of the first elements in the extentStarts and extentSizes arrays that refer to actual file extents, from 1 (the first extent contains the inode structure itself) to extentsPerInode = 8 . Directly-addressable extents must be used from the first to the last before using indirect blocks. If extentCount < extentsPerInode , unused extent specification are undefined. If extentCount < extentsPerInode and any of indirectCount , firstIndirect , or lastIndirect is not zero, this inode is considered corrupt. |
uint8_t reserved[3] | Padding, reserved for future use. See the definition for Reserved . |
uint32_t indirectCount | The number of indirect blocks owned by the file. |
uint32_t linkCount | If this file is not a fork, this field must be equal to the number of hard links (the count of directory entries) referring to this file. If this file is a fork, this field must be equal to the number of non-fork files referring to this fork. Drivers may implement fork sharing and copy-on-write. In any case, for non-deleted files this field must be at least 1. |
uint32_t uid | The numeric identifier of the user that owns the file. Interpretation of this field is operating system dependent. A mono-user driver must use zero. |
uint32_t gid | The numeric identifier of the group that owns of the file. Interpretation of this field is operating system dependent. A mono-user driver must use zero. |
uint32_t attributes | The attributes mask for the file, see enum InodeAttributes . It includes POSIX permissions, behavior flags and the file type specification. Interpretation of permissions is operating system dependent. A mono-user driver must use the owner user permission bits only (ia?USR ). Note that the "file type" values are mutually exclusive enumerated values, they are not single bits that can be combined. The iaFmtMask mask can be used to retrieve the file type. |
uint64_t fileSize | The size in bytes of the file data, thus excluding the size of the inode structure, the optional extra space for inline extended attributes, and indirect blocks. |
uint64_t blockCount | The number of data blocks (thus excluding indirect blocks, and including the first block with the inode structure) allocated for the file. Must always be at least 1. A driver may preallocate more blocks than required for a specified fileSize to implement strategies to reduce fragmentation. |
int64_t accessTime | Last access time stamp. This field is computed as the signed number of microseconds since 1970-01-01 00:00:00.000000 UTC, not including leap seconds. Drivers may choose to not update this field if performance is a constraint or the disk has limited write capabilities (e.g. flash memories). A mount-time flag or an open-time flag may be implemented to disable last access time stamps, and the implementation may want this to be the default. Initially it must have the same value as creationTime . If the operating system semantics allows it, drivers may delay update of this time stamp instead of updating it at every access. |
int64_t statusChangeTime | Status modification time stamp. This field is computed as the signed number of microseconds since 1970-01-01 00:00:00.000000 UTC, not including leap seconds. Initially it must have the same value as creationTime . If the operating system semantics allows it, drivers may delay update of this time stamp instead of updating it at every modification to the inode structure. |
int64_t modificationTime | Last data modification time stamp. This field is computed as the signed number of microseconds since 1970-01-01 00:00:00.000000 UTC, not including leap seconds. Initially it must have the same value as creationTime . If the operating system semantics allows it, drivers may delay update of this time stamp instead of updating it at every write to the file. |
int64_t creationTime | File creation time stamp. This field is computed as the signed number of microseconds since 1970-01-01 00:00:00.000000 UTC, not including leap seconds. Depending on the operating system semantics, this field may be greater than the other time stamps if this file is created as a copy of an existing file. |
uint64_t firstIndirect | The address of the first indirect block of the file. If the file has no indirect blocks the value of this field must be zero. Example: If there is only 1 indirect block currently being used, this will have the same value as lastIndirect . |
uint64_t lastIndirect | The address of the last indirect block of the file. If the file has no indirect blocks the value of this field must be zero. Example: If there is only 1 indirect block currently being used, this will have the same value as firstIndirect . |
uint64_t fork | If this file is not a fork, this field is the address of the first block of the optional fork associated with this file, or zero if this file has not a fork. If this file is a fork, this field must be zero. |
Extents[extentsPerInode] | The Extents. The count of valid extent specifications, starting from index 0, is determined by the extentCount field. Any unused extent specification is unspecified. Note that the first element of the extentStarts array is also -by definition- the address of the block containing this inode structure. |
Symbolic name | Value | Description |
---|---|---|
iaXOTH | 1 << 0 | POSIX permissions: others can execute/search in directory (same as S_IXOTH) |
iaWOTH | 1 << 1 | POSIX permissions: others can write (same as S_IWOTH) |
iaROTH | 1 << 2 | POSIX permissions: others can read (same as S_IROTH) |
iaXGRP | 1 << 3 | POSIX permissions: owner group can execute/search in directory (same as S_IXGRP) |
iaWGRP | 1 << 4 | POSIX permissions: owner group can write (same as S_IWGRP) |
iaRGRP | 1 << 5 | POSIX permissions: owner group can read (same as S_IRGRP) |
iaXUSR | 1 << 6 | POSIX permissions: owner user can execute/search in directory (same as S_IXUSR) |
iaWUSR | 1 << 7 | POSIX permissions: owner user can write (same as S_IWUSR) |
iaRUSR | 1 << 8 | POSIX permissions: owner user can read (same as S_IRUSR) |
iaSVTX | 1 << 9 | POSIX permissions: restrict rename/delete to owner (same as S_ISVTX) |
iaSGID | 1 << 10 | POSIX permissions: execute as group ID (same as S_ISGID) |
iaSUID | 1 << 11 | POSIX permissions: execute as user ID (same as S_ISUID) |
1 << 12 | Reserved. See the definition for Reserved . | |
iaSystem | 1 << 13 | Warn that this is a system file. |
iaArchive | 1 << 14 | File changed since last backup (must be set on any write, cleared on a backup) |
iaSync | 1 << 15 | Writes must be committed immediately (besides open- or mount-time flags) |
iaNoAccessTime | 1 << 16 | Do not update last access time (besides open- or mount-time flags) |
iaImmutable | 1 << 17 | Do not move file blocks (e.g. for defraggers) |
iaPrealloc | 1 << 18 | Keep any preallocated blocks beyond fileSize when the file is closed (recommended for directories) |
iaInlineExtAttr | 1 << 19 | The remaining bytes after the inode structure in the first block are reserved for inline extended attributes |
0x1FF << 20 | Reserved. See the definition for Reserved . | |
iaFmtMask | 7 << 29 | Bit mask to extract the file type |
iaFmtRegular | 1 << 29 | File type: regular file (see enum FileType) |
iaFmtDirectory | 2 << 29 | File type: directory (see enum FileType) |
iaFmtSymlink | 3 << 29 | File type: symbolic link (see enum FileType) |
iaFmtFork | 4 << 29 | File type: fork (see Extended attributes) |
Figure 3: Directory format in LEAN.
Directories
A directory is a file containing an unordered list of directory entries. Each directory entry is a structure which binds a file name with an inode structure, thus each directory entry is a hard link to a file. Directory entries do not store file metadata, except what is specified below.
A directory entry is a variable length structure, containing a fixed 12-byte header and a variable length part containing the file name. Every directory entry must be aligned to a 16-byte boundary, thus the size of the whole structure must be a multiple of 16 bytes. This helps to reuse space from deleted directory entries. A single directory entry may span across different blocks.
File name storage must be case sensitive in order to simplify name comparison, avoiding case folding to upper or lower case, that is not trivial in Unicode and may depend on the locale. This means that filename.ext
and FileName.Ext
may reside in the same directory and refer to two different files. For this reason, drivers need not to be Unicode-aware, and file name matching can be as simple as using memcmp
. Drivers may implement case insensitive file name matching in an unspecified way, for example using first-match or best-match policies. Every file name in a directory must be unique. In environments lacking support for long file names, or requiring short file name aliases, a driver may fabricate short file names on the fly, that is, without having them physically stored in the directory. The algorithm used to fabricate short file names is unspecified, but a proposed procedure is shown below.
A single file, identified by its inode number, may be represented by one or more directory entries, residing in any directory: these are the hard links to the file. A file must have at least one hard link. When a file has no more hard links, it must be deleted from a volume. Forks are an exception because, by definition, they are associated with another file and have no own directory entries, and their lifetime is specified by the files they are associated with. Directories must have exactly one parent, that is one hard link besides the ones from the special "." (dot) and ".." (dot-dot) entries, in order to prevent cycles.
This is the format of a directory entry:
Field | Description |
---|---|
uint64_t inode | The inode number of the file linked by this directory entry, that is the address of the first block of the file. Drivers must use it to locate the inode structure to do any subsequent processing on the file. |
uint8_t type | This is a shortcut to the file type, see enum FileType . A driver must keep it consistent with the file type bits in the attributes of the inode structure. |
uint8_t recLen | This is the total size (thus including the fixed length header) of the directory entry, in 16-byte units. It must be at least 1. This field must be valid even for deleted or empty directory entries, as it must be used as a link to the next directory entry. |
uint16_t nameLen | The count in bytes used to store the file name in the name field. Except for new, unused entries, this field must be greater than zero. |
char name[] | The file name. This string must not be null-terminated and must be UTF-8. Its length must be determined solely by the nameLen field, and any padding bytes optionally present after nameLen bytes are undefined. All code points are allowed. It is up to the driver to disallow the separator character from the filename. For example, a driver must disallow the forward slash when this character is used to separate directory and file names. |
The following table shows the possible values for the type
field of a directory entry. The low 7 bits hold the file type
. A value of zero must indicate that the directory entry is empty (available or presumed deleted). A file may be marked for undelete with file type = ftDeleted
which can later be undeleted by setting file type
back to the corresponding value in the inode structure, until its blocks are physically overwritten (all magic
and checksum
fields should be checked for robustness in this case). Note that values 0 to 5 are mutually exclusive enumerated values and they are not single bits that can be combined, and that -by definition- there is no type for forks. See the Undelete Capabilities section for more on undeleting files.
The high bit is the faHidden
attribute. When set, this entry is hidden from default listings. This bit must be combined with the file type
value. This bit does not represent the inode itself, it only represents this hard link to the inode. Another possible hard link to the inode may have an opposite setting.
Symbolic name | Value | Description |
---|---|---|
ftNone | = 0 | No file type, the directory entry is empty |
ftRegular | iaFmtRegular >> 29 = 1 | Regular file |
ftDirectory | iaFmtDirectory >> 29 = 2 | Directory |
ftSymlink | iaFmtSymlink >> 29 = 3 | Symbolic link |
= 4 | Reserved | |
ftDeleted | = 5 | Deleted but preserved |
faHidden | = 128 | Entry is hidden from default listings |
When ftRegular
, ftDirectory
, or ftSymlink
is used, a driver must keep the file type
field consistent with the attributes
of the inode structure. The faHidden
bit is independent of the inode structure allowing multiple hard links to the same inode to be independently marked as hidden. An appropriate mask (ftFmtMask = 0x07
) should be used when accessing the file type.
"." and ".." entries
The two special entries "." (dot) and ".." (dot-dot) must be the first and the second entry of every directory, respectively. The former must be a hard link to the directory itself, the latter must be a hard link to the parent directory. In the root directory, which has no parent, ".." must be a hard link to itself.
A driver may use fictitious "." and ".." entries ignoring the actual entries in the directory, for example to assign ".." to the parent directory of a mount point. If showing "." and ".." to the user is a concern, a driver may choose to make them hidden entries, for example those of the root directory.
Looking up a directory
This algorithm must be used to search for an entry in a directory, using appropriate file name matching criteria:
while not EOF { read a 12-byte directory header; if (((DirEntry.type & ftFmtMask) >= ftRegular) && ((DirEntry.type & ftFmtMask) <= ftSymlink)) { read the file name and compare it against the required file name; if the required entry is found return this directory entry; } advance using the information provided in recLen; } return "file not found";
Deleting an entry
To delete a hard link to a file, a driver must set the type
field of the directory entry to zero (that is ftNone
) and decrease the linkCount
field of its inode structure. The space occupied by the deleted directory entry (recLen*16
bytes) is made available for reuse (see the 4th entry in figure 3). If some degree of security is needed, a driver may zero all bytes of a directory entry, except recLen
.
If the linkCount
of the inode structure of the file reaches zero, the blocks of the file (both data and indirect) must be deleted, and, if the file has a fork, its linkCount
must be decremented, deleting the fork (both data and indirect blocks) if it reaches zero. To do this, drivers must set the corresponding bits in the bitmap to '0'. This is efficient and allows to recover data, until physically overwritten. If some degree of security is needed, a driver may also zero or fill with garbage the actual content of the deleted blocks.
See the section on Undelete Capabilities for when an entry should not be cleared, the linkCount should not be decremented, and the bitmap should not be modified.
Creating an entry
This procedure must be used to create a new hard link to a file.
Scan the directory for one or more contiguous available directory entries, enough to store the new directory entry. This is because available entries may not be coalesced together. The required size of the directory entry, in 16-byte units, must be computed as recLen = ceil((12 + nameLen) / 16)
. Available directory entries must be identified for having type == ftNone
(see above).
If enough space to store the new directory entry is found, the new directory entry should be stored in that place, otherwise the new directory entry must be appended to the directory, thus extending it. If the new directory entry is shorter than the free space found, a driver should fabricate and append a new directory entry header (with type
equal to ftNone
) to keep the link with the next directory entry consistent (see the 7th entry in figure 3). (Should being used instead of must so in the rare occasion that the remaining space is only 1 record in length (a single 16-byte block), leaving only 4 file name bytes for that entry, a driver may include that wasted space with the current entry to help speed up entry enumeration, being that a filename is rarely as short as four bytes.)
It is not required, but is recommended that if a directory contains many unused entries (type == ftNone
), due to mass deletions or other processes, these unused entries be coalesced or eliminated from the directory. An efficient directory listing has few to no free entries and is assumed to append any new entries to the listing.
Fabricating a short file name
In environments lacking support for long file names, or requiring short file name aliases, a driver may fabricate short file names. Fabricated short file names are not supposed to be stored on the file system as new hard links, but rather generated on the fly to the user when they are needed. How to fabricate short file names in such cases is unspecified. The following is an algorithm that drivers may use if appropriate:
- find the last dot, if any, in the file name, and assume it to be the separator between the name prefix and the name extension;
- remove any invalid character from the original file name;
- convert the resulting name to the required character set, if any;
- truncate the name prefix and the name extension to the required lengths;
- take the offset where the directory entry starts inside the directory containing it, and divide that offset by 16 (all directory entries are multiple of 16 bytes);
- append this number to the name prefix, to make sure it is unique in the directory (hex numbers may be used to shorten the numeric tail), truncating the name prefix further, if required, to fit a short file name.
Bad blocks
If the badInode
field in the Superblock is non-zero, it will point to a normal system file, in which the extents, both direct and indirect (if any), encompass any bad blocks on the media from block 0
to blockCount - 1
.
This is considered a pseudo file, in that all bad blocks are allocated for the file's content, including marking the bitmap accordingly, yet the actual content of the file is undefined. By doing so, this pseudo file's extents will allocate and occupy all bad blocks in the file system.
The function and structure of this file is identical to all other files, except for the fact that only the bookkeeping part of the file is stored on the disk. This being the inode structure
and its direct extents, and any optional indirect blocks
as needed. As with any other inode, this inode may contain extended attributes
, both inline and via a fork.
As for the file system, the use and function of this inode is identical to any other inode on the system, except that the actual file data is considered undefined, therefore not read or written. The inode should not be recorded in any directory entry and an implementation should not allow the inode to be deleted. If the Extended Extents option is used, the checksum
portion of any extents within this inode is undefined.
Journal block
If the journalInode
field in the Superblock is non-zero, it will point to a normal system file, treated exactly like any other data file on the volume, whose content contain information needed to create and maintain a journaling system for this volume.
This specification does not define or mandate the format or use of a Journal. The use of this file is implementation specific and outside this specification.
When a Journal is used, the Lean volume must not be modified in any way other than the file content of this Inode, therefore allowing any other implementation to mount and use this volume without relying upon this journal.
With removable media, such as USB thumb drives, a media device could be unmounted and then inserted into a non-journaling system. This new system must be able to access the whole volume without any knowledge of the journal.
With this said, that same media device can once again be re-inserted back into the original device. If this is the case, the original implementation must assume the journal data is no longer valid, and must start anew.
It is up to the implementer of this journal and outside this specification on how this is handled.
Symbolic links
A symbolic link (symlink, soft link) is a special file storing the pathname of another file. Its file type must be ftSymlink
.
The string representing the pathname must be encoded as UTF-8 and must not be null-terminated. The length of the buffer used must be equal to the file size of the symbolic link. For example, if the count of bytes used to represent the UTF-8 string is 16, the fileSize
field must be 16.
The represented pathname may be either absolute or relative. A driver or other appropriate parts of an implementation should provide the means to follow the symbolic link. In a system where pathnames may include a drive or device specification, the pathname stored by the symbolic link may contain a drive or device specification too. The way the implementation behaves in the event the drive or device referred by the symbolic link changes (for example a removable or network drive, or when the volume is used in a system not supporting drive or device specifications) is implementation specific.
The meaning of attributes for symbolic links is unspecified. A driver may apply their regular meaning to the symbolic link itself (for example, a hidden symbolic link may be not showed in a directory listing, whether the target is hidden or not, or the permissions of a symbolic link may refer to what can be done with the target when accessed via the symbolic link), or may use any conventions normally used by the system for symbolic links.
Extended attributes
Extended attributes are optional metadata, not directly interpreted by the file system, that can be attached to a file system object. They are basically name-value pairs, each of it called an extended attribute record.
The LEAN file system allows to store extended attributes inline, right after the inode structure of each file, before file data, or in a separate fork of the file. In LEAN a fork is just a file containing extended attribute records, associated with another file system object, and with no directory entries. Inline extended attributes allow to store only a limited amount of metadata, but offer better performance. A fork allows to store much more metadata (theoretically forks can be as big as regular files), but requires additional disk access.
Any kind of name-value pairs can be stored as extended attributes. The freedesktop.org project provides recommendations for a portable set of extended attributes.
Whether an extended attribute record is stored in the extra space after the inode structure or in a fork, it is a variable-length structure with a 32-bit header, with size constrained to multiples of 4 bytes, and the following format:
Field | Description |
---|---|
uint32_t header | The fixed header of the record. It must be interpreted as 0xNNRVVVVV , where the highest 8 bits (31..24), NN , represent the length in bytes of the attribute name, the next 4 bits (23..20) represent a Reserved field, and the lowest 20 bits (19..0), VVVVV , represent the size in bytes of the attribute value. If NN is 0, this is an empty record used for padding (perhaps a deleted record), and VVVVV indicates the size of the whole record, excluding the 32-bit header. The Reserved field is Implementation Specific and a driver must preserve this field when not in use. |
char name[] | The attribute name in UTF-8, is a case-sensitive string that must uniquely identify what this attribute means, possibly formatted as a dot-separated name-spaced name. The null character is not allowed for compatibility with existing extended attribute APIs. |
uint8_t value[] | The attribute value. It may have any format that makes sense to the application that manages the attribute. If it is text, it should default to UTF-8, though is not manditory since the attribute itself may instruct otherwise. If NN + VVVVV is not a multiple of 4, undefined padding bytes must be added after the value to reach a multiple of 4 bytes, zeros are recommended. |
If the file has the iaInlineExtAttr
attribute set, the extra space after the inode structure to the end of the block must be completely filled with contiguous extended attribute records, perhaps padding ones. This means that, at the very least, a single record whose header has NN = 0
and VVVVV = blockSize - inodeSize - sizeof(uint32_t)
must be present, meaning that the whole extra space is unused. Drivers should prefer inlining extended attributes that are priority, frequently used and small.
If the fork
field of the inode structure is nonzero, that inode has an associated fork whose first block is specified by fork
. That fork is a file system object, thus it must have its own inode structure. That inode structure must have the iaInlineExtAttr
attribute not set, file type iaFmtFork
and the fork
field must be set to zero. The fileSize
field of the inode structure of the fork must indicate how many bytes of extended attribute records are present in the fork. The fork must be made of a contiguous sequence of extended attribute records, perhaps padding ones.
Drivers are not required to understand extended attributes, and must not rely on them being present. Fundamental metadata must be stored in the appropriate fields of the inode structure. A driver not supporting extended attributes must leave them untouched, but must be able to delete them when the inode they are associated with is deleted. This means, at least, marking any fork blocks (data and indirect) as free in the bitmap. Drivers should not make any modification to the content of extended attributes they do not know about.
If the size of all extended attributes of a file is reasonably small, drivers may opt to cache all of them in memory, and write them back to disk as a whole. Drivers may rearrange the sequence of extended attribute records, provided that the constraints above are enforced. If using a fork, drivers may want to store all extended attributes in a single extent whenever possible. Drivers should try to allocate space for the fork as close as possible to the inode structure of the file system object the fork is associated with.
When enumerating extended attributes, drivers should start from the inline ones, if any, then continue with those in the fork, if any. When saving extended attributes, drivers should provide to applications a way (such as a "priority" flag) to specify whether the attribute should be saved inline if possible, or in a fork. Drivers may implement policies to restrict editing of certain extended attributes to system services only (for example, attribute names starting with "system.
"). An application of this could be using one extended attribute record to implement Access Control Lists: in this case, drivers may want to place that record inline and forbid their editing to unprivileged users.
A single attribute must not span from Inode to fork. If an attribute does not fit within the space after the inode, it must be placed in the fork. i.e.: The first four bytes of the fork's data stream is a 32-bit header
, the start of a new attribute, while the last few bytes of the space after the inode (if used) is a padding entry, and in the rare case that there is only four bytes remaining, a header
value of 0x00r00000
must be used.
Extended capabilities
The capabilities
field in the Superblock is included to allow extended capabilities to a Lean volume, though still allowing a simple Lean volume to be implemented. If no extended capabilities are included on the volume, the volume can still be mounted with a simple driver.
A driver should check to see that it supports all extended capabilities indicated by the capabilities
field before mounting the volume. If any bit is set that a driver does not support, it should refuse to mount the volume.
Currently, only one extended capability is defined.
Symbolic name | Value | Description |
---|---|---|
ecExtExtents | 1 << 0 | Extended Extents are used. |
Extended Extents
An extended extent adds a single member to the standard extent and is formatted as shown below. This extended extent with a count equal to 1, occupies 16 bytes. (extentSize = 16
)
uint64_t extentStarts[count] | The array of starting block addresses of the extent specifications. |
uint32_t extentSizes[count] | The array of extent sizes of the extent specifications. Each element represents the size in blocks of the corresponding extent whose first block is specified in extentStarts . |
uint32_t extentChecksum[count] | The 32-bit checksum of all bytes in this extent with exception of the first extent (see below). All bytes of the extent are calculated in the checksum even if these bytes exceed the end of the file. |
All standard extents, both in the inode and in any used indirect block, are changed to use this format. To keep the inode the same size (inodeSize = 200
), extentsPerInode
has been changed to 6 instead of 8 when this Extended Capability is used. A driver must also calculate the extentsPerIndirect
with the formula already given above, since the extentSize
value has changed.
Since these extent checksums are stored in a block that is part of the extent list, a race condition takes place when calculating the checksum of a list of checksums. Therefore, and since the inode has its own checksum, the inode (inodeSize
bytes) in the first block of the first extent is not included in the first extent's checksum. All bytes after the inode in the first block and all remaining blocks (if any) in this first extent are included in the checksum of the first extent. All bytes in any remaining extents, are included in their respected extent's checksum, including any bytes past end of file
. Since the indirect block
is not part of an extent definition, there is no race condition within the indirect block
's list of checksums.
Note that using this type of extent decreases the count of extents available in the inode and in any indirect blocks used, possibly requiring more bookkeeping blocks to be used.
A driver may check the checksum field to ensure that the extent was read correctly. A driver must update the checksum when an extent is written to the disk.
Undelete Capabilities
By definition, a driver can maintain the ability to undelete files from a Lean volume. A driver simply has to mark the directory entry as deleted (file type = ftDeleted
), and not free the used extents within the respected bitmap(s), in essence, leaving the file completely intact. Note that the inode's linkCount is not decremented and the entry should not be listed in a default listing.
As long as a driver preserves this now deleted directory entry, at any time this deleted entry can be restored by simply reading in the Inode (specified in the deleted entry) to assess the original file type, and the file can be completely restored. Note that the checksum
and magic
fields must be verified as well as other aspects of validating an inode and its content.
However, this process means that when deleting files, no space is ever freed from the volume. Meaning that the volume, depending on its size, can be used up in a fair amount of time. With this in mind, a driver can use a process to find dated deleted files, and free them from the system. This specification does not specify when and how this is done.
This feature should not be used by default. i.e.: By default, a file should be marked deleted (file type = 0
). Only sensitive files should be marked for undelete (file type = ftDeleted
). It is outside this specification on defining the type of file to mark as undelete capable.
Transactional Updates
A driver can make sure a disk's volume will never be in an inconsistent state, even if a power loss is encountered, as long as the media device guarantees that it will complete a successful write to the current block being written, here being a suggested block size equals sector size. To do so, when a file is opened for write access, a driver can maintain a list of the extents in memory throughout the life of the opened file. When a previously existing extent is modified, it can write this extent to new free blocks on the volume. At a safe time, file-close for example, it can update the inode
's or indirect block
's extent list, being sure to mark the old extent as free, repeating the process for any continued modification to the file until the file is closed.
With this technique, with a power loss or other fatal event, the worst-case scenario is that you have some orphaned extents (old or new) marked used in a bitmap. However, all on-disk files and their extent list(s) are still intact. A check-up utility could easily find these orphaned extents and mark them as free.