Additional Information about the Journaling File System

Additional Information about the Journaling File System

This page contains more details about how the journaling file system for Assignment 8 is implemented, including the overall disk layout, the structure of the log, and how the file system ensures the integrity of log entries. This material is not strictly required to complete Assignment 8, but you may find it useful and/or interesting.

We also recommend that you read some or all of the code that implements the V6 file system; it's a great example of a professional-quality C++ project and it uses many advanced C++ features.

Journaling V6 File System Layout

The on-disk layout of the journaling V6 file system consists of the following sections:

  1. The boot block (1 sector)
  2. The superblock (1 sector)
  3. Inodes (variable-length)
  4. Data blocks (variable-length)
  5. The log header (1 sector)
  6. The free-block bitmap or freemap (variable-length)
  7. The log (variable-length)

Sections 1-4 are identical or nearly identical to the original V6 file system layout, which you used in Assignment 7. Their layout is defined by data structures and constants in the header file layout.hh.

Sections 5-7 were added for this assignment. Their layout is defined by the logentry.hh header file. Since nobody runs the V6 file system on raw hard disks any more---V6 file system images exist only as files on other file systems---we simply extend the file containing a V6 file system image to accommodate the new areas required by journaling.

The boot block

This block is essentially unused, but the first two bytes must be 4 and 7 or mountv6 will not recognize the disk image.

The superblock

The superblock is essentially the same as in 1975, except that we've coopted two of the unused bytes to add two more fields:

struct filsys {
    ...
    uint8_t  s_uselog;        // Use log
    uint8_t  s_dirty;         // File system was not cleanly shut down
    ...
};

The s_uselog byte, when non-zero, indicates that our new 21st-century logging functionality has been enabled for this particular file system volume.

The s_dirty field is set to 1 whenever the file system is opened, and set to 0 when it is closed cleanly. If your file system crashes, s_dirty will be 1, and mountv6 will refuse to mount it again until you have repaired the damage. (If you want to see how bad a corrupt file system can be in action, you can forcibly mount a dirty volume with the --force flag, as in ./mountv6 --force v6.img mnt.)

There's one more change to the superblock. Originally, V6 stored free blocks in a linked list 100-wide. The superblock stored s_nfree free blocks (0--100) in an array s_free, with s_free[0] (if non-zero) pointing to a disk block containing 100 pointers to other free blocks and so on. This format is not convenient for journaling, so we set s_nfree to zero and instead dedicate a portion of the disk to a bitmap, containing one bit per data block, where 1 indicates the block is free and 0 indicates it is allocated. For brevity, we refer to the "free-block bitmap" as the freemap and store it in a structure field called freemap_.

V6 also stores a vector of up to 100 free inodes in the superblock, called s_inode. Since the superblock is kept in memory during normal operation, this speeds up inode allocation (no need to search the inode array on disk to find a free one, as long as there are free inodes cached in the superblock). The number of valid entries is s_ninode, and when it reaches zero the V6 code simply re-scans the inode area to find 100 unallocated inodes. When repairing an unclean volume, we can avoid inconsistencies in inode allocation by just setting s_ninode to 0. As long as we have properly repaired the inodes and the IALLOC flags are all correct in the i_mode fields, the file system will just re-scan the inode area the next time it needs to allocate an inode.

Inodes

There is no change to the inode section. It's a giant array of inode structures, where the entry 0 of block INODE_START_SECTOR (2) of the file system corresponds to inode number 1.

Data blocks

There is no change to this section of the file system.

Log header

The log header is the first block of the log area of disk. Its contents are defined by the loghdr structure in logentry.hh:

struct loghdr {
    uint32_t l_magic;       // LOG_MAGIC_NUM
    uint32_t l_hdrblock;    // Block number containing loghdr structure
    uint16_t l_logsize;     // Total size (in blocks) of loghdr, freemap, and log area
    uint16_t l_mapsize;     // Number of blocks in freemap
    uint32_t l_checkpoint;  // Disk offset of log checkpoint
    lsn_t l_sequence;       // Expected LSN at checkpoint

    uint32_t mapstart() const { return l_hdrblock + 1; }
    uint32_t logstart() const { return mapstart() + l_mapsize; }
    uint32_t logend() const { return logstart() + l_logsize; }
    uint32_t logbytes() const {
        return SECTOR_SIZE * (l_logsize - l_mapsize - 1);
    }
};

The first two fields are sanity checks to ensure you really have a loghdr. The l_magic field contains the constant LOG_MAGIC_NUM (0x474c0636--a randomly-generated value), while l_hdrblock is the actual block number storing the loghdr structure. The first constant is unlikely to appear in random files, since it was randomly generated. The second constant ensures that if you copy a valid loghdr into a file, it still won't look like a valid loghdr since it will be at the wrong disk location.

l_logsize specifies how many blocks have been added to the V6 file system for the new feature set. This size includes sections 5--7 of the file system (the log header, freemap, and journal). With the new feature set enabled, the total number of 512-byte blocks of the disk image should now be s_fsize (from the superblock) plus l_logsize.

l_mapsize specifies how many blocks are used for the freemap. This must be large enough to contain at least 1 bit for each block number from the start of data blocks (at INODE_START_SECTOR+s_isize) to the end of the data blocks (at s_fsize).

Finally, l_checkpoint and l_sequence tell you where to start reading the log and what log sequence number to expect. l_checkpoint is the byte offset of the first log record you should read. (Because it's a byte offset, it should reside somewhere in the half-open interval [logstart()*SECTOR_SIZE, logend()*SECTOR_SIZE).) Each log entry has a consecutively-assigned 32-bit log sequence number or LSN of type lsn_t. l_sequence tells you the LSN of the first log record you should expect to read at offset l_checkpoint. If you see the wrong LSN, even if it appears to be a valid log record, you must stop processing the log as you may be looking at an old log entry.

Freemap (free-block bitmap)

The free-block bitmap has one bit for every block in section 4 of the file system (data blocks). If the bit is 1, the block is free. If the bit is 0, the block is in use.

The log

The log consists of a series of log entries, each of which begins with a header, followed by a payload, and then a footer. Conceptually, the three fields look like this:

struct Header {
    lsn_t sequence;         // First copy of the LSN
    uint8_t type;           // What type this is (entry_type index)
} header;

entry_type payload;

struct Footer {
    uint32_t checksum;      // CRC-32 of header and payload
    lsn_t sequence;         // Another copy of the LSN
} footer;

It's important to note that, while the beginning of the log will generally be valid, there isn't a clean end. The log just turns to garbage after the last successful write; the information after the last good record might not even be recognizable as a log record.

To help identify the end of the log, the header contains a log sequence number, or LSN. This is a consecutively assigned counter that is unique for each log entry. LSNs are very important because the log area is repeatedly recycled. Hence, whatever bytes are left over from the last iteration may be old log entries or may look like valid log entries. These old entries must not be applied to the file system; doing so could damage the file system. Even a valid old entry might overwrite file data in a current file data block that was previously a directory or indirect block.

The header also contains a type integer, which specifies the type of log entry, and hence how to parse the payload. The payload section depends on type, and is one one of the structures defined later on in this writeup. The payload immediately follows the type.

Finally, the log entry ends with a footer. The footer contains another copy of the sequence number. The second copy helps detect situations in which a log entry spans a sector boundary and the part of the log entry in the first sector got written while the part in the second contains information from an old log record. There's still the possibility that the "right" log sequence number for the footer could appear in arbitrary payload data from an older log record. Hence, as an additional check, the footer also includes a checksum over the header and payload. The check reduces the probability of interpreting garbage as a log record by an additional factor of 2-32.

Log Entries

There are 6 different types of log entry that you must process. Three of them, LogPatch, LogBlockAlloc, and LogBlockFree, are discussed in the main assignment writeup. The others are described below.

Transaction begin

struct LogBegin {
};

Ensuring file system consistency often requires writing to multiple disk locations atomically. For example, creating a file requires writing a directory entry and initializing the inode. Appending a block to a file requires writing a block pointer in the inode or indirect block and marking the block as no longer free in the freemap. Doing only one of these writes would leave the file system in an inconsistent state. For example, if an inode points to a block that is still marked free, the block may get cross-allocated to another inode. If a directory entry is written but the inode is not initialized, the file system will assume that garbage data in the inode is meaningful, which is likely to cause misbehavior (e.g. the inode could appear to contain random disk blocks).

To avoid such problems, log entries that modify file system state are grouped together in atomic transactions for which either all the log entries are applied or none are applied. Transactions are bracketed by a LogBegin at the beginning and LogCommit entry at the end. The LogBegin payload itself contains no data. However, you must not apply any subsequent records unless there is also a matching LogCommit entry. If the LogCommit entry is missing, it means the system crashed in the middle of writing to the log, and hence applying subsequent records will likely leave the file system in an inconsistent state.

Note, even though the LogBegin payload contains zero bytes, the entry still has a header and footer. The header contains an LSN and a type byte signifying that this is a LogBegin, while the trailer ensures integrity of the log entry as usual.

In the assignment, we do the check for a LogCommit for you and don't even call your apply methods unless the log entries are part of a complete transaction.

Transaction commit

struct LogCommit {
    uint32_t sequence;          // LSN of LogBegin
};

The LogCommit entry indicates the end of an atomic transaction, consisting of all log entries since the previous LogBegin. Note that the sequence field must include the LSN of the corresponding LogBegin log entry. Otherwise, it indicates the log has been corrupted and you must stop processing it.

Log rewind

struct LogRewind {
};

This indicates that the log has wrapped around and the next entry is back at the beginning of the log area. (An alternative is always to write to the very last byte of the log before wrapping around, but this would result in log entries that are not contiguous on disk. Contiguous entries facilitate debugging: you can always start examining the log from the beginning of the log area and see some valid operations that happened, whether before or after the latest checkpoint.)

Tools Reference

Here is a full list of the tools included in this assignment's starter code. Not all are required to be used for this assignment, but they offer cool ways to explore logging file systems!

  • mkfsv6 image-file [num-blocks [num-inodes [num-journalblocks]]] -- This tool creates a new file system. It uses the maximum size of 65,535 blocks (~32 MiB) and one inode per 2KiB by default, but you can specify different parameters if you prefer. By default it does not create a log area, expecting you to do this separately with mountv6 -j. However, you can specify a number of journal blocks, and it will create the journal area. If you specify 0, it will use a default size for the journal.

  • dumplog image-file [offset | c ] -- Pretty-prints the log from the beginning, or from a specified numeric file offset. If you use the letter c instead of a numeric offset, prints from the checkpoint.

  • v6 -- A tool with various subcommands for examining the file system state. It expects your disk image to be called v6.img. You can change this with the V6IMG environment variable (e.g., V6IMG=test.img ./v6 ls / for a single invocation or export V6IMG=test.img for all future invocations). Some useful subcommands:

    • v6 dump -- look at the superblock and log header.

    • v6 stat {path | #i-number}... -- shows the contents of particular inodes. You can specify inodes as pathnames or as # prefixed to the i-number. Be aware that # is a comment character for most shells, so you may need to quote it. (Example: ./v6 stat / or ./v6 stat '#1' to see the root directory inode.)

    • v6 ls [-a] path... -- lists a file or directory in a format similar to ls -alni. With the -a flag, shows time of last access instead of modification (similar to ls -alniu).

    • v6 cat path... -- look at the contents of a file.

    • v6 iblock block-number... -- dumps the contents of one or more indirect blocks, showing the non-zero block pointers.

    • v6 block block-number ... -- dumps the raw contents of one or more file system blocks in a side-by-size hex and ASCII format with 16 bytes per line. This is convenient for directories because the direntv6 structure is exactly 16 bytes. (If you find this format generally useful, you can achieve something similar with the unix shell command od --endian=big -t x4z -N 512 -j block-numberb. However, v6 block replaces unprintable characters with space instead of dot, making it easier to identify the . and .. entries in directories.)

    • v6 usedblocks -- lists all allocated block numbers. If you are leaking blocks, this can help you identify which blocks you are failing to mark free.

    • v6 usedinodes -- like usedblocks, but lists all allocated inodes. Useful if you are leaking inodes.

  • fsckv6 [-y] image-file -- Reports what's wrong with the file system. If you use the -y flag, it will bring the file system back to a consistent state. However, it doesn't have any support for logging and will disable the log. (You can always recreate the log by supplying the -j flag to mountv6.)