File System Crash Recovery

Optional readings for this topic from Operating Systems: Principles and Practice: Chapter 14 up through Section 14.1.

The problem: crashes can happen anywhere, even in the middle of critical sections:

Lost data: information cached in main memory may not have been written to disk yet.
- E.g. original Unix: up to 30 seconds worth of changes
Inconsistency:
- If a modification affects multiple blocks, a crash could occur when some of the blocks have been written to disk but not the others.
- Examples:
  - Adding block to file: free list was updated to indicate block in use, but inode wasn't yet written to point to block.
  - Creating link to a file: new directory entry refers to inode, but reference count wasn't updated in inode.
- The block cache may reorder writes.
Ideally, we'd like something like an atomic operation where multi-block operations happen either in their entirety or not at all.

Approach #1: check consistency during reboot, repair problems

Example: Unix fsck ("file system check")

During every system boot fsck is executed.
Checks to see if system was shut down cleanly; if so, no more work to do.
If system didn't shut down cleanly (e.g., system crash, power failure, etc.), then scan disk contents, identify inconsistencies, repair them.
Example: block in file and also in free list
Example: reference count for an inode doesn't match the number of links in directories
Example: block in two different files
Example: inode has a reference count > 0 but is not referenced in any directory.

Limitations of fsck:

Restores disk to consistency, but doesn't prevent loss of information; system could end up unusable.
Security issues: a block could migrate from the password file to some other random file.
Can take a long time: can't restart system until fsck completes. As disks get larger, recovery time increases.

Prevent certain kinds of inconsistencies by making updates in a particular order.

For example, when adding a block to a file, first write back the free list so that it no longer contains the file's new block.
Then write the inode, referring to the new block.
What can we say about the system state after a crash?
Some general rules:
- Make sure a block is properly initialized (e.g., indirect block) before storing a pointer to it.
- Nullify existing pointers to a resource before reusing the resource

Result: no need to wait for fsck when rebooting

Problems:

Improvement:

Don't actually write the blocks synchronously, but record dependencies in the buffer cache.
For example, after adding a block to a file, add dependency between inode block and free list block.
- When it's time to write the inode back to disk, make sure that the free list block has been written first.
Tricky to get right: potentially end up with circular dependencies between blocks.

Also called journaling file systems

Implemented in Linux ext3 and NTFS (Windows).

Similar in function to logs in database systems; allows inconsistencies to be corrected quickly during reboots

Before performing an operation, record information about the operation in a special append-only log file; flush this info to disk before modifying any other blocks.
Example: log entries to add a block to a file
- "Mark block 99421 allocated"
- "Store 99421 as the location of block index 93 in inode 862"
Then the actual block updates can be carried out later, in any order.
If a crash occurs, replay the log to make sure all updates are completed on disk.
Guarantees that once an operation is started, it will eventually complete.

How do log entries describe changes?

One possibility: log entries describe logical operations such as "add block 99421 to inode 862 as block index 93"
Another approach: log physical disk updates, such as "patch 4 bytes at offset 324 in block 6159972 with the value 99421"
Assignment 8 uses a hybrid:
- Patch disk blocks
- Mark blocks free/allocated
It's important to identify consistent groups (e.g., all the changes required to add a block to a file)
- Don't process any of the entries in a group unless all are present
- Assignment 8 uses a transaction mechanism to group related entries

Problem: log grows over time, so recovery could be slow.

How much to log?

Typically the log is used only for metadata (free list, inodes, indirect blocks), not for actual file data.
Logging all file data is much more expensive.

Logging advantages?

Logging disadvantages?

Solution: delay log writes

Safety requirement: log entry must be written before any other disk blocks related to the log entry
Buffer log in memory
Before writing back a cache block, flush log
This separates durability from consistency

Crashes can still lose recently-written data if it hasn't been flushed to disk.

Disks fail

Conclusions: