0% found this document useful (0 votes)

4 views7 pages

L Journal

2020 version of 6.S081

Uploaded by

memoko8574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views7 pages

L Journal

2020 version of 6.S081

Uploaded by

memoko8574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 7

6.

S081 2020 Lecture 16: Linux ext3 crash recovery

This lecture: a case study of logging in Linux's ext3 file system

with an eye to performance and real-world design details

Why do we care about logging?

it's a successful and important idea
it magically adds crash recovery to almost any storage system
widely used
many forms of persistent storage (DBs, file systems, &c)
fault tolerance in distributed systems
lots to chew on for performance, correctness, flexibility

Review of xv6 logging

[ on-disk FS, in-memory cache, on-disk log ]
each system call is a transaction: begin_op(), writes, end_op()
initially writes affect only cached blocks in memory
the overall goal:
make a system call's writes atomic w.r.t. crashes
apply all of the writes to the FS, or none of them
e.g. mark block as free AND erase block # from i-node
at end of system call:
1. write all modified blocks to log on disk
2. write block #s to log header on disk -- the commit point
3. install modified blocks in "home" locations in FS on disk
4. zero out log header on disk
so logged blocks from next xaction don't appear committed
if crash (e.g. computer loses power, but disk OK):
reboot, then run log recovery code (before running any processes!)
if log header contains block numbers
replay all the logged writes to home locations
zero out the log header
write-ahead rule:
don't start home FS writes until all writes committed in log on disk
otherwise, if crash, recovery cannot complete the xaction
freeing rule:
don't erase/overwrite log until all home FS writes are done

What's wrong with xv6's logging?

it is slow!
each syscall has to wait for all disk I/O-- "synchronous update"
new syscalls have to wait for commit to finish
prevents concurrent execution of multiple processes
every block written twice to disk, once to log, once to FS

Linux's ext3 design

case study of what it takes to make logging efficient
ext3 adds a log to ext2, a previous log-less file system
(the reading calls it ext2, but now known as ext3)
(the reading's "journal" means the same thing as "log")

ext3 data structures:

in memory:
write-back block cache
per-transaction info:
sequence #
set of block #s to be logged
set of outstanding "handles" -- one per syscall
on disk:
FS
circular log

what's in the ext3 log (== journal)?

log superblock: log offset and seq # of earliest valid transaction
(this is not the FS superblock; it's a block at start of log file)
descriptor blocks: magic, seq, block #s
data blocks (as described by descriptor)
commit blocks: magic, seq

the log can hold multiple transactions

how does ext3 get good performance?

asynchronous disk update
batching
concurrency

asynchronous disk updates

system calls only modify blocks in cache
they do not update the disk (and thus don't wait for disk writes)
why does async update help performance?
1. system calls return quickly
2. I/O concurrency; app can do other things while disk writes
3. allows batching

asynchronous disk updates affect ext3's crash semantics

a crash may "forget" the last few completed system calls
this is a significant different from xv6 and many non-logging file systems
an application may tell a user "I finished writing the file"
but a crash may cause the file to disappear after crash+recovery
careful applications need to be aware (e.g. DB or mail server or editor)
fsync(fd) forces updates to log (and commit) before returning
databases, text editors, &c call fsync()

ext3 batching
ext3 commits every few seconds
so each transaction typically includes many syscalls
why does batching help performance?
1. spread fixed transaction costs over many system calls
fixed costs: descriptor and commit blocks, seek/rotate, syscall barrier
2. "write absorbtion"
many syscalls in the batch may modify the same block (bitmap, dirent)
thus one disk write for many syscalls' updates
3. disk scheduling: sort writes for fewer seeks, sequential runs

ext3 allows two kinds of concurrency

1. multiple system calls can add to the current transaction
2. multiple transactions may be at various stages:
* one "open" transaction that's accepting new syscalls
* committing transactions doing their log writes
* committed transactions writing to home locations in FS on disk
* old transactions being freed
why does concurrency help performance?
CPU concurrency: many processes can use FS in parallel on multi-core
I/O concurrency: processes can operate on FS in cache while
previous transactions are writing the disk

ext3 sys call code:

sys_unlink() {
h = start()
get(h, block #) ...
modify the block(s) in the cache
stop(h)
}
start():
tells the logging system about a new system call
can't commit until all start()ed system calls have called stop()
start() can block this sys call if needed
get():
tells logging system we'll modify cached block
added to list of blocks to be logged in current transaction
stop():
stop() does *not* cause a commit
transaction can commit iff all included syscalls have called stop()

committing a transaction to disk

1. block new syscalls
2. wait for in-progress syscalls to stop()
3. open a new transaction, unblock new syscalls
4. write descriptor to on-disk log w/ list of block #s
5. write modified blocks from cache to on-disk log
6. wait for descriptor and data log disk writes to finish
7. write the commit record to on-disk log
8. wait for the commit disk write to finish
this is the "commit point"
9. now modified blocks allowed to go to homes on disk (but not forced)

when can ext3 re-use log space?

log is circular: SB T4 T1 T2 T3
T2's log space can be reused if:
T2 has committed
and T2's blocks have all been written to home locations in FS on disk
==> no longer any need to replay from log after a crash
and all transactions prior to T2 have been freed in the log
freeing writes on-disk log superblock with
offset/seq of resulting oldest transaction in log

what if a crash?
crash causes RAM (and thus cached disk blocks) to be lost
but disk is assumed to be OK
crash may interrupt writing last xaction to on-disk log
so on-disk log may have a bunch of complete xactions, then one partial
may also have written some of block cache to disk
but only for fully committed xactions, not partial last one

how ext3 recovery works

SB T8 xT5 T6 T7
SB points to T6
T5 is free, T8 has overwritten start of T5
0. reboot with intact disk
1. look in log "superblock" for offset and seq# of oldest transaction
2. find the end of the log
scan until bad magic (missing commit) or unexpected seq # (old entry)
go back to last valid commit block, to ignore final partial transaction
in case the crash occured midway through writing a transaction
3. replay all blocks through last valid commit, in log order
what if block after last valid commit block looks like a log descriptor?
i.e. looks like the start of a new transaction?
perhaps a descriptor block left over from previous use of log?
no: seq # will be too low
perhaps some file data happens to look like a descriptor?
ext3 prevents this from happening
ext3 replaces magic number in data blocks in journal with zero
and sets a flag for that block in the descriptor

what if another crash during log replay?

after reboot, recovery will replay exactly the same log writes

that was the straightforward part of ext3.

there are also a bunch of tricky details!

why does ext3 delay start of T2's syscalls until all of T1's syscalls finish?
i.e. why this:
T1: |-syscalls-|
T2: |-syscalls-|
---time--->
this barrier sacrifices some performance
example problem that this barrier prevents:
T1: |-create(x)-|
T2: |-unlink(y)-|
X crash
---time--->
if we allow unlink to start before create finishes
unlink(y) free y's i-node -- i.e. marks it free in block cache
create(x) may allocate that same i-node -- i.e. read unlink's write
if T1 commits, but crash before T2 commits, then
y will exist (the unlink did not commit)
x will exist but use the same i-node as y!
we can't let T1 write T2's updates to the log!
so: don't start T2's syscalls until T1's syscalls are done
(note ext3's copy-on-write doesn't fix this: it gives T1
a copy of all blocks as of the end of T1, but the unlink
executed before T1 ended.)
(note we can't give T2 its own copies of the blocks, separate from
T1's copies, since generally we do want later system calls to
see the effects of earlier system calls.)

how about this:

T1: |--create(d/f)--|
T2: |--create(d/g)--|
both create()s write d's directory content block
suppose T1 commits (writes log on disk) *after* T2's create()
we can't let T1 write T2's update to the log!
then a crash plus replay of would replay some but not all of T2
e.g. make the directory entry d/g but without initializing g's i-node
ext3's solution:
give T1 a private copy of the block cache as it existed when T1 closed
T1 commits from this snapshot of the cache
it's efficient using copy-on-write
the copies allow syscalls in T2 to proceed while T1 is committing

so far we've been talking about EXT3's "journaled data" mode

in which file content blocks are written to log
as well as meta-data (i-nodes, directory content, free bitmaps)
so file content is included in atomicity guarantee
e.g. a write() that grows a file results in these log entries:
[descriptor], i-node, block bitmap, content, [commit]

logging file content is slow, every data block written twice

data is usually much bigger than meta-data
can we entirely omit file content from the log?
if we did, when would we write file content to the FS?

can we write file content blocks at any time at all?

no: if metadata is committed first, crash may leave i-node pointing
to blocks from someone else's deleted file, with previous
file's content!

ext3 "ordered data" mode:

write file content blocks to disk *before* commiting log
(and don't write file content to the log)
thus, for a write():
write file content blocks to FS on disk, wait, then
commit updated i-node (w/ block number and increased size) to the log
if crash after data write, but before commit:
block on disk has new data
but not visible, since i-node size and block list not updated
no metadata inconsistencies
neither i-node nor free bitmap were updated, so blocks still free
most people use ext3 ordered data mode
it's much faster: avoids writing data twice

why does it make sense for the log to ignore file content?
the real point of the log is to protect the FS internal invariants
e.g. a block cannot be both "free" and used in an i-node
in order to have fast recovery after crash + reboot.
from this point of view, file content is not relevant.
in any case, the file system doesn't know enough about application
writes to understand which groups of writes need to be atomic.
instead, applications can call fsync() to ensure data is
safe on the disk.

correctness challenges w/ ordered data mode:

A. rmdir, re-use directory content block for write() to some file,
crash before rmdir or write committed
after recovery, as if rmdir never happened,
but directory content block has been overwritten!
fix: no re-use of freed block until freeing syscall committed
B. rmdir, commit, re-use block in file, ordered file write, commit,
crash+recover, replay rmdir
but no replay of file content write!
file is left w/ directory content e.g. . and ..
fix: put "revoke" records into log, prevent log replay of a given block
note: both problems due to changing the type of a block (content vs meta-data)
so another solution might be to never change a block's type

Summary of ext3 rules

The basic write-ahead logging rule:
Don't write meta-data block to on-disk FS until committed in on-disk log
Wait for all syscalls in T1 to finish before starting T2
Don't overwrite a block in buffer cache before it is in the log
Don't free on-disk log transaction until all blocks have been written to FS
Ordered data mode:
Write datablock to FS before commit
Don't reuse free block until freeing syscall is committed
Don't replay revoked syscalls

another corner case: open fd and unlink

open a file, then unlink it
file is open, so unlink removes dir entry but doesn't free inode or blocks
unlink commits (just removal of dir entry)
crash
log doesn't contain writes that free the i-node or blocks
inode and blocks not on free list, also not reachably by any name
will never be freed! oops
solution: add inode to linked list starting from FS superblock
commit that along with remove of dir ent
recovery looks at that list, completes deletions

what if there's not enough free log space for a transaction's blocks?
free oldest transaction(s)
i.e. install its writes in the on-disk FS

what if so many syscalls have started that the entire log isn't
big enough to hold their writes, even if all older transactions
have been freed?
(unlikely, since each syscall generates few writes compared to log size)
syscall passes max number of blocks needed to start()
start() waits if total for current transaction is too high
and works on freeing old transactions

another (complex) reason for reservations:

T1 writes block 17
T1 commits
T1 finishes writing its log and commit record
but has not yet written block 17 to home location
T2 writes block 17 in the cache
ext3 does *not* make a copy of block 17
ext3 only copies if needed to write T1's log
a T2 syscall is executing, does a write for which no log space
can ext3 install T1's blocks to home locations on disk,
and free T1's log space
no: the cache no longer holds T1's version of block 17
(ext3 could read the block from the log, but it doesn't)
reservations detect the potential log space exhaustion,
prevent that T2 syscall from starting

checksums
recall: transaction's log blocks must be on disk before writing commit block
ext3 waits for disk to say "done" before starting commit block write
risk: disks usually have write caches and re-order writes, for performance
sometimes hard to turn off (the disk lies)
people often leave re-ordering enabled for speed, out of ignorance
bad news if disk writes commit block before the rest of the transaction
solution: commit block contains checksum of all data blocks
on recovery: compute checksum of datablocks
if matches checksum in commit block: install transaction
if no match: don't install transaction
ext4 has log checksumming, but ext3 does not

take-home points about logging and ext3

logs make multi-step disk updates atomic w.r.t. crashes
write-ahead rule: first commit in log, then write to disk
recovery relies on this rule -- and is fast
batching and concurrency give ext3 much better performance than xv6
at the cost of many corner cases to be dealt with

References:
Stephen Tweedie 2000 talk transcript "EXT3, Journaling Filesystem"
http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html
log format details
https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout
https://www.usenix.org/system/files/login/articles/04_tso_018-021_final.pdf

Chapter 1 Enrichment 3A Math in Focus
67% (3)
Chapter 1 Enrichment 3A Math in Focus
7 pages
New Hire Checklist Download 20170907 PDF
No ratings yet
New Hire Checklist Download 20170907 PDF
1 page
SWIFT Society For Worldwide Interbank Financial Telecommunications
100% (2)
SWIFT Society For Worldwide Interbank Financial Telecommunications
35 pages
ext3 Journaling and File System Transactions
No ratings yet
ext3 Journaling and File System Transactions
22 pages
Ext3/4 File Systems: Don Porter CSE 506
No ratings yet
Ext3/4 File Systems: Don Porter CSE 506
33 pages
L Crash
No ratings yet
L Crash
6 pages
#20 (File System Logging)
No ratings yet
#20 (File System Logging)
32 pages
Ext3 Journaling File System: Chadd Williams Shrug 10/05/2001
No ratings yet
Ext3 Journaling File System: Chadd Williams Shrug 10/05/2001
21 pages
Lec7 Logging
No ratings yet
Lec7 Logging
4 pages
Kernel VFS and File System Efficiency
No ratings yet
Kernel VFS and File System Efficiency
28 pages
File Syetem
No ratings yet
File Syetem
33 pages
Media and Storage: UNIX File Systems
No ratings yet
Media and Storage: UNIX File Systems
47 pages
Session 15 EXT2 and EXT3 Amzb
No ratings yet
Session 15 EXT2 and EXT3 Amzb
47 pages
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
No ratings yet
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
21 pages
L Fs
No ratings yet
L Fs
6 pages
Case Study
No ratings yet
Case Study
44 pages
Outline: File System Consistency Issues in The Presence of Failures
No ratings yet
Outline: File System Consistency Issues in The Presence of Failures
4 pages
5 FileSystems
No ratings yet
5 FileSystems
33 pages
Ext3 Journal Design
No ratings yet
Ext3 Journal Design
8 pages
Journal Design PDF
No ratings yet
Journal Design PDF
8 pages
Linux File System: PRAKHER GUPTA (144032) SHISHIR (144045)
No ratings yet
Linux File System: PRAKHER GUPTA (144032) SHISHIR (144045)
37 pages
Understanding UNIX / Linux File System: What Is A File?
No ratings yet
Understanding UNIX / Linux File System: What Is A File?
9 pages
537 L22 LFS
No ratings yet
537 L22 LFS
64 pages
ExtXfs Short
No ratings yet
ExtXfs Short
38 pages
LINUX File System: Slides Adopted From
No ratings yet
LINUX File System: Slides Adopted From
41 pages
Ui 3
No ratings yet
Ui 3
4 pages
Linux Notes
No ratings yet
Linux Notes
6 pages
11 Case Study Unix PDF
No ratings yet
11 Case Study Unix PDF
42 pages
Outline: Access Control Lists (ACL) : Keep Lists of Access For Each Domain With
No ratings yet
Outline: Access Control Lists (ACL) : Keep Lists of Access For Each Domain With
5 pages
Ext2, Ext3, Ext4 File Systems Guide
No ratings yet
Ext2, Ext3, Ext4 File Systems Guide
41 pages
Lecture 2 Advanced File Systems
No ratings yet
Lecture 2 Advanced File Systems
66 pages
Converted File 891129ca PDF
No ratings yet
Converted File 891129ca PDF
6 pages
Ext2 & Ext3 File System Analysis
No ratings yet
Ext2 & Ext3 File System Analysis
18 pages
The Virtual File System (VFS)
No ratings yet
The Virtual File System (VFS)
60 pages
Linux Quations and Answers
No ratings yet
Linux Quations and Answers
82 pages
Chapter 3
No ratings yet
Chapter 3
18 pages
QB Delhi Campus
No ratings yet
QB Delhi Campus
17 pages
EXT4 Filesystem
No ratings yet
EXT4 Filesystem
4 pages
Ext2 Filesystem Overview
No ratings yet
Ext2 Filesystem Overview
12 pages
Session 5 6 Revision
No ratings yet
Session 5 6 Revision
47 pages
Operating Systems With Linux John O'Gorman
100% (1)
Operating Systems With Linux John O'Gorman
313 pages
Os Unit 4
No ratings yet
Os Unit 4
25 pages
#19 (File System Internals)
No ratings yet
#19 (File System Internals)
51 pages
Files File System Core Lecture
No ratings yet
Files File System Core Lecture
36 pages
Linux OS: Components, Memory & Storage
No ratings yet
Linux OS: Components, Memory & Storage
8 pages
19CS2106S 19CS2106A Test - I Set - 1 Key and Scheme
No ratings yet
19CS2106S 19CS2106A Test - I Set - 1 Key and Scheme
8 pages
Os PPT
No ratings yet
Os PPT
30 pages
Linux File System
No ratings yet
Linux File System
18 pages
Ext2 File System Assignments and Description Lab
No ratings yet
Ext2 File System Assignments and Description Lab
14 pages
Linux Interview Questions & Answers
No ratings yet
Linux Interview Questions & Answers
87 pages
File System
No ratings yet
File System
25 pages
(A) Explain The Anatomy of A System Call in UNIX
No ratings yet
(A) Explain The Anatomy of A System Call in UNIX
23 pages
Lec15 Lfs
No ratings yet
Lec15 Lfs
24 pages
Lecture 16
No ratings yet
Lecture 16
15 pages
Ext4 Features
No ratings yet
Ext4 Features
9 pages
DSTN Session 4
No ratings yet
DSTN Session 4
45 pages
LINUX Notes
100% (1)
LINUX Notes
50 pages
Linux PDF
No ratings yet
Linux PDF
50 pages
Linux
No ratings yet
Linux
24 pages
Linux Process Management Guide
No ratings yet
Linux Process Management Guide
18 pages
cs221 Section1 Solutions
No ratings yet
cs221 Section1 Solutions
11 pages
6S081 Intro To C Fa21
No ratings yet
6S081 Intro To C Fa21
66 pages
L Organization
No ratings yet
L Organization
6 pages
L Meltdown
No ratings yet
L Meltdown
4 pages
Mogul 96 Usenix
No ratings yet
Mogul 96 Usenix
14 pages
L Net
No ratings yet
L Net
5 pages
L Lockv2
No ratings yet
L Lockv2
4 pages
Microkernel
No ratings yet
Microkernel
15 pages
L VMM
No ratings yet
L VMM
6 pages
Cs221 LEC 3 Slides
No ratings yet
Cs221 LEC 3 Slides
43 pages
Cs221 Section2 Problems
No ratings yet
Cs221 Section2 Problems
5 pages
cs221 - LEC 6-Slides
No ratings yet
cs221 - LEC 6-Slides
59 pages
Cs221 LEC 1 Slides
No ratings yet
Cs221 LEC 1 Slides
23 pages
Cs221 LEC 2 Slides
No ratings yet
Cs221 LEC 2 Slides
37 pages
Cs221 LEC 4 Slides
No ratings yet
Cs221 LEC 4 Slides
73 pages
Cs221 Section2 Problems
No ratings yet
Cs221 Section2 Problems
5 pages
Cs221 Final Review p2
No ratings yet
Cs221 Final Review p2
23 pages
Cs221 LEC 2 Slides
No ratings yet
Cs221 LEC 2 Slides
37 pages
IT Professional Resume
100% (2)
IT Professional Resume
3 pages
Civics and Citizenship Rubric
No ratings yet
Civics and Citizenship Rubric
2 pages
Surfer Gridding
No ratings yet
Surfer Gridding
20 pages
C# Programming Language Overview
No ratings yet
C# Programming Language Overview
20 pages
Hildreth 1957
No ratings yet
Hildreth 1957
7 pages
Water Quality Prediction: SVM vs XGBoost
No ratings yet
Water Quality Prediction: SVM vs XGBoost
104 pages
Unit 1 Introduction To Network Administration: Structure Page Nos
No ratings yet
Unit 1 Introduction To Network Administration: Structure Page Nos
15 pages
Intermediate Feb16
No ratings yet
Intermediate Feb16
14 pages
AMS Machinery Manager: Version 5.7 Software Installation Guide
No ratings yet
AMS Machinery Manager: Version 5.7 Software Installation Guide
68 pages
BTech IT
No ratings yet
BTech IT
81 pages
1.4 Struktur Kontrol
No ratings yet
1.4 Struktur Kontrol
11 pages
1 Amortized Analysis: Indian Institute of Information Technology Design and Manufacturing, Kancheepuram
No ratings yet
1 Amortized Analysis: Indian Institute of Information Technology Design and Manufacturing, Kancheepuram
6 pages
Time Duration Calculator
No ratings yet
Time Duration Calculator
1 page
Organising Online Campaigning To Raise Donation For The Homeless Shelter
No ratings yet
Organising Online Campaigning To Raise Donation For The Homeless Shelter
7 pages
PHPExcel Guide for Developers
No ratings yet
PHPExcel Guide for Developers
52 pages
CSCE 156/156H/RAIK 184H Assignment 7 - Project Phase VI Integrating A Data Structure
No ratings yet
CSCE 156/156H/RAIK 184H Assignment 7 - Project Phase VI Integrating A Data Structure
4 pages
Compiler Design - Module 5
No ratings yet
Compiler Design - Module 5
37 pages
Sms Dump 20190713212321
100% (1)
Sms Dump 20190713212321
4 pages
Benefits Realisation Management in Information Technology Projects
No ratings yet
Benefits Realisation Management in Information Technology Projects
14 pages
Bs Computer Science-Thesis Writing 2 Final Defense (Room 1) Day 1-December 3
No ratings yet
Bs Computer Science-Thesis Writing 2 Final Defense (Room 1) Day 1-December 3
2 pages
Bus 188 - Chapter 5 - Database Processing
No ratings yet
Bus 188 - Chapter 5 - Database Processing
3 pages
Python Interview Questions and Answers For Freshers and Advanced Level Experienced
No ratings yet
Python Interview Questions and Answers For Freshers and Advanced Level Experienced
18 pages
Questionnaire Design Guide
No ratings yet
Questionnaire Design Guide
8 pages
Online Dispute Resolution AS A SOLUTION TO CROSS-BORDER E-DISPUTES
0% (1)
Online Dispute Resolution AS A SOLUTION TO CROSS-BORDER E-DISPUTES
31 pages
Assignment ON Microprocessor AND Microcontroller: Joyal Jose Augustine
No ratings yet
Assignment ON Microprocessor AND Microcontroller: Joyal Jose Augustine
63 pages
ArcObject Training
No ratings yet
ArcObject Training
75 pages
Virus Kit
No ratings yet
Virus Kit
41 pages

L Journal

Uploaded by

L Journal

Uploaded by

6.

S081 2020 Lecture 16: Linux ext3 crash recovery

This lecture: a case study of logging in Linux's ext3 file system

Why do we care about logging?

Review of xv6 logging

What's wrong with xv6's logging?

Linux's ext3 design

ext3 data structures:

what's in the ext3 log (== journal)?

the log can hold multiple transactions

how does ext3 get good performance?

asynchronous disk updates

asynchronous disk updates affect ext3's crash semantics

ext3 allows two kinds of concurrency

ext3 sys call code:

committing a transaction to disk

when can ext3 re-use log space?

how ext3 recovery works

what if another crash during log replay?

that was the straightforward part of ext3.

how about this:

so far we've been talking about EXT3's "journaled data" mode

logging file content is slow, every data block written twice

can we write file content blocks at any time at all?

ext3 "ordered data" mode:

correctness challenges w/ ordered data mode:

Summary of ext3 rules

another corner case: open fd and unlink

another (complex) reason for reservations:

take-home points about logging and ext3

You might also like