Commit 3f6984e7 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs super updates from Christian Brauner:
 "This contains the super work for this cycle including the long-awaited
  series by Jan to make it possible to prevent writing to mounted block
  devices:

   - Writing to mounted devices is dangerous and can lead to filesystem
     corruption as well as crashes. Furthermore syzbot comes with more
     and more involved examples how to corrupt block device under a
     mounted filesystem leading to kernel crashes and reports we can do
     nothing about. Add tracking of writers to each block device and a
     kernel cmdline argument which controls whether other writeable
     opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are
     allowed.

     Note that this effectively only prevents modification of the
     particular block device's page cache by other writers. The actual
     device content can still be modified by other means - e.g. by
     issuing direct scsi commands, by doing writes through devices lower
     in the storage stack (e.g. in case loop devices, DM, or MD are
     involved) etc. But blocking direct modifications of the block
     device page cache is enough to give filesystems a chance to perform
     data validation when loading data from the underlying storage and
     thus prevent kernel crashes.

     Syzbot can use this cmdline argument option to avoid uninteresting
     crashes. Also users whose userspace setup does not need writing to
     mounted block devices can set this option for hardening. We expect
     that this will be interesting to quite a few workloads.

     Btrfs is currently opted out of this because they still haven't
     merged patches we require for this to work from three kernel
     releases ago.

   - Reimplement block device freezing and thawing as holder operations
     on the block device.

     This allows us to extend block device freezing to all devices
     associated with a superblock and not just the main device. It also
     allows us to remove get_active_super() and thus another function
     that scans the global list of superblocks.

     Freezing via additional block devices only works if the filesystem
     chooses to use @fs_holder_ops for these additional devices as well.
     That currently only includes ext4 and xfs.

     Earlier releases switched get_tree_bdev() and mount_bdev() to use
     @fs_holder_ops. The remaining nilfs2 open-coded version of
     mount_bdev() has been converted to rely on @fs_holder_ops as well.
     So block device freezing for the main block device will continue to
     work as before.

     There should be no regressions in functionality. The only special
     case is btrfs where block device freezing for the main block device
     never worked because sb->s_bdev isn't set. Block device freezing
     for btrfs can be fixed once they can switch to @fs_holder_ops but
     that can happen whenever they're ready"

* tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  block: Fix a memory leak in bdev_open_by_dev()
  super: don't bother with WARN_ON_ONCE()
  super: massage wait event mechanism
  ext4: Block writes to journal device
  xfs: Block writes to log device
  fs: Block writes to mounted block devices
  btrfs: Do not restrict writes to btrfs devices
  block: Add config option to not allow writing to mounted devices
  block: Remove blkdev_get_by_*() functions
  bcachefs: Convert to bdev_open_by_path()
  fs: handle freezing from multiple devices
  fs: remove dead check
  nilfs2: simplify device handling
  fs: streamline thaw_super_locked
  ext4: simplify device handling
  xfs: simplify device handling
  fs: simplify setup_bdev_super() calls
  blkdev: comment fs_holder_ops
  porting: document block device freeze and thaw changes
  fs: remove unused helper
  ...
parents c604110e 8ff363ad
......@@ -1061,3 +1061,15 @@ export_operations ->encode_fh() no longer has a default implementation to
encode FILEID_INO32_GEN* file handles.
Filesystems that used the default implementation may use the generic helper
generic_encode_ino32_fh() explicitly.
---
**recommended**
Block device freezing and thawing have been moved to holder operations.
Before this change, get_active_super() would only be able to find the
superblock of the main block device, i.e., the one stored in sb->s_bdev. Block
device freezing now works for any block device owned by a given superblock, not
just the main block device. The get_active_super() helper and bd_fsfreeze_sb
pointer are gone.
......@@ -78,6 +78,26 @@ config BLK_DEV_INTEGRITY_T10
select CRC_T10DIF
select CRC64_ROCKSOFT
config BLK_DEV_WRITE_MOUNTED
bool "Allow writing to mounted block devices"
default y
help
When a block device is mounted, writing to its buffer cache is very
likely going to cause filesystem corruption. It is also rather easy to
crash the kernel in this way since the filesystem has no practical way
of detecting these writes to buffer cache and verifying its metadata
integrity. However there are some setups that need this capability
like running fsck on read-only mounted root device, modifying some
features on mounted ext4 filesystem, and similar. If you say N, the
kernel will prevent processes from writing to block devices that are
mounted by filesystems which provides some more protection from runaway
privileged processes and generally makes it much harder to crash
filesystem drivers. Note however that this does not prevent
underlying device(s) from being modified by other means, e.g. by
directly submitting SCSI commands or through access to lower layers of
storage stack. If in doubt, say Y. The configuration can be overridden
with the bdev_allow_write_mounted boot option.
config BLK_DEV_ZONED
bool "Zoned block device support"
select MQ_IOSCHED_DEADLINE
......
This diff is collapsed.
......@@ -2675,7 +2675,7 @@ static int lock_fs(struct mapped_device *md)
WARN_ON(test_bit(DMF_FROZEN, &md->flags));
r = freeze_bdev(md->disk->part0);
r = bdev_freeze(md->disk->part0);
if (!r)
set_bit(DMF_FROZEN, &md->flags);
return r;
......@@ -2685,7 +2685,7 @@ static void unlock_fs(struct mapped_device *md)
{
if (!test_bit(DMF_FROZEN, &md->flags))
return;
thaw_bdev(md->disk->part0);
bdev_thaw(md->disk->part0);
clear_bit(DMF_FROZEN, &md->flags);
}
......
......@@ -289,14 +289,14 @@ static int bch2_ioc_goingdown(struct bch_fs *c, u32 __user *arg)
switch (flags) {
case FSOP_GOING_FLAGS_DEFAULT:
ret = freeze_bdev(c->vfs_sb->s_bdev);
ret = bdev_freeze(c->vfs_sb->s_bdev);
if (ret)
goto err;
bch2_journal_flush(&c->journal);
c->vfs_sb->s_flags |= SB_RDONLY;
bch2_fs_emergency_read_only(c);
thaw_bdev(c->vfs_sb->s_bdev);
bdev_thaw(c->vfs_sb->s_bdev);
break;
case FSOP_GOING_FLAGS_LOGFLUSH:
......
......@@ -164,8 +164,8 @@ void bch2_sb_field_delete(struct bch_sb_handle *sb,
void bch2_free_super(struct bch_sb_handle *sb)
{
kfree(sb->bio);
if (!IS_ERR_OR_NULL(sb->bdev))
blkdev_put(sb->bdev, sb->holder);
if (!IS_ERR_OR_NULL(sb->bdev_handle))
bdev_release(sb->bdev_handle);
kfree(sb->holder);
kfree(sb->sb_name);
......@@ -725,21 +725,22 @@ int bch2_read_super(const char *path, struct bch_opts *opts,
if (!opt_get(*opts, nochanges))
sb->mode |= BLK_OPEN_WRITE;
sb->bdev = blkdev_get_by_path(path, sb->mode, sb->holder, &bch2_sb_handle_bdev_ops);
if (IS_ERR(sb->bdev) &&
PTR_ERR(sb->bdev) == -EACCES &&
sb->bdev_handle = bdev_open_by_path(path, sb->mode, sb->holder, &bch2_sb_handle_bdev_ops);
if (IS_ERR(sb->bdev_handle) &&
PTR_ERR(sb->bdev_handle) == -EACCES &&
opt_get(*opts, read_only)) {
sb->mode &= ~BLK_OPEN_WRITE;
sb->bdev = blkdev_get_by_path(path, sb->mode, sb->holder, &bch2_sb_handle_bdev_ops);
if (!IS_ERR(sb->bdev))
sb->bdev_handle = bdev_open_by_path(path, sb->mode, sb->holder, &bch2_sb_handle_bdev_ops);
if (!IS_ERR(sb->bdev_handle))
opt_set(*opts, nochanges, true);
}
if (IS_ERR(sb->bdev)) {
ret = PTR_ERR(sb->bdev);
if (IS_ERR(sb->bdev_handle)) {
ret = PTR_ERR(sb->bdev_handle);
goto out;
}
sb->bdev = sb->bdev_handle->bdev;
ret = bch2_sb_realloc(sb, 0);
if (ret) {
......
......@@ -4,6 +4,7 @@
struct bch_sb_handle {
struct bch_sb *sb;
struct bdev_handle *bdev_handle;
struct block_device *bdev;
char *sb_name;
struct bio *bio;
......
......@@ -1406,6 +1406,8 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
return ERR_PTR(error);
}
/* No support for restricting writes to btrfs devices yet... */
mode &= ~BLK_OPEN_RESTRICT_WRITES;
/*
* Setup a dummy root and fs_info for test/set super. This is because
* we don't actually fill this stuff out until open_ctree, but we need
......
......@@ -819,11 +819,11 @@ int ext4_force_shutdown(struct super_block *sb, u32 flags)
switch (flags) {
case EXT4_GOING_FLAGS_DEFAULT:
ret = freeze_bdev(sb->s_bdev);
ret = bdev_freeze(sb->s_bdev);
if (ret)
return ret;
set_bit(EXT4_FLAGS_SHUTDOWN, &sbi->s_ext4_flags);
thaw_bdev(sb->s_bdev);
bdev_thaw(sb->s_bdev);
break;
case EXT4_GOING_FLAGS_LOGFLUSH:
set_bit(EXT4_FLAGS_SHUTDOWN, &sbi->s_ext4_flags);
......
......@@ -5864,11 +5864,9 @@ static struct bdev_handle *ext4_get_journal_blkdev(struct super_block *sb,
struct ext4_super_block *es;
int errno;
/* see get_tree_bdev why this is needed and safe */
up_write(&sb->s_umount);
bdev_handle = bdev_open_by_dev(j_dev, BLK_OPEN_READ | BLK_OPEN_WRITE,
bdev_handle = bdev_open_by_dev(j_dev,
BLK_OPEN_READ | BLK_OPEN_WRITE | BLK_OPEN_RESTRICT_WRITES,
sb, &fs_holder_ops);
down_write(&sb->s_umount);
if (IS_ERR(bdev_handle)) {
ext4_msg(sb, KERN_ERR,
"failed to open journal device unknown-block(%u,%u) %ld",
......
......@@ -2239,11 +2239,11 @@ static int f2fs_ioc_shutdown(struct file *filp, unsigned long arg)
switch (in) {
case F2FS_GOING_DOWN_FULLSYNC:
ret = freeze_bdev(sb->s_bdev);
ret = bdev_freeze(sb->s_bdev);
if (ret)
goto out;
f2fs_stop_checkpoint(sbi, false, STOP_CP_REASON_SHUTDOWN);
thaw_bdev(sb->s_bdev);
bdev_thaw(sb->s_bdev);
break;
case F2FS_GOING_DOWN_METASYNC:
/* do checkpoint only */
......
......@@ -1314,15 +1314,7 @@ nilfs_mount(struct file_system_type *fs_type, int flags,
return ERR_CAST(s);
if (!s->s_root) {
/*
* We drop s_umount here because we need to open the bdev and
* bdev->open_mutex ranks above s_umount (blkdev_put() ->
* __invalidate_device()). It is safe because we have active sb
* reference and SB_BORN is not set yet.
*/
up_write(&s->s_umount);
err = setup_bdev_super(s, flags, NULL);
down_write(&s->s_umount);
if (!err)
err = nilfs_fill_super(s, data,
flags & SB_SILENT ? 1 : 0);
......
This diff is collapsed.
......@@ -482,9 +482,9 @@ xfs_fs_goingdown(
{
switch (inflags) {
case XFS_FSOP_GOING_FLAGS_DEFAULT: {
if (!freeze_bdev(mp->m_super->s_bdev)) {
if (!bdev_freeze(mp->m_super->s_bdev)) {
xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
thaw_bdev(mp->m_super->s_bdev);
bdev_thaw(mp->m_super->s_bdev);
}
break;
}
......
......@@ -366,7 +366,8 @@ xfs_blkdev_get(
{
int error = 0;
*handlep = bdev_open_by_path(name, BLK_OPEN_READ | BLK_OPEN_WRITE,
*handlep = bdev_open_by_path(name,
BLK_OPEN_READ | BLK_OPEN_WRITE | BLK_OPEN_RESTRICT_WRITES,
mp->m_super, &fs_holder_ops);
if (IS_ERR(*handlep)) {
error = PTR_ERR(*handlep);
......@@ -438,19 +439,13 @@ xfs_open_devices(
struct bdev_handle *logdev_handle = NULL, *rtdev_handle = NULL;
int error;
/*
* blkdev_put() can't be called under s_umount, see the comment
* in get_tree_bdev() for more details
*/
up_write(&sb->s_umount);
/*
* Open real time and log devices - order is important.
*/
if (mp->m_logname) {
error = xfs_blkdev_get(mp, mp->m_logname, &logdev_handle);
if (error)
goto out_relock;
return error;
}
if (mp->m_rtname) {
......@@ -493,10 +488,7 @@ xfs_open_devices(
bdev_release(logdev_handle);
}
error = 0;
out_relock:
down_write(&sb->s_umount);
return error;
return 0;
out_free_rtdev_targ:
if (mp->m_rtdev_targp)
......@@ -509,7 +501,7 @@ xfs_open_devices(
out_close_logdev:
if (logdev_handle)
bdev_release(logdev_handle);
goto out_relock;
return error;
}
/*
......@@ -759,10 +751,6 @@ static void
xfs_mount_free(
struct xfs_mount *mp)
{
/*
* Free the buftargs here because blkdev_put needs to be called outside
* of sb->s_umount, which is held around the call to ->put_super.
*/
if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
xfs_free_buftarg(mp->m_logdev_targp);
if (mp->m_rtdev_targp)
......
......@@ -57,20 +57,18 @@ struct block_device {
void * bd_holder;
const struct blk_holder_ops *bd_holder_ops;
struct mutex bd_holder_lock;
/* The counter of freeze processes */
int bd_fsfreeze_count;
int bd_holders;
struct kobject *bd_holder_dir;
/* Mutex for freeze */
struct mutex bd_fsfreeze_mutex;
struct super_block *bd_fsfreeze_sb;
atomic_t bd_fsfreeze_count; /* number of freeze requests */
struct mutex bd_fsfreeze_mutex; /* serialize freeze/thaw */
struct partition_meta_info *bd_meta_info;
#ifdef CONFIG_FAIL_MAKE_REQUEST
bool bd_make_it_fail;
#endif
bool bd_ro_warned;
int bd_writers;
/*
* keep this out-of-line as it's both big and not needed in the fast
* path
......
......@@ -124,6 +124,8 @@ typedef unsigned int __bitwise blk_mode_t;
#define BLK_OPEN_NDELAY ((__force blk_mode_t)(1 << 3))
/* open for "writes" only for ioctls (specialy hack for floppy.c) */
#define BLK_OPEN_WRITE_IOCTL ((__force blk_mode_t)(1 << 4))
/* open is exclusive wrt all other BLK_OPEN_WRITE opens to the device */
#define BLK_OPEN_RESTRICT_WRITES ((__force blk_mode_t)(1 << 5))
struct gendisk {
/*
......@@ -1468,8 +1470,23 @@ struct blk_holder_ops {
* Sync the file system mounted on the block device.
*/
void (*sync)(struct block_device *bdev);
/*
* Freeze the file system mounted on the block device.
*/
int (*freeze)(struct block_device *bdev);
/*
* Thaw the file system mounted on the block device.
*/
int (*thaw)(struct block_device *bdev);
};
/*
* For filesystems using @fs_holder_ops, the @holder argument passed to
* helpers used to open and claim block devices via
* bd_prepare_to_claim() must point to a superblock.
*/
extern const struct blk_holder_ops fs_holder_ops;
/*
......@@ -1477,7 +1494,8 @@ extern const struct blk_holder_ops fs_holder_ops;
* as stored in sb->s_flags.
*/
#define sb_open_mode(flags) \
(BLK_OPEN_READ | (((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE))
(BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES | \
(((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE))
struct bdev_handle {
struct block_device *bdev;
......@@ -1485,10 +1503,6 @@ struct bdev_handle {
blk_mode_t mode;
};
struct block_device *blkdev_get_by_dev(dev_t dev, blk_mode_t mode, void *holder,
const struct blk_holder_ops *hops);
struct block_device *blkdev_get_by_path(const char *path, blk_mode_t mode,
void *holder, const struct blk_holder_ops *hops);
struct bdev_handle *bdev_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
const struct blk_holder_ops *hops);
struct bdev_handle *bdev_open_by_path(const char *path, blk_mode_t mode,
......@@ -1496,7 +1510,6 @@ struct bdev_handle *bdev_open_by_path(const char *path, blk_mode_t mode,
int bd_prepare_to_claim(struct block_device *bdev, void *holder,
const struct blk_holder_ops *hops);
void bd_abort_claiming(struct block_device *bdev, void *holder);
void blkdev_put(struct block_device *bdev, void *holder);
void bdev_release(struct bdev_handle *handle);
/* just for blk-cgroup, don't use elsewhere */
......@@ -1541,8 +1554,8 @@ static inline int early_lookup_bdev(const char *pathname, dev_t *dev)
}
#endif /* CONFIG_BLOCK */
int freeze_bdev(struct block_device *bdev);
int thaw_bdev(struct block_device *bdev);
int bdev_freeze(struct block_device *bdev);
int bdev_thaw(struct block_device *bdev);
struct io_comp_batch {
struct request *req_list;
......
......@@ -1187,7 +1187,8 @@ enum {
struct sb_writers {
unsigned short frozen; /* Is sb frozen? */
unsigned short freeze_holders; /* Who froze fs? */
int freeze_kcount; /* How many kernel freeze requests? */
int freeze_ucount; /* How many userspace freeze requests? */
struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS];
};
......@@ -2053,9 +2054,24 @@ extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
struct file *dst_file, loff_t dst_pos,
loff_t len, unsigned int remap_flags);
/**
* enum freeze_holder - holder of the freeze
* @FREEZE_HOLDER_KERNEL: kernel wants to freeze or thaw filesystem
* @FREEZE_HOLDER_USERSPACE: userspace wants to freeze or thaw filesystem
* @FREEZE_MAY_NEST: whether nesting freeze and thaw requests is allowed
*
* Indicate who the owner of the freeze or thaw request is and whether
* the freeze needs to be exclusive or can nest.
* Without @FREEZE_MAY_NEST, multiple freeze and thaw requests from the
* same holder aren't allowed. It is however allowed to hold a single
* @FREEZE_HOLDER_USERSPACE and a single @FREEZE_HOLDER_KERNEL freeze at
* the same time. This is relied upon by some filesystems during online
* repair or similar.
*/
enum freeze_holder {
FREEZE_HOLDER_KERNEL = (1U << 0),
FREEZE_HOLDER_USERSPACE = (1U << 1),
FREEZE_MAY_NEST = (1U << 2),
};
struct super_operations {
......@@ -3131,7 +3147,6 @@ extern int vfs_readlink(struct dentry *, char __user *, int);
extern struct file_system_type *get_filesystem(struct file_system_type *fs);
extern void put_filesystem(struct file_system_type *fs);
extern struct file_system_type *get_fs_type(const char *name);
extern struct super_block *get_active_super(struct block_device *bdev);
extern void drop_super(struct super_block *sb);
extern void drop_super_exclusive(struct super_block *sb);
extern void iterate_supers(void (*)(struct super_block *, void *), void *);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment