我想知道 writev() 的原子性是如何保证的

Question 1

我有一个多线程 Linux x86_64 用户程序，它使用 writev() 系统调用写入 SCTP 套接字。我想确认 writev() 系统调用的原子性。

writev() 的手册页指出：

ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

The data transfers performed by readv() and writev() are atomic: the data written by writev()
is written as a single block that is not intermingled with output from writes in other processes
(but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous
block of data from the file, regardless of read operations performed in other threads or processes
that have file descriptors referring to the same open file description (see open(2)).

因此，当我查看 writev() 实现时，我想我会清楚地看到一个锁。当我在 writev() 实现中没有看到锁时，我开始跟踪调用。这是我发现的。这是我第一次浏览Linux内核源代码，所以请原谅误解。

分析的 Linux 内核是 x86 上的 4.4.0。

writev() 实现从 fs/read_write.c:896 开始：

SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,u nsigned long, vlen)

并调用同一文件 fs/read_write.c:863 中定义的 vfs_writev()

ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
           unsigned long vlen, loff_t *pos)
{
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_WRITE))
        return -EINVAL;

    return do_readv_writev(WRITE, file, vec, vlen, pos);
}

其中 do_readv_writev() 也在 fs/read_write.c:798 中，并且对于类型 WRITE 将运行，

fn = (io_fn_t)file->f_op->write;
iter_fn = file->f_op->write_iter;
file_start_write(file);

file_start_write()是include/linux/fs.h:2512中的内联函数，

static inline void file_start_write(struct file *file)
{
    if (!S_ISREG(file_inode(file)->i_mode))
        return;
    __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}

S_ISREG() 在 include/uapi/linux/stat.h:20 中定义，用于检查描述符是否是常规文件。

并且 __sb_start_write 定义在 fs/super.c:1252 中

/*
 * This is an internal function, please use sb_start_{write,pagefault,intwrite}
 * instead.
 */
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
    bool force_trylock = false;
    int ret = 1;

#ifdef CONFIG_LOCKDEP
    /*
     * We want lockdep to tell us about possible deadlocks with freezing
     * but it's it bit tricky to properly instrument it. Getting a freeze
     * protection works as getting a read lock but there are subtle
     * problems. XFS for example gets freeze protection on internal level
     * twice in some cases, which is OK only because we already hold a
     * freeze protection also on higher level. Due to these cases we have
     * to use wait == F (trylock mode) which must not fail.
     */
    if (wait) {
        int i;

        for (i = 0; i < level - 1; i++)
            if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
                force_trylock = true;
                break;
            }
    }
#endif
    if (wait && !force_trylock)
        percpu_down_read(sb->s_writers.rw_sem + level-1);
    else
        ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);

    WARN_ON(force_trylock & !ret);
    return ret;
}
EXPORT_SYMBOL(__sb_start_write);

我不相信我的内核是基于此使用 CONFIG_LOCKDEP 编译的这

文件系统锁定在 fs/super.c:1322 开始的注释中描述

/**
 * freeze_super - lock the filesystem and force it into a consistent state
 * @sb: the super to lock
 *
 * Syncs the super to make sure the filesystem is consistent and calls the fs's
 * freeze_fs.  Subsequent calls to this without first thawing the fs will return
 * -EBUSY.
 *
 * During this function, sb->s_writers.frozen goes through these values:
 *
 * SB_UNFROZEN: File system is normal, all writes progress as usual.
 *
 * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
 * writes should be blocked, though page faults are still allowed. We wait for
 * all writes to complete and then proceed to the next stage.
 *
 * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
 * but internal fs threads can still modify the filesystem (although they
 * should not dirty new pages or inodes), writeback can run etc. After waiting
 * for all running page faults we sync the filesystem which will clean all
 * dirty pages and inodes (no new dirty pages or inodes can be created when
 * sync is running).
 *
 * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
 * modification are blocked (e.g. XFS preallocation truncation on inode
 * reclaim). This is usually implemented by blocking new transactions for
 * filesystems that have them and need this additional guard. After all
 * internal writers are finished we call ->freeze_fs() to finish filesystem
 * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
 * mostly auxiliary for filesystems to verify they do not modify frozen fs.
 *
 * sb->s_writers.frozen is protected by sb->s_umount.
 */

最后，在 kernel/locking/percpu-rwsem.c:70 中

/*
 * Like the normal down_read() this is not recursive, the writer can
 * come after the first percpu_down_read() and create the deadlock.
 *
 * Note: returns with lock_is_held(brw->rw_sem) == T for lockdep,
 * percpu_up_read() does rwsem_release(). This pairs with the usage
 * of ->rw_sem in percpu_down/up_write().
 */
void percpu_down_read(struct percpu_rw_semaphore *brw)
{
    might_sleep();
    rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 0, _RET_IP_);

    if (likely(update_fast_ctr(brw, +1)))
        return;

    /* Avoid rwsem_acquire_read() and rwsem_release() */
    __down_read(&brw->rw_sem);
    atomic_inc(&brw->slow_read_ctr);
    __up_read(&brw->rw_sem);
}
EXPORT_SYMBOL_GPL(percpu_down_read);

所以，这就是锁。

Answer

我有一个多线程 Linux x86_64 用户程序，它使用 writev() 系统调用写入 SCTP 套接字。我想确认 writev() 系统调用的原子性。

writev() 的手册页指出：

ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

The data transfers performed by readv() and writev() are atomic: the data written by writev()
is written as a single block that is not intermingled with output from writes in other processes
(but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous
block of data from the file, regardless of read operations performed in other threads or processes
that have file descriptors referring to the same open file description (see open(2)).

因此，当我查看 writev() 实现时，我想我会清楚地看到一个锁。当我在 writev() 实现中没有看到锁时，我开始跟踪调用。这是我发现的。这是我第一次浏览Linux内核源代码，所以请原谅误解。

分析的 Linux 内核是 x86 上的 4.4.0。

writev() 实现从 fs/read_write.c:896 开始：

SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,u nsigned long, vlen)

并调用同一文件 fs/read_write.c:863 中定义的 vfs_writev()

ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
           unsigned long vlen, loff_t *pos)
{
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_WRITE))
        return -EINVAL;

    return do_readv_writev(WRITE, file, vec, vlen, pos);
}

其中 do_readv_writev() 也在 fs/read_write.c:798 中，并且对于类型 WRITE 将运行，

fn = (io_fn_t)file->f_op->write;
iter_fn = file->f_op->write_iter;
file_start_write(file);

file_start_write()是include/linux/fs.h:2512中的内联函数，

static inline void file_start_write(struct file *file)
{
    if (!S_ISREG(file_inode(file)->i_mode))
        return;
    __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}

S_ISREG() 在 include/uapi/linux/stat.h:20 中定义，用于检查描述符是否是常规文件。

并且 __sb_start_write 定义在 fs/super.c:1252 中

/*
 * This is an internal function, please use sb_start_{write,pagefault,intwrite}
 * instead.
 */
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
    bool force_trylock = false;
    int ret = 1;

#ifdef CONFIG_LOCKDEP
    /*
     * We want lockdep to tell us about possible deadlocks with freezing
     * but it's it bit tricky to properly instrument it. Getting a freeze
     * protection works as getting a read lock but there are subtle
     * problems. XFS for example gets freeze protection on internal level
     * twice in some cases, which is OK only because we already hold a
     * freeze protection also on higher level. Due to these cases we have
     * to use wait == F (trylock mode) which must not fail.
     */
    if (wait) {
        int i;

        for (i = 0; i < level - 1; i++)
            if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
                force_trylock = true;
                break;
            }
    }
#endif
    if (wait && !force_trylock)
        percpu_down_read(sb->s_writers.rw_sem + level-1);
    else
        ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);

    WARN_ON(force_trylock & !ret);
    return ret;
}
EXPORT_SYMBOL(__sb_start_write);

我不相信我的内核是基于此使用 CONFIG_LOCKDEP 编译的这

文件系统锁定在 fs/super.c:1322 开始的注释中描述

/**
 * freeze_super - lock the filesystem and force it into a consistent state
 * @sb: the super to lock
 *
 * Syncs the super to make sure the filesystem is consistent and calls the fs's
 * freeze_fs.  Subsequent calls to this without first thawing the fs will return
 * -EBUSY.
 *
 * During this function, sb->s_writers.frozen goes through these values:
 *
 * SB_UNFROZEN: File system is normal, all writes progress as usual.
 *
 * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
 * writes should be blocked, though page faults are still allowed. We wait for
 * all writes to complete and then proceed to the next stage.
 *
 * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
 * but internal fs threads can still modify the filesystem (although they
 * should not dirty new pages or inodes), writeback can run etc. After waiting
 * for all running page faults we sync the filesystem which will clean all
 * dirty pages and inodes (no new dirty pages or inodes can be created when
 * sync is running).
 *
 * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
 * modification are blocked (e.g. XFS preallocation truncation on inode
 * reclaim). This is usually implemented by blocking new transactions for
 * filesystems that have them and need this additional guard. After all
 * internal writers are finished we call ->freeze_fs() to finish filesystem
 * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
 * mostly auxiliary for filesystems to verify they do not modify frozen fs.
 *
 * sb->s_writers.frozen is protected by sb->s_umount.
 */

最后，在 kernel/locking/percpu-rwsem.c:70 中

/*
 * Like the normal down_read() this is not recursive, the writer can
 * come after the first percpu_down_read() and create the deadlock.
 *
 * Note: returns with lock_is_held(brw->rw_sem) == T for lockdep,
 * percpu_up_read() does rwsem_release(). This pairs with the usage
 * of ->rw_sem in percpu_down/up_write().
 */
void percpu_down_read(struct percpu_rw_semaphore *brw)
{
    might_sleep();
    rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 0, _RET_IP_);

    if (likely(update_fast_ctr(brw, +1)))
        return;

    /* Avoid rwsem_acquire_read() and rwsem_release() */
    __down_read(&brw->rw_sem);
    atomic_inc(&brw->slow_read_ctr);
    __up_read(&brw->rw_sem);
}
EXPORT_SYMBOL_GPL(percpu_down_read);

所以，这就是锁。

Question 2

锁和原子性彼此没有关系。锁用于保证访问共享数据的线程之间的互斥性。同时，原子性保证操作以全有或全无的方式执行。

正如 C6Up1bQ73STi29cA 提到的， writev() 的原子性由 preempt_disable() 保证。事实上，VFS层并不能保证writev()的互斥性。相反，文件系统（或 generic_file* 函数之一 - 如果文件系统使用通用层 -）需要处理多个 writev() 写入文件的同一部分。

Answer

锁和原子性彼此没有关系。锁用于保证访问共享数据的线程之间的互斥性。同时，原子性保证操作以全有或全无的方式执行。

正如 C6Up1bQ73STi29cA 提到的， writev() 的原子性由 preempt_disable() 保证。事实上，VFS层并不能保证writev()的互斥性。相反，文件系统（或 generic_file* 函数之一 - 如果文件系统使用通用层 -）需要处理多个 writev() 写入文件的同一部分。

Question 3

顺便说一句，writev() 的处理并不比 write() 更特殊。

它不保证所有类型文件的原子性。抬头PIPE_BUF。如果向管道写入的数量超过此数量，则可能会与其他写入交错。

对于常规文件，f_pos当前受f_pos_lock.将这种情况视为原子读取和更新 f_pos，然后调用pwritev().

这种保护是一个相对较新的“修复” - 2014 年。在那之前，有一段时间 Linux 违反了 POSIX，“没有人关心过”。看起来，如果您在 Linux 程序中依赖此保证，那么您正在做一些相当不寻常的事情:)。

听起来 POSIX 中的套接字可能没有任何保证。在我看来，邮件列表讨论听起来像是 Linux 也可能为可查找的设备文件提供这种保证。我不确定我们是否能对像 ttys 这样不可搜索的内容得到任何保证。

Answer

顺便说一句，writev() 的处理并不比 write() 更特殊。

它不保证所有类型文件的原子性。抬头PIPE_BUF。如果向管道写入的数量超过此数量，则可能会与其他写入交错。

对于常规文件，f_pos当前受f_pos_lock.将这种情况视为原子读取和更新 f_pos，然后调用pwritev().

这种保护是一个相对较新的“修复” - 2014 年。在那之前，有一段时间 Linux 违反了 POSIX，“没有人关心过”。看起来，如果您在 Linux 程序中依赖此保证，那么您正在做一些相当不寻常的事情:)。

听起来 POSIX 中的套接字可能没有任何保证。在我看来，邮件列表讨论听起来像是 Linux 也可能为可查找的设备文件提供这种保证。我不确定我们是否能对像 ttys 这样不可搜索的内容得到任何保证。

我想知道 writev() 的原子性是如何保证的

答案1

答案2

答案3

相关内容