Commit 816284a3 authored by Axel Rasmussen's avatar Axel Rasmussen Committed by Andrew Morton

userfaultfd: update documentation to describe /dev/userfaultfd

Explain the different ways to create a new userfaultfd, and how access
control works for each way.

[axelrasmussen@google.com: improve wording in documentation, per Mike]
  Link: https://lkml.kernel.org/r/20220819205201.658693-5-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220808175614.3885028-5-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
Acked-by: default avatarPeter Xu <peterx@redhat.com>
Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent 77c07f7c
...@@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick. ...@@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
Design Design
====== ======
Userfaults are delivered and resolved through the ``userfaultfd`` syscall. Userspace creates a new userfaultfd, initializes it, and registers one or more
regions of virtual memory with it. Then, any page faults which occur within the
region(s) result in a message being delivered to the userfaultfd, notifying
userspace of the fault.
The ``userfaultfd`` (aside from registering and unregistering virtual The ``userfaultfd`` (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities: memory ranges) provides two primary functionalities:
...@@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory ...@@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory
management of mremap/mprotect is that the userfaults in all their management of mremap/mprotect is that the userfaults in all their
operations never involve heavyweight structures like vmas (in fact the operations never involve heavyweight structures like vmas (in fact the
``userfaultfd`` runtime load never takes the mmap_lock for writing). ``userfaultfd`` runtime load never takes the mmap_lock for writing).
Vmas are not suitable for page- (or hugepage) granular fault tracking Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that. Terabytes. Too many vmas would be needed for that.
The ``userfaultfd`` once opened by invoking the syscall, can also be The ``userfaultfd``, once created, can also be
passed using unix domain sockets to a manager process, so the same passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on different processes without them being aware about what is going on
...@@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``). ...@@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``).
API API
=== ===
Creating a userfaultfd
----------------------
There are two ways to create a new userfaultfd, each of which provide ways to
restrict access to this functionality (since historically userfaultfds which
handle kernel page faults have been a useful tool for exploiting the kernel).
The first way, supported since userfaultfd was introduced, is the
userfaultfd(2) syscall. Access to this is controlled in several ways:
- Any user can always create a userfaultfd which traps userspace page faults
only. Such a userfaultfd can be created using the userfaultfd(2) syscall
with the flag UFFD_USER_MODE_ONLY.
- In order to also trap kernel page faults for the address space, either the
process needs the CAP_SYS_PTRACE capability, or the system must have
vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd
is set to 0.
The second way, added to the kernel more recently, is by opening
/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method
yields equivalent userfaultfds to the userfaultfd(2) syscall.
Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal
filesystem permissions (user/group/mode), which gives fine grained access to
userfaultfd specifically, without also granting other unrelated privileges at
the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access
to /dev/userfaultfd can always create userfaultfds that trap kernel page faults;
vm.unprivileged_userfaultfd is not considered.
Initializing a userfaultfd
--------------------------
When first opened the ``userfaultfd`` must be enabled invoking the When first opened the ``userfaultfd`` must be enabled invoking the
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
a later API version) which will specify the ``read/POLLIN`` protocol a later API version) which will specify the ``read/POLLIN`` protocol
......
...@@ -926,6 +926,9 @@ calls without any restrictions. ...@@ -926,6 +926,9 @@ calls without any restrictions.
The default value is 0. The default value is 0.
Another way to control permissions for userfaultfd is to use
/dev/userfaultfd instead of userfaultfd(2). See
Documentation/admin-guide/mm/userfaultfd.rst.
user_reserve_kbytes user_reserve_kbytes
=================== ===================
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment