• Linus Torvalds's avatar
    Merge tag 'fs.xattr.simple.rework.rbtree.rwlock.v6.2' of... · 02bf43c7
    Linus Torvalds authored
    Merge tag 'fs.xattr.simple.rework.rbtree.rwlock.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
    
    Pull simple-xattr updates from Christian Brauner:
     "This ports the simple xattr infrastucture to rely on a simple rbtree
      protected by a read-write lock instead of a linked list protected by a
      spinlock.
    
      A while ago we received reports about scaling issues for filesystems
      using the simple xattr infrastructure that also support setting a
      larger number of xattrs. Specifically, cgroups and tmpfs.
    
      Both cgroupfs and tmpfs can be mounted by unprivileged users in
      unprivileged containers and root in an unprivileged container can set
      an unrestricted number of security.* xattrs and privileged users can
      also set unlimited trusted.* xattrs. A few more words on further that
      below. Other xattrs such as user.* are restricted for kernfs-based
      instances to a fairly limited number.
    
      As there are apparently users that have a fairly large number of
      xattrs we should scale a bit better. Using a simple linked list
      protected by a spinlock used for set, get, and list operations doesn't
      scale well if users use a lot of xattrs even if it's not a crazy
      number.
    
      Let's switch to a simple rbtree protected by a rwlock. It scales way
      better and gets rid of the perf issues some people reported. We
      originally had fancier solutions even using an rcu+seqlock protected
      rbtree but we had concerns about being to clever and also that
      deletion from an rbtree with rcu+seqlock isn't entirely safe.
    
      The rbtree plus rwlock is perfectly fine. By far the most common
      operation is getting an xattr. While setting an xattr is not and
      should be comparatively rare. And listxattr() often only happens when
      copying xattrs between files or together with the contents to a new
      file.
    
      Holding a lock across listxattr() is unproblematic because it doesn't
      list the values of xattrs. It can only be used to list the names of
      all xattrs set on a file. And the number of xattr names that can be
      listed with listxattr() is limited to XATTR_LIST_MAX aka 65536 bytes.
      If a larger buffer is passed then vfs_listxattr() caps it to
      XATTR_LIST_MAX and if more xattr names are found it will return
      -E2BIG. In short, the maximum amount of memory that can be retrieved
      via listxattr() is limited and thus listxattr() bounded.
    
      Of course, the API is broken as documented on xattr(7) already. While
      I have no idea how the xattr api ended up in this state we should
      probably try to come up with something here at some point. An iterator
      pattern similar to readdir() as an alternative to listxattr() or
      something else.
    
      Right now it is extremly strange that users can set millions of xattrs
      but then can't use listxattr() to know which xattrs are actually set.
      And it's really trivial to do:
    
    	for i in {1..1000000}; do setfattr -n security.$i -v $i ./file1; done
    
      And around 5000 xattrs it's impossible to use listxattr() to figure
      out which xattrs are actually set. So I have suggested that we try to
      limit the number of xattrs for simple xattrs at least. But that's a
      future patch and I don't consider it very urgent.
    
      A bonus of this port to rbtree+rwlock is that we shrink the memory
      consumption for users of the simple xattr infrastructure.
    
      This also adds kernel documentation to all the functions"
    
    * tag 'fs.xattr.simple.rework.rbtree.rwlock.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
      xattr: use rbtree for simple_xattrs
    02bf43c7
shmem.c 112 KB