1. 11 Oct, 2023 2 commits
    • Christian Brauner's avatar
      binfmt_misc: enable sandboxed mounts · 21ca59b3
      Christian Brauner authored
      Enable unprivileged sandboxes to create their own binfmt_misc mounts.
      This is based on Laurent's work in [1] but has been significantly
      reworked to fix various issues we identified in earlier versions.
      
      While binfmt_misc can currently only be mounted in the initial user
      namespace, binary types registered in this binfmt_misc instance are
      available to all sandboxes (Either by having them installed in the
      sandbox or by registering the binary type with the F flag causing the
      interpreter to be opened right away). So binfmt_misc binary types are
      already delegated to sandboxes implicitly.
      
      However, while a sandbox has access to all registered binary types in
      binfmt_misc a sandbox cannot currently register its own binary types
      in binfmt_misc. This has prevented various use-cases some of which were
      already outlined in [1] but we have a range of issues associated with
      this (cf. [3]-[5] below which are just a small sample).
      
      Extend binfmt_misc to be mountable in non-initial user namespaces.
      Similar to other filesystem such as nfsd, mqueue, and sunrpc we use
      keyed superblock management. The key determines whether we need to
      create a new superblock or can reuse an already existing one. We use the
      user namespace of the mount as key. This means a new binfmt_misc
      superblock is created once per user namespace creation. Subsequent
      mounts of binfmt_misc in the same user namespace will mount the same
      binfmt_misc instance. We explicitly do not create a new binfmt_misc
      superblock on every binfmt_misc mount as the semantics for
      load_misc_binary() line up with the keying model. This also allows us to
      retrieve the relevant binfmt_misc instance based on the caller's user
      namespace which can be done in a simple (bounded to 32 levels) loop.
      
      Similar to the current binfmt_misc semantics allowing access to the
      binary types in the initial binfmt_misc instance we do allow sandboxes
      access to their parent's binfmt_misc mounts if they do not have created
      a separate binfmt_misc instance.
      
      Overall, this will unblock the use-cases mentioned below and in general
      will also allow to support and harden execution of another
      architecture's binaries in tight sandboxes. For instance, using the
      unshare binary it possible to start a chroot of another architecture and
      configure the binfmt_misc interpreter without being root to run the
      binaries in this chroot and without requiring the host to modify its
      binary type handlers.
      
      Henning had already posted a few experiments in the cover letter at [1].
      But here's an additional example where an unprivileged container
      registers qemu-user-static binary handlers for various binary types in
      its separate binfmt_misc mount and is then seamlessly able to start
      containers with a different architecture without affecting the host:
      
      root    [lxc monitor] /var/snap/lxd/common/lxd/containers f1
      1000000  \_ /sbin/init
      1000000      \_ /lib/systemd/systemd-journald
      1000000      \_ /lib/systemd/systemd-udevd
      1000100      \_ /lib/systemd/systemd-networkd
      1000101      \_ /lib/systemd/systemd-resolved
      1000000      \_ /usr/sbin/cron -f
      1000103      \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
      1000000      \_ /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
      1000104      \_ /usr/sbin/rsyslogd -n -iNONE
      1000000      \_ /lib/systemd/systemd-logind
      1000000      \_ /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
      1000107      \_ dnsmasq --conf-file=/dev/null -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --liste
      1000000      \_ [lxc monitor] /var/lib/lxc f1-s390x
      1100000          \_ /usr/bin/qemu-s390x-static /sbin/init
      1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-journald
      1100000              \_ /usr/bin/qemu-s390x-static /usr/sbin/cron -f
      1100103              \_ /usr/bin/qemu-s390x-static /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-ac
      1100000              \_ /usr/bin/qemu-s390x-static /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
      1100104              \_ /usr/bin/qemu-s390x-static /usr/sbin/rsyslogd -n -iNONE
      1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-logind
      1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
      1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/0 115200,38400,9600 vt220
      1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/1 115200,38400,9600 vt220
      1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/2 115200,38400,9600 vt220
      1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/3 115200,38400,9600 vt220
      1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-udevd
      
      [1]: https://lore.kernel.org/all/20191216091220.465626-1-laurent@vivier.eu
      [2]: https://discuss.linuxcontainers.org/t/binfmt-misc-permission-denied
      [3]: https://discuss.linuxcontainers.org/t/lxd-binfmt-support-for-qemu-static-interpreters
      [4]: https://discuss.linuxcontainers.org/t/3-1-0-binfmt-support-service-in-unprivileged-guest-requires-write-access-on-hosts-proc-sys-fs-binfmt-misc
      [5]: https://discuss.linuxcontainers.org/t/qemu-user-static-not-working-4-11
      
      Link: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu (origin)
      Link: https://lore.kernel.org/r/20211028103114.2849140-2-brauner@kernel.org (v1)
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Henning Schild <henning.schild@siemens.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Laurent Vivier <laurent@vivier.eu>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarLaurent Vivier <laurent@vivier.eu>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      ---
      /* v2 */
      - Serge Hallyn <serge@hallyn.com>:
        - Use GFP_KERNEL_ACCOUNT for userspace triggered allocations when a
          new binary type handler is registered.
      - Christian Brauner <christian.brauner@ubuntu.com>:
        - Switch authorship to me. I refused to do that earlier even though
          Laurent said I should do so because I think it's genuinely bad form.
          But by now I have changed so many things that it'd be unfair to
          blame Laurent for any potential bugs in here.
        - Add more comments that explain what's going on.
        - Rename functions while changing them to better reflect what they are
          doing to make the code easier to understand.
        - In the first version when a specific binary type handler was removed
          either through a write to the entry's file or all binary type
          handlers were removed by a write to the binfmt_misc mount's status
          file all cleanup work happened during inode eviction.
          That includes removal of the relevant entries from entry list. While
          that works fine I disliked that model after thinking about it for a
          bit. Because it means that there was a window were someone has
          already removed a or all binary handlers but they could still be
          safely reached from load_misc_binary() when it has managed to take
          the read_lock() on the entries list while inode eviction was already
          happening. Again, that perfectly benign but it's cleaner to remove
          the binary handler from the list immediately meaning that ones the
          write to then entry's file or the binfmt_misc status file returns
          the binary type cannot be executed anymore. That gives stronger
          guarantees to the user.
      21ca59b3
    • Christian Brauner's avatar
      binfmt_misc: cleanup on filesystem umount · 1c5976ef
      Christian Brauner authored
      Currently, registering a new binary type pins the binfmt_misc
      filesystem. Specifically, this means that as long as there is at least
      one binary type registered the binfmt_misc filesystem survives all
      umounts, i.e. the superblock is not destroyed. Meaning that a umount
      followed by another mount will end up with the same superblock and the
      same binary type handlers. This is a behavior we tend to discourage for
      any new filesystems (apart from a few special filesystems such as e.g.
      configfs or debugfs). A umount operation without the filesystem being
      pinned - by e.g. someone holding a file descriptor to an open file -
      should usually result in the destruction of the superblock and all
      associated resources. This makes introspection easier and leads to
      clearly defined, simple and clean semantics. An administrator can rely
      on the fact that a umount will guarantee a clean slate making it
      possible to reinitialize a filesystem. Right now all binary types would
      need to be explicitly deleted before that can happen.
      
      This allows us to remove the heavy-handed calls to simple_pin_fs() and
      simple_release_fs() when creating and deleting binary types. This in
      turn allows us to replace the current brittle pinning mechanism abusing
      dget() which has caused a range of bugs judging from prior fixes in [2]
      and [3]. The additional dget() in load_misc_binary() pins the dentry but
      only does so for the sake to prevent ->evict_inode() from freeing the
      node when a user removes the binary type and kill_node() is run. Which
      would mean ->interpreter and ->interp_file would be freed causing a UAF.
      
      This isn't really nicely documented nor is it very clean because it
      relies on simple_pin_fs() pinning the filesystem as long as at least one
      binary type exists. Otherwise it would cause load_misc_binary() to hold
      on to a dentry belonging to a superblock that has been shutdown.
      Replace that implicit pinning with a clean and simple per-node refcount
      and get rid of the ugly dget() pinning. A similar mechanism exists for
      e.g. binderfs (cf. [4]). All the cleanup work can now be done in
      ->evict_inode().
      
      In a follow-up patch we will make it possible to use binfmt_misc in
      sandboxes. We will use the cleaner semantics where a umount for the
      filesystem will cause the superblock and all resources to be
      deallocated. In preparation for this apply the same semantics to the
      initial binfmt_misc mount. Note, that this is a user-visible change and
      as such a uapi change but one that we can reasonably risk. We've
      discussed this in earlier versions of this patchset (cf. [1]).
      
      The main user and provider of binfmt_misc is systemd. Systemd provides
      binfmt_misc via autofs since it is configurable as a kernel module and
      is used by a few exotic packages and users. As such a binfmt_misc mount
      is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
      provided on demand. Other autofs on demand filesystems include EFI ESP
      which systemd umounts if the mountpoint stays idle for a certain amount
      of time. This doesn't apply to the binfmt_misc autofs mount which isn't
      touched once it is mounted meaning this change can't accidently wipe
      binary type handlers without someone having explicitly unmounted
      binfmt_misc. After speaking to systemd folks they don't expect this
      change to affect them.
      
      In line with our general policy, if we see a regression for systemd or
      other users with this change we will switch back to the old behavior for
      the initial binfmt_misc mount and have binary types pin the filesystem
      again. But while we touch this code let's take the chance and let's
      improve on the status quo.
      
      [1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu
      [2]: commit 43a4f261 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
      [3]: commit 83f91827 ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
      [4]: commit f0fe2c0f ("binder: prevent UAF for binderfs devices II")
      
      Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1)
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Henning Schild <henning.schild@siemens.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Laurent Vivier <laurent@vivier.eu>
      Cc: linux-fsdevel@vger.kernel.org
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      ---
      /* v2 */
      - Christian Brauner <christian.brauner@ubuntu.com>:
        - Add more comments that explain what's going on.
        - Rename functions while changing them to better reflect what they are
          doing to make the code easier to understand.
        - In the first version when a specific binary type handler was removed
          either through a write to the entry's file or all binary type
          handlers were removed by a write to the binfmt_misc mount's status
          file all cleanup work happened during inode eviction.
          That includes removal of the relevant entries from entry list. While
          that works fine I disliked that model after thinking about it for a
          bit. Because it means that there was a window were someone has
          already removed a or all binary handlers but they could still be
          safely reached from load_misc_binary() when it has managed to take
          the read_lock() on the entries list while inode eviction was already
          happening. Again, that perfectly benign but it's cleaner to remove
          the binary handler from the list immediately meaning that ones the
          write to then entry's file or the binfmt_misc status file returns
          the binary type cannot be executed anymore. That gives stronger
          guarantees to the user.
      1c5976ef
  2. 04 Oct, 2023 4 commits
  3. 29 Sep, 2023 3 commits
  4. 25 Sep, 2023 1 commit
  5. 17 Sep, 2023 11 commits
  6. 16 Sep, 2023 12 commits
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v6.6' of... · f0b0d403
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - Fix kernel-devel RPM and linux-headers Deb package
      
       - Fix too long argument list error in 'make modules_install'
      
      * tag 'kbuild-fixes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kbuild: avoid long argument lists in make modules_install
        kbuild: fix kernel-devel RPM package and linux-headers Deb package
      f0b0d403
    • Linus Torvalds's avatar
      vm: fix move_vma() memory accounting being off · 3cec5049
      Linus Torvalds authored
      Commit 408579cd ("mm: Update do_vmi_align_munmap() return
      semantics") seems to have updated one of the callers of do_vmi_munmap()
      incorrectly: it used to check for the error case (which didn't
      change: negative means error).
      
      That commit changed the check to the success case (which did change:
      before that commit, 0 was success, and 1 was "success and lock
      downgraded".  After the change, it's always 0 for success, and the lock
      will have been released if requested).
      
      This didn't change any actual VM behavior _except_ for memory accounting
      when 'VM_ACCOUNT' was set on the vma.  Which made the wrong return value
      test fairly subtle, since everything continues to work.
      
      Or rather - it continues to work but the "Committed memory" accounting
      goes all wonky (Committed_AS value in /proc/meminfo), and depending on
      settings that then causes problems much much later as the VM relies on
      bogus statistics for its heuristics.
      
      Revert that one line of the change back to the original logic.
      
      Fixes: 408579cd ("mm: Update do_vmi_align_munmap() return semantics")
      Reported-by: default avatarChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Reported-bisected-and-tested-by: default avatarMichael Labiuk <michael.labiuk@virtuozzo.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Link: https://lore.kernel.org/all/1694366957@msgid.manchmal.in-ulm.de/Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cec5049
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · ad8a69f3
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "16 small(ish) fixes all in drivers.
      
        The major fixes are in pm8001 (fixes MSI-X issue going back to its
        origin), the qla2xxx endianness fix, which fixes a bug on big endian
        and the lpfc ones which can cause an oops on module removal without
        them"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: lpfc: Prevent use-after-free during rmmod with mapped NVMe rports
        scsi: lpfc: Early return after marking final NLP_DROPPED flag in dev_loss_tmo
        scsi: lpfc: Fix the NULL vs IS_ERR() bug for debugfs_create_file()
        scsi: target: core: Fix target_cmd_counter leak
        scsi: pm8001: Setup IRQs on resume
        scsi: pm80xx: Avoid leaking tags when processing OPC_INB_SET_CONTROLLER_CONFIG command
        scsi: pm80xx: Use phy-specific SAS address when sending PHY_START command
        scsi: ufs: core: Poll HCS.UCRDY before issuing a UIC command
        scsi: ufs: core: Move __ufshcd_send_uic_cmd() outside host_lock
        scsi: qedf: Add synchronization between I/O completions and abort
        scsi: target: Replace strlcpy() with strscpy()
        scsi: qla2xxx: Fix NULL vs IS_ERR() bug for debugfs_create_dir()
        scsi: qla2xxx: Use raw_smp_processor_id() instead of smp_processor_id()
        scsi: qla2xxx: Correct endianness for rqstlen and rsplen
        scsi: ppa: Fix accidentally reversed conditions for 16-bit and 32-bit EPP
        scsi: megaraid_sas: Fix deadlock on firmware crashdump
      ad8a69f3
    • Linus Torvalds's avatar
      Merge tag 'ata-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata · cc3e5afc
      Linus Torvalds authored
      Pull ata fixes from Damien Le Moal:
      
       - Fix link power management transitions to disallow unsupported states
         (Niklas)
      
       - A small string handling fix for the sata_mv driver (Christophe)
      
       - Clear port pending interrupts before reset, as per AHCI
         specifications (Szuying).
      
         Followup fixes for this one are to not clear ATA_PFLAG_EH_PENDING in
         ata_eh_reset() to allow EH to continue on with other actions recorded
         with error interrupts triggered before EH completes. And an
         additional fix to avoid thawing a port twice in EH (Niklas)
      
       - Small code style fixes in the pata_parport driver to silence the
         build bot as it keeps complaining about bad indentation (me)
      
       - A fix for the recent CDL code to avoid fetching sense data for
         successful commands when not necessary for correct operation (Niklas)
      
      * tag 'ata-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
        ata: libata-core: fetch sense data for successful commands iff CDL enabled
        ata: libata-eh: do not thaw the port twice in ata_eh_reset()
        ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()
        ata: pata_parport: Fix code style issues
        ata: libahci: clear pending interrupt status
        ata: sata_mv: Fix incorrect string length computation in mv_dump_mem()
        ata: libata: disallow dev-initiated LPM transitions to unsupported states
      cc3e5afc
    • Linus Torvalds's avatar
      Merge tag 'usb-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · cce67b6b
      Linus Torvalds authored
      Pull USB fix from Greg KH:
       "Here is a single USB fix for a much-reported regression for 6.6-rc1.
      
        It resolves a crash in the typec debugfs code for many systems. It's
        been in linux-next with no reported issues, and many people have
        reported it resolving their problem with 6.6-rc1"
      
      * tag 'usb-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: typec: ucsi: Fix NULL pointer dereference
      cce67b6b
    • Linus Torvalds's avatar
      Merge tag 'driver-core-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core · 205d0494
      Linus Torvalds authored
      Pull driver core fixes from Greg KH:
       "Here is a single driver core fix for a much-reported-by-sysbot issue
        that showed up in 6.6-rc1. It's been submitted by many people, all in
        the same way, so it obviously fixes things for them all.
      
        Also in here is a single documentation update adding riscv to the
        embargoed hardware document in case there are any future issues with
        that processor family.
      
        Both of these have been in linux-next with no reported problems"
      
      * tag 'driver-core-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        Documentation: embargoed-hardware-issues.rst: Add myself for RISC-V
        driver core: return an error when dev_set_name() hasn't happened
      205d0494
    • Linus Torvalds's avatar
      Merge tag 'char-misc-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · fd455e77
      Linus Torvalds authored
      Pull char/misc fix from Greg KH:
       "Here is a single patch for 6.6-rc2 that reverts a 6.5 change for the
        comedi subsystem that has ended up being incorrect and caused drivers
        that were working for people to be unable to be able to be selected to
        build at all.
      
        To fix this, the Kconfig change needs to be reverted and a future set
        of fixes for the ioport dependancies will show up in 6.7-rc1 (there's
        no rush for them.)
      
        This has been in linux-next with no reported issues"
      
      * tag 'char-misc-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        Revert "comedi: add HAS_IOPORT dependencies"
      fd455e77
    • Linus Torvalds's avatar
      Merge tag 'i2c-for-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · c37f8efc
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "The main thing is the removal of 'probe_new' because all i2c client
        drivers are converted now. Thanks Uwe, this marks the end of a long
        conversion process.
      
        Other than that, we have a few Kconfig updates and driver bugfixes"
      
      * tag 'i2c-for-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: cadence: Fix the kernel-doc warnings
        i2c: aspeed: Reset the i2c controller when timeout occurs
        i2c: I2C_MLXCPLD on ARM64 should depend on ACPI
        i2c: Make I2C_ATR invisible
        i2c: Drop legacy callback .probe_new()
        w1: ds2482: Switch back to use struct i2c_driver's .probe()
      c37f8efc
    • Niklas Cassel's avatar
      ata: libata-core: fetch sense data for successful commands iff CDL enabled · 5e35a9ac
      Niklas Cassel authored
      Currently, we fetch sense data for a _successful_ command if either:
      1) Command was NCQ and ATA_DFLAG_CDL_ENABLED flag set (flag
         ATA_DFLAG_CDL_ENABLED will only be set if the Successful NCQ command
         sense data supported bit is set); or
      2) Command was non-NCQ and regular sense data reporting is enabled.
      
      This means that case 2) will trigger for a non-NCQ command which has
      ATA_SENSE bit set, regardless if CDL is enabled or not.
      
      This decision was by design. If the device reports that it has sense data
      available, it makes sense to fetch that sense data, since the sk/asc/ascq
      could be important information regardless if CDL is enabled or not.
      
      However, the fetching of sense data for a successful command is done via
      ATA EH. Considering how intricate the ATA EH is, we really do not want to
      invoke ATA EH unless absolutely needed.
      
      Before commit 18bd7718 ("scsi: ata: libata: Handle completion of CDL
      commands using policy 0xD") we never fetched sense data for successful
      commands.
      
      In order to not invoke the ATA EH unless absolutely necessary, even if the
      device claims support for sense data reporting, only fetch sense data for
      successful (NCQ and non-NCQ commands) commands that are using CDL.
      
      [Damien] Modified the check to test the qc flag ATA_QCFLAG_HAS_CDL
      instead of the device support for CDL, which is implied for commands
      using CDL.
      
      Fixes: 3ac873c7 ("ata: libata-core: fix when to fetch sense data for successful commands")
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      5e35a9ac
    • Niklas Cassel's avatar
      ata: libata-eh: do not thaw the port twice in ata_eh_reset() · 7a3bc2b3
      Niklas Cassel authored
      commit 1e641060 ("libata: clear eh_info on reset completion") added
      a workaround that broke the retry mechanism in ATA EH.
      
      Tejun himself suggested to remove this workaround when it was identified
      to cause additional problems:
      https://lore.kernel.org/linux-ide/20110426135027.GI878@htj.dyndns.org/
      
      He even said:
      "Hmm... it seems I wasn't thinking straight when I added that work around."
      https://lore.kernel.org/linux-ide/20110426155229.GM878@htj.dyndns.org/
      
      While removing the workaround solved the issue, however, the workaround was
      kept to avoid "spurious hotplug events during reset", and instead another
      workaround was added on top of the existing workaround in commit
      8c56cacc ("libata: fix unexpectedly frozen port after ata_eh_reset()").
      
      Because these IRQs happened when the port was frozen, we know that they
      were actually a side effect of PxIS and IS.IPS(x) not being cleared before
      the COMRESET. This is now done in commit 94152042eaa9 ("ata: libahci: clear
      pending interrupt status"), so these workarounds can now be removed.
      
      Since commit 1e641060 ("libata: clear eh_info on reset completion") has
      now been reverted, the ATA EH retry mechanism is functional again, so there
      is once again no need to thaw the port more than once in ata_eh_reset().
      
      This reverts "the workaround on top of the workaround" introduced in commit
      8c56cacc ("libata: fix unexpectedly frozen port after ata_eh_reset()").
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      7a3bc2b3
    • Niklas Cassel's avatar
      ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset() · 80cc944e
      Niklas Cassel authored
      ata_scsi_port_error_handler() starts off by clearing ATA_PFLAG_EH_PENDING,
      before calling ap->ops->error_handler() (without holding the ap->lock).
      
      If an error IRQ is received while ap->ops->error_handler() is running,
      the irq handler will set ATA_PFLAG_EH_PENDING.
      
      Once ap->ops->error_handler() returns, ata_scsi_port_error_handler()
      checks if ATA_PFLAG_EH_PENDING is set, and if it is, another iteration
      of ATA EH is performed.
      
      The problem is that ATA_PFLAG_EH_PENDING is not only cleared by
      ata_scsi_port_error_handler(), it is also cleared by ata_eh_reset().
      
      ata_eh_reset() is called by ap->ops->error_handler(). This additional
      clearing done by ata_eh_reset() breaks the whole retry logic in
      ata_scsi_port_error_handler(). Thus, if an error IRQ is received while
      ap->ops->error_handler() is running, the port will currently remain
      frozen and will never get re-enabled.
      
      The additional clearing in ata_eh_reset() was introduced in commit
      1e641060 ("libata: clear eh_info on reset completion").
      
      Looking at the original error report:
      https://marc.info/?l=linux-ide&m=124765325828495&w=2
      
      We can see the following happening:
      [    1.074659] ata3: XXX port freeze
      [    1.074700] ata3: XXX hardresetting link, stopping engine
      [    1.074746] ata3: XXX flipping SControl
      
      [    1.411471] ata3: XXX irq_stat=400040 CONN|PHY
      [    1.411475] ata3: XXX port freeze
      
      [    1.420049] ata3: XXX starting engine
      [    1.420096] ata3: XXX rc=0, class=1
      [    1.420142] ata3: XXX clearing IRQs for thawing
      [    1.420188] ata3: XXX port thawed
      [    1.420234] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
      
      We are not supposed to be able to receive an error IRQ while the port is
      frozen (PxIE is set to 0, i.e. all IRQs for the port are disabled).
      
      AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) states:
      "Each bit location can be thought of as reporting a '1' if the virtual
      "interrupt line" for that port is indicating it wishes to generate an
      interrupt. That is, if a port has one or more interrupt status bit set,
      and the enables for those status bits are set, then this bit shall be set."
      
      Additionally, AHCI state P:ComInit clearly shows that the state machine
      will only jump to P:ComInitSetIS (which sets IS.IPS(x) to '1'), if PxIE.PCE
      is set to '1'. In our case, PxIE is set to 0, so IS.IPS(x) won't get set.
      
      So IS.IPS(x) only gets set if PxIS and PxIE is set.
      
      AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) also states:
      "The bits in this register are read/write clear. It is set by the level of
      the virtual interrupt line being a set, and cleared by a write of '1' from
      the software."
      
      So if IS.IPS(x) is set, you need to explicitly clear it by writing a 1 to
      IS.IPS(x) for that port.
      
      Since PxIE is cleared, the only way to get an interrupt while the port is
      frozen, is if IS.IPS(x) is set, and the only way IS.IPS(x) can be set when
      the port is frozen, is if it was set before the port was frozen.
      
      However, since commit 737dd811 ("ata: libahci: clear pending interrupt
      status"), we clear both PxIS and IS.IPS(x) after freezing the port, but
      before the COMRESET, so the problem that commit 1e641060 ("libata:
      clear eh_info on reset completion") fixed can no longer happen.
      
      Thus, revert commit 1e641060 ("libata: clear eh_info on reset
      completion"), so that the retry logic in ata_scsi_port_error_handler()
      works once again. (The retry logic is still needed, since we can still
      get an error IRQ _after_ the port has been thawed, but before
      ata_scsi_port_error_handler() takes the ap->lock in order to check
      if ATA_PFLAG_EH_PENDING is set.)
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      80cc944e
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-6.6-rc2' of... · 57d88e8a
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull more kselftest fixes from Shuah Khan
       "Fixes to user_events test and ftrace test.
      
        The user_events test was enabled by default in Linux 6.6-rc1. The
        following fixes are for bugs found since then:
      
         - add checks for dependencies and skip the test if they aren't met.
      
           The user_events test requires root access, and tracefs and
           user_events enabled. It leaves tracefs mounted and a fix is in
           progress for that missing piece.
      
         - create user_events test-specific Kconfig fragments
      
        ftrace test fixes:
      
         - unmount tracefs for recovering environment. Fix identified during
           the above mentioned user_events dependencies fix.
      
         - adds softlink to latest log directory improving usage"
      
      * tag 'linux-kselftest-fixes-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests: tracing: Fix to unmount tracefs for recovering environment
        selftests: user_events: create test-specific Kconfig fragments
        ftrace/selftests: Add softlink to latest log directory
        selftests/user_events: Fix failures when user_events is not installed
      57d88e8a
  7. 15 Sep, 2023 7 commits