1. 28 Sep, 2023 9 commits
    • Damien Le Moal's avatar
      scsi: sd: Do not issue commands to suspended disks on shutdown · 99398d20
      Damien Le Moal authored
      If an error occurs when resuming a host adapter before the devices
      attached to the adapter are resumed, the adapter low level driver may
      remove the scsi host, resulting in a call to sd_remove() for the
      disks of the host. This in turn results in a call to sd_shutdown() which
      will issue a synchronize cache command and a start stop unit command to
      spindown the disk. sd_shutdown() issues the commands only if the device
      is not already runtime suspended but does not check the power state for
      system-wide suspend/resume. That is, the commands may be issued with the
      device in a suspended state, which causes PM resume to hang, forcing a
      reset of the machine to recover.
      
      Fix this by tracking the suspended state of a disk by introducing the
      suspended boolean field in the scsi_disk structure. This flag is set to
      true when the disk is suspended is sd_suspend_common() and resumed with
      sd_resume(). When suspended is true, sd_shutdown() is not executed from
      sd_remove().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      99398d20
    • Damien Le Moal's avatar
      ata: libata-core: Do not register PM operations for SAS ports · 75e2bd5f
      Damien Le Moal authored
      libsas does its own domain based power management of ports. For such
      ports, libata should not use a device type defining power management
      operations as executing these operations for suspend/resume in addition
      to libsas calls to ata_sas_port_suspend() and ata_sas_port_resume() is
      not necessary (and likely dangerous to do, even though problems are not
      seen currently).
      
      Introduce the new ata_port_sas_type device_type for ports managed by
      libsas. This new device type is used in ata_tport_add() and is defined
      without power management operations.
      
      Fixes: 2fcbdcb4 ("[SCSI] libata: export ata_port suspend/resume infrastructure for sas")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Tested-by: default avatarChia-Lin Kao (AceLan) <acelan.kao@canonical.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      75e2bd5f
    • Damien Le Moal's avatar
      ata: libata-scsi: Fix delayed scsi_rescan_device() execution · 8b4d9469
      Damien Le Moal authored
      Commit 6aa0365a ("ata: libata-scsi: Avoid deadlock on rescan after
      device resume") modified ata_scsi_dev_rescan() to check the scsi device
      "is_suspended" power field to ensure that the scsi device associated
      with an ATA device is fully resumed when scsi_rescan_device() is
      executed. However, this fix is problematic as:
      1) It relies on a PM internal field that should not be used without PM
         device locking protection.
      2) The check for is_suspended and the call to scsi_rescan_device() are
         not atomic and a suspend PM event may be triggered between them,
         casuing scsi_rescan_device() to be called on a suspended device and
         in that function blocking while holding the scsi device lock. This
         would deadlock a following resume operation.
      These problems can trigger PM deadlocks on resume, especially with
      resume operations triggered quickly after or during suspend operations.
      E.g., a simple bash script like:
      
      for (( i=0; i<10; i++ )); do
      	echo "+2 > /sys/class/rtc/rtc0/wakealarm
      	echo mem > /sys/power/state
      done
      
      that triggers a resume 2 seconds after starting suspending a system can
      quickly lead to a PM deadlock preventing the system from correctly
      resuming.
      
      Fix this by replacing the check on is_suspended with a check on the
      return value given by scsi_rescan_device() as that function will fail if
      called against a suspended device. Also make sure rescan tasks already
      scheduled are first cancelled before suspending an ata port.
      
      Fixes: 6aa0365a ("ata: libata-scsi: Avoid deadlock on rescan after device resume")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      8b4d9469
    • Damien Le Moal's avatar
      scsi: Do not attempt to rescan suspended devices · ff48b378
      Damien Le Moal authored
      scsi_rescan_device() takes a scsi device lock before executing a device
      handler and device driver rescan methods. Waiting for the completion of
      any command issued to the device by these methods will thus be done with
      the device lock held. As a result, there is a risk of deadlocking within
      the power management code if scsi_rescan_device() is called to handle a
      device resume with the associated scsi device not yet resumed.
      
      Avoid such situation by checking that the target scsi device is in the
      running state, that is, fully capable of executing commands, before
      proceeding with the rescan and bailout returning -EWOULDBLOCK otherwise.
      With this error return, the caller can retry rescaning the device after
      a delay.
      
      The state check is done with the device lock held and is thus safe
      against incoming suspend power management operations.
      
      Fixes: 6aa0365a ("ata: libata-scsi: Avoid deadlock on rescan after device resume")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      ff48b378
    • Damien Le Moal's avatar
      ata: libata-scsi: Disable scsi device manage_system_start_stop · aa3998db
      Damien Le Moal authored
      The introduction of a device link to create a consumer/supplier
      relationship between the scsi device of an ATA device and the ATA port
      of that ATA device fixes the ordering of system suspend and resume
      operations. For suspend, the scsi device is suspended first and the ata
      port after it. This is fine as this allows the synchronize cache and
      START STOP UNIT commands issued by the scsi disk driver to be executed
      before the ata port is disabled.
      
      For resume operations, the ata port is resumed first, followed
      by the scsi device. This allows having the request queue of the scsi
      device to be unfrozen after the ata port resume is scheduled in EH,
      thus avoiding to see new requests prematurely issued to the ATA device.
      Since libata sets manage_system_start_stop to 1, the scsi disk resume
      operation also results in issuing a START STOP UNIT command to the
      device being resumed so that the device exits standby power mode.
      
      However, restoring the ATA device to the active power mode must be
      synchronized with libata EH processing of the port resume operation to
      avoid either 1) seeing the start stop unit command being received too
      early when the port is not yet resumed and ready to accept commands, or
      after the port resume process issues commands such as IDENTIFY to
      revalidate the device. In this last case, the risk is that the device
      revalidation fails with timeout errors as the drive is still spun down.
      
      Commit 0a858905 ("ata,scsi: do not issue START STOP UNIT on resume")
      disabled issuing the START STOP UNIT command to avoid issues with it.
      But this is incorrect as transitioning a device to the active power
      mode from the standby power mode set on suspend requires a media access
      command. The IDENTIFY, READ LOG and SET FEATURES commands executed in
      libata EH context triggered by the ata port resume operation may thus
      fail.
      
      Fix these synchronization issues is by handling a device power mode
      transitions for system suspend and resume directly in libata EH context,
      without relying on the scsi disk driver management triggered with the
      manage_system_start_stop flag.
      
      To do this, the following libata helper functions are introduced:
      
      1) ata_dev_power_set_standby():
      
      This function issues a STANDBY IMMEDIATE command to transitiom a device
      to the standby power mode. For HDDs, this spins down the disks. This
      function applies only to ATA and ZAC devices and does nothing otherwise.
      This function also does nothing for devices that have the
      ATA_FLAG_NO_POWEROFF_SPINDOWN or ATA_FLAG_NO_HIBERNATE_SPINDOWN flag
      set.
      
      For suspend, call ata_dev_power_set_standby() in
      ata_eh_handle_port_suspend() before the port is disabled and frozen.
      ata_eh_unload() is also modified to transition all enabled devices to
      the standby power mode when the system is shutdown or devices removed.
      
      2) ata_dev_power_set_active() and
      
      This function applies to ATA or ZAC devices and issues a VERIFY command
      for 1 sector at LBA 0 to transition the device to the active power mode.
      For HDDs, since this function will complete only once the disk spin up.
      Its execution uses the same timeouts as for reset, to give the drive
      enough time to complete spinup without triggering a command timeout.
      
      For resume, call ata_dev_power_set_active() in
      ata_eh_revalidate_and_attach() after the port has been enabled and
      before any other command is issued to the device.
      
      With these changes, the manage_system_start_stop and no_start_on_resume
      scsi device flags do not need to be set in ata_scsi_dev_config(). The
      flag manage_runtime_start_stop is still set to allow the sd driver to
      spinup/spindown a disk through the sd runtime operations.
      
      Fixes: 0a858905 ("ata,scsi: do not issue START STOP UNIT on resume")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      aa3998db
    • Damien Le Moal's avatar
      scsi: sd: Differentiate system and runtime start/stop management · 3cc2ffe5
      Damien Le Moal authored
      The underlying device and driver of a SCSI disk may have different
      system and runtime power mode control requirements. This is because
      runtime power management affects only the SCSI disk, while system level
      power management affects all devices, including the controller for the
      SCSI disk.
      
      For instance, issuing a START STOP UNIT command when a SCSI disk is
      runtime suspended and resumed is fine: the command is translated to a
      STANDBY IMMEDIATE command to spin down the ATA disk and to a VERIFY
      command to wake it up. The SCSI disk runtime operations have no effect
      on the ata port device used to connect the ATA disk. However, for
      system suspend/resume operations, the ATA port used to connect the
      device will also be suspended and resumed, with the resume operation
      requiring re-validating the device link and the device itself. In this
      case, issuing a VERIFY command to spinup the disk must be done before
      starting to revalidate the device, when the ata port is being resumed.
      In such case, we must not allow the SCSI disk driver to issue START STOP
      UNIT commands.
      
      Allow a low level driver to refine the SCSI disk start/stop management
      by differentiating system and runtime cases with two new SCSI device
      flags: manage_system_start_stop and manage_runtime_start_stop. These new
      flags replace the current manage_start_stop flag. Drivers setting the
      manage_start_stop are modifed to set both new flags, thus preserving the
      existing start/stop management behavior. For backward compatibility, the
      old manage_start_stop sysfs device attribute is kept as a read-only
      attribute showing a value of 1 for devices enabling both new flags and 0
      otherwise.
      
      Fixes: 0a858905 ("ata,scsi: do not issue START STOP UNIT on resume")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      3cc2ffe5
    • Damien Le Moal's avatar
      ata: libata-scsi: link ata port and scsi device · fb99ef17
      Damien Le Moal authored
      There is no direct device ancestry defined between an ata_device and
      its scsi device which prevents the power management code from correctly
      ordering suspend and resume operations. Create such ancestry with the
      ata device as the parent to ensure that the scsi device (child) is
      suspended before the ata device and that resume handles the ata device
      before the scsi device.
      
      The parent-child (supplier-consumer) relationship is established between
      the ata_port (parent) and the scsi device (child) with the function
      device_add_link(). The parent used is not the ata_device as the PM
      operations are defined per port and the status of all devices connected
      through that port is controlled from the port operations.
      
      The device link is established with the new function
      ata_scsi_slave_alloc(), and this function is used to define the
      ->slave_alloc callback of the scsi host template of all ata drivers.
      
      Fixes: a19a93e4 ("scsi: core: pm: Rely on the device driver core for async power management")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarJohn Garry <john.g.garry@oracle.com>
      fb99ef17
    • Damien Le Moal's avatar
      ata: libata-core: Fix port and device removal · 84d76529
      Damien Le Moal authored
      Whenever an ATA adapter driver is removed (e.g. rmmod),
      ata_port_detach() is called repeatedly for all the adapter ports to
      remove (unload) the devices attached to the port and delete the port
      device itself. Removing of devices is done using libata EH with the
      ATA_PFLAG_UNLOADING port flag set. This causes libata EH to execute
      ata_eh_unload() which disables all devices attached to the port.
      
      ata_port_detach() finishes by calling scsi_remove_host() to remove the
      scsi host associated with the port. This function will trigger the
      removal of all scsi devices attached to the host and in the case of
      disks, calls to sd_shutdown() which will flush the device write cache
      and stop the device. However, given that the devices were already
      disabled by ata_eh_unload(), the synchronize write cache command and
      start stop unit commands fail. E.g. running "rmmod ahci" with first
      removing sd_mod results in error messages like:
      
      ata13.00: disable device
      sd 0:0:0:0: [sda] Synchronizing SCSI cache
      sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
      sd 0:0:0:0: [sda] Stopping disk
      sd 0:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
      
      Fix this by removing all scsi devices of the ata devices connected to
      the port before scheduling libata EH to disable the ATA devices.
      
      Fixes: 720ba126 ("[PATCH] libata-hp: update unload-unplug")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Tested-by: default avatarChia-Lin Kao (AceLan) <acelan.kao@canonical.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      84d76529
    • Damien Le Moal's avatar
      ata: libata-core: Fix ata_port_request_pm() locking · 3b8e0af4
      Damien Le Moal authored
      The function ata_port_request_pm() checks the port flag
      ATA_PFLAG_PM_PENDING and calls ata_port_wait_eh() if this flag is set to
      ensure that power management operations for a port are not scheduled
      simultaneously. However, this flag check is done without holding the
      port lock.
      
      Fix this by taking the port lock on entry to the function and checking
      the flag under this lock. The lock is released and re-taken if
      ata_port_wait_eh() needs to be called. The two WARN_ON() macros checking
      that the ATA_PFLAG_PM_PENDING flag was cleared are removed as the first
      call is racy and the second one done without holding the port lock.
      
      Fixes: 5ef41082 ("ata: add ata port system PM callbacks")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Tested-by: default avatarChia-Lin Kao (AceLan) <acelan.kao@canonical.com>
      Reviewed-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      3b8e0af4
  2. 25 Sep, 2023 3 commits
    • Matthias Schiffer's avatar
      ata: libata-sata: increase PMP SRST timeout to 10s · 753a4d53
      Matthias Schiffer authored
      On certain SATA controllers, softreset fails after wakeup from S2RAM with
      the message "softreset failed (1st FIS failed)", sometimes resulting in
      drives not being detected again. With the increased timeout, this issue
      is avoided. Instead, "softreset failed (device not ready)" is now
      logged 1-2 times; this later failure seems to cause fewer problems
      however, and the drives are detected reliably once they've spun up and
      the probe is retried.
      
      The issue was observed with the primary SATA controller of the QNAP
      TS-453B, which is an "Intel Corporation Celeron/Pentium Silver Processor
      SATA Controller [8086:31e3] (rev 06)" integrated in the Celeron J4125 CPU,
      and the following drives:
      
      - Seagate IronWolf ST12000VN0008
      - Seagate IronWolf ST8000NE0004
      
      The SATA controller seems to be more relevant to this issue than the
      drives, as the same drives are always detected reliably on the secondary
      SATA controller on the same board (an ASMedia 106x) without any "softreset
      failed" errors even without the increased timeout.
      
      Fixes: e7d3ef13 ("libata: change drive ready wait after hard reset to 5s")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMatthias Schiffer <mschiffer@universe-factory.net>
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      753a4d53
    • Niklas Cassel's avatar
      ata: libata-scsi: ignore reserved bits for REPORT SUPPORTED OPERATION CODES · 3ef60092
      Niklas Cassel authored
      For REPORT SUPPORTED OPERATION CODES command, the service action field is
      defined as bits 0-4 in the second byte in the CDB. Bits 5-7 in the second
      byte are reserved.
      
      Only look at the service action field in the second byte when determining
      if the MAINTENANCE IN opcode is a REPORT SUPPORTED OPERATION CODES command.
      
      This matches how we only look at the service action field in the second
      byte when determining if the SERVICE ACTION IN(16) opcode is a READ
      CAPACITY(16) command (reserved bits 5-7 in the second byte are ignored).
      
      Fixes: 7b203094 ("libata: Add support for SCT Write Same")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      3ef60092
    • Rob Herring's avatar
      dt-bindings: ata: pata-common: Add missing additionalProperties on child nodes · 52bb69be
      Rob Herring authored
      The PATA child node schema is missing constraints to prevent unknown
      properties. As none of the users of this common binding extend the child
      nodes with additional properties, adding "additionalProperties: false"
      here is sufficient.
      Signed-off-by: default avatarRob Herring <robh@kernel.org>
      Acked-by: default avatarConor Dooley <conor.dooley@microchip.com>
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      52bb69be
  3. 24 Sep, 2023 4 commits
    • Linus Torvalds's avatar
      Linux 6.6-rc3 · 6465e260
      Linus Torvalds authored
      6465e260
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 8a511e7e
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
      "ARM:
      
         - Fix EL2 Stage-1 MMIO mappings where a random address was used
      
         - Fix SMCCC function number comparison when the SVE hint is set
      
        RISC-V:
      
         - Fix KVM_GET_REG_LIST API for ISA_EXT registers
      
         - Fix reading ISA_EXT register of a missing extension
      
         - Fix ISA_EXT register handling in get-reg-list test
      
         - Fix filtering of AIA registers in get-reg-list test
      
        x86:
      
         - Fixes for TSC_AUX virtualization
      
         - Stop zapping page tables asynchronously, since we don't zap them as
           often as before"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX
        KVM: SVM: Fix TSC_AUX virtualization setup
        KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway
        KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
        KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe()
        KVM: x86/mmu: Open code leaf invalidation from mmu_notifier
        KVM: riscv: selftests: Selectively filter-out AIA registers
        KVM: riscv: selftests: Fix ISA_EXT register handling in get-reg-list
        RISC-V: KVM: Fix riscv_vcpu_get_isa_ext_single() for missing extensions
        RISC-V: KVM: Fix KVM_GET_REG_LIST API for ISA_EXT registers
        KVM: selftests: Assert that vasprintf() is successful
        KVM: arm64: nvhe: Ignore SVE hint in SMCCC function ID
        KVM: arm64: Properly return allocated EL2 VA from hyp_alloc_private_va_range()
      8a511e7e
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 5edc6bb3
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix the "bytes" output of the per_cpu stat file
      
         The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the
         accounting was not accurate. It is suppose to show how many used
         bytes are still in the ring buffer, but even when the ring buffer was
         empty it would still show there were bytes used.
      
       - Fix a bug in eventfs where reading a dynamic event directory (open)
         and then creating a dynamic event that goes into that diretory screws
         up the accounting.
      
         On close, the newly created event dentry will get a "dput" without
         ever having a "dget" done for it. The fix is to allocate an array on
         dir open to save what dentries were actually "dget" on, and what ones
         to "dput" on close.
      
      * tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        eventfs: Remember what dentries were created on dir open
        ring-buffer: Fix bytes info in per_cpu buffer stats
      5edc6bb3
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 2ad78f8c
      Linus Torvalds authored
      Pull cxl fixes from Dan Williams:
       "A collection of regression fixes, bug fixes, and some small cleanups
        to the Compute Express Link code.
      
        The regressions arrived in the v6.5 dev cycle and missed the v6.6
        merge window due to my personal absences this cycle. The most
        important fixes are for scenarios where the CXL subsystem fails to
        parse valid region configurations established by platform firmware.
        This is important because agreement between OS and BIOS on the CXL
        configuration is fundamental to implementing "OS native" error
        handling, i.e. address translation and component failure
        identification.
      
        Other important fixes are a driver load error when the BIOS lets the
        Linux PCI core handle AER events, but not CXL memory errors.
      
        The other fixex might have end user impact, but for now are only known
        to trigger in our test/emulation environment.
      
        Summary:
      
         - Fix multiple scenarios where platform firmware defined regions fail
           to be assembled by the CXL core.
      
         - Fix a spurious driver-load failure on platforms that enable OS
           native AER, but not OS native CXL error handling.
      
         - Fix a regression detecting "poison" commands when "security"
           commands are also defined.
      
         - Fix a cxl_test regression with the move to centralize CXL port
           register enumeration in the CXL core.
      
         - Miscellaneous small fixes and cleanups"
      
      * tag 'cxl-fixes-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl/acpi: Annotate struct cxl_cxims_data with __counted_by
        cxl/port: Fix cxl_test register enumeration regression
        cxl/region: Refactor granularity select in cxl_port_setup_targets()
        cxl/region: Match auto-discovered region decoders by HPA range
        cxl/mbox: Fix CEL logic for poison and security commands
        cxl/pci: Replace host_bridge->native_aer with pcie_aer_is_native()
        PCI/AER: Export pcie_aer_is_native()
        cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers
      2ad78f8c
  4. 23 Sep, 2023 14 commits
    • Linus Torvalds's avatar
      Merge tag 'gpio-fixes-for-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux · 3aba70ae
      Linus Torvalds authored
      Pull gpio fixes from Bartosz Golaszewski:
      
       - fix an invalid usage of __free(kfree) leading to kfreeing an
         ERR_PTR()
      
       - fix an irq domain leak in gpio-tb10x
      
       - MAINTAINERS update
      
      * tag 'gpio-fixes-for-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
        gpio: sim: fix an invalid __free() usage
        gpio: tb10x: Fix an error handling path in tb10x_gpio_probe()
        MAINTAINERS: gpio-regmap: make myself a maintainer of it
      3aba70ae
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2023-09-23-10-31' of... · 85eba5f1
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull misc fixes from Andrew Morton:
       "13 hotfixes, 10 of which pertain to post-6.5 issues. The other three
        are cc:stable"
      
      * tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        proc: nommu: fix empty /proc/<pid>/maps
        filemap: add filemap_map_order0_folio() to handle order0 folio
        proc: nommu: /proc/<pid>/maps: release mmap read lock
        mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcement
        pidfd: prevent a kernel-doc warning
        argv_split: fix kernel-doc warnings
        scatterlist: add missing function params to kernel-doc
        selftests/proc: fixup proc-empty-vm test after KSM changes
        revert "scripts/gdb/symbols: add specific ko module load command"
        selftests: link libasan statically for tests with -fsanitize=address
        task_work: add kerneldoc annotation for 'data' argument
        mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list
        sh: mm: re-add lost __ref to ioremap_prot() to fix modpost warning
      85eba5f1
    • Linus Torvalds's avatar
      Merge tag '6.6-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 8565bdf8
      Linus Torvalds authored
      Pull smb client fixes from Steve French:
       "Six smb3 client fixes, including three for stable, from the SMB
        plugfest (testing event) this week:
      
         - Reparse point handling fix (found when investigating dir
           enumeration when fifo in dir)
      
         - Fix excessive thread creation for dir lease cleanup
      
         - UAF fix in negotiate path
      
         - remove duplicate error message mapping and fix confusing warning
           message
      
         - add dynamic trace point to improve debugging RDMA connection
           attempts"
      
      * tag '6.6-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        smb3: fix confusing debug message
        smb: client: handle STATUS_IO_REPARSE_TAG_NOT_HANDLED
        smb3: remove duplicate error mapping
        cifs: Fix UAF in cifs_demultiplex_thread()
        smb3: do not start laundromat thread when dir leases  disabled
        smb3: Add dynamic trace points for RDMA (smbdirect) reconnect
      8565bdf8
    • Linus Torvalds's avatar
      Merge tag 'i2c-for-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 5a4de7dc
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "A set of I2C driver fixes. Mostly fixing resource leaks or sanity
        checks"
      
      * tag 'i2c-for-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: xiic: Correct return value check for xiic_reinit()
        i2c: mux: gpio: Add missing fwnode_handle_put()
        i2c: mux: demux-pinctrl: check the return value of devm_kstrdup()
        i2c: designware: fix __i2c_dw_disable() in case master is holding SCL low
        i2c: i801: unregister tco_pdev in i801_probe() error path
      5a4de7dc
    • Charles Keepax's avatar
      mfd: cs42l43: Use correct macro for new-style PM runtime ops · eb72d520
      Charles Keepax authored
      The code was accidentally mixing new and old style macros, update the
      macros used to remove an unused function warning whilst building with
      no PM enabled in the config.
      
      Fixes: ace6d144 ("mfd: cs42l43: Add support for cs42l43 core driver")
      Signed-off-by: default avatarCharles Keepax <ckeepax@opensource.cirrus.com>
      Link: https://lore.kernel.org/all/20230822114914.340359-1-ckeepax@opensource.cirrus.com/Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarLee Jones <lee@kernel.org>
      Signed-off-by: default avatarUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb72d520
    • Linus Torvalds's avatar
      Merge tag 'loongarch-fixes-6.6-1' of... · 93397d3a
      Linus Torvalds authored
      Merge tag 'loongarch-fixes-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch fixes from Huacai Chen:
       "Fix lockdep, fix a boot failure, fix some build warnings, fix document
        links, and some cleanups"
      
      * tag 'loongarch-fixes-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        docs/zh_CN/LoongArch: Update the links of ABI
        docs/LoongArch: Update the links of ABI
        LoongArch: Don't inline kasan_mem_to_shadow()/kasan_shadow_to_mem()
        kasan: Cleanup the __HAVE_ARCH_SHADOW_MAP usage
        LoongArch: Set all reserved memblocks on Node#0 at initialization
        LoongArch: Remove dead code in relocate_new_kernel
        LoongArch: Use _UL() and _ULL()
        LoongArch: Fix some build warnings with W=1
        LoongArch: Fix lockdep static memory detection
      93397d3a
    • Linus Torvalds's avatar
      Merge tag 's390-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 2e3d3911
      Linus Torvalds authored
      Pull s390 fixes from Vasily Gorbik:
      
       - Fix potential string buffer overflow in hypervisor user-defined
         certificates handling
      
       - Update defconfigs
      
      * tag 's390-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/cert_store: fix string length handling
        s390: update defconfigs
      2e3d3911
    • Linus Torvalds's avatar
      Merge tag 'iomap-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 59c376d6
      Linus Torvalds authored
      Pull iomap fixes from Darrick Wong:
      
       - Return EIO on bad inputs to iomap_to_bh instead of BUGging, to deal
         less poorly with block device io racing with block device resizing
      
       - Fix a stale page data exposure bug introduced in 6.6-rc1 when
         unsharing a file range that is not in the page cache
      
      * tag 'iomap-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        iomap: convert iomap_unshare_iter to use large folios
        iomap: don't skip reading in !uptodate folios when unsharing a range
        iomap: handle error conditions more gracefully in iomap_to_bh
      59c376d6
    • Paolo Bonzini's avatar
      Merge tag 'kvm-riscv-fixes-6.6-1' of https://github.com/kvm-riscv/linux into HEAD · 5804c19b
      Paolo Bonzini authored
      KVM/riscv fixes for 6.6, take #1
      
      - Fix KVM_GET_REG_LIST API for ISA_EXT registers
      - Fix reading ISA_EXT register of a missing extension
      - Fix ISA_EXT register handling in get-reg-list test
      - Fix filtering of AIA registers in get-reg-list test
      5804c19b
    • Tom Lendacky's avatar
      KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX · 916e3e5f
      Tom Lendacky authored
      When the TSC_AUX MSR is virtualized, the TSC_AUX value is swap type "B"
      within the VMSA. This means that the guest value is loaded on VMRUN and
      the host value is restored from the host save area on #VMEXIT.
      
      Since the value is restored on #VMEXIT, the KVM user return MSR support
      for TSC_AUX can be replaced by populating the host save area with the
      current host value of TSC_AUX. And, since TSC_AUX is not changed by Linux
      post-boot, the host save area can be set once in svm_hardware_enable().
      This eliminates the two WRMSR instructions associated with the user return
      MSR support.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <d381de38eb0ab6c9c93dda8503b72b72546053d7.1694811272.git.thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      916e3e5f
    • Tom Lendacky's avatar
      KVM: SVM: Fix TSC_AUX virtualization setup · e0096d01
      Tom Lendacky authored
      The checks for virtualizing TSC_AUX occur during the vCPU reset processing
      path. However, at the time of initial vCPU reset processing, when the vCPU
      is first created, not all of the guest CPUID information has been set. In
      this case the RDTSCP and RDPID feature support for the guest is not in
      place and so TSC_AUX virtualization is not established.
      
      This continues for each vCPU created for the guest. On the first boot of
      an AP, vCPU reset processing is executed as a result of an APIC INIT
      event, this time with all of the guest CPUID information set, resulting
      in TSC_AUX virtualization being enabled, but only for the APs. The BSP
      always sees a TSC_AUX value of 0 which probably went unnoticed because,
      at least for Linux, the BSP TSC_AUX value is 0.
      
      Move the TSC_AUX virtualization enablement out of the init_vmcb() path and
      into the vcpu_after_set_cpuid() path to allow for proper initialization of
      the support after the guest CPUID information has been set.
      
      With the TSC_AUX virtualization support now in the vcpu_set_after_cpuid()
      path, the intercepts must be either cleared or set based on the guest
      CPUID input.
      
      Fixes: 296d5a17 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <4137fbcb9008951ab5f0befa74a0399d2cce809a.1694811272.git.thomas.lendacky@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0096d01
    • Paolo Bonzini's avatar
      KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway · e8d93d5d
      Paolo Bonzini authored
      svm_recalc_instruction_intercepts() is always called at least once
      before the vCPU is started, so the setting or clearing of the RDTSCP
      intercept can be dropped from the TSC_AUX virtualization support.
      
      Extracted from a patch by Tom Lendacky.
      
      Cc: stable@vger.kernel.org
      Fixes: 296d5a17 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8d93d5d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously · 0df9dab8
      Sean Christopherson authored
      Stop zapping invalidate TDP MMU roots via work queue now that KVM
      preserves TDP MMU roots until they are explicitly invalidated.  Zapping
      roots asynchronously was effectively a workaround to avoid stalling a vCPU
      for an extended during if a vCPU unloaded a root, which at the time
      happened whenever the guest toggled CR0.WP (a frequent operation for some
      guest kernels).
      
      While a clever hack, zapping roots via an unbound worker had subtle,
      unintended consequences on host scheduling, especially when zapping
      multiple roots, e.g. as part of a memslot.  Because the work of zapping a
      root is no longer bound to the task that initiated the zap, things like
      the CPU affinity and priority of the original task get lost.  Losing the
      affinity and priority can be especially problematic if unbound workqueues
      aren't affined to a small number of CPUs, as zapping multiple roots can
      cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
      the CPUs KVM is already using to run vCPUs.
      
      When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
      zap can result in KVM occupying all logical CPUs for ~8ms, and result in
      high priority tasks not being scheduled in in a timely manner.  In v5.15,
      which doesn't preserve unloaded roots, the issues were even more noticeable
      as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.
      
      Consuming all CPUs for an extended duration can lead to significant jitter
      throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
      is a semi-frequent operation as memslots are deleted and recreated with
      different host virtual addresses to react to host GPU drivers allocating
      and freeing GPU blobs.  On ChromeOS, the jitter manifests as audio blips
      during games due to the audio server's tasks not getting scheduled in
      promptly, despite the tasks having a high realtime priority.
      
      Deleting memslots isn't exactly a fast path and should be avoided when
      possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
      memslot shenanigans, but KVM is squarely in the wrong.  Not to mention
      that removing the async zapping eliminates a non-trivial amount of
      complexity.
      
      Note, one of the subtle behaviors hidden behind the async zapping is that
      KVM would zap invalidated roots only once (ignoring partial zaps from
      things like mmu_notifier events).  Preserve this behavior by adding a flag
      to identify roots that are scheduled to be zapped versus roots that have
      already been zapped but not yet freed.
      
      Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
      encounter invalid roots, as it's not at all obvious why zapping
      invalidated roots shouldn't simply zap all invalid roots.
      Reported-by: default avatarPattara Teerapong <pteerapong@google.com>
      Cc: David Stevens <stevensd@google.com>
      Cc: Yiwei Zhang<zzyiwei@google.com>
      Cc: Paul Hsia <paulhsia@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20230916003916.2545000-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0df9dab8
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe() · 441a5dfc
      Paolo Bonzini authored
      All callers except the MMU notifier want to process all address spaces.
      Remove the address space ID argument of for_each_tdp_mmu_root_yield_safe()
      and switch the MMU notifier to use __for_each_tdp_mmu_root_yield_safe().
      
      Extracted out of a patch by Sean Christopherson <seanjc@google.com>
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      441a5dfc
  5. 22 Sep, 2023 10 commits
    • Linus Torvalds's avatar
      Merge tag 'hardening-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · d90b0276
      Linus Torvalds authored
      Pull hardening fixes from Kees Cook:
      
       - Fix UAPI stddef.h to avoid C++-ism (Alexey Dobriyan)
      
       - Fix harmless UAPI stddef.h header guard endif (Alexey Dobriyan)
      
      * tag 'hardening-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        uapi: stddef.h: Fix __DECLARE_FLEX_ARRAY for C++
        uapi: stddef.h: Fix header guard location
      d90b0276
    • Linus Torvalds's avatar
      Merge tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 3abc79dc
      Linus Torvalds authored
      Pull xfs fixes from Chandan Babu:
      
       - Fix an integer overflow bug when processing an fsmap call
      
       - Fix crash due to CPU hot remove event racing with filesystem mount
         operation
      
       - During read-only mount, XFS does not allow the contents of the log to
         be recovered when there are one or more unrecognized rcompat features
         in the primary superblock, since the log might have intent items
         which the kernel does not know how to process
      
       - During recovery of log intent items, XFS now reserves log space
         sufficient for one cycle of a permanent transaction to execute.
         Otherwise, this could lead to livelocks due to non-availability of
         log space
      
       - On an fs which has an ondisk unlinked inode list, trying to delete a
         file or allocating an O_TMPFILE file can cause the fs to the shutdown
         if the first inode in the ondisk inode list is not present in the
         inode cache. The bug is solved by explicitly loading the first inode
         in the ondisk unlinked inode list into the inode cache if it is not
         already cached
      
         A similar problem arises when the uncached inode is present in the
         middle of the ondisk unlinked inode list. This second bug is
         triggered when executing operations like quotacheck and bulkstat. In
         this case, XFS now reads in the entire ondisk unlinked inode list
      
       - Enable LARP mode only on recent v5 filesystems
      
       - Fix a out of bounds memory access in scrub
      
       - Fix a performance bug when locating the tail of the log during
         mounting a filesystem
      
      * tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail
        xfs: only call xchk_stats_merge after validating scrub inputs
        xfs: require a relatively recent V5 filesystem for LARP mode
        xfs: make inode unlinked bucket recovery work with quotacheck
        xfs: load uncached unlinked inodes into memory on demand
        xfs: reserve less log space when recovering log intent items
        xfs: fix log recovery when unknown rocompat bits are set
        xfs: reload entire unlinked bucket lists
        xfs: allow inode inactivation during a ro mount log recovery
        xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
        xfs: remove CPU hotplug infrastructure
        xfs: remove the all-mounts list
        xfs: use per-mount cpumask to track nonempty percpu inodegc lists
        xfs: fix an agbno overflow in __xfs_getfsmap_datadev
        xfs: fix per-cpu CIL structure aggregation racing with dying cpus
        xfs: fix select in config XFS_ONLINE_SCRUB_STATS
      3abc79dc
    • Kees Cook's avatar
      cxl/acpi: Annotate struct cxl_cxims_data with __counted_by · c66650d2
      Kees Cook authored
      Prepare for the coming implementation by GCC and Clang of the __counted_by
      attribute. Flexible array members annotated with __counted_by can have
      their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
      (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
      functions).
      
      As found with Coccinelle[1], add __counted_by for struct cxl_cxims_data.
      Additionally, since the element count member must be set before accessing
      the annotated flexible array member, move its initialization earlier.
      
      [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
      
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: linux-cxl@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Link: https://lore.kernel.org/r/20230922175319.work.096-kees@kernel.orgSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      c66650d2
    • Dan Williams's avatar
      cxl/port: Fix cxl_test register enumeration regression · a76b6251
      Dan Williams authored
      The cxl_test unit test environment models a CXL topology for
      sysfs/user-ABI regression testing. It uses interface mocking via the
      "--wrap=" linker option to redirect cxl_core routines that parse
      hardware registers with versions that just publish objects, like
      devm_cxl_enumerate_decoders().
      
      Starting with:
      
      Commit 19ab69a6 ("cxl/port: Store the port's Component Register mappings in struct cxl_port")
      
      ...port register enumeration is moved into devm_cxl_add_port(). This
      conflicts with the "cxl_test avoids emulating registers stance" so
      either the port code needs to be refactored (too violent), or modified
      so that register enumeration is skipped on "fake" cxl_test ports
      (annoying, but straightforward).
      
      This conflict has happened previously and the "check for platform
      device" workaround to avoid instrusive refactoring was deployed in those
      scenarios. In general, refactoring should only benefit production code,
      test code needs to remain minimally instrusive to the greatest extent
      possible.
      
      This was missed previously because it may sometimes just cause warning
      messages to be emitted, but it can also cause test failures. The
      backport to -stable is only nice to have for clean cxl_test runs.
      
      Fixes: 19ab69a6 ("cxl/port: Store the port's Component Register mappings in struct cxl_port")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAlison Schofield <alison.schofield@intel.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Tested-by: default avatarDave Jiang <dave.jiang@intel.com>
      Link: https://lore.kernel.org/r/169476525052.1013896.6235102957693675187.stgit@dwillia2-xfh.jf.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      a76b6251
    • Steven Rostedt (Google)'s avatar
      eventfs: Remember what dentries were created on dir open · ef36b4f9
      Steven Rostedt (Google) authored
      Using the following code with libtracefs:
      
      	int dfd;
      
      	// create the directory events/kprobes/kp1
      	tracefs_kprobe_raw(NULL, "kp1", "schedule_timeout", "time=$arg1");
      
      	// Open the kprobes directory
      	dfd = tracefs_instance_file_open(NULL, "events/kprobes", O_RDONLY);
      
      	// Do a lookup of the kprobes/kp1 directory (by looking at enable)
      	tracefs_file_exists(NULL, "events/kprobes/kp1/enable");
      
      	// Now create a new entry in the kprobes directory
      	tracefs_kprobe_raw(NULL, "kp2", "schedule_hrtimeout", "expires=$arg1");
      
      	// Do another lookup to create the dentries
      	tracefs_file_exists(NULL, "events/kprobes/kp2/enable"))
      
      	// Close the directory
      	close(dfd);
      
      What happened above, the first open (dfd) will call
      dcache_dir_open_wrapper() that will create the dentries and up their ref
      counts.
      
      Now the creation of "kp2" will add another dentry within the kprobes
      directory.
      
      Upon the close of dfd, eventfs_release() will now do a dput for all the
      entries in kprobes. But this is where the problem lies. The open only
      upped the dentry of kp1 and not kp2. Now the close is decrementing both
      kp1 and kp2, which causes kp2 to get a negative count.
      
      Doing a "trace-cmd reset" which deletes all the kprobes cause the kernel
      to crash! (due to the messed up accounting of the ref counts).
      
      To solve this, save all the dentries that are opened in the
      dcache_dir_open_wrapper() into an array, and use this array to know what
      dentries to do a dput on in eventfs_release().
      
      Since the dcache_dir_open_wrapper() calls dcache_dir_open() which uses the
      file->private_data, we need to also add a wrapper around dcache_readdir()
      that uses the cursor assigned to the file->private_data. This is because
      the dentries need to also be saved in the file->private_data. To do this
      create the structure:
      
        struct dentry_list {
      	void		*cursor;
      	struct dentry	**dentries;
        };
      
      Which will hold both the cursor and the dentries. Some shuffling around is
      needed to make sure that dcache_dir_open() and dcache_readdir() only see
      the cursor.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230919211804.230edf1e@gandalf.local.home/
      Link: https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d4fa@gandalf.local.home
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Ajay Kaher <akaher@vmware.com>
      Fixes: 63940449 ("eventfs: Implement eventfs lookup, read, open functions")
      Reported-by: default avatar"Masami Hiramatsu (Google)" <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      ef36b4f9
    • Zheng Yejian's avatar
      ring-buffer: Fix bytes info in per_cpu buffer stats · 45d99ea4
      Zheng Yejian authored
      The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
      bytes in cpu buffer that have not been consumed. However, currently
      after consuming data by reading file 'trace_pipe', the 'bytes' info
      was not changed as expected.
      
        # cat per_cpu/cpu0/stats
        entries: 0
        overrun: 0
        commit overrun: 0
        bytes: 568             <--- 'bytes' is problematical !!!
        oldest event ts:  8651.371479
        now ts:  8653.912224
        dropped events: 0
        read events: 8
      
      The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
        1. When stat 'read_bytes', account consumed event in rb_advance_reader();
        2. When stat 'entries_bytes', exclude the discarded padding event which
           is smaller than minimum size because it is invisible to reader. Then
           use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
           page-based read/remove/overrun.
      
      Also correct the comments of ring_buffer_bytes_cpu() in this patch.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com
      
      Cc: stable@vger.kernel.org
      Fixes: c64e148a ("trace: Add ring buffer stats to measure rate of events")
      Signed-off-by: default avatarZheng Yejian <zhengyejian1@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      45d99ea4
    • Linus Torvalds's avatar
      Merge tag 'thermal-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 8018e02a
      Linus Torvalds authored
      Pull thermal control fix from Rafael Wysocki:
       "Unbreak the trip point update sysfs interface that has been broken
        since the 6.3 cycle (Rafael Wysocki)"
      
      * tag 'thermal-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thermal: sysfs: Fix trip_point_hyst_store()
      8018e02a
    • Linus Torvalds's avatar
      Merge tag 'acpi-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · b184c040
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These fix a general ACPI processor driver regression and an ia64 build
        issue, both introduced recently.
      
        Specifics:
      
         - Fix recently introduced uninitialized memory access issue in the
           ACPI processor driver (Michal Wilczynski)
      
         - Fix ia64 build inadvertently broken by recent ACPI processor driver
           changes, which is prudent to do for 6.6 even though ia64 support is
           slated for removal in 6.7 (Ard Biesheuvel)"
      
      * tag 'acpi-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: processor: Fix uninitialized access of buf in acpi_set_pdc_bits()
        acpi: Provide ia64 dummy implementation of acpi_proc_quirk_mwait_check()
      b184c040
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 36fcf381
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "Small crop of relatively boring arm64 fixes for -rc3.
      
        That's not to say we don't have any juicy bugs, however, it's just
        that fixes for those are likely to come via -mm and -tip for a hugetlb
        and an atomics issue respectively. I get left with the
        documentation...
      
         - Fix detection of "ClearBHB" and "Hinted Conditional Branch" features
      
         - Fix broken wildcarding for Arm PMU MAINTAINERS entry
      
         - Add missing documentation for userspace-visible ID register fields"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: Document missing userspace visible fields in ID_AA64ISAR2_EL1
        arm64/hbc: Document HWCAP2_HBC
        arm64/sme: Include ID_AA64PFR1_EL1.SME in cpu-feature-registers.rst
        arm64: cpufeature: Fix CLRBHB and BC detection
        MAINTAINERS: Use wildcard pattern for ARM PMU headers
      36fcf381
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b61ec8d0
      Linus Torvalds authored
      Pull x86 rethunk fixes from Borislav Petkov:
       "Fix the patching ordering between static calls and return thunks"
      
      * tag 'x86_urgent_for_v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86,static_call: Fix static-call vs return-thunk
        x86/alternatives: Remove faulty optimization
      b61ec8d0