1. 03 Aug, 2021 6 commits
  2. 02 Aug, 2021 18 commits
    • Vladimir Oltean's avatar
      net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry · 0541a629
      Vladimir Oltean authored
      Currently it is possible to add broken extern_learn FDB entries to the
      bridge in two ways:
      
      1. Entries pointing towards the bridge device that are not local/permanent:
      
      ip link add br0 type bridge
      bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static
      
      2. Entries pointing towards the bridge device or towards a port that
      are marked as local/permanent, however the bridge does not process the
      'permanent' bit in any way, therefore they are recorded as though they
      aren't permanent:
      
      ip link add br0 type bridge
      bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent
      
      Since commit 52e4bec1 ("net: bridge: switchdev: treat local FDBs the
      same as entries towards the bridge"), these incorrect FDB entries can
      even trigger NULL pointer dereferences inside the kernel.
      
      This is because that commit made the assumption that all FDB entries
      that are not local/permanent have a valid destination port. For context,
      local / permanent FDB entries either have fdb->dst == NULL, and these
      point towards the bridge device and are therefore local and not to be
      used for forwarding, or have fdb->dst == a net_bridge_port structure
      (but are to be treated in the same way, i.e. not for forwarding).
      
      That assumption _is_ correct as long as things are working correctly in
      the bridge driver, i.e. we cannot logically have fdb->dst == NULL under
      any circumstance for FDB entries that are not local. However, the
      extern_learn code path where FDB entries are managed by a user space
      controller show that it is possible for the bridge kernel driver to
      misinterpret the NUD flags of an entry transmitted by user space, and
      end up having fdb->dst == NULL while not being a local entry. This is
      invalid and should be rejected.
      
      Before, the two commands listed above both crashed the kernel in this
      check from br_switchdev_fdb_notify:
      
      	struct net_device *dev = info.is_local ? br->dev : dst->dev;
      
      info.is_local == false, dst == NULL.
      
      After this patch, the invalid entry added by the first command is
      rejected:
      
      ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static; ip link del br0
      Error: bridge: FDB entry towards bridge must be permanent.
      
      and the valid entry added by the second command is properly treated as a
      local address and does not crash br_switchdev_fdb_notify anymore:
      
      ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent; ip link del br0
      
      Fixes: eb100e0e ("net: bridge: allow to add externally learned entries from user-space")
      Reported-by: syzbot+9ba1174359adba5a5b7c@syzkaller.appspotmail.com
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Link: https://lore.kernel.org/r/20210801231730.7493-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0541a629
    • Jakub Kicinski's avatar
      Revert "mhi: Fix networking tree build." · 1c69d7cf
      Jakub Kicinski authored
      This reverts commit 40e15940.
      
      Looks like this commit breaks the build for me.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1c69d7cf
    • Jakub Kicinski's avatar
      docs: operstates: document IF_OPER_TESTING · 7a7b8635
      Jakub Kicinski authored
      IF_OPER_TESTING is in fact used today.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a7b8635
    • Jakub Kicinski's avatar
      docs: operstates: fix typo · 66e0da21
      Jakub Kicinski authored
      TVL -> TLV
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66e0da21
    • Jakub Kicinski's avatar
      net: sparx5: fix compiletime_assert for GCC 4.9 · 6387f65e
      Jakub Kicinski authored
      Stephen reports sparx5 broke GCC 4.9 build.
      Move the compiletime_assert() out of the static function.
      Compile-tested only, no object code changes.
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Fixes: f3cad261 ("net: sparx5: add hostmode with phylink support")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6387f65e
    • Wang Hai's avatar
      net: natsemi: Fix missing pci_disable_device() in probe and remove · 7fe74dfd
      Wang Hai authored
      Replace pci_enable_device() with pcim_enable_device(),
      pci_disable_device() and pci_release_regions() will be
      called in release automatically.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarWang Hai <wanghai38@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7fe74dfd
    • Steve Bennett's avatar
      net: phy: micrel: Fix detection of ksz87xx switch · a5e63c7d
      Steve Bennett authored
      The logic for discerning between KSZ8051 and KSZ87XX PHYs is incorrect
      such that the that KSZ87XX switch is not identified correctly.
      
      ksz8051_ksz8795_match_phy_device() uses the parameter ksz_phy_id
      to discriminate whether it was called from ksz8051_match_phy_device()
      or from ksz8795_match_phy_device() but since PHY_ID_KSZ87XX is the
      same value as PHY_ID_KSZ8051, this doesn't work.
      
      Instead use a bool to discriminate the caller.
      
      Without this patch, the KSZ8795 switch port identifies as:
      
      ksz8795-switch spi3.1 ade1 (uninitialized): PHY [dsa-0.1:03] driver [Generic PHY]
      
      With the patch, it identifies correctly:
      
      ksz8795-switch spi3.1 ade1 (uninitialized): PHY [dsa-0.1:03] driver [Micrel KSZ87XX Switch]
      
      Fixes: 8b95599c ("net: phy: micrel: Discern KSZ8051 and KSZ8795 PHYs")
      Signed-off-by: default avatarSteve Bennett <steveb@workware.net.au>
      Reviewed-by: default avatarMarek Vasut <marex@denx.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5e63c7d
    • David S. Miller's avatar
      Merge branch 'sja1105-fdb-fixes' · cebb5103
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      FDB fixes for NXP SJA1105
      
      I have some upcoming patches that make heavy use of statically installed
      FDB entries, and when testing them on SJA1105P/Q/R/S and SJA1110, it
      became clear that these switches do not behave reliably at all.
      
      - On SJA1110, a static FDB entry cannot be installed at all
      - On SJA1105P/Q/R/S, it is very picky about the inner/outer VLAN type
      - Dynamically learned entries will make us not install static ones, or
        even if we do, they might not take effect
      
      Patch 5/6 has a conflict with net-next (sorry), the commit message of
      that patch describes how to deal with it. Thanks.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cebb5103
    • Vladimir Oltean's avatar
      net: dsa: sja1105: match FDB entries regardless of inner/outer VLAN tag · 47c2c0c2
      Vladimir Oltean authored
      On SJA1105P/Q/R/S and SJA1110, the L2 Lookup Table entries contain a
      maskable "inner/outer tag" bit which means:
      - when set to 1: match single-outer and double tagged frames
      - when set to 0: match untagged and single-inner tagged frames
      - when masked off: match all frames regardless of the type of tag
      
      This driver does not make any meaningful distinction between inner tags
      (matches on TPID) and outer tags (matches on TPID2). In fact, all VLAN
      table entries are installed as SJA1110_VLAN_D_TAG, which means that they
      match on both inner and outer tags.
      
      So it does not make sense that we install FDB entries with the IOTAG bit
      set to 1.
      
      In VLAN-unaware mode, we set both TPID and TPID2 to 0xdadb, so the
      switch will see frames as outer-tagged or double-tagged (never inner).
      So the FDB entries will match if IOTAG is set to 1.
      
      In VLAN-aware mode, we set TPID to 0x8100 and TPID2 to 0x88a8. So the
      switch will see untagged and 802.1Q-tagged packets as inner-tagged, and
      802.1ad-tagged packets as outer-tagged. So untagged and 802.1Q-tagged
      packets will not match FDB entries if IOTAG is set to 1, but 802.1ad
      tagged packets will. Strange.
      
      To fix this, simply mask off the IOTAG bit from FDB entries, and make
      them match regardless of whether the VLAN tag is inner or outer.
      
      Fixes: 1da73821 ("net: dsa: sja1105: Add FDB operations for P/Q/R/S series")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47c2c0c2
    • Vladimir Oltean's avatar
      net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too · 589918df
      Vladimir Oltean authored
      Similar but not quite the same with what was done in commit b11f0a4c
      ("net: dsa: sja1105: be stateless when installing FDB entries") for
      SJA1105E/T, it is desirable to drop the priv->vlan_aware check and
      simply go ahead and install FDB entries in the VLAN that was given by
      the bridge.
      
      As opposed to SJA1105E/T, in SJA1105P/Q/R/S and SJA1110, the FDB is a
      maskable TCAM, and we are installing VLAN-unaware FDB entries with the
      VLAN ID masked off. However, such FDB entries might completely obscure
      VLAN-aware entries where the VLAN ID is included in the search mask,
      because the switch looks up the FDB from left to right and picks the
      first entry which results in a masked match. So it depends on whether
      the bridge installs first the VLAN-unaware or the VLAN-aware FDB entries.
      
      Anyway, if we had a VLAN-unaware FDB entry towards one set of DESTPORTS
      and a VLAN-aware one towards other set of DESTPORTS, the result is that
      the packets in VLAN-aware mode will be forwarded towards the DESTPORTS
      specified by the VLAN-unaware entry.
      
      To solve this, simply do not use the masked matching ability of the FDB
      for VLAN ID, and always match precisely on it. In VLAN-unaware mode, we
      configure the switch for shared VLAN learning, so the VLAN ID will be
      ignored anyway during lookup, so it is redundant to mask it off in the
      TCAM.
      
      This patch conflicts with net-next commit 0fac6aa0 ("net: dsa: sja1105:
      delete the best_effort_vlan_filtering mode") which changed this line:
      	if (priv->vlan_state != SJA1105_VLAN_UNAWARE) {
      into:
      	if (priv->vlan_aware) {
      
      When merging with net-next, the lines added by this patch should take
      precedence in the conflict resolution (i.e. the "if" condition should be
      deleted in both cases).
      
      Fixes: 1da73821 ("net: dsa: sja1105: Add FDB operations for P/Q/R/S series")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      589918df
    • Vladimir Oltean's avatar
      net: dsa: sja1105: ignore the FDB entry for unknown multicast when adding a new address · 728db843
      Vladimir Oltean authored
      Currently, when sja1105pqrs_fdb_add() is called for a host-joined IPv6
      MDB entry such as 33:33:00:00:00:6a, the search for that address will
      return the FDB entry for SJA1105_UNKNOWN_MULTICAST, which has a
      destination MAC of 01:00:00:00:00:00 and a mask of 01:00:00:00:00:00.
      It returns that entry because, well, it matches, in the sense that
      unknown multicast is supposed by design to match it...
      
      But the issue is that we then proceed to overwrite this entry with the
      one for our precise host-joined multicast address, and the unknown
      multicast entry is no longer there - unknown multicast is now flooded to
      the same group of ports as broadcast, which does not look up the FDB.
      
      To solve this problem, we should ignore searches that return the unknown
      multicast address as the match, and treat them as "no match" which will
      result in the entry being installed to hardware.
      
      For this to work properly, we need to put the result of the FDB search
      in a temporary variable in order to avoid overwriting the l2_lookup
      entry we want to program. The l2_lookup entry returned by the search
      might not have the same set of DESTPORTS and not even the same MACADDR
      as the entry we're trying to add.
      
      Fixes: 4d942354 ("net: dsa: sja1105: offload bridge port flags to device")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      728db843
    • Vladimir Oltean's avatar
      net: dsa: sja1105: invalidate dynamic FDB entries learned concurrently with statically added ones · 6c5fc159
      Vladimir Oltean authored
      The procedure to add a static FDB entry in sja1105 is concurrent with
      dynamic learning performed on all bridge ports and the CPU port.
      
      The switch looks up the FDB from left to right, and also learns
      dynamically from left to right, so it is possible that between the
      moment when we pick up a free slot to install an FDB entry, another slot
      to the left of that one becomes free due to an address ageing out, and
      that other slot is then immediately used by the switch to learn
      dynamically the same address as we're trying to add statically.
      
      The result is that we succeeded to add our static FDB entry, but it is
      being shadowed by a dynamic FDB entry to its left, and the switch will
      behave as if our static FDB entry did not exist.
      
      We cannot really prevent this from happening unless we make the entire
      process to add a static FDB entry a huge critical section where address
      learning is temporarily disabled on _all_ ports, and then re-enabled
      according to the configuration done by sja1105_port_set_learning.
      However, that is kind of disruptive for the operation of the network.
      
      What we can do alternatively is to simply read back the FDB for dynamic
      entries located before our newly added static one, and delete them.
      This will guarantee that our static FDB entry is now operational. It
      will still not guarantee that there aren't dynamic FDB entries to the
      _right_ of that static FDB entry, but at least those entries will age
      out by themselves since they aren't hit, and won't bother anyone.
      
      Fixes: 291d1e72 ("net: dsa: sja1105: Add support for FDB and MDB management")
      Fixes: 1da73821 ("net: dsa: sja1105: Add FDB operations for P/Q/R/S series")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c5fc159
    • Vladimir Oltean's avatar
      net: dsa: sja1105: overwrite dynamic FDB entries with static ones in .port_fdb_add · e11e865b
      Vladimir Oltean authored
      The SJA1105 switch family leaves it up to software to decide where
      within the FDB to install a static entry, and to concatenate destination
      ports for already existing entries (the FDB is also used for multicast
      entries), it is not as simple as just saying "please add this entry".
      
      This means we first need to search for an existing FDB entry before
      adding a new one. The driver currently manages to fool itself into
      thinking that if an FDB entry already exists, there is nothing to be
      done. But that FDB entry might be dynamically learned, case in which it
      should be replaced with a static entry, but instead it is left alone.
      
      This patch checks the LOCKEDS ("locked/static") bit from found FDB
      entries, and lets the code "goto skip_finding_an_index;" if the FDB
      entry was not static. So we also need to move the place where we set
      LOCKEDS = true, to cover the new case where a dynamic FDB entry existed
      but was dynamic.
      
      Fixes: 291d1e72 ("net: dsa: sja1105: Add support for FDB and MDB management")
      Fixes: 1da73821 ("net: dsa: sja1105: Add FDB operations for P/Q/R/S series")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e11e865b
    • Vladimir Oltean's avatar
      net: dsa: sja1105: fix static FDB writes for SJA1110 · cb81698f
      Vladimir Oltean authored
      The blamed commit made FDB access on SJA1110 functional only as far as
      dumping the existing entries goes, but anything having to do with an
      entry's index (adding, deleting) is still broken.
      
      There are in fact 2 problems, all caused by improperly inheriting the
      code from SJA1105P/Q/R/S:
      - An entry size is SJA1110_SIZE_L2_LOOKUP_ENTRY (24) bytes and not
        SJA1105PQRS_SIZE_L2_LOOKUP_ENTRY (20) bytes
      - The "index" field within an FDB entry is at bits 10:1 for SJA1110 and
        not 15:6 as in SJA1105P/Q/R/S
      
      This patch moves the packing function for the cmd->index outside of
      sja1105pqrs_common_l2_lookup_cmd_packing() and into the device specific
      functions sja1105pqrs_l2_lookup_cmd_packing and
      sja1110_l2_lookup_cmd_packing.
      
      Fixes: 74e7feff ("net: dsa: sja1105: fix dynamic access to L2 Address Lookup table for SJA1110")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb81698f
    • David S. Miller's avatar
      mhi: Fix networking tree build. · 40e15940
      David S. Miller authored
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40e15940
    • Yannick Vignon's avatar
      net/sched: taprio: Fix init procedure · ebca25ea
      Yannick Vignon authored
      Commit 13511704 ("net: taprio offload: enforce qdisc to netdev queue mapping")
      resulted in duplicate entries in the qdisc hash.
      While this did not impact the overall operation of the qdisc and taprio
      code paths, it did result in an infinite loop when dumping the qdisc
      properties, at least on one target (NXP LS1028 ARDB).
      Removing the duplicate call to qdisc_hash_add() solves the problem.
      
      Fixes: 13511704 ("net: taprio offload: enforce qdisc to netdev queue mapping")
      Signed-off-by: default avatarYannick Vignon <yannick.vignon@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebca25ea
    • Jakub Sitnicki's avatar
      net, gro: Set inner transport header offset in tcp/udp GRO hook · d51c5907
      Jakub Sitnicki authored
      GSO expects inner transport header offset to be valid when
      skb->encapsulation flag is set. GSO uses this value to calculate the length
      of an individual segment of a GSO packet in skb_gso_transport_seglen().
      
      However, tcp/udp gro_complete callbacks don't update the
      skb->inner_transport_header when processing an encapsulated TCP/UDP
      segment. As a result a GRO skb has ->inner_transport_header set to a value
      carried over from earlier skb processing.
      
      This can have mild to tragic consequences. From miscalculating the GSO
      segment length to triggering a page fault [1], when trying to read TCP/UDP
      header at an address past the skb->data page.
      
      The latter scenario leads to an oops report like so:
      
        BUG: unable to handle page fault for address: ffff9fa7ec00d008
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 123f201067 P4D 123f201067 PUD 123f209067 PMD 0
        Oops: 0000 [#1] SMP NOPTI
        CPU: 44 PID: 0 Comm: swapper/44 Not tainted 5.4.53-cloudflare-2020.7.21 #1
        Hardware name: HYVE EDGE-METAL-GEN10/HS-1811DLite1, BIOS V2.15 02/21/2020
        RIP: 0010:skb_gso_transport_seglen+0x44/0xa0
        Code: c0 41 83 e0 11 f6 87 81 00 00 00 20 74 30 0f b7 87 aa 00 00 00 0f [...]
        RSP: 0018:ffffad8640bacbb8 EFLAGS: 00010202
        RAX: 000000000000feda RBX: ffff9fcc8d31bc00 RCX: ffff9fa7ec00cffc
        RDX: ffff9fa7ebffdec0 RSI: 000000000000feda RDI: 0000000000000122
        RBP: 00000000000005c4 R08: 0000000000000001 R09: 0000000000000000
        R10: ffff9fe588ae3800 R11: ffff9fe011fc92f0 R12: ffff9fcc8d31bc00
        R13: ffff9fe0119d4300 R14: 00000000000005c4 R15: ffff9fba57d70900
        FS:  0000000000000000(0000) GS:ffff9fe68df00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffff9fa7ec00d008 CR3: 0000003e99b1c000 CR4: 0000000000340ee0
        Call Trace:
         <IRQ>
         skb_gso_validate_network_len+0x11/0x70
         __ip_finish_output+0x109/0x1c0
         ip_sublist_rcv_finish+0x57/0x70
         ip_sublist_rcv+0x2aa/0x2d0
         ? ip_rcv_finish_core.constprop.0+0x390/0x390
         ip_list_rcv+0x12b/0x14f
         __netif_receive_skb_list_core+0x2a9/0x2d0
         netif_receive_skb_list_internal+0x1b5/0x2e0
         napi_complete_done+0x93/0x140
         veth_poll+0xc0/0x19f [veth]
         ? mlx5e_napi_poll+0x221/0x610 [mlx5_core]
         net_rx_action+0x1f8/0x790
         __do_softirq+0xe1/0x2bf
         irq_exit+0x8e/0xc0
         do_IRQ+0x58/0xe0
         common_interrupt+0xf/0xf
         </IRQ>
      
      The bug can be observed in a simple setup where we send IP/GRE/IP/TCP
      packets into a netns over a veth pair. Inside the netns, packets are
      forwarded to dummy device:
      
        trafgen -> [veth A]--[veth B] -forward-> [dummy]
      
      For veth B to GRO aggregate packets on receive, it needs to have an XDP
      program attached (for example, a trivial XDP_PASS). Additionally, for UDP,
      we need to enable GSO_UDP_L4 feature on the device:
      
        ip netns exec A ethtool -K AB rx-udp-gro-forwarding on
      
      The last component is an artificial delay to increase the chances of GRO
      batching happening:
      
        ip netns exec A tc qdisc add dev AB root \
           netem delay 200us slot 5ms 10ms packets 2 bytes 64k
      
      With such a setup in place, the bug can be observed by tracing the skb
      outer and inner offsets when GSO skb is transmitted from the dummy device:
      
      tcp:
      
      FUNC              DEV   SKB_LEN  NH  TH ENC INH ITH GSO_SIZE GSO_TYPE
      ip_finish_output  dumB     2830 270 290   1 294 254     1383 (tcpv4,gre,)
                                                      ^^^
      udp:
      
      FUNC              DEV   SKB_LEN  NH  TH ENC INH ITH GSO_SIZE GSO_TYPE
      ip_finish_output  dumB     2818 270 290   1 294 254     1383 (gre,udp_l4,)
                                                      ^^^
      
      Fix it by updating the inner transport header offset in tcp/udp
      gro_complete callbacks, similar to how {inet,ipv6}_gro_complete callbacks
      update the inner network header offset, when skb->encapsulation flag is
      set.
      
      [1] https://lore.kernel.org/netdev/CAKxSbF01cLpZem2GFaUaifh0S-5WYViZemTicAg7FCHOnh6kug@mail.gmail.com/
      
      Fixes: bf296b12 ("tcp: Add GRO support")
      Fixes: f993bc25 ("net: core: handle encapsulation offloads when computing segment lengths")
      Fixes: e20cf8d3 ("udp: implement GRO for plain UDP sockets.")
      Reported-by: default avatarAlex Forster <aforster@cloudflare.com>
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d51c5907
    • Prabhakar Kushwaha's avatar
      qede: fix crash in rmmod qede while automatic debug collection · 1159e25c
      Prabhakar Kushwaha authored
      A crash has been observed if rmmod is done while automatic debug
      collection in progress. It is due to a race  condition between
      both of them.
      
      To fix stop the sp_task during unload to avoid running qede_sp_task
      even if they are schedule during removal process.
      Signed-off-by: default avatarAlok Prasad <palok@marvell.com>
      Signed-off-by: default avatarShai Malin <smalin@marvell.com>
      Signed-off-by: default avatarAriel Elior <aelior@marvell.com>
      Signed-off-by: default avatarPrabhakar Kushwaha <pkushwaha@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1159e25c
  3. 30 Jul, 2021 16 commits
    • Linus Torvalds's avatar
      Merge tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · c7d10223
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
        (mac80211) and netfilter trees.
      
        Current release - regressions:
      
         - mac80211: fix starting aggregation sessions on mesh interfaces
      
        Current release - new code bugs:
      
         - sctp: send pmtu probe only if packet loss in Search Complete state
      
         - bnxt_en: add missing periodic PHC overflow check
      
         - devlink: fix phys_port_name of virtual port and merge error
      
         - hns3: change the method of obtaining default ptp cycle
      
         - can: mcba_usb_start(): add missing urb->transfer_dma initialization
      
        Previous releases - regressions:
      
         - set true network header for ECN decapsulation
      
         - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
           LRO
      
         - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
           PHY
      
         - sctp: fix return value check in __sctp_rcv_asconf_lookup
      
        Previous releases - always broken:
      
         - bpf:
             - more spectre corner case fixes, introduce a BPF nospec
               instruction for mitigating Spectre v4
             - fix OOB read when printing XDP link fdinfo
             - sockmap: fix cleanup related races
      
         - mac80211: fix enabling 4-address mode on a sta vif after assoc
      
         - can:
             - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
             - j1939: j1939_session_deactivate(): clarify lifetime of session
               object, avoid UAF
             - fix number of identical memory leaks in USB drivers
      
         - tipc:
             - do not blindly write skb_shinfo frags when doing decryption
             - fix sleeping in tipc accept routine"
      
      * tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
        gve: Update MAINTAINERS list
        can: esd_usb2: fix memory leak
        can: ems_usb: fix memory leak
        can: usb_8dev: fix memory leak
        can: mcba_usb_start(): add missing urb->transfer_dma initialization
        can: hi311x: fix a signedness bug in hi3110_cmd()
        MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
        bpf: Fix leakage due to insufficient speculative store bypass mitigation
        bpf: Introduce BPF nospec instruction for mitigating Spectre v4
        sis900: Fix missing pci_disable_device() in probe and remove
        net: let flow have same hash in two directions
        nfc: nfcsim: fix use after free during module unload
        tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
        sctp: fix return value check in __sctp_rcv_asconf_lookup
        nfc: s3fwrn5: fix undefined parameter values in dev_err()
        net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
        net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
        net/mlx5: Unload device upon firmware fatal error
        net/mlx5e: Fix page allocation failure for ptp-RQ over SF
        net/mlx5e: Fix page allocation failure for trap-RQ over SF
        ...
      c7d10223
    • Linus Torvalds's avatar
      Merge tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · e1dab4c0
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These revert a recent IRQ resources handling modification that turned
        out to be problematic, fix suspend-to-idle handling on AMD platforms
        to take upcoming systems into account properly and fix the retrieval
        of the DPTF attributes of the PCH FIVR.
      
        Specifics:
      
         - Revert recent change of the ACPI IRQ resources handling that
           attempted to improve the ACPI IRQ override selection logic, but
           introduced serious regressions on some systems (Hui Wang).
      
         - Fix up quirks for AMD platforms in the suspend-to-idle support code
           so as to take upcoming systems using uPEP HID AMDI007 into account
           as appropriate (Mario Limonciello).
      
         - Fix the code retrieving DPTF attributes of the PCH FIVR so that it
           agrees on the return data type with the ACPI control method
           evaluated for this purpose (Srinivas Pandruvada)"
      
      * tag 'acpi-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: DPTF: Fix reading of attributes
        Revert "ACPI: resources: Add checks for ACPI IRQ override"
        ACPI: PM: Add support for upcoming AMD uPEP HID AMDI007
      e1dab4c0
    • Linus Torvalds's avatar
      pipe: make pipe writes always wake up readers · 3a34b13a
      Linus Torvalds authored
      Since commit 1b6b26ae ("pipe: fix and clarify pipe write wakeup
      logic") we have sanitized the pipe write logic, and would only try to
      wake up readers if they needed it.
      
      In particular, if the pipe already had data in it before the write,
      there was no point in trying to wake up a reader, since any existing
      readers must have been aware of the pre-existing data already.  Doing
      extraneous wakeups will only cause potential thundering herd problems.
      
      However, it turns out that some Android libraries have misused the EPOLL
      interface, and expected "edge triggered" be to "any new write will
      trigger it".  Even if there was no edge in sight.
      
      Quoting Sandeep Patil:
       "The commit 1b6b26ae ('pipe: fix and clarify pipe write wakeup
        logic') changed pipe write logic to wakeup readers only if the pipe
        was empty at the time of write. However, there are libraries that
        relied upon the older behavior for notification scheme similar to
        what's described in [1]
      
        One such library 'realm-core'[2] is used by numerous Android
        applications. The library uses a similar notification mechanism as GNU
        Make but it never drains the pipe until it is full. When Android moved
        to v5.10 kernel, all applications using this library stopped working.
      
        The library has since been fixed[3] but it will be a while before all
        applications incorporate the updated library"
      
      Our regression rule for the kernel is that if applications break from
      new behavior, it's a regression, even if it was because the application
      did something patently wrong.  Also note the original report [4] by
      Michal Kerrisk about a test for this epoll behavior - but at that point
      we didn't know of any actual broken use case.
      
      So add the extraneous wakeup, to approximate the old behavior.
      
      [ I say "approximate", because the exact old behavior was to do a wakeup
        not for each write(), but for each pipe buffer chunk that was filled
        in. The behavior introduced by this change is not that - this is just
        "every write will cause a wakeup, whether necessary or not", which
        seems to be sufficient for the broken library use. ]
      
      It's worth noting that this adds the extraneous wakeup only for the
      write side, while the read side still considers the "edge" to be purely
      about reading enough from the pipe to allow further writes.
      
      See commit f467a6a6 ("pipe: fix and clarify pipe read wakeup logic")
      for the pipe read case, which remains that "only wake up if the pipe was
      full, and we read something from it".
      
      Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
      Link: https://github.com/realm/realm-core [2]
      Link: https://github.com/realm/realm-core/issues/4666 [3]
      Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
      Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/Reported-by: default avatarSandeep Patil <sspatil@android.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a34b13a
    • Rafael J. Wysocki's avatar
      Merge branches 'acpi-resources' and 'acpi-dptf' · e83f54ea
      Rafael J. Wysocki authored
      * acpi-resources:
        Revert "ACPI: resources: Add checks for ACPI IRQ override"
      
      * acpi-dptf:
        ACPI: DPTF: Fix reading of attributes
      e83f54ea
    • Linus Torvalds's avatar
      Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block · 4669e13c
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - gendisk freeing fix (Christoph)
      
       - blk-iocost wake ordering fix (Tejun)
      
       - tag allocation error handling fix (John)
      
       - loop locking fix. While this isn't the prettiest fix in the world,
         nobody has any good alternatives for 5.14. Something to likely
         revisit for 5.15. (Tetsuo)
      
      * tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
        block: delay freeing the gendisk
        blk-iocost: fix operation ordering in iocg_wake_fn()
        blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
        loop: reintroduce global lock for safe loop_validate_file() traversal
      4669e13c
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block · 27eb687b
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - A fix for block backed reissue (me)
      
       - Reissue context hardening (me)
      
       - Async link locking fix (Pavel)
      
      * tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
        io_uring: fix poll requests leaking second poll entries
        io_uring: don't block level reissue off completion path
        io_uring: always reissue from task_work context
        io_uring: fix race in unified task_work running
        io_uring: fix io_prep_async_link locking
      27eb687b
    • Linus Torvalds's avatar
      Merge tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block · f6c5971b
      Linus Torvalds authored
      Pull libata fixlets from Jens Axboe:
      
       - A fix for PIO highmem (Christoph)
      
       - Kill HAVE_IDE as it's now unused (Lukas)
      
      * tag 'libata-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
        arch: Kconfig: clean up obsolete use of HAVE_IDE
        libata: fix ata_pio_sector for CONFIG_HIGHMEM
      f6c5971b
    • Linus Torvalds's avatar
      Merge tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 051df241
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - fix -Warray-bounds warning, to help external patchset to make it
         default treewide
      
       - fix writeable device accounting (syzbot report)
      
       - fix fsync and log replay after a rename and inode eviction
      
       - fix potentially lost error code when submitting multiple bios for
         compressed range
      
      * tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: calculate number of eb pages properly in csum_tree_block
        btrfs: fix rw device counting in __btrfs_free_extra_devids
        btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
        btrfs: mark compressed range uptodate only if all bio succeed
      051df241
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · 8723bc8f
      Linus Torvalds authored
      Pull HID fixes from Jiri Kosina:
      
       - resume timing fix for intel-ish driver (Ye Xiang)
      
       - fix for using incorrect MMIO register in amd_sfh driver (Dylan
         MacKenzie)
      
       - Cintiq 24HDT / 27QHDT regression fix and touch processing fix for
         Wacom driver (Jason Gerecke)
      
       - device removal bugfix for ft260 driver (Michael Zaidman)
      
       - other small assorted fixes
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
        HID: ft260: fix device removal due to USB disconnect
        HID: wacom: Skip processing of touches with negative slot values
        HID: wacom: Re-enable touch by default for Cintiq 24HDT / 27QHDT
        HID: Kconfig: Fix spelling mistake "Uninterruptable" -> "Uninterruptible"
        HID: apple: Add support for Keychron K1 wireless keyboard
        HID: fix typo in Kconfig
        HID: ft260: fix format type warning in ft260_word_show()
        HID: amd_sfh: Use correct MMIO register for DMA address
        HID: asus: Remove check for same LED brightness on set
        HID: intel-ish-hid: use async resume function
      8723bc8f
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · ad6ec09d
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "7 patches.
      
        Subsystems affected by this patch series: lib, ocfs2, and mm (slub,
        migration, and memcg)"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()
        slub: fix unreclaimable slab stat for bulk free
        mm/migrate: fix NR_ISOLATED corruption on 64-bit
        mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
        ocfs2: issue zeroout to EOF blocks
        ocfs2: fix zero out valid data
        lib/test_string.c: move string selftest in the Runtime Testing menu
      ad6ec09d
    • Jakub Kicinski's avatar
      Merge tag 'linux-can-fixes-for-5.14-20210730' of... · 8d670412
      Jakub Kicinski authored
      Merge tag 'linux-can-fixes-for-5.14-20210730' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
      
      Marc Kleine-Budde says:
      
      ====================
      pull-request: can 2021-07-30
      
      The first patch is by me and adds Yasushi SHOJI as a reviewer for the
      Microchip CAN BUS Analyzer Tool driver.
      
      Dan Carpenter's patch fixes a signedness bug in the hi311x driver.
      
      Pavel Skripkin provides 4 patches, the first targets the mcba_usb
      driver by adding the missing urb->transfer_dma initialization, which
      was broken in a previous commit. The last 3 patches fix a memory leak
      in the usb_8dev, ems_usb and esd_usb2 driver.
      
      * tag 'linux-can-fixes-for-5.14-20210730' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
        can: esd_usb2: fix memory leak
        can: ems_usb: fix memory leak
        can: usb_8dev: fix memory leak
        can: mcba_usb_start(): add missing urb->transfer_dma initialization
        can: hi311x: fix a signedness bug in hi3110_cmd()
        MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
      ====================
      
      Link: https://lore.kernel.org/r/20210730070526.1699867-1-mkl@pengutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8d670412
    • Wang Hai's avatar
      mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook() · 121dffe2
      Wang Hai authored
      When I use kfree_rcu() to free a large memory allocated by kmalloc_node(),
      the following dump occurs.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000020
        [...]
        Oops: 0000 [#1] SMP
        [...]
        Workqueue: events kfree_rcu_work
        RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline]
        RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline]
        RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363
        [...]
        Call Trace:
          kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293
          kfree_bulk include/linux/slab.h:413 [inline]
          kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300
          process_one_work+0x207/0x530 kernel/workqueue.c:2276
          worker_thread+0x320/0x610 kernel/workqueue.c:2422
          kthread+0x13d/0x160 kernel/kthread.c:313
          ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
      When kmalloc_node() a large memory, page is allocated, not slab, so when
      freeing memory via kfree_rcu(), this large memory should not be used by
      memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for
      slab.
      
      Using page_objcgs_check() instead of page_objcgs() in
      memcg_slab_free_hook() to fix this bug.
      
      Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com
      Fixes: 270c6a71 ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data")
      Signed-off-by: default avatarWang Hai <wanghai38@huawei.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      121dffe2
    • Shakeel Butt's avatar
      slub: fix unreclaimable slab stat for bulk free · f227f0fa
      Shakeel Butt authored
      SLUB uses page allocator for higher order allocations and update
      unreclaimable slab stat for such allocations.  At the moment, the bulk
      free for SLUB does not share code with normal free code path for these
      type of allocations and have missed the stat update.  So, fix the stat
      update by common code.  The user visible impact of the bug is the
      potential of inconsistent unreclaimable slab stat visible through
      meminfo and vmstat.
      
      Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com
      Fixes: 6a486c0a ("mm, sl[ou]b: improve memory accounting")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f227f0fa
    • Aneesh Kumar K.V's avatar
      mm/migrate: fix NR_ISOLATED corruption on 64-bit · b5916c02
      Aneesh Kumar K.V authored
      Similar to commit 2da9f630 ("mm/vmscan: fix NR_ISOLATED_FILE
      corruption on 64-bit") avoid using unsigned int for nr_pages.  With
      unsigned int type the large unsigned int converts to a large positive
      signed long.
      
      Symptoms include CMA allocations hanging forever due to
      alloc_contig_range->...->isolate_migratepages_block waiting forever in
      "while (unlikely(too_many_isolated(pgdat)))".
      
      Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com
      Fixes: c5fc5c3a ("mm: migrate: account THP NUMA migration counters correctly")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reported-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5916c02
    • Johannes Weiner's avatar
      mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code · 30def935
      Johannes Weiner authored
      Dan Carpenter reports:
      
          The patch 2d146aa3: "mm: memcontrol: switch to rstat" from Apr
          29, 2021, leads to the following static checker warning:
      
      	    kernel/cgroup/rstat.c:200 cgroup_rstat_flush()
      	    warn: sleeping in atomic context
      
          mm/memcontrol.c
            3572  static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
            3573  {
            3574          unsigned long val;
            3575
            3576          if (mem_cgroup_is_root(memcg)) {
            3577                  cgroup_rstat_flush(memcg->css.cgroup);
      			    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      
          This is from static analysis and potentially a false positive.  The
          problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold()
          which holds an rcu_read_lock().  And the cgroup_rstat_flush() function
          can sleep.
      
            3578                  val = memcg_page_state(memcg, NR_FILE_PAGES) +
            3579                          memcg_page_state(memcg, NR_ANON_MAPPED);
            3580                  if (swap)
            3581                          val += memcg_page_state(memcg, MEMCG_SWAP);
            3582          } else {
            3583                  if (!swap)
            3584                          val = page_counter_read(&memcg->memory);
            3585                  else
            3586                          val = page_counter_read(&memcg->memsw);
            3587          }
            3588          return val;
            3589  }
      
      __mem_cgroup_threshold() indeed holds the rcu lock.  In addition, the
      thresholding code is invoked during stat changes, and those contexts
      have irqs disabled as well.  If the lock breaking occurs inside the
      flush function, it will result in a sleep from an atomic context.
      
      Use the irqsafe flushing variant in mem_cgroup_usage() to fix this.
      
      Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org
      Fixes: 2d146aa3 ("mm: memcontrol: switch to rstat")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30def935
    • Junxiao Bi's avatar
      ocfs2: issue zeroout to EOF blocks · 9449ad33
      Junxiao Bi authored
      For punch holes in EOF blocks, fallocate used buffer write to zero the
      EOF blocks in last cluster.  But since ->writepage will ignore EOF
      pages, those zeros will not be flushed.
      
      This "looks" ok as commit 6bba4471 ("ocfs2: fix data corruption by
      fallocate") will zero the EOF blocks when extend the file size, but it
      isn't.  The problem happened on those EOF pages, before writeback, those
      pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
      set, when writeback run by write_cache_pages(), DIRTY flag on the page
      was cleared, but DIRTY flag on the buffer_head not.
      
      When next write happened to those EOF pages, since buffer_head already
      had DIRTY flag set, it would not mark page DIRTY again.  That made
      writeback ignore them forever.  That will cause data corruption.  Even
      directio write can't work because it will fail when trying to drop pages
      caches before direct io, as it found the buffer_head for those pages
      still had DIRTY flag set, then it will fall back to buffer io mode.
      
      To make a summary of the issue, as writeback ingores EOF pages, once any
      EOF page is generated, any write to it will only go to the page cache,
      it will never be flushed to disk even file size extends and that page is
      not EOF page any more.  The fix is to avoid zero EOF blocks with buffer
      write.
      
      The following code snippet from qemu-img could trigger the corruption.
      
        656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
        ...
        660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
        660   fallocate(11, 0, 2275868672, 327680) = 0
        658   pwrite64(11, "
      
      Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.comSigned-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9449ad33