1. 14 Jul, 2014 5 commits
    • Christoph Schulz's avatar
      net: pppoe: use correct channel MTU when using Multilink PPP · a8a3e41c
      Christoph Schulz authored
      The PPP channel MTU is used with Multilink PPP when ppp_mp_explode() (see
      ppp_generic module) tries to determine how big a fragment might be. According
      to RFC 1661, the MTU excludes the 2-byte PPP protocol field, see the
      corresponding comment and code in ppp_mp_explode():
      
      		/*
      		 * hdrlen includes the 2-byte PPP protocol field, but the
      		 * MTU counts only the payload excluding the protocol field.
      		 * (RFC1661 Section 2)
      		 */
      		mtu = pch->chan->mtu - (hdrlen - 2);
      
      However, the pppoe module *does* include the PPP protocol field in the channel
      MTU, which is wrong as it causes the PPP payload to be 1-2 bytes too big under
      certain circumstances (one byte if PPP protocol compression is used, two
      otherwise), causing the generated Ethernet packets to be dropped. So the pppoe
      module has to subtract two bytes from the channel MTU. This error only
      manifests itself when using Multilink PPP, as otherwise the channel MTU is not
      used anywhere.
      
      In the following, I will describe how to reproduce this bug. We configure two
      pppd instances for multilink PPP over two PPPoE links, say eth2 and eth3, with
      a MTU of 1492 bytes for each link and a MRRU of 2976 bytes. (This MRRU is
      computed by adding the two link MTUs and subtracting the MP header twice, which
      is 4 bytes long.) The necessary pppd statements on both sides are "multilink
      mtu 1492 mru 1492 mrru 2976". On the client side, we additionally need "plugin
      rp-pppoe.so eth2" and "plugin rp-pppoe.so eth3", respectively; on the server
      side, we additionally need to start two pppoe-server instances to be able to
      establish two PPPoE sessions, one over eth2 and one over eth3. We set the MTU
      of the PPP network interface to the MRRU (2976) on both sides of the connection
      in order to make use of the higher bandwidth. (If we didn't do that, IP
      fragmentation would kick in, which we want to avoid.)
      
      Now we send a ICMPv4 echo request with a payload of 2948 bytes from client to
      server over the PPP link. This results in the following network packet:
      
         2948 (echo payload)
       +    8 (ICMPv4 header)
       +   20 (IPv4 header)
      ---------------------
         2976 (PPP payload)
      
      These 2976 bytes do not exceed the MTU of the PPP network interface, so the
      IP packet is not fragmented. Now the multilink PPP code in ppp_mp_explode()
      prepends one protocol byte (0x21 for IPv4), making the packet one byte bigger
      than the negotiated MRRU. So this packet would have to be divided in three
      fragments. But this does not happen as each link MTU is assumed to be two bytes
      larger. So this packet is diveded into two fragments only, one of size 1489 and
      one of size 1488. Now we have for that bigger fragment:
      
         1489 (PPP payload)
       +    4 (MP header)
       +    2 (PPP protocol field for the MP payload (0x3d))
       +    6 (PPPoE header)
      --------------------------
         1501 (Ethernet payload)
      
      This packet exceeds the link MTU and is discarded.
      
      If one configures the link MTU on the client side to 1501, one can see the
      discarded Ethernet frames with tcpdump running on the client. A
      
      ping -s 2948 -c 1 192.168.15.254
      
      leads to the smaller fragment that is correctly received on the server side:
      
      (tcpdump -vvvne -i eth3 pppoes and ppp proto 0x3d)
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x3] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [end], length 1492
      
      and to the bigger fragment that is not received on the server side:
      
      (tcpdump -vvvne -i eth2 pppoes and ppp proto 0x3d)
      52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
        length 1515: PPPoE  [ses 0x5] MLPPP (0x003d), length 1495: seq 0x000,
        Flags [begin], length 1493
      
      With the patch below, we correctly obtain three fragments:
      
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [begin], length 1492
      52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [none], length 1492
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 27: PPPoE  [ses 0x1] MLPPP (0x003d), length 7: seq 0x000,
        Flags [end], length 5
      
      And the ICMPv4 echo request is successfully received at the server side:
      
      IP (tos 0x0, ttl 64, id 21925, offset 0, flags [DF], proto ICMP (1),
        length 2976)
          192.168.222.2 > 192.168.15.254: ICMP echo request, id 30530, seq 0,
            length 2956
      
      The bug was introduced in commit c9aa6895
      ("[PPPOE]: Advertise PPPoE MTU") from the very beginning. This patch applies
      to 3.10 upwards but the fix can be applied (with minor modifications) to
      kernels as old as 2.6.32.
      Signed-off-by: default avatarChristoph Schulz <develop@kristov.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8a3e41c
    • Mathias Krause's avatar
      neigh: sysctl - simplify address calculation of gc_* variables · 9ecf07a1
      Mathias Krause authored
      The code in neigh_sysctl_register() relies on a specific layout of
      struct neigh_table, namely that the 'gc_*' variables are directly
      following the 'parms' member in a specific order. The code, though,
      expresses this in the most ugly way.
      
      Get rid of the ugly casts and use the 'tbl' pointer to get a handle to
      the table. This way we can refer to the 'gc_*' variables directly.
      
      Similarly seen in the grsecurity patch, written by Brad Spengler.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ecf07a1
    • Daniel Borkmann's avatar
      net: sctp: fix information leaks in ulpevent layer · 8f2e5ae4
      Daniel Borkmann authored
      While working on some other SCTP code, I noticed that some
      structures shared with user space are leaking uninitialized
      stack or heap buffer. In particular, struct sctp_sndrcvinfo
      has a 2 bytes hole between .sinfo_flags and .sinfo_ppid that
      remains unfilled by us in sctp_ulpevent_read_sndrcvinfo() when
      putting this into cmsg. But also struct sctp_remote_error
      contains a 2 bytes hole that we don't fill but place into a skb
      through skb_copy_expand() via sctp_ulpevent_make_remote_error().
      
      Both structures are defined by the IETF in RFC6458:
      
      * Section 5.3.2. SCTP Header Information Structure:
      
        The sctp_sndrcvinfo structure is defined below:
      
        struct sctp_sndrcvinfo {
          uint16_t sinfo_stream;
          uint16_t sinfo_ssn;
          uint16_t sinfo_flags;
          <-- 2 bytes hole  -->
          uint32_t sinfo_ppid;
          uint32_t sinfo_context;
          uint32_t sinfo_timetolive;
          uint32_t sinfo_tsn;
          uint32_t sinfo_cumtsn;
          sctp_assoc_t sinfo_assoc_id;
        };
      
      * 6.1.3. SCTP_REMOTE_ERROR:
      
        A remote peer may send an Operation Error message to its peer.
        This message indicates a variety of error conditions on an
        association. The entire ERROR chunk as it appears on the wire
        is included in an SCTP_REMOTE_ERROR event. Please refer to the
        SCTP specification [RFC4960] and any extensions for a list of
        possible error formats. An SCTP error notification has the
        following format:
      
        struct sctp_remote_error {
          uint16_t sre_type;
          uint16_t sre_flags;
          uint32_t sre_length;
          uint16_t sre_error;
          <-- 2 bytes hole  -->
          sctp_assoc_t sre_assoc_id;
          uint8_t  sre_data[];
        };
      
      Fix this by setting both to 0 before filling them out. We also
      have other structures shared between user and kernel space in
      SCTP that contains holes (e.g. struct sctp_paddrthlds), but we
      copy that buffer over from user space first and thus don't need
      to care about it in that cases.
      
      While at it, we can also remove lengthy comments copied from
      the draft, instead, we update the comment with the correct RFC
      number where one can look it up.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f2e5ae4
    • françois romieu's avatar
      845d6fcc
    • Florian Fainelli's avatar
      net: bcmgenet: fix RGMII_MODE_EN bit · 5a680fad
      Florian Fainelli authored
      RGMII_MODE_EN bit was defined to 0, while it is actually 6. It was not
      much of a problem on older designs where this was a no-op, and the RGMII
      data-path would always be enabled, but newer GENET controllers need to
      explicitely enable their RGMII data-pad using this bit.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a680fad
  2. 11 Jul, 2014 5 commits
  3. 10 Jul, 2014 6 commits
  4. 09 Jul, 2014 8 commits
    • hayeswang's avatar
      r8169: disable L23 · b51ecea8
      hayeswang authored
      For RTL8411, RTL8111G, RTL8402, RTL8105, and RTL8106, disable the feature
      of entering the L2/L3 link state of the PCIe. When the nic starts the process
      of entering the L2/L3 link state and the PCI reset occurs before the work
      is finished, the work would be queued and continue after the next the PCI
      reset occurs. This causes the device stays in L2/L3 link state, and the system
      couldn't find the device.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Acked-by: default avatarFrancois Romieu <romieu@fr.zoreil.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b51ecea8
    • Ben Pfaff's avatar
      netlink: Fix handling of error from netlink_dump(). · ac30ef83
      Ben Pfaff authored
      netlink_dump() returns a negative errno value on error.  Until now,
      netlink_recvmsg() directly recorded that negative value in sk->sk_err, but
      that's wrong since sk_err takes positive errno values.  (This manifests as
      userspace receiving a positive return value from the recv() system call,
      falsely indicating success.) This bug was introduced in the commit that
      started checking the netlink_dump() return value, commit b44d211e (netlink:
      handle errors from netlink_dump()).
      
      Multithreaded Netlink dumps are one way to trigger this behavior in
      practice, as described in the commit message for the userspace workaround
      posted here:
          http://openvswitch.org/pipermail/dev/2014-June/042339.html
      
      This commit also fixes the same bug in netlink_poll(), introduced in commit
      cd1df525 (netlink: add flow control for memory mapped I/O).
      Signed-off-by: default avatarBen Pfaff <blp@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac30ef83
    • Thomas Fitzsimmons's avatar
      net: mvneta: Fix big endian issue in mvneta_txq_desc_csum() · 0a198587
      Thomas Fitzsimmons authored
      This commit fixes the command value generated for CSUM calculation
      when running in big endian mode.  The Ethernet protocol ID for IP was
      being unconditionally byte-swapped in the layer 3 protocol check (with
      swab16), which caused the mvneta driver to not function correctly in
      big endian mode.  This patch byte-swaps the ID conditionally with
      htons.
      
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: default avatarThomas Fitzsimmons <fitzsim@fitzsim.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a198587
    • Thomas Petazzoni's avatar
      net: mvneta: fix operation in 10 Mbit/s mode · 4d12bc63
      Thomas Petazzoni authored
      As reported by Maggie Mae Roxas, the mvneta driver doesn't behave
      properly in 10 Mbit/s mode. This is due to a misconfiguration of the
      MVNETA_GMAC_AUTONEG_CONFIG register: bit MVNETA_GMAC_CONFIG_MII_SPEED
      must be set for a 100 Mbit/s speed, but cleared for a 10 Mbit/s speed,
      which the driver was not properly doing. This commit adjusts that by
      setting the MVNETA_GMAC_CONFIG_MII_SPEED bit only in 100 Mbit/s mode,
      and relying on the fact that all the speed related bits of this
      register are cleared at the beginning of the mvneta_adjust_link()
      function.
      
      This problem exists since c5aff182 ("net: mvneta: driver for
      Marvell Armada 370/XP network unit") which is the commit that
      introduced the mvneta driver in the kernel.
      
      Cc: <stable@vger.kernel.org> # v3.8+
      Fixes: c5aff182 ("net: mvneta: driver for Marvell Armada 370/XP network unit")
      Reported-by: default avatarMaggie Mae Roxas <maggie.mae.roxas@gmail.com>
      Cc: Maggie Mae Roxas <maggie.mae.roxas@gmail.com>
      Signed-off-by: default avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d12bc63
    • Amir Vadai's avatar
      net/mlx4_en: Ignore budget on TX napi polling · fbc6daf1
      Amir Vadai authored
      It is recommended that TX work not count against the quota.
      The cost of TX packet liberation is a minute percentage of what it costs to
      process an RX frame. Furthermore, that SKB freeing makes memory available for
      other paths in the stack.
      
      Give the TX a larger budget and be more aggressive about cleaning up the Tx
      descriptors this budget could be changed using ethtool:
      $ ethtool -C eth1 tx-frames-irq <budget>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbc6daf1
    • hayeswang's avatar
      r8152: wake up the device before dumping the hw counter · 0b030244
      hayeswang authored
      The device should be waked up from runtime suspend before dumping
      the hw counter.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b030244
    • Andrey Utkin's avatar
      appletalk: Fix socket referencing in skb · 36beddc2
      Andrey Utkin authored
      Setting just skb->sk without taking its reference and setting a
      destructor is invalid. However, in the places where this was done, skb
      is used in a way not requiring skb->sk setting. So dropping the setting
      of skb->sk.
      Thanks to Eric Dumazet <eric.dumazet@gmail.com> for correct solution.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=79441Reported-by: default avatarEd Martin <edman007@edman007.com>
      Signed-off-by: default avatarAndrey Utkin <andrey.krieger.utkin@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36beddc2
    • Dmitry Popov's avatar
      ip_tunnel: fix ip_tunnel_lookup · e0056593
      Dmitry Popov authored
      This patch fixes 3 similar bugs where incoming packets might be routed into
      wrong non-wildcard tunnels:
      
      1) Consider the following setup:
          ip address add 1.1.1.1/24 dev eth0
          ip address add 1.1.1.2/24 dev eth0
          ip tunnel add ipip1 remote 2.2.2.2 local 1.1.1.1 mode ipip dev eth0
          ip link set ipip1 up
      
      Incoming ipip packets from 2.2.2.2 were routed into ipip1 even if it has dst =
      1.1.1.2. Moreover even if there was wildcard tunnel like
         ip tunnel add ipip0 remote 2.2.2.2 local any mode ipip dev eth0
      but it was created before explicit one (with local 1.1.1.1), incoming ipip
      packets with src = 2.2.2.2 and dst = 1.1.1.2 were still routed into ipip1.
      
      Same issue existed with all tunnels that use ip_tunnel_lookup (gre, vti)
      
      2)  ip address add 1.1.1.1/24 dev eth0
          ip tunnel add ipip1 remote 2.2.146.85 local 1.1.1.1 mode ipip dev eth0
          ip link set ipip1 up
      
      Incoming ipip packets with dst = 1.1.1.1 were routed into ipip1, no matter what
      src address is. Any remote ip address which has ip_tunnel_hash = 0 raised this
      issue, 2.2.146.85 is just an example, there are more than 4 million of them.
      And again, wildcard tunnel like
         ip tunnel add ipip0 remote any local 1.1.1.1 mode ipip dev eth0
      wouldn't be ever matched if it was created before explicit tunnel like above.
      
      Gre & vti tunnels had the same issue.
      
      3)  ip address add 1.1.1.1/24 dev eth0
          ip tunnel add gre1 remote 2.2.146.84 local 1.1.1.1 key 1 mode gre dev eth0
          ip link set gre1 up
      
      Any incoming gre packet with key = 1 were routed into gre1, no matter what
      src/dst addresses are. Any remote ip address which has ip_tunnel_hash = 0 raised
      the issue, 2.2.146.84 is just an example, there are more than 4 million of them.
      Wildcard tunnel like
         ip tunnel add gre2 remote any local any key 1 mode gre dev eth0
      wouldn't be ever matched if it was created before explicit tunnel like above.
      
      All this stuff happened because while looking for a wildcard tunnel we didn't
      check that matched tunnel is a wildcard one. Fixed.
      Signed-off-by: default avatarDmitry Popov <ixaphire@qrator.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0056593
  5. 08 Jul, 2014 16 commits
    • Rickard Strandqvist's avatar
      isdn: hisax: l3ni1.c: Fix for possible null pointer dereference · 85b722d7
      Rickard Strandqvist authored
      There is otherwise a risk of a possible null pointer dereference.
      
      Was largely found by using a static code analysis program called cppcheck.
      Signed-off-by: default avatarRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      85b722d7
    • Jon Paul Maloy's avatar
      tipc: fix bug in multicast/broadcast message reassembly · 29322d0d
      Jon Paul Maloy authored
      Since commit 37e22164 ("tipc: rename and
      move message reassembly function") reassembly of long broadcast messages
      has been broken. This is because we test for a non-NULL return value
      of the *buf parameter as criteria for succesful reassembly. However, this
      parameter is left defined even after reception of the first fragment,
      when reassebly is still incomplete. This leads to a kernel crash as soon
      as a the first fragment of a long broadcast message is received.
      
      We fix this with this commit, by implementing a stricter behavior of the
      function and its return values.
      
      This commit should be applied to both net and net-next.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29322d0d
    • Alexander Aring's avatar
      MAINTAINERS: change IEEE 802.15.4 maintainer · b6e195fd
      Alexander Aring authored
      This patch changes the IEEE 802.15.4 subsystem maintainer to Alexander Aring.
      We discussed this change before via e-mail and I collected the acks from the
      current maintainers.
      Signed-off-by: default avatarAlexander Aring <alex.aring@gmail.com>
      Acked-by: default avatarDmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Acked-by: default avatarAlexander Smirnov <alex.bluesman.sminov@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6e195fd
    • David S. Miller's avatar
      Merge branch 'xen-netfront' · ff883aeb
      David S. Miller authored
      David Vrabel says:
      
      ====================
      xen-netfront: multi-queue related locking fixes
      
      Two fixes to the per-queue locking bugs in xen-netfront that were
      introduced in 3.16-rc1 with the multi-queue support.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff883aeb
    • David Vrabel's avatar
      xen-netfront: call netif_carrier_off() only once when disconnecting · f9feb1e6
      David Vrabel authored
      In xennet_disconnect_backend(), netif_carrier_off() was called once
      per queue when it needs to only be called once.
      
      The queue locking around the netif_carrier_off() call looked very
      odd. I think they were supposed to synchronize any NAPI instances with
      the expectation that no further NAPI instances would be scheduled
      because of the carrier being off (see the check in
      xennet_rx_interrupt()).  But I can't easily tell if this works
      correctly.
      
      Instead, add a napi_synchronize() call after disabling the interrupts.
      This is obviously correct as with no Rx interrupts, no further NAPI
      instances will be scheduled.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9feb1e6
    • David Vrabel's avatar
      xen-netfront: don't nest queue locks in xennet_connect() · f50b4076
      David Vrabel authored
      The nesting of the per-queue rx_lock and tx_lock in xennet_connect()
      is confusing to both humans and lockdep.  The locking is safe because
      this is the only place where the locks are nested in this way but
      lockdep still warns.
      
      Instead of adding the missing lockdep annotations, refactor the
      locking to avoid the confusing nesting.  This is still safe, because
      the xenbus connection state changes are all serialized by the xenwatch
      thread.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reported-by: default avatarSander Eikelenboom <linux@eikelenboom.it>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f50b4076
    • Yuchung Cheng's avatar
      tcp: fix false undo corner cases · 6e08d5e3
      Yuchung Cheng authored
      The undo code assumes that, upon entering loss recovery, TCP
      1) always retransmit something
      2) the retransmission never fails locally (e.g., qdisc drop)
      
      so undo_marker is set in tcp_enter_recovery() and undo_retrans is
      incremented only when tcp_retransmit_skb() is successful.
      
      When the assumption is broken because TCP's cwnd is too small to
      retransmit or the retransmit fails locally. The next (DUP)ACK
      would incorrectly revert the cwnd and the congestion state in
      tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
      may enter the recovery state. The sender repeatedly enter and
      (incorrectly) exit recovery states if the retransmits continue to
      fail locally while receiving (DUP)ACKs.
      
      The fix is to initialize undo_retrans to -1 and start counting on
      the first retransmission. Always increment undo_retrans even if the
      retransmissions fail locally because they couldn't cause DSACKs to
      undo the cwnd reduction.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e08d5e3
    • Or Gerlitz's avatar
      net/mlx4_en: Don't configure the HW vxlan parser when vxlan offloading isn't set · e326f2f1
      Or Gerlitz authored
      The add_vxlan_port ndo driver code was wrongly testing whether HW vxlan offloads
      are supported by the device instead of checking if they are currently enabled.
      
      This causes the driver to configure the HW parser to conduct matching for vxlan
      packets but since no steering rules were set, vxlan packets are dropped on RX.
      
      Fix that by doing the right test, as done in the del_vxlan_port ndo handler.
      
      Fixes: 1b136de1 ('net/mlx4: Implement vxlan ndo calls')
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e326f2f1
    • dingtianhong's avatar
      igmp: fix the problem when mc leave group · 52ad353a
      dingtianhong authored
      The problem was triggered by these steps:
      
      1) create socket, bind and then setsockopt for add mc group.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("192.168.1.2");
         setsockopt(sockfd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));
      
      2) drop the mc group for this socket.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("0.0.0.0");
         setsockopt(sockfd, IPPROTO_IP, IP_DROP_MEMBERSHIP, &mreq, sizeof(mreq));
      
      3) and then drop the socket, I found the mc group was still used by the dev:
      
         netstat -g
      
         Interface       RefCnt Group
         --------------- ------ ---------------------
         eth2		   1	  255.0.0.37
      
      Normally even though the IP_DROP_MEMBERSHIP return error, the mc group still need
      to be released for the netdev when drop the socket, but this process was broken when
      route default is NULL, the reason is that:
      
      The ip_mc_leave_group() will choose the in_dev by the imr_interface.s_addr, if input addr
      is NULL, the default route dev will be chosen, then the ifindex is got from the dev,
      then polling the inet->mc_list and return -ENODEV, but if the default route dev is NULL,
      the in_dev and ifIndex is both NULL, when polling the inet->mc_list, the mc group will be
      released from the mc_list, but the dev didn't dec the refcnt for this mc group, so
      when dropping the socket, the mc_list is NULL and the dev still keep this group.
      
      v1->v2: According Hideaki's suggestion, we should align with IPv6 (RFC3493) and BSDs,
      	so I add the checking for the in_dev before polling the mc_list, make sure when
      	we remove the mc group, dec the refcnt to the real dev which was using the mc address.
      	The problem would never happened again.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52ad353a
    • Loic Prylli's avatar
      net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush · 54951194
      Loic Prylli authored
      A bug was introduced in NETDEV_CHANGE notifier sequence causing the
      arp table to be sometimes spuriously cleared (including manual arp
      entries marked permanent), upon network link carrier changes.
      
      The changed argument for the notifier was applied only to a single
      caller of NETDEV_CHANGE, missing among others netdev_state_change().
      So upon net_carrier events induced by the network, which are
      triggering a call to netdev_state_change(), arp_netdev_event() would
      decide whether to clear or not arp cache based on random/junk stack
      values (a kind of read buffer overflow).
      
      Fixes: be9efd36 ("net: pass changed flags along with NETDEV_CHANGE event")
      Fixes: 6c8b4e3f ("arp: flush arp cache on IFF_NOARP change")
      Signed-off-by: default avatarLoic Prylli <loicp@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54951194
    • Bernd Wachter's avatar
      net: qmi_wwan: Add ID for Telewell TW-LTE 4G v2 · 8dcb4b15
      Bernd Wachter authored
      There's a new version of the Telewell 4G modem working with, but not
      recognized by this driver.
      Signed-off-by: default avatarBernd Wachter <bernd.wachter@jolla.com>
      Acked-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8dcb4b15
    • David S. Miller's avatar
      Revert "net: stmmac: add platform init/exit for Altera's ARM socfpga" · 26a9ebca
      David S. Miller authored
      This reverts commit 0acf1676.
      
      Breaks the build due to missing reference to phy_resume in
      the resulting dwmac-socfpga.o object.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26a9ebca
    • Zhao Qiang's avatar
      powerpc/ucc_geth: deal with a compile warning · 8844a006
      Zhao Qiang authored
      deal with a compile warning: comparison between
      'enum qe_fltr_largest_external_tbl_lookup_key_size'
      and 'enum qe_fltr_tbl_lookup_key_size'
      
      the code:
      	"if (ug_info->largestexternallookupkeysize ==
      	     QE_FLTR_TABLE_LOOKUP_KEY_SIZE_8_BYTES)"
      is warned because different enum, so modify it.
      
      	"enum qe_fltr_largest_external_tbl_lookup_key_size
      	             largestexternallookupkeysize;
      
      	enum qe_fltr_tbl_lookup_key_size {
      		 QE_FLTR_TABLE_LOOKUP_KEY_SIZE_8_BYTES
      			 = 0x3f,         /* LookupKey parsed by the Generate LookupKey
      					    CMD is truncated to 8 bytes */
      		 QE_FLTR_TABLE_LOOKUP_KEY_SIZE_16_BYTES
      			 = 0x5f,         /* LookupKey parsed by the Generate LookupKey
      					    CMD is truncated to 16 bytes */
      	 };
      
      	 /* QE FLTR extended filtering Largest External Table Lookup Key Size */
      	 enum qe_fltr_largest_external_tbl_lookup_key_size {
      		 QE_FLTR_LARGEST_EXTERNAL_TABLE_LOOKUP_KEY_SIZE_NONE
      			 = 0x0,/* not used */
      		 QE_FLTR_LARGEST_EXTERNAL_TABLE_LOOKUP_KEY_SIZE_8_BYTES
      			 = QE_FLTR_TABLE_LOOKUP_KEY_SIZE_8_BYTES,        /* 8 bytes */
      		 QE_FLTR_LARGEST_EXTERNAL_TABLE_LOOKUP_KEY_SIZE_16_BYTES
      			 = QE_FLTR_TABLE_LOOKUP_KEY_SIZE_16_BYTES,       /* 16 bytes */
      	 };"
      Signed-off-by: default avatarZhao Qiang <B45475@freescale.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8844a006
    • David S. Miller's avatar
      Merge branch 'net_ovs_fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/pshelar/openvswitch · edc1bb0b
      David S. Miller authored
      Pravin B Shelar says:
      
      ====================
      Open vSwitch
      
      A set of fixes for net.
      First bug is related flow-table management.  Second one is in sample
      action. Third is related flow stats and last one add gre-err handler for ovs.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      edc1bb0b
    • Tom Herbert's avatar
      net: Performance fix for process_backlog · 11ef7a89
      Tom Herbert authored
      In process_backlog the input_pkt_queue is only checked once for new
      packets and quota is artificially reduced to reflect precisely the
      number of packets on the input_pkt_queue so that the loop exits
      appropriately.
      
      This patches changes the behavior to be more straightforward and
      less convoluted. Packets are processed until either the quota
      is met or there are no more packets to process.
      
      This patch seems to provide a small, but noticeable performance
      improvement. The performance improvement is a result of staying
      in the process_backlog loop longer which can reduce number of IPI's.
      
      Performance data using super_netperf TCP_RR with 200 flows:
      
      Before fix:
      
      88.06% CPU utilization
      125/190/309 90/95/99% latencies
      1.46808e+06 tps
      1145382 intrs.sec.
      
      With fix:
      
      87.73% CPU utilization
      122/183/296 90/95/99% latencies
      1.4921e+06 tps
      1021674.30 intrs./sec.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11ef7a89
    • Edward Allcutt's avatar
      ipv4: icmp: Fix pMTU handling for rare case · 68b7107b
      Edward Allcutt authored
      Some older router implementations still send Fragmentation Needed
      errors with the Next-Hop MTU field set to zero. This is explicitly
      described as an eventuality that hosts must deal with by the
      standard (RFC 1191) since older standards specified that those
      bits must be zero.
      
      Linux had a generic (for all of IPv4) implementation of the algorithm
      described in the RFC for searching a list of MTU plateaus for a good
      value. Commit 46517008 ("ipv4: Kill ip_rt_frag_needed().")
      removed this as part of the changes to remove the routing cache.
      Subsequently any Fragmentation Needed packet with a zero Next-Hop
      MTU has been discarded without being passed to the per-protocol
      handlers or notifying userspace for raw sockets.
      
      When there is a router which does not implement RFC 1191 on an
      MTU limited path then this results in stalled connections since
      large packets are discarded and the local protocols are not
      notified so they never attempt to lower the pMTU.
      
      One example I have seen is an OpenBSD router terminating IPSec
      tunnels. It's worth pointing out that this case is distinct from
      the BSD 4.2 bug which incorrectly calculated the Next-Hop MTU
      since the commit in question dismissed that as a valid concern.
      
      All of the per-protocols handlers implement the simple approach from
      RFC 1191 of immediately falling back to the minimum value. Although
      this is sub-optimal it is vastly preferable to connections hanging
      indefinitely.
      
      Remove the Next-Hop MTU != 0 check and allow such packets
      to follow the normal path.
      
      Fixes: 46517008 ("ipv4: Kill ip_rt_frag_needed().")
      Signed-off-by: default avatarEdward Allcutt <edward.allcutt@openmarket.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68b7107b