1. 04 Aug, 2015 20 commits
  2. 26 Jun, 2015 5 commits
  3. 24 Jun, 2015 15 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next · e0456717
      Linus Torvalds authored
      Pull networking updates from David Miller:
      
       1) Add TX fast path in mac80211, from Johannes Berg.
      
       2) Add TSO/GRO support to ibmveth, from Thomas Falcon
      
       3) Move away from cached routes in ipv6, just like ipv4, from Martin
          KaFai Lau.
      
       4) Lots of new rhashtable tests, from Thomas Graf.
      
       5) Run ingress qdisc lockless, from Alexei Starovoitov.
      
       6) Allow servers to fetch TCP packet headers for SYN packets of new
          connections, for fingerprinting.  From Eric Dumazet.
      
       7) Add mode parameter to pktgen, for testing receive.  From Alexei
          Starovoitov.
      
       8) Cache access optimizations via simplifications of build_skb(), from
          Alexander Duyck.
      
       9) Move page frag allocator under mm/, also from Alexander.
      
      10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
      
      11) Add a counter guard in case we try to perform endless reclassify
          loops in the packet scheduler.
      
      12) Extern flow dissector to be programmable and use it in new "Flower"
          classifier.  From Jiri Pirko.
      
      13) AF_PACKET fanout rollover fixes, performance improvements, and new
          statistics.  From Willem de Bruijn.
      
      14) Add netdev driver for GENEVE tunnels, from John W Linville.
      
      15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
      
      16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
      
      17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
          Borkmann.
      
      18) Add tail call support to BPF, from Alexei Starovoitov.
      
      19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
      
      20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
      
      21) Favor even port numbers for allocation to connect() requests, and
          odd port numbers for bind(0), in an effort to help avoid
          ip_local_port_range exhaustion.  From Eric Dumazet.
      
      22) Add Cavium ThunderX driver, from Sunil Goutham.
      
      23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
          from Alexei Starovoitov.
      
      24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
      
      25) Double TCP Small Queues default to 256K to accomodate situations
          like the XEN driver and wireless aggregation.  From Wei Liu.
      
      26) Add more entropy inputs to flow dissector, from Tom Herbert.
      
      27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
          Jonassen.
      
      28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
      
      29) Track and act upon link status of ipv4 route nexthops, from Andy
          Gospodarek.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
        bridge: vlan: flush the dynamically learned entries on port vlan delete
        bridge: multicast: add a comment to br_port_state_selection about blocking state
        net: inet_diag: export IPV6_V6ONLY sockopt
        stmmac: troubleshoot unexpected bits in des0 & des1
        net: ipv4 sysctl option to ignore routes when nexthop link is down
        net: track link-status of ipv4 nexthops
        net: switchdev: ignore unsupported bridge flags
        net: Cavium: Fix MAC address setting in shutdown state
        drivers: net: xgene: fix for ACPI support without ACPI
        ip: report the original address of ICMP messages
        net/mlx5e: Prefetch skb data on RX
        net/mlx5e: Pop cq outside mlx5e_get_cqe
        net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
        net/mlx5e: Remove extra spaces
        net/mlx5e: Avoid TX CQE generation if more xmit packets expected
        net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
        net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
        net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
        net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
        net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
        ...
      e0456717
    • Linus Torvalds's avatar
      Merge branch 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 98ec21a0
      Linus Torvalds authored
      Pull scheduler updates from Thomas Gleixner:
       "This series of scheduler updates depends on sched/core and timers/core
        branches, which are already in your tree:
      
         - Scheduler balancing overhaul to plug a hard to trigger race which
           causes an oops in the balancer (Peter Zijlstra)
      
         - Lockdep updates which are related to the balancing updates (Peter
           Zijlstra)"
      
      * 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched,lockdep: Employ lock pinning
        lockdep: Implement lock pinning
        lockdep: Simplify lock_release()
        sched: Streamline the task migration locking a little
        sched: Move code around
        sched,dl: Fix sched class hopping CBS hole
        sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance callbacks
        sched,dl: Remove return value from pull_dl_task()
        sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks
        sched,rt: Remove return value from pull_rt_task()
        sched: Allow balance callbacks for check_class_changed()
        sched: Use replace normalize_task() with __sched_setscheduler()
        sched: Replace post_schedule with a balance callback list
      98ec21a0
    • Linus Torvalds's avatar
      Merge branch 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a2629483
      Linus Torvalds authored
      Pull locking updates from Thomas Gleixner:
       "These locking updates depend on the alreay merged sched/core branch:
      
         - Lockless top waiter wakeup for rtmutex (Davidlohr)
      
         - Reduce hash bucket lock contention for PI futexes (Sebastian)
      
         - Documentation update (Davidlohr)"
      
      * 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/rtmutex: Update stale plist comments
        futex: Lower the lock contention on the HB lock during wake up
        locking/rtmutex: Implement lockless top-waiter wakeup
      a2629483
    • Linus Torvalds's avatar
      Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · e3d8238d
      Linus Torvalds authored
      Pull arm64 updates from Catalin Marinas:
       "Mostly refactoring/clean-up:
      
         - CPU ops and PSCI (Power State Coordination Interface) refactoring
           following the merging of the arm64 ACPI support, together with
           handling of Trusted (secure) OS instances
      
         - Using fixmap for permanent FDT mapping, removing the initial dtb
           placement requirements (within 512MB from the start of the kernel
           image).  This required moving the FDT self reservation out of the
           memreserve processing
      
         - Idmap (1:1 mapping used for MMU on/off) handling clean-up
      
         - Removing flush_cache_all() - not safe on ARM unless the MMU is off.
           Last stages of CPU power down/up are handled by firmware already
      
         - "Alternatives" (run-time code patching) refactoring and support for
           immediate branch patching, GICv3 CPU interface access
      
         - User faults handling clean-up
      
        And some fixes:
      
         - Fix for VDSO building with broken ELF toolchains
      
         - Fix another case of init_mm.pgd usage for user mappings (during
           ASID roll-over broadcasting)
      
         - Fix for FPSIMD reloading after CPU hotplug
      
         - Fix for missing syscall trace exit
      
         - Workaround for .inst asm bug
      
         - Compat fix for switching the user tls tpidr_el0 register"
      
      * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (42 commits)
        arm64: use private ratelimit state along with show_unhandled_signals
        arm64: show unhandled SP/PC alignment faults
        arm64: vdso: work-around broken ELF toolchains in Makefile
        arm64: kernel: rename __cpu_suspend to keep it aligned with arm
        arm64: compat: print compat_sp instead of sp
        arm64: mm: Fix freeing of the wrong memmap entries with !SPARSEMEM_VMEMMAP
        arm64: entry: fix context tracking for el0_sp_pc
        arm64: defconfig: enable memtest
        arm64: mm: remove reference to tlb.S from comment block
        arm64: Do not attempt to use init_mm in reset_context()
        arm64: KVM: Switch vgic save/restore to alternative_insn
        arm64: alternative: Introduce feature for GICv3 CPU interface
        arm64: psci: fix !CONFIG_HOTPLUG_CPU build warning
        arm64: fix bug for reloading FPSIMD state after CPU hotplug.
        arm64: kernel thread don't need to save fpsimd context.
        arm64: fix missing syscall trace exit
        arm64: alternative: Work around .inst assembler bugs
        arm64: alternative: Merge alternative-asm.h into alternative.h
        arm64: alternative: Allow immediate branch as alternative instruction
        arm64: Rework alternate sequence for ARM erratum 845719
        ...
      e3d8238d
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 4e241557
      Linus Torvalds authored
      Pull first batch of KVM updates from Paolo Bonzini:
       "The bulk of the changes here is for x86.  And for once it's not for
        silicon that no one owns: these are really new features for everyone.
      
        Details:
      
         - ARM:
              several features are in progress but missed the 4.2 deadline.
              So here is just a smattering of bug fixes, plus enabling the
              VFIO integration.
      
         - s390:
              Some fixes/refactorings/optimizations, plus support for 2GB
              pages.
      
         - x86:
              * host and guest support for marking kvmclock as a stable
                scheduler clock.
              * support for write combining.
              * support for system management mode, needed for secure boot in
                guests.
              * a bunch of cleanups required for the above
              * support for virtualized performance counters on AMD
              * legacy PCI device assignment is deprecated and defaults to "n"
                in Kconfig; VFIO replaces it
      
              On top of this there are also bug fixes and eager FPU context
              loading for FPU-heavy guests.
      
         - Common code:
              Support for multiple address spaces; for now it is used only for
              x86 SMM but the s390 folks also have plans"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
        KVM: s390: clear floating interrupt bitmap and parameters
        KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
        KVM: x86/vPMU: Implement AMD vPMU code for KVM
        KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
        KVM: x86/vPMU: introduce kvm_pmu_msr_idx_to_pmc
        KVM: x86/vPMU: reorder PMU functions
        KVM: x86/vPMU: whitespace and stylistic adjustments in PMU code
        KVM: x86/vPMU: use the new macros to go between PMC, PMU and VCPU
        KVM: x86/vPMU: introduce pmu.h header
        KVM: x86/vPMU: rename a few PMU functions
        KVM: MTRR: do not map huge page for non-consistent range
        KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
        KVM: MTRR: introduce mtrr_for_each_mem_type
        KVM: MTRR: introduce fixed_mtrr_addr_* functions
        KVM: MTRR: sort variable MTRRs
        KVM: MTRR: introduce var_mtrr_range
        KVM: MTRR: introduce fixed_mtrr_segment table
        KVM: MTRR: improve kvm_mtrr_get_guest_memory_type
        KVM: MTRR: do not split 64 bits MSR content
        KVM: MTRR: clean up mtrr default type
        ...
      4e241557
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux · 08d183e3
      Linus Torvalds authored
      Pull powerpc updates from Michael Ellerman:
      
       - disable the 32-bit vdso when building LE, so we can build with a
         64-bit only toolchain.
      
       - EEH fixes from Gavin & Richard.
      
       - enable the sys_kcmp syscall from Laurent.
      
       - sysfs control for fastsleep workaround from Shreyas.
      
       - expose OPAL events as an irq chip by Alistair.
      
       - MSI ops moved to pci_controller_ops by Daniel.
      
       - fix for kernel to userspace backtraces for perf from Anton.
      
       - merge pseries and pseries_le defconfigs from Cyril.
      
       - CXL in-kernel API from Mikey.
      
       - OPAL prd driver from Jeremy.
      
       - fix for DSCR handling & tests from Anshuman.
      
       - Powernv flash mtd driver from Cyril.
      
       - dynamic DMA Window support on powernv from Alexey.
      
       - LLVM clang fixes & workarounds from Anton.
      
       - reworked version of the patch to abort syscalls when transactional.
      
       - fix the swap encoding to support 4TB, from Aneesh.
      
       - various fixes as usual.
      
       - Freescale updates from Scott: Highlights include more 8xx
         optimizations, an e6500 hugetlb optimization, QMan device tree nodes,
         t1024/t1023 support, and various fixes and cleanup.
      
      * tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (180 commits)
        cxl: Fix typo in debug print
        cxl: Add CXL_KERNEL_API config option
        powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
        powerpc/mm: Change the swap encoding in pte.
        powerpc/mm: PTE_RPN_MAX is not used, remove the same
        powerpc/tm: Abort syscalls in active transactions
        powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
        powerpc/include: Add opal-prd to installed uapi headers
        powerpc/powernv: fix construction of opal PRD messages
        powerpc/powernv: Increase opal-irqchip initcall priority
        powerpc: Make doorbell check preemption safe
        powerpc/powernv: pnv_init_idle_states() should only run on powernv
        macintosh/nvram: Remove as unused
        powerpc: Don't use gcc specific options on clang
        powerpc: Don't use -mno-strict-align on clang
        powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it
        powerpc: Only use -mabi=altivec if toolchain supports it
        powerpc: Fix duplicate const clang warning in user access code
        vfio: powerpc/spapr: Support Dynamic DMA windows
        vfio: powerpc/spapr: Register memory and define IOMMU v2
        ...
      08d183e3
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 4b1f2af6
      Linus Torvalds authored
      Pull s390 updates from Martin Schwidefsky:
       "Pretty boring for a merge window pull.
      
        One change in behaviour is the patch for dasd driver, the module which
        provides the diagnose discipline is now loaded automatically.
      
        The SCLP code got a nice cleanup, a new global structure replaces a
        bunch of accessor functions.
      
        And a couple of random, small improvements"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/pci: improve handling of hotplug event 0x301
        s390/setup: fix DMA_API_DEBUG warnings
        s390/zcrypt: remove obsolete __constant
        s390/keyboard: avoid off-by-one when using strnlen_user()
        s390/sclp: pass timeout as HZ independent value
        s390/mm: s/specifiation/specification/, s/an specification/a specification/
        s390/sclp: Use DECLARE_BITMAP
        s390/dasd: Enable automatic loading of dasd_diag_mod
        s390/sclp: move sclp_facilities into "struct sclp"
        s390/sclp: get rid of sclp_get_mtid() and sclp_get_mtid_max()
        s390/sclp: unify basic sclp access by exposing "struct sclp"
        s390/sclp: prepare smp_fill_possible_mask for global "struct sclp"
      4b1f2af6
    • Linus Torvalds's avatar
      Merge tag 'microblaze-4.2-rc1' of git://git.monstr.eu/linux-2.6-microblaze · aaa64485
      Linus Torvalds authored
      Pull Microblaze updates from Michal Simek:
      
       - some PCI fixups
      
       - add new MB versions
      
       - sparse fixups
      
      * tag 'microblaze-4.2-rc1' of git://git.monstr.eu/linux-2.6-microblaze:
        microblaze/PCI: Remove unnecessary struct pci_dev declaration
        microblaze/PCI: Remove unnecessary pci_bus_find_capability() declaration
        microblaze/PCI: Remove unused declarations
        microblaze: Label local function static
        microblaze: Add missing release version code
      aaa64485
    • Nikolay Aleksandrov's avatar
      bridge: vlan: flush the dynamically learned entries on port vlan delete · 1ea2d020
      Nikolay Aleksandrov authored
      Add a new argument to br_fdb_delete_by_port which allows to specify a
      vid to match when flushing entries and use it in nbp_vlan_delete() to
      flush the dynamically learned entries of the vlan/port pair when removing
      a vlan from a port. Before this patch only the local mac was being
      removed and the dynamically learned ones were left to expire.
      Note that the do_all argument is still respected and if specified, the
      vid will be ignored.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ea2d020
    • Nikolay Aleksandrov's avatar
      bridge: multicast: add a comment to br_port_state_selection about blocking state · 9aa66382
      Nikolay Aleksandrov authored
      Add a comment to explain why we're not disabling port's multicast when it
      goes in blocking state. Since there's a check in the timer's function which
      bypasses the timer if the port's in blocking/disabled state, the timer will
      simply expire and stop without sending more queries.
      Suggested-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9aa66382
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 3a07bd6f
      David S. Miller authored
      Conflicts:
      	drivers/net/ethernet/mellanox/mlx4/main.c
      	net/packet/af_packet.c
      
      Both conflicts were cases of simple overlapping changes.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a07bd6f
    • Phil Sutter's avatar
      net: inet_diag: export IPV6_V6ONLY sockopt · 20462155
      Phil Sutter authored
      For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
      exported to userspace. It indicates whether a socket bound to in6addr_any
      listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
      listed by e.g. 'ss -l -4'.
      
      This patch is accompanied by an appropriate one for iproute2 to enable
      the additional information in 'ss -e'.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20462155
    • Alexey Brodkin's avatar
      stmmac: troubleshoot unexpected bits in des0 & des1 · f1590670
      Alexey Brodkin authored
      Current implementation of descriptor init procedure only takes
      care about setting/clearing ownership flag in "des0"/"des1"
      fields while it is perfectly possible to get unexpected bits
      set because of the following factors:
      
       [1] On driver probe underlying memory allocated with
           dma_alloc_coherent() might not be zeroed and so
           it will be filled with garbage.
      
       [2] During driver operation some bits could be set by SD/MMC
           controller (for example error flags etc).
      
      And unexpected and/or randomly set flags in "des0"/"des1"
      fields may lead to unpredictable behavior of GMAC DMA block.
      
      This change addresses both items above with:
      
       [1] Use of dma_zalloc_coherent() instead of simple
           dma_alloc_coherent() to make sure allocated memory is
           zeroed. That shouldn't affect performance because
           this allocation only happens once on driver probe.
      
       [2] Do explicit zeroing of both "des0" and "des1" fields
           of all buffer descriptors during initialization of
           DMA transfer.
      
      And while at it fixed identation of dma_free_coherent()
      counterpart as well.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: arc-linux-dev@synopsys.com
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1590670
    • David S. Miller's avatar
      Merge branch 'ipv4-nexthop-link-status' · f389a40e
      David S. Miller authored
      Andy Gospodarek says:
      
      ====================
      changes to make ipv4 routing table aware of next-hop link status
      
      This series adds the ability to have the Linux kernel track whether or
      not a particular route should be used based on the link-status of the
      interface associated with the next-hop.
      
      Before this patch any link-failure on an interface that was serving as a
      gateway for some systems could result in those systems being isolated
      from the rest of the network as the stack would continue to attempt to
      send frames out of an interface that is actually linked-down.  When the
      kernel is responsible for all forwarding, it should also be responsible
      for taking action when the traffic can no longer be forwarded -- there
      is no real need to outsource link-monitoring to userspace anymore.
      
      This feature is only enabled with the new per-interface or ipv4 global
      sysctls called 'ignore_routes_with_linkdown'.
      
      net.ipv4.conf.all.ignore_routes_with_linkdown = 0
      net.ipv4.conf.default.ignore_routes_with_linkdown = 0
      net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, the kernel will not only report to
      userspace that the link is down, but it will also report to userspace
      that a route is dead.  This will signal to userspace that the route will
      not be selected.
      
      With the new sysctls set, the following behavior can be observed
      (interface p8p1 is link-down):
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
          cache
      
      While the route does remain in the table (so it can be modified if
      needed rather than being wiped away as it would be if IFF_UP was
      cleared), the proper next-hop is chosen automatically when the link is
      down.  Now interface p8p1 is linked-up:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
      90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      
      and the output changes to what one would expect.
      
      If the global or interface sysctl is not set, the following output would
      be expected when p8p1 is down:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      
      If the dead flag does not appear there should be no expectation that the
      kernel would skip using this route due to link being down.
      
      v2: Split kernel changes into 2 patches: first to add linkdown flag and
      second to add new sysctl settings.  Also took suggestion from Alex to
      simplify code by only checking sysctl during fib lookup and suggestion
      from Scott to add a per-interface sysctl.  Added iproute2 patch to
      recognize and print linkdown flag.
      
      v3: Code cleanups along with reverse-path checks suggested by Alex and
      small fixes related to problems found when multipath was disabled.
      
      v4: Drop binary sysctls
      
      v5: Whitespace and variable declaration fixups suggested by Dave
      
      v6: Style changes noticed by Dave and checkpath suggestions.
      
      v7: Last checkpatch fixup.
      
      Though there were some that preferred not to have a configuration option
      and to make this behavior the default when it was discussed in Ottawa
      earlier this year since "it was time to do this."  I wanted to propose
      the config option to preserve the current behavior for those that desire
      it.  I'll happily remove it if Dave and Linus approve.
      
      An IPv6 implementation is also needed (DECnet too!), but I wanted to
      start with the IPv4 implementation to get people comfortable with the
      idea before moving forward.  If this is accepted the IPv6 implementation
      can be posted shortly.
      
      There was also a request for switchdev support for this, but that will
      be posted as a followup as switchdev does not currently handle dead
      next-hops in a multi-path case and I felt that infra needed to be added
      first.
      
      FWIW, we have been running the original version of this series with a
      global sysctl and our customers have been happily using a backported
      version for IPv4 and IPv6 for >6 months.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f389a40e
    • Andy Gospodarek's avatar
      net: ipv4 sysctl option to ignore routes when nexthop link is down · 0eeb075f
      Andy Gospodarek authored
      This feature is only enabled with the new per-interface or ipv4 global
      sysctls called 'ignore_routes_with_linkdown'.
      
      net.ipv4.conf.all.ignore_routes_with_linkdown = 0
      net.ipv4.conf.default.ignore_routes_with_linkdown = 0
      net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, will report to userspace that a route is
      dead and will no longer resolve to this nexthop when performing a fib
      lookup.  This will signal to userspace that the route will not be
      selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
      if the sysctl is enabled and link is down.  This was done as without it
      the netlink listeners would have no idea whether or not a nexthop would
      be selected.   The kernel only sets RTNH_F_DEAD internally if the
      interface has IFF_UP cleared.
      
      With the new sysctl set, the following behavior can be observed
      (interface p8p1 is link-down):
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
          cache
      
      While the route does remain in the table (so it can be modified if
      needed rather than being wiped away as it would be if IFF_UP was
      cleared), the proper next-hop is chosen automatically when the link is
      down.  Now interface p8p1 is linked-up:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
      90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      
      and the output changes to what one would expect.
      
      If the sysctl is not set, the following output would be expected when
      p8p1 is down:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      
      Since the dead flag does not appear, there should be no expectation that
      the kernel would skip using this route due to link being down.
      
      v2: Split kernel changes into 2 patches, this actually makes a
      behavioral change if the sysctl is set.  Also took suggestion from Alex
      to simplify code by only checking sysctl during fib lookup and
      suggestion from Scott to add a per-interface sysctl.
      
      v3: Code clean-ups to make it more readable and efficient as well as a
      reverse path check fix.
      
      v4: Drop binary sysctl
      
      v5: Whitespace fixups from Dave
      
      v6: Style changes from Dave and checkpatch suggestions
      
      v7: One more checkpatch fixup
      Signed-off-by: default avatarAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: default avatarDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0eeb075f