1. 14 May, 2015 38 commits
    • Andrea Arcangeli's avatar
      userfaultfd: avoid mmap_sem read recursion in mcopy_atomic · 517183a7
      Andrea Arcangeli authored
      If the rwsem starves writers it wasn't strictly a bug but lockdep
      doesn't like it and this avoids depending on lowlevel implementation
      details of the lock.
      517183a7
    • Andrea Arcangeli's avatar
      userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation · 3f65604a
      Andrea Arcangeli authored
      This implements mcopy_atomic and mfill_zeropage that are the lowlevel
      VM methods that are invoked respectively by the UFFDIO_COPY and
      UFFDIO_ZEROPAGE userfaultfd commands.
      3f65604a
    • Andrea Arcangeli's avatar
      userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI · 8e71cd35
      Andrea Arcangeli authored
      This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.
      8e71cd35
    • Andrea Arcangeli's avatar
      userfaultfd: activate syscall · c20828d5
      Andrea Arcangeli authored
      This activates the userfaultfd syscall.
      c20828d5
    • Andrea Arcangeli's avatar
      userfaultfd: buildsystem activation · f2d31f1d
      Andrea Arcangeli authored
      This allows to select the userfaultfd during configuration to build it.
      f2d31f1d
    • Andrea Arcangeli's avatar
      userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read · 5a2b3614
      Andrea Arcangeli authored
      Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
      userfaultfd_read if they are run on different threads simultaneously.
      
      Until now qemu solved the race in userland: the race was explicitly
      and intentionally left for userland to solve. However we can also
      solve it in kernel.
      
      Requiring all users to solve this race if they use two threads (one
      for the background transfer and one for the userfault reads) isn't
      very attractive from an API prospective, furthermore this allows to
      remove a whole bunch of mutex and bitmap code from qemu, making it
      faster. The cost of __get_user_pages_fast should be insignificant
      considering it scales perfectly and the pagetables are already hot in
      the CPU cache, compared to the overhead in userland to maintain those
      structures.
      
      Applying this patch is backwards compatible with respect to the
      userfaultfd userland API, however reverting this change wouldn't be
      backwards compatible anymore.
      
      Without this patch qemu in the background transfer thread, has to read
      the old state, and do UFFDIO_WAKE if old_state is missing but it
      become REQUESTED by the time it tries to set it to RECEIVED (signaling
      the other side received an userfault).
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
      			postcopy_place_page
      			read old_state -> MISSING
       			UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
       					poll() -> POLLIN
       					read() -> 0x7fb76a139000
       					postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
      
       			tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
      			/* check that no userfault raced with UFFDIO_COPY */
      			if (old_state == MISSING && tmp_state == REQUESTED)
      				UFFDIO_WAKE from background thread
      
      And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
      			postcopy_place_page
      			read old_state -> MISSING
       			UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
       			tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
       					poll() -> POLLIN
       					read() -> 0x7fb76a139000
      
       					if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
      						UFFDIO_WAKE from userfault thread
      
      This patch removes the need of both UFFDIO_WAKE and of the associated
      per-page tristate as well.
      5a2b3614
    • Andrea Arcangeli's avatar
      userfaultfd: allocate the userfaultfd_ctx cacheline aligned · bd0a30cd
      Andrea Arcangeli authored
      Use proper slab to guarantee alignment.
      bd0a30cd
    • Andrea Arcangeli's avatar
      userfaultfd: optimize read() and poll() to be O(1) · 18c5b6c4
      Andrea Arcangeli authored
      This makes read O(1) and poll that was already O(1) becomes lockless.
      18c5b6c4
    • Andrea Arcangeli's avatar
      userfaultfd: wake pending userfaults · a1837777
      Andrea Arcangeli authored
      This is an optimization but it's a userland visible one and it affects
      the API.
      
      The downside of this optimization is that if you call poll() and you
      get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
      may be waken by a different thread, before read(ufd) comes
      around. This in short means that poll() isn't really usable if the
      userfaultfd is opened in blocking mode.
      
      userfaults won't wait in "pending" state to be read anymore and any
      UFFDIO_WAKE or similar operations that has the objective of waking
      userfaults after their resolution, will wake all blocked userfaults
      for the resolved range, including those that haven't been read() by
      userland yet.
      
      The behavior of poll() becomes not standard, but this obviates the
      need of "spurious" UFFDIO_WAKE and it lets the userland threads to
      restart immediately without requiring an UFFDIO_WAKE. This is even
      more significant in case of repeated faults on the same address from
      multiple threads.
      
      This optimization is justified by the measurement that the number of
      spurious UFFDIO_WAKE accounts for 5% and 10% of the total
      userfaults for heavy workloads, so it's worth optimizing those away.
      a1837777
    • Andrea Arcangeli's avatar
      userfaultfd: change the read API to return a uffd_msg · a18d6e1c
      Andrea Arcangeli authored
      I had requests to return the full address (not the page aligned one)
      to userland.
      
      It's not entirely clear how the page offset could be relevant because
      userfaults aren't like SIGBUS that can sigjump to a different place
      and it actually skip resolving the fault depending on a page
      offset. There's currently no real way to skip the fault especially
      because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be
      retried within the kernel without having to return to userland first
      (not even self modifying code replacing the .text that touched the
      faulting address would prevent the fault to be repeated). Userland
      cannot skip repeating the fault even more so if the fault was
      triggered by a KVM secondary page fault or any get_user_pages or any
      copy-user inside some syscall which will return to kernel code. The
      second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS
      being raised because the userfault can't wait if it cannot release the
      mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that).
      
      Still returning userland a proper structure during the read() on the
      uffd, can allow to use the current UFFD_API for the future
      non-cooperative extensions too and it looks cleaner as well. Once we
      get additional fields there's no point to return the fault address
      page aligned anymore to reuse the bits below PAGE_SHIFT.
      
      The only downside is that the read() syscall will read 32bytes instead
      of 8bytes but that's not going to be measurable overhead.
      
      The total number of new events that can be extended or of new future
      bits for already shipped events, is limited to 64 by the features
      field of the uffdio_api structure. If more will be needed a bump of
      UFFD_API will be required.
      a18d6e1c
    • Andrea Arcangeli's avatar
      userfaultfd: Rename uffd_api.bits into .features fixup · f050ac8e
      Andrea Arcangeli authored
      Update comment.
      f050ac8e
    • Pavel Emelyanov's avatar
      userfaultfd: Rename uffd_api.bits into .features · b9ca6f1f
      Pavel Emelyanov authored
      This is (seem to be) the minimal thing that is required to unblock
      standard uffd usage from the non-cooperative one. Now more bits can
      be added to the features field indicating e.g. UFFD_FEATURE_FORK and
      others needed for the latter use-case.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      b9ca6f1f
    • Andrea Arcangeli's avatar
      userfaultfd: add new syscall to provide memory externalization · 2f73ffa8
      Andrea Arcangeli authored
      Once an userfaultfd has been created and certain region of the process
      virtual address space have been registered into it, the thread
      responsible for doing the memory externalization can manage the page
      faults in userland by talking to the kernel using the userfaultfd
      protocol.
      
      poll() can be used to know when there are new pending userfaults to be
      read (POLLIN).
      2f73ffa8
    • Andrea Arcangeli's avatar
      userfaultfd: prevent khugepaged to merge if userfaultfd is armed · 33c24f63
      Andrea Arcangeli authored
      If userfaultfd is armed on a certain vma we can't "fill" the holes
      with zeroes or we'll break the userland on demand paging. The holes if
      the userfault is armed, are really missing information (not zeroes)
      that the userland has to load from network or elsewhere.
      
      The same issue happens for wrprotected ptes that we can't just convert
      into a single writable pmd_trans_huge.
      
      We could however in theory still merge across zeropages if only
      VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be
      slightly improved but it'd be much more complex code for a tiny corner
      case.
      33c24f63
    • Andrea Arcangeli's avatar
      userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx · 868f0d8c
      Andrea Arcangeli authored
      vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
      must be aware about so that we can merge vmas back like they were
      originally before arming the userfaultfd on some memory range.
      868f0d8c
    • Andrea Arcangeli's avatar
      userfaultfd: call handle_userfault() for userfaultfd_missing() faults · d574e5aa
      Andrea Arcangeli authored
      This is where the page faults must be modified to call
      handle_userfault() if userfaultfd_missing() is true (so if the
      vma->vm_flags had VM_UFFD_MISSING set).
      
      handle_userfault() then takes care of blocking the page fault and
      delivering it to userland.
      
      The fault flags must also be passed as parameter so the "read|write"
      kind of fault can be passed to userland.
      d574e5aa
    • Andrea Arcangeli's avatar
      userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP · 3de85438
      Andrea Arcangeli authored
      These two flags gets set in vma->vm_flags to tell the VM common code
      if the userfaultfd is armed and in which mode (only tracking missing
      faults, only tracking wrprotect faults or both). If neither flags is
      set it means the userfaultfd is not armed on the vma.
      3de85438
    • Andrea Arcangeli's avatar
      userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct · 1ec419d8
      Andrea Arcangeli authored
      This adds the vm_userfaultfd_ctx to the vm_area_struct.
      1ec419d8
    • Andrea Arcangeli's avatar
      userfaultfd: linux/userfaultfd_k.h · c6bb4e14
      Andrea Arcangeli authored
      Kernel header defining the methods needed by the VM common code to
      interact with the userfaultfd.
      c6bb4e14
    • Andrea Arcangeli's avatar
      userfaultfd: uAPI · c90748b0
      Andrea Arcangeli authored
      Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.
      c90748b0
    • Andrea Arcangeli's avatar
      userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key · 34de35c8
      Andrea Arcangeli authored
      userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
      instead of the current hardcoded 1 (that would wake just the first
      waitqueue in the head list).
      34de35c8
    • Andrea Arcangeli's avatar
      userfaultfd: linux/Documentation/vm/userfaultfd.txt · 87d7a171
      Andrea Arcangeli authored
      Add documentation.
      87d7a171
    • Andrea Arcangeli's avatar
      kvm: fix crash in kvm_vcpu_reload_apic_access_page · 978c6891
      Andrea Arcangeli authored
      memslot->userfault_addr is set by the kernel with a mmap executed
      from the kernel but the userland can still munmap it and lead to the
      below oops after memslot->userfault_addr points to a host virtual
      address that has no vma or mapping.
      
      [  327.538306] BUG: unable to handle kernel paging request at fffffffffffffffe
      [  327.538407] IP: [<ffffffff811a7b55>] put_page+0x5/0x50
      [  327.538474] PGD 1a01067 PUD 1a03067 PMD 0
      [  327.538529] Oops: 0000 [#1] SMP
      [  327.538574] Modules linked in: macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT iptable_filter ip_tables tun bridge stp llc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ipmi_devintf iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp dcdbas intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ipmi_si ipmi_msghandler acpi_pad wmi acpi_power_meter lpc_ich mfd_core mei_me
      [  327.539488]  mei shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc mlx4_ib ib_sa ib_mad ib_core mlx4_en vxlan ib_addr ip_tunnel xfs libcrc32c sd_mod crc_t10dif crct10dif_common crc32c_intel mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm drm ahci i2c_core libahci mlx4_core libata tg3 ptp pps_core megaraid_sas ntb dm_mirror dm_region_hash dm_log dm_mod
      [  327.539956] CPU: 3 PID: 3161 Comm: qemu-kvm Not tainted 3.10.0-240.el7.userfault19.4ca4011.x86_64.debug #1
      [  327.540045] Hardware name: Dell Inc. PowerEdge R420/0CN7CM, BIOS 2.1.2 01/20/2014
      [  327.540115] task: ffff8803280ccf00 ti: ffff880317c58000 task.ti: ffff880317c58000
      [  327.540184] RIP: 0010:[<ffffffff811a7b55>]  [<ffffffff811a7b55>] put_page+0x5/0x50
      [  327.540261] RSP: 0018:ffff880317c5bcf8  EFLAGS: 00010246
      [  327.540313] RAX: 00057ffffffff000 RBX: ffff880616a20000 RCX: 0000000000000000
      [  327.540379] RDX: 0000000000002014 RSI: 00057ffffffff000 RDI: fffffffffffffffe
      [  327.540445] RBP: ffff880317c5bd10 R08: 0000000000000103 R09: 0000000000000000
      [  327.540511] R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffffe
      [  327.540576] R13: 0000000000000000 R14: ffff880317c5bd70 R15: ffff880317c5bd50
      [  327.540643] FS:  00007fd230b7f700(0000) GS:ffff880630800000(0000) knlGS:0000000000000000
      [  327.540717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  327.540771] CR2: fffffffffffffffe CR3: 000000062a2c3000 CR4: 00000000000427e0
      [  327.540837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  327.540904] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  327.540974] Stack:
      [  327.541008]  ffffffffa05d6d0c ffff880616a20000 0000000000000000 ffff880317c5bdc0
      [  327.541093]  ffffffffa05ddaa2 0000000000000000 00000000002191bf 00000042f3feab2d
      [  327.541177]  00000042f3feab2d 0000000000000002 0000000000000001 0321000000000000
      [  327.541261] Call Trace:
      [  327.541321]  [<ffffffffa05d6d0c>] ? kvm_vcpu_reload_apic_access_page+0x6c/0x80 [kvm]
      [  327.543615]  [<ffffffffa05ddaa2>] vcpu_enter_guest+0x3f2/0x10f0 [kvm]
      [  327.545918]  [<ffffffffa05e2f10>] kvm_arch_vcpu_ioctl_run+0x2b0/0x5a0 [kvm]
      [  327.548211]  [<ffffffffa05e2d02>] ? kvm_arch_vcpu_ioctl_run+0xa2/0x5a0 [kvm]
      [  327.550500]  [<ffffffffa05ca845>] kvm_vcpu_ioctl+0x2b5/0x680 [kvm]
      [  327.552768]  [<ffffffff810b8d12>] ? creds_are_invalid.part.1+0x12/0x50
      [  327.555069]  [<ffffffff810b8d71>] ? creds_are_invalid+0x21/0x30
      [  327.557373]  [<ffffffff812d6066>] ? inode_has_perm.isra.49.constprop.65+0x26/0x80
      [  327.559663]  [<ffffffff8122d985>] do_vfs_ioctl+0x305/0x530
      [  327.561917]  [<ffffffff8122dc51>] SyS_ioctl+0xa1/0xc0
      [  327.564185]  [<ffffffff816de829>] system_call_fastpath+0x16/0x1b
      [  327.566480] Code: 0b 31 f6 4c 89 e7 e8 4b 7f ff ff 0f 0b e8 24 fd ff ff e9 a9 fd ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 2a 8b 47 1c 85 c0 74 1e f0
      978c6891
    • Andrea Arcangeli's avatar
      mm: gup: use get_user_pages_fast instead of get_user_pages_unlocked · bc26cb8c
      Andrea Arcangeli authored
      Just an optimization, where possible use get_user_pages_fast.
      bc26cb8c
    • Andrea Arcangeli's avatar
      mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious · 051bd3c6
      Andrea Arcangeli authored
      This teaches gup_fast and __gup_fast to re-enable irqs and
      cond_resched() if possible every BATCH_PAGES.
      
      This must be implemented by other archs as well and it's a requirement
      before converting more get_user_pages() to get_user_pages_fast() as an
      optimization (instead of using get_user_pages_unlocked which would be
      slower).
      051bd3c6
    • Andrea Arcangeli's avatar
      mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode · 123cb69c
      Andrea Arcangeli authored
      This adds compaction to zone_reclaim so THP enabled won't decrease the
      NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.
      
      It is important to boot with numa_zonelist_order=n (n means nodes) to
      get more accurate NUMA locality if there are multiple zones per node.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      123cb69c
    • Andrea Arcangeli's avatar
      mm: zone_reclaim: after a successful zone_reclaim check the min watermark · ea160218
      Andrea Arcangeli authored
      If we're in the fast path and we succeeded zone_reclaim(), it means we
      freed enough memory and we can use the min watermark to have some
      margin against concurrent allocations from other CPUs or interrupts.
      ea160218
    • Andrea Arcangeli's avatar
      mm: zone_reclaim: compaction: export compact_zone_order() · b6c2abee
      Andrea Arcangeli authored
      Needed by zone_reclaim_mode compaction-awareness.
      b6c2abee
    • Andrea Arcangeli's avatar
      mm: zone_reclaim: compaction: increase the high order pages in the watermarks · b14cae31
      Andrea Arcangeli authored
      Prevent the scaling down to reduce the watermarks too much.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      b14cae31
    • Andrea Arcangeli's avatar
      mm: compaction: don't require high order pages below min wmark · 7a271558
      Andrea Arcangeli authored
      The min wmark should be satisfied with just 1 hugepage. And the other
      wmarks should be adjusted accordingly. We need to succeed the low
      wmark check if there's some significant amount of 0 order pages, but
      we don't need plenty of high order pages because the PF_MEMALLOC paths
      don't require those. Creating a ton of high order pages that cannot be
      allocated by the high order allocation paths (no PF_MEMALLOC) is quite
      wasteful because they can be splitted in lower order pages before
      anybody has a chance to allocate them.
      7a271558
    • Andrea Arcangeli's avatar
      mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable · af24516e
      Andrea Arcangeli authored
      If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
      plenty of free memory) compaction is otherwise crippled down and stops
      running for a while after the free/isolation cursor meets. After that
      allocation can fail for a full cycle of compaction_deferred, until
      compaction_restarting finally reset it again.
      
      Stopping compaction for a full cycle after the cursor meets, even if
      it never failed and it's not going to fail, doesn't make sense.
      
      We already throttle compaction CPU utilization using
      defer_compaction. We shouldn't prevent compaction to run after each
      pass completes when the cursor meets, unless it failed.
      
      This makes direct compaction functional again. The throttling of
      direct compaction is still controlled by the defer_compaction
      logic.
      
      kswapd still won't risk to reset compaction, and it will wait direct
      compaction to do so. Not sure if this is ideal but it at least
      decreases the risk of kswapd doing too much work. kswapd will only run
      one pass of compaction until some allocation invokes compaction again.
      
      This decreased reliability of compaction was introduced in commit
      62997027 .
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      af24516e
    • Andrea Arcangeli's avatar
      mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory · 02eaa78b
      Andrea Arcangeli authored
      Reset the stats so /proc/sys/vm/compact_memory will scan all memory.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      02eaa78b
    • Andrea Arcangeli's avatar
      mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED · 43dc77b0
      Andrea Arcangeli authored
      Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
      thread allocates memory at the same time, it forces a premature
      allocation into remote NUMA nodes even when there's plenty of clean
      cache to reclaim in the local nodes.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      43dc77b0
    • Oleg Nesterov's avatar
      mm: fix the theoretical compound_lock() vs prep_new_page() race · 93585352
      Oleg Nesterov authored
      get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
      compound_lock(). In theory this page_head can be already freed and
      reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
      get_page_unless_zero() can succeed right after set_page_refcounted(),
      and compound_lock() can race with the non-atomic __SetPageHead().
      
      Perhaps we should rework the thp locking (under discussion), but
      until then this patch moves set_page_refcounted() and adds wmb()
      to ensure that page->_count != 0 comes as a last change.
      
      I am not sure about other callers of set_page_refcounted(), but at
      first glance they look fine to me.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      93585352
    • Andrea Arcangeli's avatar
      oom: allow !__GFP_FS allocations access emergency reserves like __GFP_NOFAIL · fa175d10
      Andrea Arcangeli authored
      With the previous two commits I cannot reproduce any ext4 related
      livelocks anymore, however I hit ext4 memory corruption. ext4 thinks
      it can handle alloc_pages to fail and it doesn't use __GFP_NOFAIL in
      some places but it actually cannot. No surprise as those errors paths
      couldn't ever run so they're likely untested.
      
      I logged all the stack traces of all ext4 failures that lead to the
      ext4 final corruption, at least one of them should be the culprit (the
      lasts ones are more probable). The actual bug in the error paths
      should be found by code review (or the error paths should be deleted
      and __GFP_NOFAIL should be added to the gfp_mask).
      
      Until ext4 is fixed, it is safer to threat !__GFP_FS like __GFP_NOFAIL
      if TIF_MEMDIE is not set (so we cannot exercise any new allocation
      error path in kernel threads, because they're never picked as OOM
      killer victims and TIF_MEMDIE never gets set on them).
      
      I assume other filesystems may have become complacent of this
      accommodating allocator behavior that cannot fail an allocation if
      invoked by a kernel thread too, but the longer we keep the
      __GFP_NOFAIL behavior in should_alloc_retry for small order
      allocations, the less robust these error paths will become and the
      harder it will be to remove this livelock prone assumption in
      should_alloc_retry. In fact we should remove that assumption not just
      for !__GFP_FS allocations.
      
      In practice with this fix there's no regression and all livelocks are
      still gone. The only risk in this approach is to extinguish the
      emergency reserves earlier than before but only during OOM (during
      normal runtime GFP_ATOMIC allocation or other __GFP_MEMALLOC
      allocation reliability is not affected). Clearly this actually reduces
      the livelock risk (verified in practice too) so it is a low risk net
      improvement to the OOM handling with no risk of regression because
      this way no new allocation error paths is exercised.
      fa175d10
    • Andrea Arcangeli's avatar
      oom: fix ext4 __GFP_NOFAIL livelock · 47fb3887
      Andrea Arcangeli authored
      The previous commit fixed a ext4 livelock by not making !__GFP_FS
      allocations behave similarly to __GFP_NOFAIL and I mentioned how
      __GFP_NOFAIL is livelock prone.
      
      After letting the trinity load run for a while I actually hit the
      very __GFP_NOFAIL livelock too:
      
       #0  get_page_from_freelist (gfp_mask=0x20858, nodemask=0x0 <irq_stack_union>, order=0x0, zonelist=0xffff88007fffc100, hi
      gh_zoneidx=0x2, alloc_flags=0xc0, preferred_zone=0xffff88007fffa840, classzone_idx=classzone_idx@entry=0x1, migratetype=
      migratetype@entry=0x2) at mm/page_alloc.c:1953
       #1  0xffffffff81178e88 in __alloc_pages_slowpath (migratetype=0x2, classzone_idx=0x1, preferred_zone=0xffff88007fffa840,
       nodemask=0x0 <irq_stack_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x20858) at
      mm/page_alloc.c:2597
       #2  __alloc_pages_nodemask (gfp_mask=<optimized out>, order=0x0, zonelist=0xffff87fffffffffa, nodemask=0x0 <irq_stack_un
      ion>) at mm/page_alloc.c:2832
       #3  0xffffffff811becab in alloc_pages_current (gfp=0x20858, order=0x0) at mm/mempolicy.c:2100
       #4  0xffffffff8116e450 in alloc_pages (order=0x0, gfp_mask=0x20858) at include/linux/gfp.h:336
       #5  __page_cache_alloc (gfp=0x20858) at mm/filemap.c:663
       #6  0xffffffff8116f03c in pagecache_get_page (mapping=0xffff88007cc03908, offset=0xc920f, fgp_flags=0x7, cache_gfp_mask=
      0x20858, radix_gfp_mask=0x850) at mm/filemap.c:1096
       #7  0xffffffff812160f4 in find_or_create_page (mapping=<optimized out>, gfp_mask=<optimized out>, offset=0xc920f) at inc
      lude/linux/pagemap.h:336
       #8  grow_dev_page (sizebits=0x0, size=0x1000, index=0xc920f, block=0xc920f, bdev=0xffff88007cc03580) at fs/buffer.c:1022
       #9  grow_buffers (size=<optimized out>, block=<optimized out>, bdev=<optimized out>) at fs/buffer.c:1095
       #10 __getblk_slow (size=0x1000, block=0xc920f, bdev=0xffff88007cc03580) at fs/buffer.c:1121
       #11 __getblk (bdev=0xffff88007cc03580, block=0xc920f, size=0x1000) at fs/buffer.c:1395
       #12 0xffffffff8125c8ed in sb_getblk (block=0xc920f, sb=<optimized out>) at include/linux/buffer_head.h:310
       #13 ext4_read_block_bitmap_nowait (sb=0xffff88007c579000, block_group=0x2f) at fs/ext4/balloc.c:407
       #14 0xffffffff8125ced4 in ext4_read_block_bitmap (sb=0xffff88007c579000, block_group=0x2f) at fs/ext4/balloc.c:489
       #15 0xffffffff8167963b in ext4_mb_discard_group_preallocations (sb=0xffff88007c579000, group=0x2f, needed=0x38) at fs/ex
      t4/mballoc.c:3798
       #16 0xffffffff8129ddbd in ext4_mb_discard_preallocations (needed=0x38, sb=0xffff88007c579000) at fs/ext4/mballoc.c:4346
       #17 ext4_mb_new_blocks (handle=0xffff88003305ee98, ar=0xffff88001f50b890, errp=0xffff88001f50b880) at fs/ext4/mballoc.c:4479
       #18 0xffffffff81290fd3 in ext4_ext_map_blocks (handle=0xffff88003305ee98, inode=0xffff88007b85b178, map=0xffff88001f50ba50, flags=0x25) at fs/ext4/extents.c:4453
       #19 0xffffffff81265688 in ext4_map_blocks (handle=0xffff88003305ee98, inode=0xffff88007b85b178, map=0xffff88001f50ba50, flags=0x25) at fs/ext4/inode.c:648
       #20 0xffffffff8126af77 in mpage_map_one_extent (mpd=0xffff88001f50ba28, handle=0xffff88003305ee98) at fs/ext4/inode.c:2164
       #21 mpage_map_and_submit_extent (give_up_on_write=<synthetic pointer>, mpd=0xffff88001f50ba28, handle=0xffff88003305ee98) at fs/ext4/inode.c:2219
       #22 ext4_writepages (mapping=0xffff88007b85b350, wbc=0xffff88001f50bb60) at fs/ext4/inode.c:2557
       #23 0xffffffff8117ce81 in do_writepages (mapping=0xffff88007b85b350, wbc=0xffff88001f50bb60) at mm/page-writeback.c:2046
       #24 0xffffffff812096c0 in __writeback_single_inode (inode=0xffff88007b85b178, wbc=0xffff88001f50bb60) at fs/fs-writeback.c:460
       #25 0xffffffff8120b311 in writeback_sb_inodes (sb=0xffff88007c579000, wb=0xffff88007bceb060, work=0xffff8800130f9d80) at fs/fs-writeback.c:687
       #26 0xffffffff8120b68f in __writeback_inodes_wb (wb=0xffff88007bceb060, work=0xffff8800130f9d80) at fs/fs-writeback.c:732
       #27 0xffffffff8120b94b in wb_writeback (wb=0xffff88007bceb060, work=0xffff8800130f9d80) at fs/fs-writeback.c:863
       #28 0xffffffff8120befc in wb_do_writeback (wb=0xffff88007bceb060) at fs/fs-writeback.c:998
       #29 bdi_writeback_workfn (work=0xffff88007bceb078) at fs/fs-writeback.c:1043
       #30 0xffffffff81092cf5 in process_one_work (worker=0xffff88002c555e80, work=0xffff88007bceb078) at kernel/workqueue.c:2081
       #31 0xffffffff8109376b in worker_thread (__worker=0xffff88002c555e80) at kernel/workqueue.c:2212
       #32 0xffffffff8109ba54 in kthread (_create=0xffff88007bf2e2c0) at kernel/kthread.c:207
       #33 <signal handler called>
       #34 0x0000000000000000 in irq_stack_union ()
       #35 0x0000000000000000 in ?? ()
      
      To solve this I set manually with gdb ALLOC_NO_WATERMARKS in
      alloc_flags, and the livelock resolved itself.
      
      The fix simply allows __GFP_NOFAIL allocation to get access to the
      emergency reserves in the buddy allocator if __GFP_NOFAIL triggers a
      reclaim failure signaling an out of memory condition. Worst case it'll
      deadlock because we run out of emergency reserves but not giving it
      access to the emergency reserves after the __GFP_NOFAIL hits on a out
      of memory condition may actually result in a livelock despite there
      are still ~50Mbyte free! So this is safer. After applying this OOM
      livelock fix I cannot reproduce the livelock anymore in __GFP_NOFAIL.
      47fb3887
    • Andrea Arcangeli's avatar
      mm: ext4 livelock during OOM · 7636db0a
      Andrea Arcangeli authored
      I can easily reproduce a livelock with some trinity load on a 2GB
      guest running in parallel:
      
      	./trinity -X -c remap_anon_pages -q
      	./trinity -X -c userfaultfd -q
      
      The last OOM killer invocation selects this task:
      
      Out of memory: Kill process 6537 (trinity-c6) score 106 or sacrifice child
      Killed process 6537 (trinity-c6) total-vm:414772kB, anon-rss:186744kB, file-rss:560kB
      
      The victim task shortly later is detected in uninterruptible state for
      too long by the hangcheck timer:
      
      INFO: task trinity-c6:6537 blocked for more than 120 seconds.
            Not tainted 3.16.0-rc1+ #4
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      trinity-c6      D ffff88004ec37b50 11080  6537   5530 0x00100004
       ffff88004ec37aa8 0000000000000082 0000000000000000 ffff880039174910
       ffff88004ec37fd8 0000000000004000 ffff88007c8de3d0 ffff880039174910
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
      Call Trace:
       [<ffffffff8167c759>] ? schedule+0x29/0x70
       [<ffffffff8112ae42>] ? __delayacct_blkio_start+0x22/0x30
       [<ffffffff8116e290>] ? __lock_page+0x70/0x70
       [<ffffffff8167c759>] schedule+0x29/0x70
       [<ffffffff8167c82f>] io_schedule+0x8f/0xd0
       [<ffffffff8116e29e>] sleep_on_page+0xe/0x20
       [<ffffffff8167cd33>] __wait_on_bit_lock+0x73/0xb0
       [<ffffffff8116e287>] __lock_page+0x67/0x70
       [<ffffffff810c34d0>] ? wake_atomic_t_function+0x40/0x40
       [<ffffffff8116f125>] pagecache_get_page+0x165/0x1f0
       [<ffffffff8116f3d4>] grab_cache_page_write_begin+0x34/0x50
       [<ffffffff81268f82>] ext4_da_write_begin+0x92/0x380
       [<ffffffff8116d717>] generic_perform_write+0xc7/0x1d0
       [<ffffffff8116fee3>] __generic_file_write_iter+0x173/0x350
       [<ffffffff8125e6ad>] ext4_file_write_iter+0x10d/0x3c0
       [<ffffffff811db72b>] ? vfs_write+0x1bb/0x1f0
       [<ffffffff811da581>] new_sync_write+0x81/0xb0
       [<ffffffff811db62f>] vfs_write+0xbf/0x1f0
       [<ffffffff811dbb72>] SyS_write+0x52/0xc0
       [<ffffffff816816d2>] system_call_fastpath+0x16/0x1b
      3 locks held by trinity-c6/6537:
       #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811fc0ee>] __fdget_pos+0x3e/0x50
       #1:  (sb_writers#3){.+.+.+}, at: [<ffffffff811db72b>] vfs_write+0x1bb/0x1f0
       #2:  (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8125e62d>] ext4_file_write_iter+0x8d/0x3c0
      
      The task that holds the page lock is likely the below one that never
      returns from __alloc_pages_slowpath.
      
      ck_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2661
      ion>) at mm/page_alloc.c:2821
      0, radix_gfp_mask=0x50) at mm/filemap.c:1096
      emap.h:336
       at fs/ext4/mballoc.c:4442
      50, flags=0x25) at fs/ext4/extents.c:4453
      flags=0x25) at fs/ext4/inode.c:648
      64
      
      (full stack trace of the !__GFP_FS allocation in the kernel thread
      holding the page lock below)
      
      gfp_mask in __alloc_pages_slowpath is gfp_mask=0x50, so
      ___GFP_IO|___GFP_WAIT. ext4_writepages run from a kworker kernel
      thread is holding the page lock that the OOM victim task is waiting
      on.
      
      If alloc_pages returned NULL the whole livelock would resolve itself
      (-ENOMEM would be returned all the way up, ext4 thinks it can handle
      it, in reality it cannot but that's for a later patch in this series).
      ext4_writepages would return and the kworker would try again later to
      flush the dirty pages in the dirty inodes.
      
      To verify I breakpointed in the should_alloc_retry and added
      __GFP_NORETRY to the gfp_mask just before the __GFP_NORETRY check.
      
      gdb> b mm/page_alloc.c:2185
      Breakpoint 8 at 0xffffffff81179122: file mm/page_alloc.c, line 2185.
      gdb> c
      Continuing.
      [Switching to Thread 1]
      _______________________________________________________________________________
           eax:00000000 ebx:00000050  ecx:00000001  edx:00000001     eflags:00000206
           esi:0000196A edi:2963AA50  esp:4EDEF448  ebp:4EDEF568     eip:Error while running hook_stop:
      Value can't be converted to integer.
      
      Breakpoint 8, __alloc_pages_slowpath (migratetype=0x0, classzone_idx=0x1, preferred_zone=0xffff88007fffa840, nodemask=0x0 <irq_stack_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2713
      
      I set the breakpoint at 2185 and it stopped at 2713 but 2713 is in the
      middle of some comment, I assume that's addr2line imperfection and
      it's not relevant.
      
      Then I simply added __GFP_NORETRY to the gfp_mask in the stack:
      
      gdb> print gfp_mask
      $1 = 0x50
      gdb> set gfp_mask = 0x1050
      gdb> p gfp_mask
      $2 = 0x1050
      gdb> c
      Continuing.
      
      After that the livelock resolved itself immediately, the OOM victim
      quit and the workload continued without errors.
      
      The problem was probably introduced in commit
      11e33f6a .
      
      	/*
      	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
      	 * means __GFP_NOFAIL, but that may not be true in other
      	 * implementations.
      	 */
      	if (order <= PAGE_ALLOC_COSTLY_ORDER)
      		return 1;
      
      Retrying forever and depending on the OOM killer to send a SIGKILL is
      only ok if the victim task isn't sleeping in uninterruptible state
      waiting in this case a kernel thread to release a page lock.
      
      In this case the kernel thread holding the lock would never be picked
      by the OOM killer in the first place so this is a error path in ext4
      probably never exercised.
      
      The objective of an implicit __GFP_NOFAIL behavior (but that fails the
      allocation if TIF_MEMDIE is set on the task, unlike a real
      __GFP_NOFAIL that never fails) I assume is to avoid spurious
      VM_FAULT_OOM to make the OOM killer more reliable, but I don't think
      an almost implicit __GFP_NOFAIL is safe in general. __GFP_NOFAIL is
      unsafe too in fact but at least it's very rarely used.
      
      For now we can start by letting the allocations that hold lowlevel
      filesystem locks (__GFP_FS clear) fail so they can release those
      locks. Those locks tends to be uninterruptible too.
      
      This will reduce the scope of the problem, but I'd rather prefer to
      drop that entire check quoted above in should_alloc_retry though. If a
      kernel thread would use_mm() and then take the mmap_sem for writing,
      and then run a GFP_KERNEL allocation that invokes the OOM killer to
      kill one process that is waiting in down_read in __do_page_fault the
      same problem would emerge.
      
      In short it's a tradeoff between the accuracy of the OOM killer (not
      causing spurious allocation failures in addition to killing the task)
      and the risk of livelock.
      
      Furthermore the more we hold on this change, the more likely those
      ext4 allocations done by kernel thread (never picked by the OOM
      killer, which would set TIF_MEMDIE and let alloc_pages fail once in a
      while) will never be exercised.
      
      After the fix the identical trinity load that reliably reproduces the
      problem completes and in addition to the OOM killer info, as expected
      I get this in the kernel log (instead of the below allocation error,
      I'd get the livelock earlier).
      
      kworker/u16:0: page allocation failure: order:0, mode:0x50
      CPU: 2 PID: 6006 Comm: kworker/u16:0 Tainted: G        W     3.16.0-rc1+ #6
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Workqueue: writeback bdi_writeback_workfn (flush-254:0)
       0000000000000000 ffff880006d3f408 ffffffff81679b51 ffff88007fc8f068
       0000000000000050 ffff880006d3f498 ffffffff811748c2 0000000000000010
       ffffffffffffffff ffff88007fffc128 ffffffff810d0bed ffff880006d3f468
      Call Trace:
       [<ffffffff81679b51>] dump_stack+0x4e/0x68
       [<ffffffff811748c2>] warn_alloc_failed+0xe2/0x130
       [<ffffffff810d0bed>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff811791d0>] __alloc_pages_nodemask+0x910/0xb10
       [<ffffffff811becab>] alloc_pages_current+0x8b/0x120
       [<ffffffff8116e450>] __page_cache_alloc+0x10/0x20
       [<ffffffff8116f03c>] pagecache_get_page+0x7c/0x1f0
       [<ffffffff81299004>] ext4_mb_load_buddy+0x274/0x3b0
       [<ffffffff8129a3f2>] ext4_mb_regular_allocator+0x1e2/0x480
       [<ffffffff81296931>] ? ext4_mb_use_preallocated+0x31/0x600
       [<ffffffff8129de28>] ext4_mb_new_blocks+0x568/0x7f0
       [<ffffffff81290fd3>] ext4_ext_map_blocks+0x683/0x1970
       [<ffffffff81265688>] ext4_map_blocks+0x168/0x4d0
       [<ffffffff8126af77>] ext4_writepages+0x6e7/0x1030
       [<ffffffff8117ce81>] do_writepages+0x21/0x50
       [<ffffffff812096c0>] __writeback_single_inode+0x40/0x550
       [<ffffffff8120b311>] writeback_sb_inodes+0x281/0x560
       [<ffffffff8120b68f>] __writeback_inodes_wb+0x9f/0xd0
       [<ffffffff8120b94b>] wb_writeback+0x28b/0x510
       [<ffffffff8120befc>] bdi_writeback_workfn+0x11c/0x6a0
       [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620
       [<ffffffff81092cf5>] process_one_work+0x1c5/0x620
       [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620
       [<ffffffff8109376b>] worker_thread+0x11b/0x4f0
       [<ffffffff81093650>] ? init_pwq+0x190/0x190
       [<ffffffff8109ba54>] kthread+0xe4/0x100
       [<ffffffff8109b970>] ? __init_kthread_worker+0x70/0x70
       [<ffffffff8168162c>] ret_from_fork+0x7c/0xb0
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      7636db0a
    • Andrea Arcangeli's avatar
      Revert "Merge tag 'usb-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb" · fb684033
      Andrea Arcangeli authored
      This reverts commit 42e3a58b, reversing
      changes made to 4fd48b45.
      fb684033
  2. 13 May, 2015 2 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 110bc767
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from
          Avri Altman.
      
       2) Use the correct FW API for scan completions in iwlwifi, from Avraham
          Stern.
      
       3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad
          Kaufman.
      
       4) rhashtable conversion of mac80211 station table was buggy, the
          virtual interface was not taken into account.  Fix from Johannes
          Berg.
      
       5) Fix deadlock in rtlwifi by not using a zero timeout for
          usb_control_msg(), from Larry Finger.
      
       6) Update reordering state before calculating loss detection, from
          Yuchung Cheng.
      
       7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter.
      
       8) Fix extended frame handling in xiling_can driver, from Jeppe
          Ledet-Pedersen.
      
       9) Fix CODEL packet scheduler behavior in the presence of TSO packets,
          from Eric Dumazet.
      
      10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck.
      
      11) macvlan needs to propagate promisc settings down the the lower
          device, from Vlad Yasevich.
      
      12) igb driver can oops when changing number of rings, from Toshiaki
          Makita.
      
      13) Source specific default routes not handled properly in ipv6, from
          Markus Stenberg.
      
      14) Use after free in tc_ctl_tfilter(), from WANG Cong.
      
      15) Use softirq spinlocking in netxen driver, from Tony Camuso.
      
      16) Two ARM bpf JIT fixes from Nicolas Schichan.
      
      17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from
          Mathias Kretschmer.
      
      18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei
          Starovoitov.
      
      19) ll_temac driver DMA maps TX packet header with incorrect length, fix
          from Michal Simek.
      
      20) We removed pm_qos bits from netdevice.h, but some indirect
          references remained.  Kill them.  From David Ahern.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
        net: Remove remaining remnants of pm_qos from netdevice.h
        e1000e: Add pm_qos header
        net: phy: micrel: Fix regression in kszphy_probe
        net: ll_temac: Fix DMA map size bug
        x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions
        netns: return RTM_NEWNSID instead of RTM_GETNSID on a get
        Update be2net maintainers' email addresses
        net_sched: gred: use correct backlog value in WRED mode
        pppoe: drop pppoe device in pppoe_unbind_sock_work
        net: qca_spi: Fix possible race during probe
        net: mdio-gpio: Allow for unspecified bus id
        af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).
        bnx2x: limit fw delay in kdump to 5s after boot
        ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits.
        ARM: net fix emit_udiv() for BPF_ALU | BPF_DIV | BPF_K intruction.
        mpls: Change reserved label names to be consistent with netbsd
        usbnet: avoid integer overflow in start_xmit
        netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2)
        net: xgene_enet: Set hardware dependency
        net: amd-xgbe: Add hardware dependency
        ...
      110bc767
    • David Ahern's avatar
      net: Remove remaining remnants of pm_qos from netdevice.h · 01d460dd
      David Ahern authored
      Commit e2c65448 removed pm_qos from struct net_device but left the
      comment and header file. Remove those.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01d460dd