1. 18 Mar, 2015 35 commits
    • Vlastimil Babka's avatar
      mm: when stealing freepages, also take pages created by splitting buddy page · cdf47668
      Vlastimil Babka authored
      commit 99592d59 upstream.
      
      When studying page stealing, I noticed some weird looking decisions in
      try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
      following two patches were driven by evaluation.
      
      Testing was done with stress-highalloc of mmtests, using the
      mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
      often page stealing occurs for individual migratetypes, and what
      migratetypes are used for fallbacks.  Arguably, the worst case of page
      stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
      RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
      so the goal is to minimize these two cases.
      
      The evaluation of v2 wasn't always clear win and Joonsoo questioned the
      results.  Here I used different baseline which includes RFC compaction
      improvements from [1].  I found that the compaction improvements reduce
      variability of stress-highalloc, so there's less noise in the data.
      
      First, let's look at stress-highalloc configured to do sync compaction,
      and how these patches reduce page stealing events during the test.  First
      column is after fresh reboot, other two are reiterations of test without
      reboot.  That was all accumulater over 5 re-iterations (so the benchmark
      was run 5x3 times with 5 fresh restarts).
      
      Baseline:
      
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        5-nothp-1       5-nothp-2       5-nothp-3
      Page alloc extfrag event                               10264225     8702233    10244125
      Extfrag fragmenting                                    10263271     8701552    10243473
      Extfrag fragmenting for unmovable                         13595       17616       15960
      Extfrag fragmenting unmovable placed with movable          7989       12193        8447
      Extfrag fragmenting for reclaimable                         658        1840        1817
      Extfrag fragmenting reclaimable placed with movable         558        1677        1679
      Extfrag fragmenting for movable                        10249018     8682096    10225696
      
      With Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        6-nothp-1       6-nothp-2       6-nothp-3
      Page alloc extfrag event                               11834954     9877523     9774860
      Extfrag fragmenting                                    11833993     9876880     9774245
      Extfrag fragmenting for unmovable                          7342       16129       11712
      Extfrag fragmenting unmovable placed with movable          4191       10547        6270
      Extfrag fragmenting for reclaimable                         373        1130         923
      Extfrag fragmenting reclaimable placed with movable         302         906         738
      Extfrag fragmenting for movable                        11826278     9859621     9761610
      
      With Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        7-nothp-1       7-nothp-2       7-nothp-3
      Page alloc extfrag event                                4725990     3668793     3807436
      Extfrag fragmenting                                     4725104     3668252     3806898
      Extfrag fragmenting for unmovable                          6678        7974        7281
      Extfrag fragmenting unmovable placed with movable          2051        3829        4017
      Extfrag fragmenting for reclaimable                         429        1208        1278
      Extfrag fragmenting reclaimable placed with movable         369         976        1034
      Extfrag fragmenting for movable                         4717997     3659070     3798339
      
      With Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        8-nothp-1       8-nothp-2       8-nothp-3
      Page alloc extfrag event                                5016183     4700142     3850633
      Extfrag fragmenting                                     5015325     4699613     3850072
      Extfrag fragmenting for unmovable                          1312        3154        3088
      Extfrag fragmenting unmovable placed with movable          1115        2777        2714
      Extfrag fragmenting for reclaimable                         437        1193        1097
      Extfrag fragmenting reclaimable placed with movable         330         969         879
      Extfrag fragmenting for movable                         5013576     4695266     3845887
      
      In v2 we've seen apparent regression with Patch 1 for unmovable events,
      this is now gone, suggesting it was indeed noise.  Here, each patch
      improves the situation for unmovable events.  Reclaimable is improved by
      patch 1 and then either the same modulo noise, or perhaps sligtly worse -
      a small price for unmovable improvements, IMHO.  The number of movable
      allocations falling back to other migratetypes is most noisy, but it's
      reduced to half at Patch 2 nevertheless.  These are least critical as
      compaction can move them around.
      
      If we look at success rates, the patches don't affect them, that didn't change.
      
      Baseline:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  5-nothp-1             5-nothp-2             5-nothp-3
      Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
      Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
      Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
      Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
      Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
      Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
      Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
      Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 1:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  6-nothp-1             6-nothp-2             6-nothp-3
      Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
      Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
      Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
      Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
      Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
      Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
      Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)
      
      Patch 2:
      
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  7-nothp-1             7-nothp-2             7-nothp-3
      Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
      Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
      Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
      Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
      Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
      Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 3:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  8-nothp-1             8-nothp-2             8-nothp-3
      Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
      Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
      Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
      Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
      Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
      Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
      Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
      Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
      Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)
      
      While there's no improvement here, I consider reduced fragmentation events
      to be worth on its own.  Patch 2 also seems to reduce scanning for free
      pages, and migrations in compaction, suggesting it has somewhat less work
      to do:
      
      Patch 1:
      
      Compaction stalls                 4153        3959        3978
      Compaction success                1523        1441        1446
      Compaction failures               2630        2517        2531
      Page migrate success           4600827     4943120     5104348
      Page migrate failure             19763       16656       17806
      Compaction pages isolated      9597640    10305617    10653541
      Compaction migrate scanned    77828948    86533283    87137064
      Compaction free scanned      517758295   521312840   521462251
      Compaction cost                   5503        5932        6110
      
      Patch 2:
      
      Compaction stalls                 3800        3450        3518
      Compaction success                1421        1316        1317
      Compaction failures               2379        2134        2201
      Page migrate success           4160421     4502708     4752148
      Page migrate failure             19705       14340       14911
      Compaction pages isolated      8731983     9382374     9910043
      Compaction migrate scanned    98362797    96349194    98609686
      Compaction free scanned      496512560   469502017   480442545
      Compaction cost                   5173        5526        5811
      
      As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
      of unmovable and reclaimable pageblocks.
      
      Configuring the benchmark to allocate like THP page fault (i.e.  no sync
      compaction) gives much noisier results for iterations 2 and 3 after
      reboot.  This is not so surprising given how [1] offers lower improvements
      in this scenario due to less restarts after deferred compaction which
      would change compaction pivot.
      
      Baseline:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          5-thp-1         5-thp-2         5-thp-3
      Page alloc extfrag event                                8148965     6227815     6646741
      Extfrag fragmenting                                     8147872     6227130     6646117
      Extfrag fragmenting for unmovable                         10324       12942       15975
      Extfrag fragmenting unmovable placed with movable          5972        8495       10907
      Extfrag fragmenting for reclaimable                         601        1707        2210
      Extfrag fragmenting reclaimable placed with movable         520        1570        2000
      Extfrag fragmenting for movable                         8136947     6212481     6627932
      
      Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          6-thp-1         6-thp-2         6-thp-3
      Page alloc extfrag event                                8345457     7574471     7020419
      Extfrag fragmenting                                     8343546     7573777     7019718
      Extfrag fragmenting for unmovable                         10256       18535       30716
      Extfrag fragmenting unmovable placed with movable          6893       11726       22181
      Extfrag fragmenting for reclaimable                         465        1208        1023
      Extfrag fragmenting reclaimable placed with movable         353         996         843
      Extfrag fragmenting for movable                         8332825     7554034     6987979
      
      Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          7-thp-1         7-thp-2         7-thp-3
      Page alloc extfrag event                                3512847     3020756     2891625
      Extfrag fragmenting                                     3511940     3020185     2891059
      Extfrag fragmenting for unmovable                          9017        6892        6191
      Extfrag fragmenting unmovable placed with movable          1524        3053        2435
      Extfrag fragmenting for reclaimable                         445        1081        1160
      Extfrag fragmenting reclaimable placed with movable         375         918         986
      Extfrag fragmenting for movable                         3502478     3012212     2883708
      
      Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          8-thp-1         8-thp-2         8-thp-3
      Page alloc extfrag event                                3181699     3082881     2674164
      Extfrag fragmenting                                     3180812     3082303     2673611
      Extfrag fragmenting for unmovable                          1201        4031        4040
      Extfrag fragmenting unmovable placed with movable           974        3611        3645
      Extfrag fragmenting for reclaimable                         478        1165        1294
      Extfrag fragmenting reclaimable placed with movable         387         985        1030
      Extfrag fragmenting for movable                         3179133     3077107     2668277
      
      The improvements for first iteration are clear, the rest is much noisier
      and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.
      
      Allocation success rates are again unaffected so there's no point in
      making this e-mail any longer.
      
      [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
      
      This patch (of 3):
      
      When __rmqueue_fallback() is called to allocate a page of order X, it will
      find a page of order Y >= X of a fallback migratetype, which is different
      from the desired migratetype.  With the help of try_to_steal_freepages(),
      it may change the migratetype (to the desired one) also of:
      
      1) all currently free pages in the pageblock containing the fallback page
      2) the fallback pageblock itself
      3) buddy pages created by splitting the fallback page (when Y > X)
      
      These decisions take the order Y into account, as well as the desired
      migratetype, with the goal of preventing multiple fallback allocations
      that could e.g.  distribute UNMOVABLE allocations among multiple
      pageblocks.
      
      Originally, decision for 1) has implied the decision for 3).  Commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
      (probably unintentionally) so that the buddy pages in case 3) are always
      changed to the desired migratetype, except for CMA pageblocks.
      
      Commit fef903ef ("mm/page_allo.c: restructure free-page stealing code
      and fix a bug") did some refactoring and added a comment that the case of
      3) is intended.  Commit 0cbef29a ("mm: __rmqueue_fallback() should
      respect pageblock type") removed the comment and tried to restore the
      original behavior where 1) implies 3), but due to the previous
      refactoring, the result is instead that only 2) implies 3) - and the
      conditions for 2) are less frequently met than conditions for 1).  This
      may increase fragmentation in situations where the code decides to steal
      all free pages from the pageblock (case 1)), but then gives back the buddy
      pages produced by splitting.
      
      This patch restores the original intended logic where 1) implies 3).
      During testing with stress-highalloc from mmtests, this has shown to
      decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
      steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
      In some cases it has increased the number of events when MOVABLE
      allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
      fixable by sync compaction and thus less harmful.
      
      Note that evaluation has shown that the behavior introduced by
      47118af0 for buddy pages in case 3) is actually even better than the
      original logic, so the following patch will introduce it properly once
      again.  For stable backports of this patch it makes thus sense to only fix
      versions containing 0cbef29a.
      
      [iamjoonsoo.kim@lge.com: tracepoint fix]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdf47668
    • Andrey Ryabinin's avatar
      mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? · 89316deb
      Andrey Ryabinin authored
      commit 3cd7645d upstream.
      
      Commit ed4d4902 ("mm, hugetlb: remove hugetlb_zero and
      hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero'
      leading to out-of-bounds access in proc_doulongvec_minmax().  Use
      '.extra1 = NULL' instead of '.extra1 = &zero'.  Passing NULL is
      equivalent to passing minimal value, which is 0 for unsigned types.
      
      Fixes: ed4d4902 ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity")
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Suggested-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      89316deb
    • Naoya Horiguchi's avatar
      mm/hugetlb: add migration entry check in __unmap_hugepage_range · 3ffc797a
      Naoya Horiguchi authored
      commit 9fbc1f63 upstream.
      
      If __unmap_hugepage_range() tries to unmap the address range over which
      hugepage migration is on the way, we get the wrong page because pte_page()
      doesn't work for migration entries.  This patch simply clears the pte for
      migration entries as we do for hwpoison entries.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ffc797a
    • Naoya Horiguchi's avatar
      mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection · 2c90c58c
      Naoya Horiguchi authored
      commit a8bda28d upstream.
      
      There is a race condition between hugepage migration and
      change_protection(), where hugetlb_change_protection() doesn't care about
      migration entries and wrongly overwrites them.  That causes unexpected
      results like kernel crash.  HWPoison entries also can cause the same
      problem.
      
      This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
      function to do proper actions.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c90c58c
    • Naoya Horiguchi's avatar
      mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() · 75809873
      Naoya Horiguchi authored
      commit 0f792cf9 upstream.
      
      When running the test which causes the race as shown in the previous patch,
      we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
      
      This race happens when pte turns into migration entry just after the first
      check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
      To fix this, we need to check pte_present() again after huge_ptep_get().
      
      This patch also reorders taking ptl and doing pte_page(), because
      pte_page() should be done in ptl.  Due to this reordering, we need use
      trylock_page() in page != pagecache_page case to respect locking order.
      
      Fixes: 66aebce7 ("hugetlb: fix race condition in hugetlb_fault()")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75809873
    • Jiri Pirko's avatar
      team: don't traverse port list using rcu in team_set_mac_address · dde0b1d5
      Jiri Pirko authored
      [ Upstream commit 9215f437 ]
      
      Currently the list is traversed using rcu variant. That is not correct
      since dev_set_mac_address can be called which eventually calls
      rtmsg_ifinfo_build_skb and there, skb allocation can sleep. So fix this
      by remove the rcu usage here.
      
      Fixes: 3d249d4c "net: introduce ethernet teaming device"
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dde0b1d5
    • Lorenzo Colitti's avatar
      net: ping: Return EAFNOSUPPORT when appropriate. · 2391f6b4
      Lorenzo Colitti authored
      [ Upstream commit 9145736d ]
      
      1. For an IPv4 ping socket, ping_check_bind_addr does not check
         the family of the socket address that's passed in. Instead,
         make it behave like inet_bind, which enforces either that the
         address family is AF_INET, or that the family is AF_UNSPEC and
         the address is 0.0.0.0.
      2. For an IPv6 ping socket, ping_check_bind_addr returns EINVAL
         if the socket family is not AF_INET6. Return EAFNOSUPPORT
         instead, for consistency with inet6_bind.
      3. Make ping_v4_sendmsg and ping_v6_sendmsg return EAFNOSUPPORT
         instead of EINVAL if an incorrect socket address structure is
         passed in.
      4. Make IPv6 ping sockets be IPv6-only. The code does not support
         IPv4, and it cannot easily be made to support IPv4 because
         the protocol numbers for ICMP and ICMPv6 are different. This
         makes connect(::ffff:192.0.2.1) fail with EAFNOSUPPORT instead
         of making the socket unusable.
      
      Among other things, this fixes an oops that can be triggered by:
      
          int s = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
          struct sockaddr_in6 sin6 = {
              .sin6_family = AF_INET6,
              .sin6_addr = in6addr_any,
          };
          bind(s, (struct sockaddr *) &sin6, sizeof(sin6));
      
      Change-Id: If06ca86d9f1e4593c0d6df174caca3487c57a241
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2391f6b4
    • Michal Kubeček's avatar
      udp: only allow UFO for packets from SOCK_DGRAM sockets · 123303d4
      Michal Kubeček authored
      [ Upstream commit acf8dd0a ]
      
      If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a
      UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to
      CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer
      checksum is to be computed on segmentation. However, in this case,
      skb->csum_start and skb->csum_offset are never set as raw socket
      transmit path bypasses udp_send_skb() where they are usually set. As a
      result, driver may access invalid memory when trying to calculate the
      checksum and store the result (as observed in virtio_net driver).
      
      Moreover, the very idea of modifying the userspace provided UDP header
      is IMHO against raw socket semantics (I wasn't able to find a document
      clearly stating this or the opposite, though). And while allowing
      CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit
      too intrusive change just to handle a corner case like this. Therefore
      disallowing UFO for packets from SOCK_DGRAM seems to be the best option.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      123303d4
    • Ben Shelton's avatar
      usb: plusb: Add support for National Instruments host-to-host cable · da122a6b
      Ben Shelton authored
      [ Upstream commit 42c972a1 ]
      
      The National Instruments USB Host-to-Host Cable is based on the Prolific
      PL-25A1 chipset.  Add its VID/PID so the plusb driver will recognize it.
      Signed-off-by: default avatarBen Shelton <ben.shelton@ni.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da122a6b
    • Eric Dumazet's avatar
      net: do not use rcu in rtnl_dump_ifinfo() · 25ba5bb0
      Eric Dumazet authored
      [ Upstream commit cac5e65e ]
      
      We did a failed attempt in the past to only use rcu in rtnl dump
      operations (commit e67f88dd "net: dont hold rtnl mutex during
      netlink dump callbacks")
      
      Now that dumps are holding RTNL anyway, there is no need to also
      use rcu locking, as it forbids any scheduling ability, like
      GFP_KERNEL allocations that controlling path should use instead
      of GFP_ATOMIC whenever possible.
      
      This should fix following splat Cong Wang reported :
      
       [ INFO: suspicious RCU usage. ]
       3.19.0+ #805 Tainted: G        W
      
       include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 1, debug_locks = 0
       2 locks held by ip/771:
        #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8182b8f4>] netlink_dump+0x21/0x26c
        #1:  (rcu_read_lock){......}, at: [<ffffffff817d785b>] rcu_read_lock+0x0/0x6e
      
       stack backtrace:
       CPU: 3 PID: 771 Comm: ip Tainted: G        W       3.19.0+ #805
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000001 ffff8800d51e7718 ffffffff81a27457 0000000029e729e6
        ffff8800d6108000 ffff8800d51e7748 ffffffff810b539b ffffffff820013dd
        00000000000001c8 0000000000000000 ffff8800d7448088 ffff8800d51e7758
       Call Trace:
        [<ffffffff81a27457>] dump_stack+0x4c/0x65
        [<ffffffff810b539b>] lockdep_rcu_suspicious+0x107/0x110
        [<ffffffff8109796f>] rcu_preempt_sleep_check+0x45/0x47
        [<ffffffff8109e457>] ___might_sleep+0x1d/0x1cb
        [<ffffffff8109e67d>] __might_sleep+0x78/0x80
        [<ffffffff814b9b1f>] idr_alloc+0x45/0xd1
        [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
        [<ffffffff814b9f9d>] ? idr_for_each+0x53/0x101
        [<ffffffff817c1383>] alloc_netid+0x61/0x69
        [<ffffffff817c14c3>] __peernet2id+0x79/0x8d
        [<ffffffff817c1ab7>] peernet2id+0x13/0x1f
        [<ffffffff817d8673>] rtnl_fill_ifinfo+0xa8d/0xc20
        [<ffffffff810b17d9>] ? __lock_is_held+0x39/0x52
        [<ffffffff817d894f>] rtnl_dump_ifinfo+0x149/0x213
        [<ffffffff8182b9c2>] netlink_dump+0xef/0x26c
        [<ffffffff8182bcba>] netlink_recvmsg+0x17b/0x2c5
        [<ffffffff817b0adc>] __sock_recvmsg+0x4e/0x59
        [<ffffffff817b1b40>] sock_recvmsg+0x3f/0x51
        [<ffffffff817b1f9a>] ___sys_recvmsg+0xf6/0x1d9
        [<ffffffff8115dc67>] ? handle_pte_fault+0x6e1/0xd3d
        [<ffffffff8100a3a0>] ? native_sched_clock+0x35/0x37
        [<ffffffff8109f45b>] ? sched_clock_local+0x12/0x72
        [<ffffffff8109f6ac>] ? sched_clock_cpu+0x9e/0xb7
        [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
        [<ffffffff811abde8>] ? __fcheck_files+0x4c/0x58
        [<ffffffff811ac556>] ? __fget_light+0x2d/0x52
        [<ffffffff817b376f>] __sys_recvmsg+0x42/0x60
        [<ffffffff817b379f>] SyS_recvmsg+0x12/0x1c
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 0c7aecd4 ("netns: add rtnl cmd to add and get peer netns ids")
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25ba5bb0
    • Geert Uytterhoeven's avatar
      sh_eth: Fix lost MAC address on kexec · d97191b0
      Geert Uytterhoeven authored
      [ Upstream commit a14c7d15 ]
      
      Commit 740c7f31 ("sh_eth: Ensure DMA engines are stopped before
      freeing buffers") added a call to sh_eth_reset() to the
      sh_eth_set_ringparam() and sh_eth_close() paths.
      
      However, setting the software reset bit(s) in the EDMR register resets
      the MAC Address Registers to zero. Hence after kexec, the new kernel
      doesn't detect a valid MAC address and assigns a random MAC address,
      breaking DHCP.
      
      Set the MAC address again after the reset in sh_eth_dev_exit() to fix
      this.
      
      Tested on r8a7740/armadillo (GETHER) and r8a7791/koelsch (FAST_RCAR).
      
      Fixes: 740c7f31 ("sh_eth: Ensure DMA engines are stopped before freeing buffers")
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d97191b0
    • Florian Fainelli's avatar
      net: bcmgenet: fix software maintained statistics · 8e0f5ee1
      Florian Fainelli authored
      [ Upstream commit f62ba9c1 ]
      
      Commit 44c8bc3c ("net: bcmgenet: log RX buffer allocation and RX/TX dma
      failures") added a few software maintained statistics using
      BCMGENET_STAT_MIB_RX and BCMGENET_STAT_MIB_TX. These statistics are read from
      the hardware MIB counters, such that bcmgenet_update_mib_counters() was trying
      to read from a non-existing MIB offset for these counters.
      
      Fix this by introducing a special type: BCMGENET_STAT_SOFT, similar to
      BCMGENET_STAT_NETDEV, such that bcmgenet_get_ethtool_stats will read from the
      software mib.
      
      Fixes: 44c8bc3c ("net: bcmgenet: log RX buffer allocation and RX/TX dma failures")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e0f5ee1
    • Jaedon Shin's avatar
      net: bcmgenet: fix throughtput regression · 001f8cee
      Jaedon Shin authored
      [ Upstream commit 4092e6ac ]
      
      This patch adds bcmgenet_tx_poll for the tx_rings. This can reduce the
      interrupt load and send xmit in network stack on time. This also
      separated for the completion of tx_ring16 from bcmgenet_poll.
      
      The bcmgenet_tx_reclaim of tx_ring[{0,1,2,3}] operative by an interrupt
      is to be not more than a certain number TxBDs. It is caused by too
      slowly reclaiming the transmitted skb. Therefore, performance
      degradation of xmit after 605ad7f1 ("tcp: refine TSO autosizing").
      Signed-off-by: default avatarJaedon Shin <jaedon.shin@gmail.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      001f8cee
    • Eric Dumazet's avatar
      macvtap: make sure neighbour code can push ethernet header · 72e72674
      Eric Dumazet authored
      [ Upstream commit 2f1d8b9e ]
      
      Brian reported crashes using IPv6 traffic with macvtap/veth combo.
      
      I tracked the crashes in neigh_hh_output()
      
      -> memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD);
      
      Neighbour code assumes headroom to push Ethernet header is
      at least 16 bytes.
      
      It appears macvtap has only 14 bytes available on arches
      where NET_IP_ALIGN is 0 (like x86)
      
      Effect is a corruption of 2 bytes right before skb->head,
      and possible crashes if accessing non existing memory.
      
      This fix should also increase IPv4 performance, as paranoid code
      in ip_finish_output2() wont have to call skb_realloc_headroom()
      Reported-by: default avatarBrian Rak <brak@vultr.com>
      Tested-by: default avatarBrian Rak <brak@vultr.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72e72674
    • Catalin Marinas's avatar
      net: compat: Ignore MSG_CMSG_COMPAT in compat_sys_{send, recv}msg · c9a44034
      Catalin Marinas authored
      [ Upstream commit d720d8ce ]
      
      With commit a7526eb5 (net: Unbreak compat_sys_{send,recv}msg), the
      MSG_CMSG_COMPAT flag is blocked at the compat syscall entry points,
      changing the kernel compat behaviour from the one before the commit it
      was trying to fix (1be374a0, net: Block MSG_CMSG_COMPAT in
      send(m)msg and recv(m)msg).
      
      On 32-bit kernels (!CONFIG_COMPAT), MSG_CMSG_COMPAT is 0 and the native
      32-bit sys_sendmsg() allows flag 0x80000000 to be set (it is ignored by
      the kernel). However, on a 64-bit kernel, the compat ABI is different
      with commit a7526eb5.
      
      This patch changes the compat_sys_{send,recv}msg behaviour to the one
      prior to commit 1be374a0.
      
      The problem was found running 32-bit LTP (sendmsg01) binary on an arm64
      kernel. Arguably, LTP should not pass 0xffffffff as flags to sendmsg()
      but the general rule is not to break user ABI (even when the user
      behaviour is not entirely sane).
      
      Fixes: a7526eb5 (net: Unbreak compat_sys_{send,recv}msg)
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9a44034
    • Jiri Pirko's avatar
      team: fix possible null pointer dereference in team_handle_frame · 82b92857
      Jiri Pirko authored
      [ Upstream commit 57e59563 ]
      
      Currently following race is possible in team:
      
      CPU0                                        CPU1
                                                  team_port_del
                                                    team_upper_dev_unlink
                                                      priv_flags &= ~IFF_TEAM_PORT
      team_handle_frame
        team_port_get_rcu
          team_port_exists
            priv_flags & IFF_TEAM_PORT == 0
          return NULL (instead of port got
                       from rx_handler_data)
                                                    netdev_rx_handler_unregister
      
      The thing is that the flag is removed before rx_handler is unregistered.
      If team_handle_frame is called in between, team_port_exists returns 0
      and team_port_get_rcu will return NULL.
      So do not check the flag here. It is guaranteed by netdev_rx_handler_unregister
      that team_handle_frame will always see valid rx_handler_data pointer.
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Fixes: 3d249d4c ("net: introduce ethernet teaming device")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82b92857
    • Eric Dumazet's avatar
      net: pktgen: disable xmit_clone on virtual devices · 37c63bf5
      Eric Dumazet authored
      [ Upstream commit 52d6c8c6 ]
      
      Trying to use burst capability (aka xmit_more) on a virtual device
      like bonding is not supported.
      
      For example, skb might be queued multiple times on a qdisc, with
      various list corruptions.
      
      Fixes: 38b2cf29 ("net: pktgen: packet bursting via skb->xmit_more")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37c63bf5
    • David S. Miller's avatar
      Revert "r8169: add support for Byte Queue Limits" · 9ae56d8b
      David S. Miller authored
      This reverts commit 1e918876.
      
      Revert BQL support in r8169 driver as several regressions
      point to this commit and we cannot figure out the real
      cause yet.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ae56d8b
    • Matthew Thode's avatar
      net: reject creation of netdev names with colons · 465146d2
      Matthew Thode authored
      [ Upstream commit a4176a93 ]
      
      colons are used as a separator in netdev device lookup in dev_ioctl.c
      
      Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME
      Signed-off-by: default avatarMatthew Thode <mthode@mthode.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      465146d2
    • Eric Dumazet's avatar
      sock: sock_dequeue_err_skb() needs hard irq safety · 9d0ba3cf
      Eric Dumazet authored
      [ Upstream commit 997d5c3f ]
      
      Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb()
      from hard IRQ context.
      
      Therefore, sock_dequeue_err_skb() needs to block hard irq or
      corruptions or hangs can happen.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 364a9e93 ("sock: deduplicate errqueue dequeue")
      Fixes: cb820f8e ("net: Provide a generic socket error queue delivery method for Tx time stamps.")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d0ba3cf
    • Pravin B Shelar's avatar
      openvswitch: Fix net exit. · 488da593
      Pravin B Shelar authored
      [ Upstream commit 7b4577a9 ]
      
      Open vSwitch allows moving internal vport to different namespace
      while still connected to the bridge. But when namespace deleted
      OVS does not detach these vports, that results in dangling
      pointer to netdevice which causes kernel panic as follows.
      This issue is fixed by detaching all ovs ports from the deleted
      namespace at net-exit.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
      IP: [<ffffffffa0aadaa5>] ovs_vport_locate+0x35/0x80 [openvswitch]
      Oops: 0000 [#1] SMP
      Call Trace:
       [<ffffffffa0aa6391>] lookup_vport+0x21/0xd0 [openvswitch]
       [<ffffffffa0aa65f9>] ovs_vport_cmd_get+0x59/0xf0 [openvswitch]
       [<ffffffff8167e07c>] genl_family_rcv_msg+0x1bc/0x3e0
       [<ffffffff8167e319>] genl_rcv_msg+0x79/0xc0
       [<ffffffff8167d919>] netlink_rcv_skb+0xb9/0xe0
       [<ffffffff8167deac>] genl_rcv+0x2c/0x40
       [<ffffffff8167cffd>] netlink_unicast+0x12d/0x1c0
       [<ffffffff8167d3da>] netlink_sendmsg+0x34a/0x6b0
       [<ffffffff8162e140>] sock_sendmsg+0xa0/0xe0
       [<ffffffff8162e5e8>] ___sys_sendmsg+0x408/0x420
       [<ffffffff8162f541>] __sys_sendmsg+0x51/0x90
       [<ffffffff8162f592>] SyS_sendmsg+0x12/0x20
       [<ffffffff81764ee9>] system_call_fastpath+0x12/0x17
      Reported-by: default avatarAssaf Muller <amuller@redhat.com>
      Fixes: 46df7b81("openvswitch: Add support for network namespaces.")
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Reviewed-by: default avatarThomas Graf <tgraf@noironetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      488da593
    • Ignacy Gawędzki's avatar
      ematch: Fix auto-loading of ematch modules. · 7f77f6c8
      Ignacy Gawędzki authored
      [ Upstream commit 34eea79e ]
      
      In tcf_em_validate(), after calling request_module() to load the
      kind-specific module, set em->ops to NULL before returning -EAGAIN, so
      that module_put() is not called again by tcf_em_tree_destroy().
      Signed-off-by: default avatarIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f77f6c8
    • Guenter Roeck's avatar
      net: phy: Fix verification of EEE support in phy_init_eee · 1dd8e324
      Guenter Roeck authored
      [ Upstream commit 54da5a8b ]
      
      phy_init_eee uses phy_find_setting(phydev->speed, phydev->duplex)
      to find a valid entry in the settings array for the given speed
      and duplex value. For full duplex 1000baseT, this will return
      the first matching entry, which is the entry for 1000baseKX_Full.
      
      If the phy eee does not support 1000baseKX_Full, this entry will not
      match, causing phy_init_eee to fail for no good reason.
      
      Fixes: 9a9c56cb ("net: phy: fix a bug when verify the EEE support")
      Fixes: 3e707706 ("phy: Expand phy speed/duplex settings array")
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1dd8e324
    • Alexander Drozdov's avatar
      ipv4: ip_check_defrag should not assume that skb_network_offset is zero · 8959499c
      Alexander Drozdov authored
      [ Upstream commit 3e32e733 ]
      
      ip_check_defrag() may be used by af_packet to defragment outgoing packets.
      skb_network_offset() of af_packet's outgoing packets is not zero.
      Signed-off-by: default avatarAlexander Drozdov <al.drozdov@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8959499c
    • Alexander Drozdov's avatar
      ipv4: ip_check_defrag should correctly check return value of skb_copy_bits · 1ccd26c9
      Alexander Drozdov authored
      [ Upstream commit fba04a9e ]
      
      skb_copy_bits() returns zero on success and negative value on error,
      so it is needed to invert the condition in ip_check_defrag().
      
      Fixes: 1bf3751e ("ipv4: ip_check_defrag must not modify skb before unsharing")
      Signed-off-by: default avatarAlexander Drozdov <al.drozdov@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ccd26c9
    • Ignacy Gawędzki's avatar
      gen_stats.c: Duplicate xstats buffer for later use · e67e4522
      Ignacy Gawędzki authored
      [ Upstream commit 1c4cff0c ]
      
      The gnet_stats_copy_app() function gets called, more often than not, with its
      second argument a pointer to an automatic variable in the caller's stack.
      Therefore, to avoid copying garbage afterwards when calling
      gnet_stats_finish_copy(), this data is better copied to a dynamically allocated
      memory that gets freed after use.
      
      [xiyou.wangcong@gmail.com: remove a useless kfree()]
      Signed-off-by: default avatarIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e67e4522
    • WANG Cong's avatar
      rtnetlink: call ->dellink on failure when ->newlink exists · 2f3b6173
      WANG Cong authored
      [ Upstream commit 7afb8886 ]
      
      Ignacy reported that when eth0 is down and add a vlan device
      on top of it like:
      
        ip link add link eth0 name eth0.1 up type vlan id 1
      
      We will get a refcount leak:
      
        unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2
      
      The problem is when rtnl_configure_link() fails in rtnl_newlink(),
      we simply call unregister_device(), but for stacked device like vlan,
      we almost do nothing when we unregister the upper device, more work
      is done when we unregister the lower device, so call its ->dellink().
      Reported-by: default avatarIgnacy Gawedzki <ignacy.gawedzki@green-communications.fr>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f3b6173
    • Martin KaFai Lau's avatar
      ipv6: fix ipv6_cow_metrics for non DST_HOST case · 29ec7670
      Martin KaFai Lau authored
      [ Upstream commit 3b471175 ]
      
      ipv6_cow_metrics() currently assumes only DST_HOST routes require
      dynamic metrics allocation from inetpeer.  The assumption breaks
      when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric.
      Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set()
      is called after the route is created.
      
      This patch creates the metrics array (by calling dst_cow_metrics_generic) in
      ipv6_cow_metrics().
      
      Test:
      radvd.conf:
      interface qemubr0
      {
      	AdvLinkMTU 1300;
      	AdvCurHopLimit 30;
      
      	prefix fd00:face:face:face::/64
      	{
      		AdvOnLink on;
      		AdvAutonomous on;
      		AdvRouterAddr off;
      	};
      };
      
      Before:
      [root@qemu1 ~]# ip -6 r show | egrep -v unreachable
      fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec
      fe80::/64 dev eth0  proto kernel  metric 256
      default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec
      
      After:
      [root@qemu1 ~]# ip -6 r show | egrep -v unreachable
      fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec mtu 1300
      fe80::/64 dev eth0  proto kernel  metric 256  mtu 1300
      default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec mtu 1300 hoplimit 30
      
      Fixes: 8e2ec639 (ipv6: don't use inetpeer to store metrics for routes.)
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29ec7670
    • Eric Dumazet's avatar
      tcp: make sure skb is not shared before using skb_get() · b73f6e47
      Eric Dumazet authored
      [ Upstream commit ba34e6d9 ]
      
      IPv6 can keep a copy of SYN message using skb_get() in
      tcp_v6_conn_request() so that caller wont free the skb when calling
      kfree_skb() later.
      
      Therefore TCP fast open has to clone the skb it is queuing in
      child->sk_receive_queue, as all skbs consumed from receive_queue are
      freed using __kfree_skb() (ie assuming skb->users == 1)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Fixes: 5b7ed089 ("tcp: move fastopen functions to tcp_fastopen.c")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b73f6e47
    • Vlad Yasevich's avatar
      ipv6: Make __ipv6_select_ident static · 9519ba74
      Vlad Yasevich authored
      [ Upstream commit 8381eacf ]
      
      Make __ipv6_select_ident() static as it isn't used outside
      the file.
      
      Fixes: 0508c07f (ipv6: Select fragment id during UFO segmentation if not set.)
      Signed-off-by: default avatarVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9519ba74
    • Vlad Yasevich's avatar
      ipv6: Fix fragment id assignment on LE arches. · aee61469
      Vlad Yasevich authored
      [ Upstream commit 51f30770 ]
      
      Recent commit:
      0508c07f
      Author: Vlad Yasevich <vyasevich@gmail.com>
      Date:   Tue Feb 3 16:36:15 2015 -0500
      
          ipv6: Select fragment id during UFO segmentation if not set.
      
      Introduced a bug on LE in how ipv6 fragment id is assigned.
      This was cought by nightly sparce check:
      
      Resolve the following sparce error:
       net/ipv6/output_core.c:57:38: sparse: incorrect type in assignment
       (different base types)
         net/ipv6/output_core.c:57:38:    expected restricted __be32
      [usertype] ip6_frag_id
         net/ipv6/output_core.c:57:38:    got unsigned int [unsigned]
      [assigned] [usertype] id
      
      Fixes: 0508c07f (ipv6: Select fragment id during UFO segmentation if not set.)
      Signed-off-by: default avatarVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aee61469
    • Daniel Borkmann's avatar
      rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY · 4f630d42
      Daniel Borkmann authored
      [ Upstream commit 364d5716 ]
      
      ifla_vf_policy[] is wrong in advertising its individual member types as
      NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
      len member as *max* attribute length [0, len].
      
      The issue is that when do_setvfinfo() is being called to set up a VF
      through ndo handler, we could set corrupted data if the attribute length
      is less than the size of the related structure itself.
      
      The intent is exactly the opposite, namely to make sure to pass at least
      data of minimum size of len.
      
      Fixes: ebc08a6f ("rtnetlink: Add VF config code to rtnetlink")
      Cc: Mitch Williams <mitch.a.williams@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f630d42
    • Sabrina Dubroca's avatar
      pktgen: fix UDP checksum computation · 18bb5d30
      Sabrina Dubroca authored
      [ Upstream commit 7744b5f3 ]
      
      This patch fixes two issues in UDP checksum computation in pktgen.
      
      First, the pseudo-header uses the source and destination IP
      addresses. Currently, the ports are used for IPv4.
      
      Second, the UDP checksum covers both header and data.  So we need to
      generate the data earlier (move pktgen_finalize_skb up), and compute
      the checksum for UDP header + data.
      
      Fixes: c26bf4a5 ("pktgen: Add UDPCSUM flag to support UDP checksums")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18bb5d30
    • Daniel Borkmann's avatar
      ipv6: addrconf: add missing validate_link_af handler · 5484abd9
      Daniel Borkmann authored
      [ Upstream commit 11b1f828 ]
      
      We still need a validate_link_af() handler with an appropriate nla policy,
      similarly as we have in IPv4 case, otherwise size validations are not being
      done properly in that case.
      
      Fixes: f53adae4 ("net: ipv6: add tokenized interface identifier support")
      Fixes: bc91b0f0 ("ipv6: addrconf: implement address generation modes")
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5484abd9
    • Miroslav Urbanek's avatar
      flowcache: Fix kernel panic in flow_cache_flush_task · 3ec37c5e
      Miroslav Urbanek authored
      [ Upstream commit 233c96fc ]
      
      flow_cache_flush_task references a structure member flow_cache_gc_work
      where it should reference flow_cache_flush_task instead.
      
      Kernel panic occurs on kernels using IPsec during XFRM garbage
      collection. The garbage collection interval can be shortened using the
      following sysctl settings:
      
      net.ipv4.xfrm4_gc_thresh=4
      net.ipv6.xfrm6_gc_thresh=4
      
      With the default settings, our productions servers crash approximately
      once a week. With the settings above, they crash immediately.
      
      Fixes: ca925cf1 ("flowcache: Make flow cache name space aware")
      Reported-by: default avatarTomáš Charvát <tc@excello.cz>
      Tested-by: default avatarJan Hejl <jh@excello.cz>
      Signed-off-by: default avatarMiroslav Urbanek <mu@miroslavurbanek.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ec37c5e
  2. 06 Mar, 2015 5 commits