1. 25 Mar, 2015 17 commits
    • Mel Gorman's avatar
      mm: numa: group related processes based on VMA flags instead of page table flags · bea66fbd
      Mel Gorman authored
      These are three follow-on patches based on the xfsrepair workload Dave
      Chinner reported was problematic in 4.0-rc1 due to changes in page table
      management -- https://lkml.org/lkml/2015/3/1/226.
      
      Much of the problem was reduced by commit 53da3bc2 ("mm: fix up numa
      read-only thread grouping logic") and commit ba68bc01 ("mm: thp:
      Return the correct value for change_huge_pmd").  It was known that the
      performance in 3.19 was still better even if is far less safe.  This
      series aims to restore the performance without compromising on safety.
      
      For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
      three patches applied on top
      
        autonumabench
                                                      3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                                     vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
        Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
        Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
        Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
        Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
        Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
        Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
        Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)
      
                      3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                     vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        User        46334.60    46391.94    44383.95    43971.89    44372.12
        System        252.84      284.66      252.61      201.24      249.00
        Elapsed      1062.14     1050.96      998.68     1000.94     1026.78
      
      Overall the system CPU usage is comparable and the test is naturally a
      bit variable.  The slowing of the scanner hurts numa01 but on this
      machine it is an adverse workload and patches that dramatically help it
      often hurt absolutely everything else.
      
      Due to patch 2, the fault activity is interesting
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                   2097811     2656646     2597249     1981230     1636841
        Major Faults                       362         450         365         364         365
      
      Note the impact preserving the write bit across protection updates and
      fault reduces faults.
      
        NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
        NUMA alloc miss                      0           0           0           0           0
        NUMA interleave hit                  0           0           0           0           0
        NUMA alloc local               1228514     1216317     1190871     1177448     1199021
        NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
        NUMA huge PMD updates           479530      468448      464868      477573      224487
        NUMA page range updates      491225557   479886983   476207932   489222218   229950144
        NUMA hint faults                659753      656503      641678      656926      294842
        NUMA hint local faults          381604      373963      360478      337585      186249
        NUMA hint local percent             57          56          56          51          63
        NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
        AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%
      
      Here the impact of slowing the PTE scanner on migratrion failures is
      obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
      massively reduced even though the headline performance is very similar.
      
      As xfsrepair was the reported workload here is the impact of the series
      on it.
      
        xfsrepair
                                               3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                              vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
        Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
        Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
        Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
        Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
        Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
        Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
        Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
        Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
        Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
        Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
        Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
        CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
        CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
        CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
        CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
        Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
        Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
        Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
        Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)
      
      The really relevant lines as syst-xfsrepair which is the system CPU
      usage when running xfsrepair.  Note that on my machine the overhead was
      45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
      preserve the write bit across faults, it's only 2.51% higher on average.
      With the full series applied, system CPU usage is 24.6% lower on
      average.
      
      Again, the impact of preserving the write bit on minor faults is obvious
      and the impact of slowing scanning after migration failures is obvious
      on the PTE updates.  Note also that the number of pages migrated is much
      reduced even though the headline performance is comparable.
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                 153466827   254507978   249163829   153501373   105737890
        Major Faults                       610         702         690         649         724
        NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
        NUMA huge PMD updates           129294       85044      106921      127246       79887
        NUMA pages migrated           21938995    29705270    28594162    22687324    16258075
      
                              3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                             vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
        Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
        Mean sdb-await         25.92        5.09        5.33        5.02        5.22
        Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
        Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
        Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
        Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
        Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
        Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
        Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
        Max  sdb-await        168.24       39.28       49.22       44.64       65.62
        Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
        Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
        Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
        Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
        Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15
      
      FWIW, I also checked SPECjbb in different configurations but it's
      similar observations -- minor faults lower, PTE update activity lower
      and performance is roughly comparable against 3.19.
      
      This patch (of 3):
      
      Threads that share writable data within pages are grouped together as
      related tasks.  This decision is based on whether the PTE is marked
      dirty which is subject to timing races between the PTE scanner update
      and when the application writes the page.  If the page is file-backed,
      then background flushes and sync also affect placement.  This is
      unpredictable behaviour which is impossible to reason about so this
      patch makes grouping decisions based on the VMA flags.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarDave Chinner <david@fromorbit.com>
      Tested-by: default avatarDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bea66fbd
    • Sergei Antonov's avatar
      hfsplus: fix B-tree corruption after insertion at position 0 · 98cf21c6
      Sergei Antonov authored
      Fix B-tree corruption when a new record is inserted at position 0 in the
      node in hfs_brec_insert().  In this case a hfs_brec_update_parent() is
      called to update the parent index node (if exists) and it is passed
      hfs_find_data with a search_key containing a newly inserted key instead
      of the key to be updated.  This results in an inconsistent index node.
      The bug reproduces on my machine after an extents overflow record for
      the catalog file (CNID=4) is inserted into the extents overflow B-tree.
      Because of a low (reserved) value of CNID=4, it has to become the first
      record in the first leaf node.
      
      The resulting first leaf node is correct:
      
        ----------------------------------------------------
        | key0.CNID=4 | key1.CNID=123 | key2.CNID=456, ... |
        ----------------------------------------------------
      
      But the parent index key0 still contains the previous key CNID=123:
      
        -----------------------
        | key0.CNID=123 | ... |
        -----------------------
      
      A change in hfs_brec_insert() makes hfs_brec_update_parent() work
      correctly by preventing it from getting fd->record=-1 value from
      __hfs_brec_find().
      
      Along the way, I removed duplicate code with unification of the if
      condition.  The resulting code is equivalent to the original code
      because node is never 0.
      
      Also hfs_brec_update_parent() will now return an error after getting a
      negative fd->record value.  However, the return value of
      hfs_brec_update_parent() is not checked anywhere in the file and I'm
      leaving it unchanged by this patch.  brec.c lacks error checking after
      some other calls too, but this issue is of less importance than the one
      being fixed by this patch.
      Signed-off-by: default avatarSergei Antonov <saproj@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Reviewed-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: default avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Cc: Anton Altaparmakov <aia21@cam.ac.uk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98cf21c6
    • Jean Delvare's avatar
      MAINTAINERS: add Jan as DMI/SMBIOS support maintainer · 1f31e1b1
      Jean Delvare authored
      I am familiar with these drivers and I care about them so let me add
      myself as their maintainer.
      Signed-off-by: default avatarJean Delvare <jdelvare@suse.de>
      Acked-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f31e1b1
    • Taesoo Kim's avatar
      fs/affs/file.c: unlock/release page on error · 3d5d472c
      Taesoo Kim authored
      When affs_bread_ino() fails, correctly unlock the page and release the
      page cache with proper error value.  All write_end() should
      unlock/release the page that was locked by write_beg().
      Signed-off-by: default avatarTaesoo Kim <tsgatesv@gmail.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d5d472c
    • Laura Abbott's avatar
      mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate · cfa86943
      Laura Abbott authored
      Commit 3c605096 ("mm/page_alloc: restrict max order of merging on
      isolated pageblock") changed the logic of unset_migratetype_isolate to
      check the buddy allocator and explicitly call __free_pages to merge.
      
      The page that is being freed in this path never had prep_new_page called
      so set_page_refcounted is called explicitly but there is no call to
      kernel_map_pages.  With the default kernel_map_pages this is mostly
      harmless but if kernel_map_pages does any manipulation of the page
      tables (unmapping or setting pages to read only) this may trigger a
      fault:
      
          alloc_contig_range test_pages_isolated(ceb00, ced00) failed
          Unable to handle kernel paging request at virtual address ffffffc0cec00000
          pgd = ffffffc045fc4000
          [ffffffc0cec00000] *pgd=0000000000000000
          Internal error: Oops: 9600004f [#1] PREEMPT SMP
          Modules linked in: exfatfs
          CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1
          task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000
          PC is at memset+0xc8/0x1c0
          LR is at kernel_map_pages+0x1ec/0x244
      
      Fix this by calling kernel_map_pages to ensure the page is set in the
      page table properly
      
      Fixes: 3c605096 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cfa86943
    • Mark Rutland's avatar
      mm/slub: fix lockups on PREEMPT && !SMP kernels · 859b7a0e
      Mark Rutland authored
      Commit 9aabf810 ("mm/slub: optimize alloc/free fastpath by removing
      preemption on/off") introduced an occasional hang for kernels built with
      CONFIG_PREEMPT && !CONFIG_SMP.
      
      The problem is the following loop the patch introduced to
      slab_alloc_node and slab_free:
      
          do {
              tid = this_cpu_read(s->cpu_slab->tid);
              c = raw_cpu_ptr(s->cpu_slab);
          } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
      
      GCC 4.9 has been observed to hoist the load of c and c->tid above the
      loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
      constant and does not force a reload).  On arm64 the generated assembly
      looks like:
      
               ldr     x4, [x0,#8]
        loop:
               ldr     x1, [x0,#8]
               cmp     x1, x4
               b.ne    loop
      
      If the thread is preempted between the load of c->tid (into x1) and tid
      (into x4), and an allocation or free occurs in another thread (bumping
      the cpu_slab's tid), the thread will be stuck in the loop until
      s->cpu_slab->tid wraps, which may be forever in the absence of
      allocations/frees on the same CPU.
      
      This patch changes the loop condition to access c->tid with READ_ONCE.
      This ensures that the value is reloaded even when the compiler would
      otherwise assume it could cache the value, and also ensures that the
      load will not be torn.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      859b7a0e
    • Gu Zheng's avatar
      mm/memory hotplug: postpone the reset of obsolete pgdat · b0dc3a34
      Gu Zheng authored
      Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
      stress condition:
      
        BUG: unable to handle kernel paging request at 0000000000025f60
        IP: next_online_pgdat+0x1/0x50
        PGD 0
        Oops: 0000 [#1] SMP
        ACPI: Device does not support D3cold
        Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
        CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
        Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
        Workqueue: events vmstat_update
        task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
        RIP: 0010: next_online_pgdat+0x1/0x50
        RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
        RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
        RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
        RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
        R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
        R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
        FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          refresh_cpu_vm_stats+0xd0/0x140
          vmstat_update+0x11/0x50
          process_one_work+0x194/0x3d0
          worker_thread+0x12b/0x410
          kthread+0xc6/0xd0
          ret_from_fork+0x7c/0xb0
      
      The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
      try_offline_node, which will reset all the content of pgdat to 0, as the
      pgdat is accessed lock-free, so that the users still using the pgdat
      will panic, such as the vmstat_update routine.
      
      process A:				offline node XX:
      
      vmstat_updat()
         refresh_cpu_vm_stats()
           for_each_populated_zone()
             find online node XX
           cond_resched()
      					offline cpu and memory, then try_offline_node()
      					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
             zone = next_zone(zone)
               pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
                 next_online_pgdat(pgdat)
                   next_online_node(pgdat->node_id);  // NULL pointer access
      
      So the solution here is postponing the reset of obsolete pgdat from
      try_offline_node() to hotadd_new_pgdat(), and just resetting
      pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
      0 to avoid breaking pointer information in pgdat.
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Reported-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Suggested-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0dc3a34
    • Joe Perches's avatar
      MAINTAINERS: correct rtc armada38x pattern entry · 59ec9671
      Joe Perches authored
      Commit c6a95dbe ("MAINTAINERS: add the RTC driver for the
      Armada38x") typoed the pattern, fix it.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarGregory CLEMENT <gregory.clement@free-electrons.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59ec9671
    • Naoya Horiguchi's avatar
      mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers · f6837395
      Naoya Horiguchi authored
      walk_page_test() is purely pagewalk's internal stuff, and its positive
      return values are not intended to be passed to the callers of pagewalk.
      
      However, in the current code if the last vma in the do-while loop in
      walk_page_range() happens to return a positive value, it leaks outside
      walk_page_range().  So the user visible effect is invalid/unexpected
      return value (according to the reporter, mbind() causes it.)
      
      This patch fixes it simply by reinitializing the return value after
      checked.
      
      Another exposed interface, walk_page_vma(), already returns 0 for such
      cases so no problem.
      
      Fixes: fafaa426 ("pagewalk: improve vma handling")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarKazutomo Yoshii <kazutomo.yoshii@gmail.com>
      Reported-by: default avatarKazutomo Yoshii <kazutomo.yoshii@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6837395
    • Leon Yu's avatar
      mm: fix anon_vma->degree underflow in anon_vma endless growing prevention · 3fe89b3e
      Leon Yu authored
      I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after
      upgrading to 3.19 and had no luck with 4.0-rc1 neither.
      
      So, after looking into new logic introduced by commit 7a3ef208 ("mm:
      prevent endless growth of anon_vma hierarchy"), I found chances are that
      unlink_anon_vmas() is called without incrementing dst->anon_vma->degree
      in anon_vma_clone() due to allocation failure.  If dst->anon_vma is not
      NULL in error path, its degree will be incorrectly decremented in
      unlink_anon_vmas() and eventually underflow when exiting as a result of
      another call to unlink_anon_vmas().  That's how "kernel BUG at
      mm/rmap.c:399!" is triggered for me.
      
      This patch fixes the underflow by dropping dst->anon_vma when allocation
      fails.  It's safe to do so regardless of original value of dst->anon_vma
      because dst->anon_vma doesn't have valid meaning if anon_vma_clone()
      fails.  Besides, callers don't care dst->anon_vma in such case neither.
      
      Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as
      anon_vma_clone() now does the work.
      
      [akpm@linux-foundation.org: tweak comment]
      Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
      Signed-off-by: default avatarLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fe89b3e
    • Lars-Peter Clausen's avatar
      drivers/rtc/rtc-mrst: fix suspend/resume · ddd2a30d
      Lars-Peter Clausen authored
      The Moorestown RTC driver implements suspend and resume callbacks and
      assigns them to the suspend and resume fields of the device_driver
      struct.  These callbacks are never actually called by anything though.
      
      Modify the driver to properly use dev_pm_ops so that the suspend and
      resume functions are actually executed upon suspend/resume.
      
      [akpm@linux-foundation.org: device_driver.name is const char *]
      Signed-off-by: default avatarLars-Peter Clausen <lars@metafoo.de>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddd2a30d
    • Ed Cashin's avatar
      aoe: update aoe maintainer information · fb903811
      Ed Cashin authored
      The coraid.com email address is defunct.  The old aoe support area hosted
      at coraid.com is no longer up.  These changes update the email and website
      to current ones.
      Signed-off-by: default avatarEd Cashin <ed.cashin@acm.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb903811
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · c875f421
      Linus Torvalds authored
      Pull two arm64 fixes from Catalin Marinas:
      
       - switch_mm() fix where init_mm.pgd ends up in the user TTBR0;
         swapper_pg_dir is not suitable for user mappings
      
       - this_cpu accessors fix for preemption safety
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: percpu: Make this_cpu accessors pre-empt safe
        arm64: Use the reserved TTBR0 if context switching to the init_mm
      c875f421
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux · a55feeb1
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix the MCE code to use CONFIG_KVM_BOOK3S_64_HANDLER
      
       - Little endian fixes for post mobility device tree update
      
       - Add PVR for POWER8NVL processor
      
       - Fixes for hypervisor doorbell handling
      
      * tag 'powerpc-4.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
        powerpc/book3s: Fix the MCE code to use CONFIG_KVM_BOOK3S_64_HANDLER
        powerpc/pseries: Little endian fixes for post mobility device tree update
        powerpc: Add PVR for POWER8NVL processor
        powerpc/powernv: Fixes for hypervisor doorbell handling
      a55feeb1
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/virt/kvm/kvm · 0d33cd0a
      Linus Torvalds authored
      Pull kvm fixes from Marcelo Tosatti:
       "Fix for higher-order page allocation failures, fix Xen-on-KVM with
        x2apic, L1 crash with unrestricted guest mode (nested VMX)"
      
      * git://git.kernel.org/pub/scm/virt/kvm/kvm:
        kvm: avoid page allocation failure in kvm_set_memory_region()
        KVM: x86: call irq notifiers with directed EOI
        KVM: nVMX: mask unrestricted_guest if disabled on L0
      0d33cd0a
    • Linus Torvalds's avatar
      Merge branch 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata · 1401b7c3
      Linus Torvalds authored
      Pull libata fix from Tejun Heo:
       "One patch to fix a regression from the recent switch to blk-mq tag
        allocation which can cause oops on SAS-attached SATA drives"
      
      * 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
        ata: Add a new flag to destinguish sas controller
      1401b7c3
    • Linus Torvalds's avatar
      Merge tag 'mfd-fixes-4.0' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd · 5cf955e0
      Linus Torvalds authored
      Pull MFD fixes from Lee Jones:
       - Use DMA'able addresses for DMA; rtsx_usb
       - Use return value in the correct way; kempld-core
      
      * tag 'mfd-fixes-4.0' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
        mfd: kempld-core: Fix callback return value check
        mfd: rtsx_usb: Prevent DMA from stack
      5cf955e0
  2. 24 Mar, 2015 6 commits
  3. 23 Mar, 2015 7 commits
    • Radim Krčmář's avatar
      KVM: x86: call irq notifiers with directed EOI · c806a6ad
      Radim Krčmář authored
      kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled.
      We need to do that for irq notifiers.  (Like with edge interrupts.)
      
      Fix it by skipping EOI broadcast only.
      
      Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Tested-by: default avatarBandan Das <bsd@redhat.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      c806a6ad
    • Mark Brown's avatar
    • Catalin Marinas's avatar
      arm64: Use the reserved TTBR0 if context switching to the init_mm · e53f21bc
      Catalin Marinas authored
      The idle_task_exit() function may call switch_mm() with next ==
      &init_mm. On arm64, init_mm.pgd cannot be used for user mappings, so
      this patch simply sets the reserved TTBR0.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarJon Medhurst (Tixy) <tixy@linaro.org>
      Tested-by: default avatarJon Medhurst (Tixy) <tixy@linaro.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      e53f21bc
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 90a5a895
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Validate iov ranges before feeding them into iov_iter_init(), from
          Al Viro.
      
       2) We changed copy_from_msghdr_from_user() to zero out the msg_namelen
          is a NULL pointer is given for the msg_name.  Do the same in the
          compat code too.  From Catalin Marinas.
      
       3) Fix partially initialized tuples in netfilter conntrack helper, from
          Ian Wilson.
      
       4) Missing continue; statement in nft_hash walker can lead to crashes,
          from Herbert Xu.
      
       5) tproxy_tg6_check looks for IP6T_INV_PROTO in ->flags instead of
          ->invflags, fix from Pablo Neira Ayuso.
      
       6) Incorrect memory account of TCP FINs can result in negative socket
          memory accounting values.  Fix from Josh Hunt.
      
       7) Don't allow virtual functions to enable VLAN promiscuous mode in
          be2net driver, from Vasundhara Volam.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        netfilter: nft_compat: set IP6T_F_PROTO flag if protocol is set
        cx82310_eth: wait for firmware to become ready
        net: validate the range we feed to iov_iter_init() in sys_sendto/sys_recvfrom
        net: compat: Update get_compat_msghdr() to match copy_msghdr_from_user() behaviour
        be2net: use PCI MMIO read instead of config read for errors
        be2net: restrict MODIFY_EQ_DELAY cmd to a max of 8 EQs
        be2net: Prevent VFs from enabling VLAN promiscuous mode
        tcp: fix tcp fin memory accounting
        ipv6: fix backtracking for throw routes
        net: ethernet: pcnet32: Setup the SRAM and NOUFLO on Am79C97{3, 5}
        ipv6: call ipv6_proxy_select_ident instead of ipv6_select_ident in udp6_ufo_fragment
        netfilter: xt_TPROXY: fix invflags check in tproxy_tg6_check()
        netfilter: restore rule tracing via nfnetlink_log
        netfilter: nf_tables: allow to change chain policy without hook if it exists
        netfilter: Fix potential crash in nft_hash walker
        netfilter: Zero the tuple in nfnl_cthelper_parse_tuple()
      90a5a895
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · d5049617
      Linus Torvalds authored
      Pull sparc fixes from David Miller:
       "Some perf bug fixes from David Ahern, and the fix for that nasty
        memmove() bug"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc64: Fix several bugs in memmove().
        sparc: Touch NMI watchdog when walking cpus and calling printk
        sparc: perf: Add support M7 processor
        sparc: perf: Make counting mode actually work
        sparc: perf: Remove redundant perf_pmu_{en|dis}able calls
      d5049617
    • David S. Miller's avatar
      sparc64: Fix several bugs in memmove(). · 2077cef4
      David S. Miller authored
      Firstly, handle zero length calls properly.  Believe it or not there
      are a few of these happening during early boot.
      
      Next, we can't just drop to a memcpy() call in the forward copy case
      where dst <= src.  The reason is that the cache initializing stores
      used in the Niagara memcpy() implementations can end up clearing out
      cache lines before we've sourced their original contents completely.
      
      For example, considering NG4memcpy, the main unrolled loop begins like
      this:
      
           load   src + 0x00
           load   src + 0x08
           load   src + 0x10
           load   src + 0x18
           load   src + 0x20
           store  dst + 0x00
      
      Assume dst is 64 byte aligned and let's say that dst is src - 8 for
      this memcpy() call.  That store at the end there is the one to the
      first line in the cache line, thus clearing the whole line, which thus
      clobbers "src + 0x28" before it even gets loaded.
      
      To avoid this, just fall through to a simple copy only mildly
      optimized for the case where src and dst are 8 byte aligned and the
      length is a multiple of 8 as well.  We could get fancy and call
      GENmemcpy() but this is good enough for how this thing is actually
      used.
      Reported-by: default avatarDavid Ahern <david.ahern@oracle.com>
      Reported-by: default avatarBob Picco <bpicco@meloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2077cef4
    • Mahesh Salgaonkar's avatar
      powerpc/book3s: Fix the MCE code to use CONFIG_KVM_BOOK3S_64_HANDLER · 44d5f6f5
      Mahesh Salgaonkar authored
      commit id 2ba9f0d8 has changed CONFIG_KVM_BOOK3S_64_HV to tristate to allow
      HV/PR bits to be built as modules. But the MCE code still depends on
      CONFIG_KVM_BOOK3S_64_HV which is wrong. When user selects
      CONFIG_KVM_BOOK3S_64_HV=m to build HV/PR bits as a separate module the
      relevant MCE code gets excluded.
      
      This patch fixes the MCE code to use CONFIG_KVM_BOOK3S_64_HANDLER. This
      makes sure that the relevant MCE code is included when HV/PR bits
      are built as a separate modules.
      
      Fixes: 2ba9f0d8 ("kvm: powerpc: book3s: Support building HV and PR KVM as module")
      Cc: stable@vger.kernel.org  # v3.14+
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Acked-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      44d5f6f5
  4. 22 Mar, 2015 9 commits
  5. 21 Mar, 2015 1 commit
    • Ondrej Zary's avatar
      cx82310_eth: wait for firmware to become ready · f40bff42
      Ondrej Zary authored
      When the device is powered up, some (older) firmware versions fail to work
      properly if we send commands before the boot is complete (everything is OK
      when the device is hot-plugged). The firmware indicates its ready status by
      putting the link up.
      Newer firmwares delay the first command so they don't suffer from this problem.
      They also report the link being always up.
      
      Wait for firmware to become ready (link up) before sending any commands and/or
      data.
      
      This also allows lowering CMD_TIMEOUT value to a reasonable time.
      
      Tested with 4.1.0.9 (old) and 4.1.0.30 (new) firmware versions.
      Signed-off-by: default avatarOndrej Zary <linux@rainbow-software.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f40bff42