1. 07 Apr, 2014 40 commits
    • Guillaume Morin's avatar
      kernel/exit.c: call proc_exit_connector() after exit_state is set · ef982393
      Guillaume Morin authored
      The process events connector delivers a notification when a process
      exits.  This is really convenient for a process that spawns and wants to
      monitor its children through an epoll-able() interface.
      
      Unfortunately, there is a small window between when the event is
      delivered and the child become wait()-able.
      
      This is creates a race if the parent wants to make sure that it knows
      about the exit, e.g
      
      pid_t pid = fork();
      if (pid > 0) {
      	register_interest_for_pid(pid);
      	if (waitpid(pid, NULL, WNOHANG) > 0)
      	{
      	  /* We might have raced with exit() */
      	}
      	return;
      }
      
      /* Child */
      execve(...)
      
      register_interest_for_pid() would be telling the the connector socket
      reader to pay attention to events related to pid.
      
      Though this is not a bug, I think it would make the connector a bit more
      usable if this race was closed by simply moving the call to
      proc_exit_connector() from just before exit_notify() to right after.
      
      Oleg said:
      
      : Even with this patch the code above is still "racy" if the child is
      : multi-threaded.  Plus it should obviously filter-out subthreads.  And
      : afaics there is no way to make it reliable, even if you change the code
      : above so that waitpid() is called only after the last thread exits WNOHANG
      : still can fail.
      Signed-off-by: default avatarGuillaume Morin <guillaume@morinfr.org>
      Cc: Matt Helsley <matt.helsley@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef982393
    • Oleg Nesterov's avatar
      exit: move check_stack_usage() to the end of do_exit() · 4bcb8232
      Oleg Nesterov authored
      It is not clear why check_stack_usage() is called so early and thus it
      never checks the stack usage in, say, exit_notify() or
      flush_ptrace_hw_breakpoint() or other functions which are only called by
      do_exit().
      
      Move the callsite down to the last preempt_disable/schedule.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bcb8232
    • Oleg Nesterov's avatar
      exit: call disassociate_ctty() before exit_task_namespaces() · c39df5fa
      Oleg Nesterov authored
      Commit 8aac6270 ("move exit_task_namespaces() outside of
      exit_notify()") breaks pppd and the exiting service crashes the kernel:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
          IP: ppp_register_channel+0x13/0x20 [ppp_generic]
          Call Trace:
            ppp_asynctty_open+0x12b/0x170 [ppp_async]
            tty_ldisc_open.isra.2+0x27/0x60
            tty_ldisc_hangup+0x1e3/0x220
            __tty_hangup+0x2c4/0x440
            disassociate_ctty+0x61/0x270
            do_exit+0x7f2/0xa50
      
      ppp_register_channel() needs ->net_ns and current->nsproxy == NULL.
      
      Move disassociate_ctty() before exit_task_namespaces(), it doesn't make
      sense to delay it after perf_event_exit_task() or cgroup_exit().
      
      This also allows to use task_work_add() inside the (nontrivial) code
      paths in disassociate_ctty().
      
      Investigated by Peter Hurley.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSree Harsha Totakura <sreeharsha@totakura.in>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Sree Harsha Totakura <sreeharsha@totakura.in>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c39df5fa
    • SeongJae Park's avatar
      mm/zswap.c: remove unnecessary parentheses · 5d2d42de
      SeongJae Park authored
      Fix following trivial checkpatch error:
      
        ERROR: return is not a function, parentheses are not required
      Signed-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Acked-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d2d42de
    • Minchan Kim's avatar
      mm/zswap: support multiple swap devices · 60105e12
      Minchan Kim authored
      Cai Liu reporeted that now zbud pool pages counting has a problem when
      multiple swap is used because it just counts only one swap intead of all
      of swap so zswap cannot control writeback properly.  The result is
      unnecessary writeback or no writeback when we should really writeback.
      
      IOW, it made zswap crazy.
      
      Another problem in zswap is:
      
      For example, let's assume we use two swap A and B with different
      priority and A already has charged 19% long time ago and let's assume
      that A swap is full now so VM start to use B so that B has charged 1%
      recently.  It menas zswap charged (19% + 1%) is full by default.  Then,
      if VM want to swap out more pages into B, zbud_reclaim_page would be
      evict one of pages in B's pool and it would be repeated continuously.
      It's totally LRU reverse problem and swap thrashing in B would happen.
      
      This patch makes zswap consider mutliple swap by creating *a* zbud pool
      which will be shared by multiple swap so all of zswap pages in multiple
      swap keep order by LRU so it can prevent above two problems.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarCai Liu <cai.liu@samsung.com>
      Suggested-by: default avatarWeijie Yang <weijie.yang.kh@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60105e12
    • SeongJae Park's avatar
      mm/zswap.c: update zsmalloc in comment to zbud · 6335b193
      SeongJae Park authored
      zswap used zsmalloc before and now using zbud.  But, some comments saying
      it use zsmalloc yet.  Fix the trivial problems.
      Signed-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6335b193
    • SeongJae Park's avatar
      mm/zswap.c: fix trivial typo and arrange indentation · 6b452516
      SeongJae Park authored
      Signed-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b452516
    • Joonsoo Kim's avatar
      zram: support REQ_DISCARD · f4659d8e
      Joonsoo Kim authored
      zram is ram based block device and can be used by backend of filesystem.
      When filesystem deletes a file, it normally doesn't do anything on data
      block of that file.  It just marks on metadata of that file.  This
      behavior has no problem on disk based block device, but has problems on
      ram based block device, since we can't free memory used for data block.
      To overcome this disadvantage, there is REQ_DISCARD functionality.  If
      block device support REQ_DISCARD and filesystem is mounted with discard
      option, filesystem sends REQ_DISCARD to block device whenever some data
      blocks are discarded.  All we have to do is to handle this request.
      
      This patch implements to flag up QUEUE_FLAG_DISCARD and handle this
      REQ_DISCARD request.  With it, we can free memory used by zram if it isn't
      used.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4659d8e
    • Sergey Senozhatsky's avatar
      zram: use scnprintf() in attrs show() methods · 56b4e8cb
      Sergey Senozhatsky authored
      sysfs.txt documentation lists the following requirements:
      
       - The buffer will always be PAGE_SIZE bytes in length. On i386, this
         is 4096.
      
       - show() methods should return the number of bytes printed into the
         buffer. This is the return value of scnprintf().
      
       - show() should always use scnprintf().
      
      Use scnprintf() in show() functions.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56b4e8cb
    • Minchan Kim's avatar
      zram: propagate error to user · 60a726e3
      Minchan Kim authored
      When we initialized zcomp with single, we couldn't change
      max_comp_streams without zram reset but current interface doesn't show
      any error to user and even it changes max_comp_streams's value without
      any effect so it would make user very confusing.
      
      This patch prevents max_comp_streams's change when zcomp was initialized
      as single zcomp and emit the error to user(ex, echo).
      
      [akpm@linux-foundation.org: don't return with the lock held, per Sergey]
      [fengguang.wu@intel.com: fix coccinelle warnings]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60a726e3
    • Sergey Senozhatsky's avatar
      zram: return error-valued pointer from zcomp_create() · fcfa8d95
      Sergey Senozhatsky authored
      Instead of returning just NULL, return ERR_PTR from zcomp_create() if
      compressing backend creation has failed.  ERR_PTR(-EINVAL) for unsupported
      compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or
      compression stream) error.
      
      Perform IS_ERR() check of returned from zcomp_create() value in
      disksize_store() and set return code to PTR_ERR().
      
      Change suggested by Jerome Marchand.
      
      [akpm@linux-foundation.org: clean up error recovery flow]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcfa8d95
    • Sergey Senozhatsky's avatar
      zram: move comp allocation out of init_lock · d61f98c7
      Sergey Senozhatsky authored
      While fixing lockdep spew of ->init_lock reported by Sasha Levin [1],
      Minchan Kim noted [2] that it's better to move compression backend
      allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as
      with zram_meta_alloc(), in order to prevent the same lockdep spew.
      
      [1] https://lkml.org/lkml/2014/2/27/337
      [2] https://lkml.org/lkml/2014/3/3/32Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d61f98c7
    • Sergey Senozhatsky's avatar
      zram: add lz4 algorithm backend · 6e76668e
      Sergey Senozhatsky authored
      Introduce LZ4 compression backend and make it available for selection.
      LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config
      option.  The default compression backend is LZO.
      
      TEST
      
      (x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G,
      ext4 file system, 3 compression streams)
      
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             Test           LZO           LZ4
      ----------------------------------------------
        Initial write   1642744.62    1317005.09
              Rewrite   2498980.88    1800645.16
                 Read   3957026.38    5877043.75
              Re-read   3950997.38    5861847.00
         Reverse Read   2937114.56    5047384.00
          Stride read   2948163.19    4929587.38
          Random read   3292692.69    4880793.62
       Mixed workload   1545602.62    3502940.38
         Random write   2448039.75    1758786.25
               Pwrite   1670051.03    1338329.69
                Pread   2530682.00    5097177.62
               Fwrite   3232085.62    3275942.56
                Fread   6306880.25    6645271.12
      
      So on my system LZ4 is slower in write-only tests, while it performs
      better in read-only and mixed (reads + writes) tests.
      
      Official LZ4 benchmarks available here http://code.google.com/p/lz4/
      (linux kernel uses revision r90).
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e76668e
    • Sergey Senozhatsky's avatar
      zram: make compression algorithm selection possible · e46b8a03
      Sergey Senozhatsky authored
      Add and document `comp_algorithm' device attribute.  This attribute allows
      to show supported compression and currently selected compression
      algorithms:
      
      	cat /sys/block/zram0/comp_algorithm
      	[lzo] lz4
      
      and change selected compression algorithm:
      	echo lzo > /sys/block/zram0/comp_algorithm
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e46b8a03
    • Sergey Senozhatsky's avatar
      zram: add set_max_streams knob · fe8eb122
      Sergey Senozhatsky authored
      This patch allows to change max_comp_streams on initialised zcomp.
      
      Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams()
      and zcomp_strm_single_set_max_streams() callbacks to change streams limit
      for zcomp_strm_multi and zcomp_strm_single, accordingly.  set_max_streams
      for single steam zcomp does nothing.
      
      If user has lowered the limit, then zcomp_strm_multi_set_max_streams()
      attempts to immediately free extra streams (as much as it can, depending
      on idle streams availability).
      
      Note, this patch does not allow to change stream 'policy' from single to
      multi stream (or vice versa) on already initialised compression backend.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe8eb122
    • Sergey Senozhatsky's avatar
      zram: add multi stream functionality · beca3ec7
      Sergey Senozhatsky authored
      Existing zram (zcomp) implementation has only one compression stream
      (buffer and algorithm private part), so in order to prevent data
      corruption only one write (compress operation) can use this compression
      stream, forcing all concurrent write operations to wait for stream lock
      to be released.  This patch changes zcomp to keep a compression streams
      list of user-defined size (via sysfs device attr).  Each write operation
      still exclusively holds compression stream, the difference is that we
      can have N write operations (depending on size of streams list)
      executing in parallel.  See TEST section later in commit message for
      performance data.
      
      Introduce struct zcomp_strm_multi and a set of functions to manage
      zcomp_strm stream access.  zcomp_strm_multi has a list of idle
      zcomp_strm structs, spinlock to protect idle list and wait queue, making
      it possible to perform parallel compressions.
      
      The following set of functions added:
      - zcomp_strm_multi_find()/zcomp_strm_multi_release()
        find and release a compression stream, implement required locking
      - zcomp_strm_multi_create()/zcomp_strm_multi_destroy()
        create and destroy zcomp_strm_multi
      
      zcomp ->strm_find() and ->strm_release() callbacks are set during
      initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release()
      correspondingly.
      
      Each time zcomp issues a zcomp_strm_multi_find() call, the following set
      of operations performed:
      
      - spin lock strm_lock
      - if idle list is not empty, remove zcomp_strm from idle list, spin
        unlock and return zcomp stream pointer to caller
      - if idle list is empty, current adds itself to wait queue. it will be
        awaken by zcomp_strm_multi_release() caller.
      
      zcomp_strm_multi_release():
      - spin lock strm_lock
      - add zcomp stream to idle list
      - spin unlock, wake up sleeper
      
      Minchan Kim reported that spinlock-based locking scheme has demonstrated
      a severe perfomance regression for single compression stream case,
      comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16)
      
      base                      spinlock                    mutex
      
      ==Initial write           ==Initial write             ==Initial  write
      records:  5               records:  5                 records:   5
      avg:      1642424.35      avg:      699610.40         avg:       1655583.71
      std:      39890.95(2.43%) std:      232014.19(33.16%) std:       52293.96
      max:      1690170.94      max:      1163473.45        max:       1697164.75
      min:      1568669.52      min:      573429.88         min:       1553410.23
      ==Rewrite                 ==Rewrite                   ==Rewrite
      records:  5               records:  5                 records:   5
      avg:      1611775.39      avg:      501406.64         avg:       1684419.11
      std:      17144.58(1.06%) std:      15354.41(3.06%)   std:       18367.42
      max:      1641800.95      max:      531356.78         max:       1706445.84
      min:      1593515.27      min:      488817.78         min:       1655335.73
      
      When only one compression stream available, mutex with spin on owner
      tends to perform much better than frequent wait_event()/wake_up().  This
      is why single stream implemented as a special case with mutex locking.
      
      Introduce and document zram device attribute max_comp_streams.  This
      attr shows and stores current zcomp's max number of zcomp streams
      (max_strm).  Extend zcomp's zcomp_create() with `max_strm' parameter.
      `max_strm' limits the number of zcomp_strm structs in compression
      backend's idle list (max_comp_streams).
      
      max_comp_streams used during initialisation as follows:
      -- passing to zcomp_create() max_strm equals to 1 will initialise zcomp
      using single compression stream zcomp_strm_single (mutex-based locking).
      -- passing to zcomp_create() max_strm greater than 1 will initialise zcomp
      using multi compression stream zcomp_strm_multi (spinlock-based locking).
      
      default max_comp_streams value is 1, meaning that zram with single stream
      will be initialised.
      
      Later patch will introduce configuration knob to change max_comp_streams
      on already initialised and used zcomp.
      
      TEST
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             test           base       1 strm (mutex)     3 strm (spinlock)
      -----------------------------------------------------------------------
       Initial write      589286.78       583518.39          718011.05
             Rewrite      604837.97       596776.38         1515125.72
        Random write      584120.11       595714.58         1388850.25
              Pwrite      535731.17       541117.38          739295.27
              Fwrite     1418083.88      1478612.72         1484927.06
      
      Usage example:
      set max_comp_streams to 4
              echo 4 > /sys/block/zram0/max_comp_streams
      
      show current max_comp_streams (default value is 1).
              cat /sys/block/zram0/max_comp_streams
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      beca3ec7
    • Sergey Senozhatsky's avatar
      zram: factor out single stream compression · 9cc97529
      Sergey Senozhatsky authored
      This is preparation patch to add multi stream support to zcomp.
      
      Introduce struct zcomp_strm_single and a set of functions to manage
      zcomp_strm stream access.  zcomp_strm_single implements single compession
      stream, same way as current zcomp implementation.  This moves zcomp_strm
      stream control and locking from zcomp, so compressing backend zcomp is not
      aware of required locking.
      
      Single and multi streams require different locking schemes.  Minchan Kim
      reported that spinlock-based locking scheme (which is used in multi stream
      implementation) has demonstrated a severe perfomance regression for single
      compression stream case, comparing to mutex-based.  see
      https://lkml.org/lkml/2014/2/18/16
      
      The following set of functions added:
      - zcomp_strm_single_find()/zcomp_strm_single_release()
        find and release a compression stream, implement required locking
      - zcomp_strm_single_create()/zcomp_strm_single_destroy()
        create and destroy zcomp_strm_single
      
      New ->strm_find() and ->strm_release() callbacks added to zcomp, which are
      set to zcomp_strm_single_find() and zcomp_strm_single_release() during
      initialisation.  Instead of direct locking and zcomp_strm access from
      zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find()
      and ->strm_release() correspondingly.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cc97529
    • Sergey Senozhatsky's avatar
      zram: use zcomp compressing backends · b7ca232e
      Sergey Senozhatsky authored
      Do not perform direct LZO compress/decompress calls, initialise
      and use zcomp LZO backend (single compression stream) instead.
      
      [akpm@linux-foundation.org: resolve conflicts with zram-delete-zram_init_device-fix.patch]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7ca232e
    • Sergey Senozhatsky's avatar
      zram: introduce compressing backend abstraction · e7e1ef43
      Sergey Senozhatsky authored
      ZRAM performs direct LZO compression algorithm calls, making it the one
      and only option.  While LZO is generally performs well, LZ4 algorithm
      tends to have a faster decompression (see http://code.google.com/p/lz4/
      for full report)
      
      	Name            Ratio  C.speed D.speed
      	                        MB/s    MB/s
      	LZ4 (r101)      2.084    422    1820
      	LZO 2.06        2.106    414     600
      
      Thus, users who have mostly read (decompress) usage scenarious or mixed
      workflow (writes with relatively high read ops number) will benefit from
      using LZ4 compression backend.
      
      Introduce compressing backend abstraction zcomp in order to support
      multiple compression algorithms with the following set of operations:
      
              .create
              .destroy
              .compress
              .decompress
      
      Schematically zram write() usually contains the following steps:
      0) preparation (decompression of partioal IO, etc.)
      1) lock buffer_lock mutex (protects meta compress buffers)
      2) compress (using meta compress buffers)
      3) alloc and map zs_pool object
      4) copy compressed data (from meta compress buffers) to object allocated by 3)
      5) free previous pool page, assign a new one
      6) unlock buffer_lock mutex
      
      As we can see, compressing buffers must remain untouched from 1) to 4),
      because, otherwise, concurrent write() can overwrite data.  At the same
      time, zram_meta must be aware of a) specific compression algorithm memory
      requirements and b) necessary locking to protect compression buffers.  To
      remove requirement a) new struct zcomp_strm introduced, which contains a
      compress/decompress `buffer' and compression algorithm `private' part.
      While struct zcomp implements zcomp_strm stream handling and locking and
      removes requirement b) from zram meta.  zcomp ->create() and ->destroy(),
      respectively, allocate and deallocate algorithm specific zcomp_strm
      `private' part.
      
      Every zcomp has zcomp stream and mutex to protect its compression stream.
      Stream usage semantics remains the same -- only one write can hold stream
      lock and use its buffers.  zcomp_strm_find() turns caller into exclusive
      user of a stream (holding stream mutex until zram release stream), and
      zcomp_strm_release() makes zcomp stream available (unlock the stream
      mutex).  Hence no concurrent write (compression) operations possible at
      the moment.
      
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             test            base           patched
      --------------------------------------------------
        Initial write      597992.91       591660.58
              Rewrite      609674.34       616054.97
                 Read     2404771.75      2452909.12
              Re-read     2459216.81      2470074.44
         Reverse Read     1652769.66      1589128.66
          Stride read     2202441.81      2202173.31
          Random read     2236311.47      2276565.31
       Mixed workload     1423760.41      1709760.06
         Random write      579584.08       615933.86
               Pwrite      597550.02       594933.70
                Pread     1703672.53      1718126.72
               Fwrite     1330497.06      1461054.00
                Fread     3922851.00      3957242.62
      
      Usage examples:
      
      	comp = zcomp_create(NAME) /* NAME e.g. "lzo" */
      
      which initialises compressing backend if requested algorithm is supported.
      
      Compress:
      	zstrm = zcomp_strm_find(comp)
      	zcomp_compress(comp, zstrm, src, &dst_len)
      	[..] /* copy compressed data */
      	zcomp_strm_release(comp, zstrm)
      
      Decompress:
      	zcomp_decompress(comp, src, src_len, dst);
      
      Free compessing backend and its zcomp stream:
      	zcomp_destroy(comp)
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7e1ef43
    • Sergey Senozhatsky's avatar
      zram: delete zram_init_device() · b67d1ec1
      Sergey Senozhatsky authored
      allocate new `zram_meta' in disksize_store() only for uninitialised zram
      device, saving a number of allocations and deallocations in case if
      disksize_store() was called on currently used device.  at the same time
      zram_meta stack variable is not necessary, because we can set ->meta
      directly.  there is also no need in setting QUEUE_FLAG_NONROT queue on
      every disksize_store(), set it once during device creation.
      
      [minchan@kernel.org: handle zram->meta alloc fail case]
      [minchan@kernel.org: prevent lockdep spew of init_lock]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b67d1ec1
    • Sergey Senozhatsky's avatar
      zram: document failed_reads, failed_writes stats · 8dd1d324
      Sergey Senozhatsky authored
      Document `failed_reads' and `failed_writes' device attributes.
      Remove info about `discard' - there is no such zram attr.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8dd1d324
    • Sergey Senozhatsky's avatar
      zram: move zram size warning to documentation · e64cd51d
      Sergey Senozhatsky authored
      Move zram warning about disksize and size of memory correlation to zram
      documentation.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e64cd51d
    • Sergey Senozhatsky's avatar
      zram: drop not used table `count' member · 59fc86a4
      Sergey Senozhatsky authored
      struct table `count' member is not used.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59fc86a4
    • Sergey Senozhatsky's avatar
      zram: report failed read and write stats · 64447249
      Sergey Senozhatsky authored
      zram accounted but did not report numbers of failed read and write
      queries.  make these stats available as failed_reads and failed_writes
      attrs.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64447249
    • Sergey Senozhatsky's avatar
      zram: remove zram stats code duplication · a68eb3b6
      Sergey Senozhatsky authored
      Introduce ZRAM_ATTR_RO macro that generates device_attribute and default
      ATTR show() function for existing atomic64_t zram stats.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a68eb3b6
    • Sergey Senozhatsky's avatar
      zram: use atomic64_t for all zram stats · 90a7806e
      Sergey Senozhatsky authored
      This is a preparation patch for stats code duplication removal.
      
      1) use atomic64_t for `pages_zero' and `pages_stored' zram stats.
      
      2) `compr_size' and `pages_zero' struct zram_stats members did not
         follow the existing device attr naming scheme: zram_stats.ATTR has
         ATTR_show() function.  rename them:
      
         -- compr_size -> compr_data_size
         -- pages_zero -> zero_pages
      
      Minchan Kim's note:
       If we really have trouble with atomic stat operation, we could
       change it with percpu_counter so that it could solve atomic overhead and
       unnecessary memory space by introducing unsigned long instead of 64bit
       atomic_t.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90a7806e
    • Sergey Senozhatsky's avatar
      zram: remove good and bad compress stats · b7cccf8b
      Sergey Senozhatsky authored
      Remove `good' and `bad' compressed sub-requests stats.  RW request may
      cause a number of RW sub-requests.  zram used to account `good' compressed
      sub-queries (with compressed size less than 50% of original size), `bad'
      compressed sub-queries (with compressed size greater that 75% of original
      size), leaving sub-requests with compression size between 50% and 75% of
      original size not accounted and not reported.  zram already accounts each
      sub-request's compression size so we can calculate real device compression
      ratio.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7cccf8b
    • Sergey Senozhatsky's avatar
      zram: do not pass rw argument to __zram_make_request() · be257c61
      Sergey Senozhatsky authored
      Do not pass rw argument down the __zram_make_request() -> zram_bvec_rw()
      chain, decode it in zram_bvec_rw() instead.  Besides, this is the place
      where we distinguish READ and WRITE bio data directions, so account zram
      RW stats here, instead of __zram_make_request().  This also allows to
      account a real number of zram READ/WRITE operations, not just requests
      (single RW request may cause a number of zram RW ops with separate
      locking, compression/decompression, etc).
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be257c61
    • Sergey Senozhatsky's avatar
      zram: drop `init_done' struct zram member · be2d1d56
      Sergey Senozhatsky authored
      Introduce init_done() helper function which allows us to drop `init_done'
      struct zram member.  init_done() uses the fact that ->init_done == 1
      equals to ->meta != NULL.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be2d1d56
    • John Hubbard's avatar
      mm/page_alloc.c: change mm debug routines back to EXPORT_SYMBOL · ed12d845
      John Hubbard authored
      A new dump_page() routine was recently added, and marked
      EXPORT_SYMBOL_GPL.  dump_page() was also added to the VM_BUG_ON_PAGE()
      macro, and so the end result is that non-GPL code can no longer call
      get_page() and a few other routines.
      
      This only happens if the kernel was compiled with CONFIG_DEBUG_VM.
      
      Change dump_page() to be EXPORT_SYMBOL.
      
      Longer explanation:
      
      Prior to commit 309381fe ("mm: dump page when hitting a VM_BUG_ON
      using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
      drivers on Fedora.  Fedora is semi-unique, in that it sets
      CONFIG_VM_DEBUG.
      
      Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
      via VM_BUG_ON_PAGE, via get_page().  As one of the authors of NVIDIA's
      new, open source, "UVM-Lite" kernel module, I originally choose to use
      the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
      because get_page() has always seemed to be very clearly intended for use
      by non-GPL, driver code.
      
      So I'm hoping that making get_page() widely accessible again will not be
      too controversial.  We did check with Fedora first, and they responded
      (https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should
      try to get upstream changed, before asking Fedora to change.  Their
      reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
      Fedora to help catch mm bugs.
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed12d845
    • Srikar Dronamraju's avatar
      numa: use LAST_CPUPID_SHIFT to calculate LAST_CPUPID_MASK · 834a964a
      Srikar Dronamraju authored
      LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH.  However
      LAST_CPUPID_WIDTH itself can be 0.  (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is
      set).  In such a case LAST_CPUPID_MASK turns out to be 0.
      
      But with recent commit 1ae71d03: (mm: numa: bugfix for
      LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0,
      page_cpupid_xchg_last() and page_cpupid_reset_last() causes
      page->_last_cpupid to be set to 0.
      
      This causes performance regression. Its almost as if numa_balancing is
      off.
      
      Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of
      LAST_CPUPID_WIDTH.
      
      Some performance numbers and perf stats with and without the fix.
      
      (3.14-rc6)
      ----------
      numa01
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
      
               12,27,462 cs                                                           [100.00%]
                2,41,957 migrations                                                   [100.00%]
             1,68,01,713 faults                                                       [100.00%]
          7,99,35,29,041 cache-misses
                  98,808 migrate:mm_migrate_pages                                     [100.00%]
      
          1407.690148814 seconds time elapsed
      
      numa02
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
      
                  63,065 cs                                                           [100.00%]
                  14,364 migrations                                                   [100.00%]
                2,08,118 faults                                                       [100.00%]
            25,32,59,404 cache-misses
                      12 migrate:mm_migrate_pages                                     [100.00%]
      
            63.840827219 seconds time elapsed
      
      (3.14-rc6 with fix)
      -------------------
      numa01
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
      
                9,68,911 cs                                                           [100.00%]
                1,01,414 migrations                                                   [100.00%]
               88,38,697 faults                                                       [100.00%]
          4,42,92,51,042 cache-misses
                4,25,060 migrate:mm_migrate_pages                                     [100.00%]
      
           685.965331189 seconds time elapsed
      
      numa02
      
       Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
      
                  17,543 cs                                                           [100.00%]
                   2,962 migrations                                                   [100.00%]
                1,17,843 faults                                                       [100.00%]
            11,80,61,644 cache-misses
                  12,358 migrate:mm_migrate_pages                                     [100.00%]
      
            20.380132343 seconds time elapsed
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      834a964a
    • Zhang Yanfei's avatar
      madvise: correct the comment of MADV_DODUMP flag · 85892f19
      Zhang Yanfei authored
      s/MADV_NODUMP/MADV_DONTDUMP/
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85892f19
    • Fabian Frederick's avatar
      mm/readahead.c: inline ra_submit · 29f175d1
      Fabian Frederick authored
      Commit f9acc8c7 ("readahead: sanify file_ra_state names") left
      ra_submit with a single function call.
      
      Move ra_submit to internal.h and inline it to save some stack.  Thanks
      to Andrew Morton for commenting different versions.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29f175d1
    • Mizuma, Masayoshi's avatar
      mm: hugetlb: fix softlockup when a large number of hugepages are freed. · 55f67141
      Mizuma, Masayoshi authored
      When I decrease the value of nr_hugepage in procfs a lot, softlockup
      happens.  It is because there is no chance of context switch during this
      process.
      
      On the other hand, when I allocate a large number of hugepages, there is
      some chance of context switch.  Hence softlockup doesn't happen during
      this process.  So it's necessary to add the context switch in the
      freeing process as same as allocating process to avoid softlockup.
      
      When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
      process occupied a CPU over 150 seconds and following softlockup message
      appeared twice or more.
      
      $ echo 6000000 > /proc/sys/vm/nr_hugepages
      $ cat /proc/sys/vm/nr_hugepages
      6000000
      $ grep ^Huge /proc/meminfo
      HugePages_Total:   6000000
      HugePages_Free:    6000000
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      Hugepagesize:       2048 kB
      $ echo 0 > /proc/sys/vm/nr_hugepages
      
      BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
      Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
      Call Trace:
        free_pool_huge_page+0xb8/0xd0
        set_max_huge_pages+0x128/0x190
        hugetlb_sysctl_handler_common+0x113/0x140
        hugetlb_sysctl_handler+0x1e/0x20
        proc_sys_call_handler+0x97/0xd0
        proc_sys_write+0x14/0x20
        vfs_write+0xb8/0x1a0
        sys_write+0x51/0x90
        __audit_syscall_exit+0x265/0x290
        system_call_fastpath+0x16/0x1b
      
      I have not confirmed this problem with upstream kernels because I am not
      able to prepare the machine equipped with 12TB memory now.  However I
      confirmed that the amount of decreasing hugepages was directly
      proportional to the amount of required time.
      
      I measured required times on a smaller machine.  It showed 130-145
      hugepages decreased in a millisecond.
      
        Amount of decreasing     Required time      Decreasing rate
        hugepages                     (msec)         (pages/msec)
        ------------------------------------------------------------
        10,000 pages == 20GB         70 -  74          135-142
        30,000 pages == 60GB        208 - 229          131-144
      
      It means decrement of 6TB hugepages will trigger softlockup with the
      default threshold 20sec, in this decreasing rate.
      Signed-off-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55f67141
    • Fabian Frederick's avatar
      mm/memblock.c: use PFN_PHYS() · 16763230
      Fabian Frederick authored
      Replace ((phys_addr_t)(x) << PAGE_SHIFT) by pfn macro.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16763230
    • Emil Medve's avatar
      memblock: use for_each_memblock() · 136199f0
      Emil Medve authored
      This is a small cleanup.
      Signed-off-by: default avatarEmil Medve <Emilian.Medve@Freescale.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      136199f0
    • Miklos Szeredi's avatar
      mm: remove unused arg of set_page_dirty_balance() · ed6d7c8e
      Miklos Szeredi authored
      There's only one caller of set_page_dirty_balance() and that will call it
      with page_mkwrite == 0.
      
      The page_mkwrite argument was unused since commit b827e496 "mm: close
      page_mkwrite races".
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed6d7c8e
    • Vlastimil Babka's avatar
      mm: try_to_unmap_cluster() should lock_page() before mlocking · 57e68e9c
      Vlastimil Babka authored
      A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
      fuzzing with trinity.  The call site try_to_unmap_cluster() does not lock
      the pages other than its check_page parameter (which is already locked).
      
      The BUG_ON in mlock_vma_page() is not documented and its purpose is
      somewhat unclear, but apparently it serializes against page migration,
      which could otherwise fail to transfer the PG_mlocked flag.  This would
      not be fatal, as the page would be eventually encountered again, but
      NR_MLOCK accounting would become distorted nevertheless.  This patch adds
      a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
      effect.
      
      The call site try_to_unmap_cluster() is fixed so that for page !=
      check_page, trylock_page() is attempted (to avoid possible deadlocks as we
      already have check_page locked) and mlock_vma_page() is performed only
      upon success.  If the page lock cannot be obtained, the page is left
      without PG_mlocked, which is again not a problem in the whole unevictable
      memory design.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57e68e9c
    • Johannes Weiner's avatar
      mm: page_alloc: spill to remote nodes before waking kswapd · 3a025760
      Johannes Weiner authored
      On NUMA systems, a node may start thrashing cache or even swap anonymous
      pages while there are still free pages on remote nodes.
      
      This is a result of commits 81c0a2bb ("mm: page_alloc: fair zone
      allocator policy") and fff4068c ("mm: page_alloc: revert NUMA aspect
      of fair allocation policy").
      
      Before those changes, the allocator would first try all allowed zones,
      including those on remote nodes, before waking any kswapds.  But now,
      the allocator fastpath doubles as the fairness pass, which in turn can
      only consider the local node to prevent remote spilling based on
      exhausted fairness batches alone.  Remote nodes are only considered in
      the slowpath, after the kswapds are woken up.  But if remote nodes still
      have free memory, kswapd should not be woken to rebalance the local node
      or it may thrash cash or swap prematurely.
      
      Fix this by adding one more unfair pass over the zonelist that is
      allowed to spill to remote nodes after the local fairness pass fails but
      before entering the slowpath and waking the kswapds.
      
      This also gets rid of the GFP_THISNODE exemption from the fairness
      protocol because the unfair pass is no longer tied to kswapd, which
      GFP_THISNODE is not allowed to wake up.
      
      However, because remote spills can be more frequent now - we prefer them
      over local kswapd reclaim - the allocation batches on remote nodes could
      underflow more heavily.  When resetting the batches, use
      atomic_long_read() directly instead of zone_page_state() to calculate the
      delta as the latter filters negative counter values.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a025760
    • Michal Hocko's avatar
      memcg: rename high level charging functions · d715ae08
      Michal Hocko authored
      mem_cgroup_newpage_charge is used only for charging anonymous memory so
      it is better to rename it to mem_cgroup_charge_anon.
      
      mem_cgroup_cache_charge is used for file backed memory so rename it to
      mem_cgroup_charge_file.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d715ae08