1. 03 Apr, 2009 40 commits
    • Sukadev Bhattiprolu's avatar
      signals: protect cinit from unblocked SIG_DFL signals · 921cf9f6
      Sukadev Bhattiprolu authored
      Drop early any SIG_DFL or SIG_IGN signals to container-init from within
      the same container.  But queue SIGSTOP and SIGKILL to the container-init
      if they are from an ancestor container.
      
      Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the
      container can still terminate the container-init.  That will be addressed
      in the next patch.
      
      Note:	To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits
         	in a follow-on patch. Until then, this patch is just a preparatory
      	step.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      921cf9f6
    • Sukadev Bhattiprolu's avatar
      signals: add from_ancestor_ns parameter to send_signal() · 7978b567
      Sukadev Bhattiprolu authored
      send_signal() (or its helper) needs to determine the pid namespace of the
      sender.  But a signal sent via kill_pid_info_as_uid() comes from within
      the kernel and send_signal() does not need to determine the pid namespace
      of the sender.  So define a helper for send_signal() which takes an
      additional parameter, 'from_ancestor_ns' and have kill_pid_info_as_uid()
      use that helper directly.
      
      The 'from_ancestor_ns' parameter will be used in a follow-on patch.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7978b567
    • Oleg Nesterov's avatar
      signals: protect init from unwanted signals more · f008faff
      Oleg Nesterov authored
      (This is a modified version of the patch submitted by Oleg Nesterov
      http://lkml.org/lkml/2008/11/18/249 and tries to address comments that
      came up in that discussion)
      
      init ignores the SIG_DFL signals but we queue them anyway, including
      SIGKILL.  This is mostly OK, the signal will be dropped silently when
      dequeued, but the pending SIGKILL has 2 bad implications:
      
              - it implies fatal_signal_pending(), so we confuse things
                like wait_for_completion_killable/lock_page_killable.
      
              - for the sub-namespace inits, the pending SIGKILL can
                mask (legacy_queue) the subsequent SIGKILL from the
                parent namespace which must kill cinit reliably.
                (preparation, cinits don't have SIGNAL_UNKILLABLE yet)
      
      The patch can't help when init is ptraced, but ptracing of init is not
      "safe" anyway.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f008faff
    • Oleg Nesterov's avatar
      signals: remove 'handler' parameter to tracehook functions · 43918f2b
      Oleg Nesterov authored
      Container-init must behave like global-init to processes within the
      container and hence it must be immune to unhandled fatal signals from
      within the container (i.e SIG_DFL signals that terminate the process).
      
      But the same container-init must behave like a normal process to processes
      in ancestor namespaces and so if it receives the same fatal signal from a
      process in ancestor namespace, the signal must be processed.
      
      Implementing these semantics requires that send_signal() determine pid
      namespace of the sender but since signals can originate from workqueues/
      interrupt-handlers, determining pid namespace of sender may not always be
      possible or safe.
      
      This patchset implements the design/simplified semantics suggested by
      Oleg Nesterov.  The simplified semantics for container-init are:
      
      	- container-init must never be terminated by a signal from a
      	  descendant process.
      
      	- container-init must never be immune to SIGKILL from an ancestor
      	  namespace (so a process in parent namespace must always be able
      	  to terminate a descendant container).
      
      	- container-init may be immune to unhandled fatal signals (like
      	  SIGUSR1) even if they are from ancestor namespace. SIGKILL/SIGSTOP
      	  are the only reliable signals to a container-init from ancestor
      	  namespace.
      
      This patch:
      
      Based on an earlier patch submitted by Oleg Nesterov and comments from
      Roland McGrath (http://lkml.org/lkml/2008/11/19/258).
      
      The handler parameter is currently unused in the tracehook functions.
      Besides, the tracehook functions are called with siglock held, so the
      functions can check the handler if they later need to.
      
      Removing the parameter simiplifies changes to sig_ignored() in a follow-on
      patch.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43918f2b
    • Oleg Nesterov's avatar
      do_wait: fix waiting for the group stop with the dead leader · 90bc8d8b
      Oleg Nesterov authored
      do_wait(WSTOPPED) assumes that p->state must be == TASK_STOPPED, this is
      not true if the leader is already dead.  Check SIGNAL_STOP_STOPPED instead
      and use signal->group_exit_code.
      
      Trivial test-case:
      
      	void *tfunc(void *arg)
      	{
      		pause();
      		return NULL;
      	}
      
      	int main(void)
      	{
      		pthread_t thr;
      		pthread_create(&thr, NULL, tfunc, NULL);
      		pthread_exit(NULL);
      		return 0;
      	}
      
      It doesn't react to ^Z (and then to ^C or ^\). The task is stopped, but
      bash can't see this.
      
      The bug is very old, and it was reported multiple times. This patch was sent
      more than a year ago (http://marc.info/?t=119713920000003) but it was ignored.
      
      This change also fixes other oddities (but not all) in this area.  For
      example, before this patch:
      
      	$ sleep 100
      	^Z
      	[1]+  Stopped                 sleep 100
      	$ strace -p `pidof sleep`
      	Process 11442 attached - interrupt to quit
      
      strace hangs in do_wait(), because ->exit_code was already consumed by
      bash.  After this patch, strace happily proceeds:
      
      	--- SIGTSTP (Stopped) @ 0 (0) ---
      	restart_syscall(<... resuming interrupted call ...>
      
      To me, this looks much more "natural" and correct.
      
      Another example.  Let's suppose we have the main thread M and sub-thread
      T, the process is stopped, and its parent did wait(WSTOPPED).  Now we can
      ptrace T but not M.  This looks at least strange to me.
      
      Imho, do_wait() should not confuse the per-thread ptrace stops with the
      per-process job control stops.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Kaz Kylheku <kkylheku@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90bc8d8b
    • David Rientjes's avatar
      cpusets: prevent PF_THREAD_BOUND tasks from attaching to non-root cpusets · 6d7b2f5f
      David Rientjes authored
      Kthreads that have the PF_THREAD_BOUND bit set in their flags are bound to a
      specific cpu.  Thus, their set of allowed cpus shall not change.
      
      This patch prevents such threads from attaching to non-root cpusets.  They do
      not have mempolicies that restrict them to a subset of system nodes and, since
      their cpumask may never change, they cannot use any of the features of
      cpusets.
      
      The tasks will forever be a member of the root cpuset and will be returned
      when listing the tasks attached to that cpuset.
      
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d7b2f5f
    • Paul Menage's avatar
      cpusets: allow cpusets to be configured/built on non-SMP systems · db7f47cf
      Paul Menage authored
      Allow cpusets to be configured/built on non-SMP systems
      
      Currently it's impossible to build cpusets under UML on x86-64, since
      cpusets depends on SMP and x86-64 UML doesn't support SMP.
      
      There's code in cpusets that doesn't depend on SMP.  This patch surrounds
      the minimum amount of cpusets code with #ifdef CONFIG_SMP in order to
      allow cpusets to build/run on UP systems (for testing purposes under UML).
      Reviewed-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db7f47cf
    • David Rientjes's avatar
      cpusets: replace zone allowed functions with node allowed · a1bc5a4e
      David Rientjes authored
      The cpuset_zone_allowed() variants are actually only a function of the
      zone's node.
      
      Cc: Paul Menage <menage@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1bc5a4e
    • Li Zefan's avatar
      cpuset: remove struct cpuset_hotplug_scanner · 7f81b1ae
      Li Zefan authored
      Use cgroup_scanner.data, instead of introducing cpuset_hotplug_scanner.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f81b1ae
    • Li Zefan's avatar
      cpuset: avoid changing cpuset's mems when errno returned · 010cfac4
      Li Zefan authored
      When writing to cpuset.mems, cpuset has to update its mems_allowed before
      calling update_tasks_nodemask(), but this function might return -ENOMEM.
      
      To avoid this rare case, we allocate the memory before changing
      mems_allowed, and then pass to update_tasks_nodemask().  Similar to what
      update_cpumask() does.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      010cfac4
    • Li Zefan's avatar
      cpuset: rewrite update_tasks_nodemask() · 3b6766fe
      Li Zefan authored
      This patch uses cgroup_scan_tasks() to rebind tasks' vmas to new cpuset's
      mems_allowed.
      
      Not only simplify the code largely, but also avoid allocating an array to
      hold mm pointers of all the tasks in the cpuset.  This array can be big
      (size > PAGESIZE) if we have lots of tasks in that cpuset, thus has a
      chance to fail the allocation when under memory stress.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b6766fe
    • Li Zefan's avatar
      cgroups: add 'data' field to struct cgroup_scanner · bd1a8ab7
      Li Zefan authored
      We need to pass some data to test_task() or process_task() in some cases.
      Will be used later.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd1a8ab7
    • Li Zefan's avatar
      cpuset: fix possible races in cpu/memory hotplug · 0b4217b3
      Li Zefan authored
      Change to cpuset->cpus_allowed and cpuset->mems_allowed should be protected
      by callback_mutex, otherwise the reader may read wrong cpus/mems. This is
      cpuset's lock rule.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b4217b3
    • Daisuke Nishimura's avatar
      memcg: cleanup cache_charge · 83aae4c7
      Daisuke Nishimura authored
      Current mem_cgroup_cache_charge is a bit complicated especially
      in the case of shmem's swap-in.
      
      This patch cleans it up by using try_charge_swapin and commit_charge_swapin.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83aae4c7
    • KAMEZAWA Hiroyuki's avatar
      memcg: remove redundant message at swapon · 627991a2
      KAMEZAWA Hiroyuki authored
      It's pointed out that swap_cgroup's message at swapon() is nonsense.
      Because
      
        * It can be calculated very easily if all necessary information is
          written in Kconfig.
      
        * It's not necessary to annoying people at every swapon().
      
      In other view, now, memory usage per swp_entry is reduced to 2bytes from
      8bytes(64bit) and I think it's reasonably small.
      Reported-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      627991a2
    • KAMEZAWA Hiroyuki's avatar
      cgroups: use css id in swap cgroup for saving memory v5 · a3b2d692
      KAMEZAWA Hiroyuki authored
      Try to use CSS ID for records in swap_cgroup.  By this, on 64bit machine,
      size of swap_cgroup goes down to 2 bytes from 8bytes.
      
      This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)
      
      	From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
      	To   size of swap_cgroup = 2G/4k * 2 = 1Mbytes.
      
      Reduction is large.  Of course, there are trade-offs.  This CSS ID will
      add overhead to swap-in/swap-out/swap-free.
      
      But in general,
        - swap is a resource which the user tend to avoid use.
        - If swap is never used, swap_cgroup area is not used.
        - Reading traditional manuals, size of swap should be proportional to
          size of memory. Memory size of machine is increasing now.
      
      I think reducing size of swap_cgroup makes sense.
      
      Note:
        - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
        - memcg can be obsolete at rmdir() but not freed while refcnt from
          swap_cgroup is available.
      
      Changelog v4->v5:
       - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
      Changlog ->v4:
       - fixed not configured case.
       - deleted unnecessary comments.
       - fixed NULL pointer bug.
       - fixed message in dmesg.
      
      [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3b2d692
    • Daisuke Nishimura's avatar
      memcg: charge swapcache to proper memcg · 3c776e64
      Daisuke Nishimura authored
      memcg_test.txt says at 4.1:
      
      	This swap-in is one of the most complicated work. In do_swap_page(),
      	following events occur when pte is unchanged.
      
      	(1) the page (SwapCache) is looked up.
      	(2) lock_page()
      	(3) try_charge_swapin()
      	(4) reuse_swap_page() (may call delete_swap_cache())
      	(5) commit_charge_swapin()
      	(6) swap_free().
      
      	Considering following situation for example.
      
      	(A) The page has not been charged before (2) and reuse_swap_page()
      	    doesn't call delete_from_swap_cache().
      	(B) The page has not been charged before (2) and reuse_swap_page()
      	    calls delete_from_swap_cache().
      	(C) The page has been charged before (2) and reuse_swap_page() doesn't
      	    call delete_from_swap_cache().
      	(D) The page has been charged before (2) and reuse_swap_page() calls
      	    delete_from_swap_cache().
      
      	    memory.usage/memsw.usage changes to this page/swp_entry will be
      	 Case          (A)      (B)       (C)     (D)
               Event
             Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
                ===========================================
                (3)        +1/+1    +1/+1     +1/+1   +1/+1
                (4)          -       0/ 0       -     -1/ 0
                (5)         0/-1     0/ 0     -1/-1    0/ 0
                (6)          -       0/-1       -      0/-1
                ===========================================
             Result         1/ 1     1/ 1      1/ 1    1/ 1
      
             In any cases, charges to this page should be 1/ 1.
      
      In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
      (because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
      charges to the memcg("foo") to which the "current" belongs.
      OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
      to which the page has been charged.
      
      So, if the "foo" and "baa" is different(for example because of task move),
      this charge will be moved from "baa" to "foo".
      
      I think this is an unexpected behavior.
      
      This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
      to return the memcg to which the swapcache has been charged if PCG_USED bit
      is set.
      IIUC, checking PCG_USED bit of swapcache is safe under page lock.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c776e64
    • KOSAKI Motohiro's avatar
      memcg: remove mem_cgroup_reclaim_imbalance() remnants · 3918b96e
      KOSAKI Motohiro authored
      commit 4f98a2fe (vmscan: split LRU lists
      into anon & file sets) removed mem_cgroup_reclaim_imbalance(), but there
      are some leftovers in memcontrol.h.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3918b96e
    • KOSAKI Motohiro's avatar
      memcg: remove mem_cgroup_calc_mapped_ratio() · c137b5ec
      KOSAKI Motohiro authored
      Currently, mem_cgroup_calc_mapped_ratio() is unused at all.  it can be
      removed and KAMEZAWA-san suggested it.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c137b5ec
    • Balbir Singh's avatar
      memcg: show memcg information during OOM · e222432b
      Balbir Singh authored
      Add RSS and swap to OOM output from memcg
      
      Display memcg values like failcnt, usage and limit when an OOM occurs due
      to memcg.
      
      Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
      Daisuke Nishimura and KOSAKI Motohiro for review.
      
      Sample output
      -------------
      
      Task in /a/x killed as a result of limit of /a
      memory: usage 1048576kB, limit 1048576kB, failcnt 4183
      memory+swap: usage 1400964akB, limit 9007199254740991kB, failcnt 0
      
      [akpm@linux-foundation.org: compilation fix]
      [akpm@linux-foundation.org: fix kerneldoc and whitespace]
      [akpm@linux-foundation.org: add printk facility level]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e222432b
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix OOM killer under memcg · 0b7f569e
      KAMEZAWA Hiroyuki authored
      This patch tries to fix OOM Killer problems caused by hierarchy.
      Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
      kill a task in memcg.
      
      But, when hierarchy is used, it's broken and correct task cannot
      be killed. For example, in following cgroup
      
      	/groupA/	hierarchy=1, limit=1G,
      		01	nolimit
      		02	nolimit
      All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
      groupA's 1Gbytes but OOM Killer just kills tasks in groupA.
      
      This patch provides makes the bad process be selected from all tasks
      under hierarchy. BTW, currently, oom_jiffies is updated against groupA
      in above case. oom_jiffies of tree should be updated.
      
      To see how oom_jiffies is used, please check mem_cgroup_oom_called()
      callers.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: const fix]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b7f569e
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix shrinking memory to return -EBUSY by fixing retry algorithm · 81d39c20
      KAMEZAWA Hiroyuki authored
      As pointed out, shrinking memcg's limit should return -EBUSY after
      reasonable retries.  This patch tries to fix the current behavior of
      shrink_usage.
      
      Before looking into "shrink should return -EBUSY" problem, we should fix
      hierarchical reclaim code.  It compares current usage and current limit,
      but it only makes sense when the kernel reclaims memory because hit
      limits.  This is also a problem.
      
      What this patch does are.
      
        1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
           hierarchical reclaim returns immediately and the caller checks the kernel
           should shrink more or not.
           (At shrinking memory, usage is always smaller than limit. So check for
            usage < limit is useless.)
      
        2. For adjusting to above change, 2 changes in "shrink"'s retry path.
           2-a. retry_count depends on # of children because the kernel visits
      	  the children under hierarchy one by one.
           2-b. rather than checking return value of hierarchical_reclaim's progress,
      	  compares usage-before-shrink and usage-after-shrink.
      	  If usage-before-shrink <= usage-after-shrink, retry_count is
      	  decremented.
      Reported-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81d39c20
    • KAMEZAWA Hiroyuki's avatar
      memcg: hierarchical stat · 14067bb3
      KAMEZAWA Hiroyuki authored
      Clean up memory.stat file routine and show "total" hierarchical stat.
      
      This patch does
        - renamed get_all_zonestat to be get_local_zonestat.
        - remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
        - add mcs_stat to cover both of per-cpu/per-lru stat.
        - add "total" stat of hierarchy (*)
        - add a callback system to scan all memcg under a root.
      == "total" is added.
      [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
      cache 0
      rss 0
      pgpgin 0
      pgpgout 0
      inactive_anon 0
      active_anon 0
      inactive_file 0
      active_file 0
      unevictable 0
      hierarchical_memory_limit 50331648
      hierarchical_memsw_limit 9223372036854775807
      total_cache 65536
      total_rss 192512
      total_pgpgin 218
      total_pgpgout 155
      total_inactive_anon 0
      total_active_anon 135168
      total_inactive_file 61440
      total_active_file 4096
      total_unevictable 0
      ==
      (*) maybe the user can do calc hierarchical stat by his own program
         in userland but if it can be written in clean way, it's worth to be
         shown, I think.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14067bb3
    • KAMEZAWA Hiroyuki's avatar
      memcg: use CSS ID · 04046e1a
      KAMEZAWA Hiroyuki authored
      Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.
      
      	Assume folloing tree.
      
      	group_A (ID=3)
      		/01 (ID=4)
      		   /0A (ID=7)
      		/02 (ID=10)
      	group_B (ID=5)
      	and task in group_A/01/0A hits limit at group_A.
      
      	reclaim will be done in following order (round-robin).
      	group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
      	-> group_A -> .....
      
      	Round robin by ID. The last visited cgroup is recorded and restart
      	from it when it start reclaim again.
      	(More smart algorithm can be implemented..)
      
      	No cgroup_mutex or hierarchy_mutex is required.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04046e1a
    • Li Zefan's avatar
      devcgroup: avoid using cgroup_lock · b4046f00
      Li Zefan authored
      There is nothing special that has to be protected by cgroup_lock,
      so introduce devcgroup_mtuex for it's own use.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4046f00
    • Li Zefan's avatar
      debug cgroup: remove unneeded cgroup_lock · d969fbe6
      Li Zefan authored
      Since we are in cgroup write handler, so the cgrp is valid, so we don't
      have to hold cgroup_mutex when calling cgroup_task_count().  One similar
      example is in cgroup_tasks_open().
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d969fbe6
    • Li Zefan's avatar
      cgroups: don't change release_agent when remount failed · 0670e08b
      Li Zefan authored
      Remount can fail in either case:
        - wrong mount options is specified, or option 'noprefix' is changed.
        - a to-be-added subsys is already mounted/active.
      
      When using remount to change 'release_agent', for the above former failure
      case, remount will return errno with release_agent unchanged, but for the
      latter case, remount will return EBUSY with relase_agent changed, which is
      unexpected I think:
      
       # mount -t cgroup -o cpu xxx /cgrp1
       # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2
       # cat /cgrp2/release_agent
       agent1
       # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2
       mount: /cgrp2 not mounted already, or bad option
       # cat /cgrp2/release_agent
       agent1     <-- ok
       # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2
       mount: /cgrp2 is busy
       # cat /cgrp2/release_agent
       agent2     <-- unexpected!
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0670e08b
    • Li Zefan's avatar
      cgroups: show correct file mode · 099fca32
      Li Zefan authored
      We have some read-only files and write-only files, but currently they are
      all set to 0644, which is counter-intuitive and cause trouble for some
      cgroup tools like libcgroup.
      
      This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
      own files' file mode, and for the most cases cft->mode can be default to 0
      and cgroup will figure out proper mode.
      Acked-by: default avatarPaul Menage <menage@google.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      099fca32
    • Li Zefan's avatar
      cgroups: more documentation for remount and release_agent · b6719ec1
      Li Zefan authored
      This won't remove cpuacct from the mounted hierachy:
       # mount -t cgroup -o cpu,cpuacct xxx /mnt
       # mount -o remount,cpu /mnt
      
      Because for this usage mount(8) will append the new options to the original
      options.
      
      And this will get you right:
       # mount [-t cgroup] -o remount,cpu xxx /mnt
      
      Also document how to specify or change release_agent.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Reviewd-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6719ec1
    • Jesper Juhl's avatar
      kernel/cgroup.c: kfree(NULL) is legal · 66bdc9cf
      Jesper Juhl authored
      Reduces object file size a bit:
      
      Before:
      $ size kernel/cgroup.o
         text    data     bss     dec     hex filename
        21593    7804    4924   34321    8611 kernel/cgroup.o
      After:
      $ size kernel/cgroup.o
         text    data     bss     dec     hex filename
        21537    7744    4924   34205    859d kernel/cgroup.o
      Signed-off-by: default avatarJesper Juhl <jj@chaosbits.net>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66bdc9cf
    • KAMEZAWA Hiroyuki's avatar
      cgroup: fix frequent -EBUSY at rmdir · ec64f515
      KAMEZAWA Hiroyuki authored
      In following situation, with memory subsystem,
      
      	/groupA use_hierarchy==1
      		/01 some tasks
      		/02 some tasks
      		/03 some tasks
      		/04 empty
      
      When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
      is triggered and the kernel walks tree under groupA. In this case,
      rmdir /groupA/04 fails with -EBUSY frequently because of temporal
      refcnt from the kernel.
      
      In general. cgroup can be rmdir'd if there are no children groups and
      no tasks. Frequent fails of rmdir() is not useful to users.
      (And the reason for -EBUSY is unknown to users.....in most cases)
      
      This patch tries to modify above behavior, by
      	- retries if css_refcnt is got by someone.
      	- add "return value" to pre_destroy() and allows subsystem to
      	  say "we're really busy!"
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec64f515
    • KAMEZAWA Hiroyuki's avatar
      cgroup: CSS ID support · 38460b48
      KAMEZAWA Hiroyuki authored
      Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
      
      This patch attaches unique ID to each css and provides following.
      
       - css_lookup(subsys, id)
         returns pointer to struct cgroup_subysys_state of id.
       - css_get_next(subsys, id, rootid, depth, foundid)
         returns the next css under "root" by scanning
      
      When cgroup_subsys->use_id is set, an id for css is maintained.
      
      The cgroup framework only parepares
      	- css_id of root css for subsys
      	- id is automatically attached at creation of css.
      	- id is *not* freed automatically. Because the cgroup framework
      	  don't know lifetime of cgroup_subsys_state.
      	  free_css_id() function is provided. This must be called by subsys.
      
      There are several reasons to develop this.
      	- Saving space .... For example, memcg's swap_cgroup is array of
      	  pointers to cgroup. But it is not necessary to be very fast.
      	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
      	  reduce much amount of memory usage.
      
      	- Scanning without lock.
      	  CSS_ID provides "scan id under this ROOT" function. By this, scanning
      	  css under root can be written without locks.
      	  ex)
      	  do {
      		rcu_read_lock();
      		next = cgroup_get_next(subsys, id, root, &found);
      		/* check sanity of next here */
      		css_tryget();
      		rcu_read_unlock();
      		id = found + 1
      	 } while(...)
      
      Characteristics:
      	- Each css has unique ID under subsys.
      	- Lifetime of ID is controlled by subsys.
      	- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
      	- Allowed ID is 1-65535, ID 0 is UNUSED ID.
      
      Design Choices:
      	- scan-by-ID v.s. scan-by-tree-walk.
      	  As /proc's pid scan does, scan-by-ID is robust when scanning is done
      	  by following kind of routine.
      	  scan -> rest a while(release a lock) -> conitunue from interrupted
      	  memcg's hierarchical reclaim does this.
      
      	- When subsys->use_id is set, # of css in the system is limited to
      	  65535.
      
      [bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarBharata B Rao <bharata@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38460b48
    • Grzegorz Nosek's avatar
      cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroups · 313e924c
      Grzegorz Nosek authored
      The ns_proxy cgroup allows moving processes to child cgroups only one
      level deep at a time.  This commit relaxes this restriction and makes it
      possible to attach tasks directly to grandchild cgroups, e.g.:
      
      ($pid is in the root cgroup)
      echo $pid > /cgroup/CG1/CG2/tasks
      
      Previously this operation would fail with -EPERM and would have to be
      performed as two steps:
      echo $pid > /cgroup/CG1/tasks
      echo $pid > /cgroup/CG1/CG2/tasks
      
      Also, the target cgroup no longer needs to be empty to move a task there.
      Signed-off-by: default avatarGrzegorz Nosek <root@localdomain.pl>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Reviewed-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      313e924c
    • Paul Menage's avatar
      cgroups: fix cgroup.h comments · d20a390a
      Paul Menage authored
      Fix the style of some multi-line comments in cgroup.h to match
      Documentation/CodingStyle
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d20a390a
    • Li Xiaodong's avatar
      documentation: fix unix_dgram_qlen description · 45dad7bd
      Li Xiaodong authored
      Previous description about system parameter in /proc/sys/net/unix/ is
      wrong (or missed).  Simply add a new description about unix_dgram_qlen
      according to latest kernel.
      Signed-off-by: default avatarLi Xiaodong <lixd@cn.fujitsu.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45dad7bd
    • Shen Feng's avatar
      documentation: update Documentation/filesystem/proc.txt and Documentation/sysctls · 760df93e
      Shen Feng authored
      Now /proc/sys is described in many places and much information is
      redundant.  This patch updates the proc.txt and move the /proc/sys
      desciption out to the files in Documentation/sysctls.
      
      Details are:
      
      merge
      -  2.1  /proc/sys/fs - File system data
      -  2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
      -  2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
      with Documentation/sysctls/fs.txt.
      
      remove
      -  2.2  /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
      since it's not better then the Documentation/binfmt_misc.txt.
      
      merge
      -  2.3  /proc/sys/kernel - general kernel parameters
      with Documentation/sysctls/kernel.txt
      
      remove
      -  2.5  /proc/sys/dev - Device specific parameters
      since it's obsolete the sysfs is used now.
      
      remove
      -  2.6  /proc/sys/sunrpc - Remote procedure calls
      since it's not better then the Documentation/sysctls/sunrpc.txt
      
      move
      -  2.7  /proc/sys/net - Networking stuff
      -  2.9  Appletalk
      -  2.10 IPX
      to newly created Documentation/sysctls/net.txt.
      
      remove
      -  2.8  /proc/sys/net/ipv4 - IPV4 settings
      since it's not better then the Documentation/networking/ip-sysctl.txt.
      
      add
      - Chapter 3 Per-Process Parameters
      to descibe /proc/<pid>/xxx parameters.
      Signed-off-by: default avatarShen Feng <shen@cn.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      760df93e
    • Henrik Austad's avatar
      documentation: ignore byproducts from latex · 70eed8d0
      Henrik Austad authored
      When using 'make pdfdocs', auto-generated files should be ignored
      Signed-off-by: default avatarHenrik Austad <henrik@austad.us>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70eed8d0
    • Roel Kluin's avatar
      hppfs: hppfs_read_file() may return -ERROR · 880fe76e
      Roel Kluin authored
      hppfs_read_file() may return (ssize_t) -ENOMEM, or -EFAULT.  When stored
      in size_t 'count', these errors will not be noticed, a large value will be
      added to *ppos.
      Signed-off-by: default avatarRoel Kluin <roel.kluin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      880fe76e
    • Jan Kara's avatar
      ext3: avoid false EIO errors · 695f6ae0
      Jan Kara authored
      Sometimes block_write_begin() can map buffers in a page but later we
      fail to copy data into those buffers (because the source page has been
      paged out in the mean time).  We then end up with !uptodate mapped
      buffers.  To add a bit more to the confusion, block_write_end() does
      not commit any data (and thus does not any mark buffers as uptodate) if
      we didn't succeed with copying all the data.
      
      Commit f4fc66a8 (ext3: convert to new
      aops) missed these cases and thus we were inserting non-uptodate
      buffers to transaction's list which confuses JBD code and it reports IO
      errors, aborts a transaction and generally makes users afraid about
      their data ;-P.
      
      This patch fixes the problem by reorganizing ext3_..._write_end() code
      to first call block_write_end() to mark buffers with valid data
      uptodate and after that we file only uptodate buffers to transaction's
      lists.
      
      We also fix a problem where we could leave blocks allocated beyond i_size
      (i_disksize in fact) because of failed write. We now add inode to orphan
      list when write fails (to be safe in case we crash) and then truncate blocks
      beyond i_size in a separate transaction.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      695f6ae0
    • Bryan Donlan's avatar
      ext3: return -EIO not -ESTALE on directory traversal through deleted inode · de18f3b2
      Bryan Donlan authored
      ext3_iget() returns -ESTALE if invoked on a deleted inode, in order to
      report errors to NFS properly.  However, in ext[234]_lookup(), this
      -ESTALE can be propagated to userspace if the filesystem is corrupted such
      that a directory entry references a deleted inode.  This leads to a
      misleading error message - "Stale NFS file handle" - and confusion on the
      part of the admin.
      
      The bug can be easily reproduced by creating a new filesystem, making a
      link to an unused inode using debugfs, then mounting and attempting to ls
      -l said link.
      
      This patch thus changes ext3_lookup to return -EIO if it receives -ESTALE
      from ext3_iget(), as ext3 does for other filesystem metadata corruption;
      and also invokes the appropriate ext*_error functions when this case is
      detected.
      Signed-off-by: default avatarBryan Donlan <bdonlan@gmail.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de18f3b2