1. 06 May, 2011 13 commits
    • Paul E. McKenney's avatar
      rcu: add tracing for RCU's kthread run states. · d71df90e
      Paul E. McKenney authored
      Add tracing to help debugging situations when RCU's kthreads are not
      running but are supposed to be.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      d71df90e
    • Paul E. McKenney's avatar
      rcu: add callback-queue information to rcudata output · 0ac3d136
      Paul E. McKenney authored
      This commit adds an indication of the state of the callback queue using
      a string of four characters following the "ql=" integer queue length.
      The first character is "N" if there are callbacks that have been
      queued that are not yet ready to be handled by the next grace period, or
      "." otherwise.  The second character is "R" if there are callbacks queued
      that are ready to be handled by the next grace period, or "." otherwise.
      The third character is "W" if there are callbacks waiting for the current
      grace period, or "." otherwise.  Finally, the fourth character is "D"
      if there are callbacks that have been handled by a prior grace period
      and are waiting to be invoked, or ".".
      
      Note that callbacks that are in the process of being invoked are
      not shown.  These callbacks would have been removed from the rcu_data
      structure's list by rcu_do_batch() prior to being executed.  (These
      callbacks are also not reflected in the "ql=" total, FWIW.)
      
      Also, document the new callback-queue trace information.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      0ac3d136
    • Paul E. McKenney's avatar
      rcu: Update RCU's trace.txt documentation for new format · 2fa218d8
      Paul E. McKenney authored
      The trace.txt file had obsolete output for the debugfs rcu/rcudata
      file, so update it.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      2fa218d8
    • Paul E. McKenney's avatar
      rcu: Add boosting to TREE_PREEMPT_RCU tracing · 0ea1f2eb
      Paul E. McKenney authored
      Includes total number of tasks boosted, number boosted on behalf of each
      of normal and expedited grace periods, and statistics on attempts to
      initiate boosting that failed for various reasons.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      0ea1f2eb
    • Paul E. McKenney's avatar
      rcu: eliminate unused boosting statistics · 67b98dba
      Paul E. McKenney authored
      The n_rcu_torture_boost_allocerror and n_rcu_torture_boost_afferror
      statistics are not actually incremented anymore, so eliminate them.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      67b98dba
    • Paul E. McKenney's avatar
      rcu: avoid hammering sched with yet another bound RT kthread · 3acf4a9a
      Paul E. McKenney authored
      The scheduler does not appear to take kindly to having multiple
      real-time threads bound to a CPU that is going offline.  So this
      commit is a temporary hack-around to avoid that happening.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3acf4a9a
    • Paul E. McKenney's avatar
      rcu: put per-CPU kthread at non-RT priority during CPU hotplug operations · e3995a25
      Paul E. McKenney authored
      If you are doing CPU hotplug operations, it is best not to have
      CPU-bound realtime tasks running CPU-bound on the outgoing CPU.
      So this commit makes per-CPU kthreads run at non-realtime priority
      during that time.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      e3995a25
    • Paul E. McKenney's avatar
      rcu: Force per-rcu_node kthreads off of the outgoing CPU · 0f962a5e
      Paul E. McKenney authored
      The scheduler has had some heartburn in the past when too many real-time
      kthreads were affinitied to the outgoing CPU.  So, this commit lightens
      the load by forcing the per-rcu_node and the boost kthreads off of the
      outgoing CPU.  Note that RCU's per-CPU kthread remains on the outgoing
      CPU until the bitter end, as it must in order to preserve correctness.
      
      Also avoid disabling hardirqs across calls to set_cpus_allowed_ptr(),
      given that this function can block.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0f962a5e
    • Paul E. McKenney's avatar
      rcu: priority boosting for TREE_PREEMPT_RCU · 27f4d280
      Paul E. McKenney authored
      Add priority boosting for TREE_PREEMPT_RCU, similar to that for
      TINY_PREEMPT_RCU.  This is enabled by the default-off RCU_BOOST
      kernel parameter.  The priority to which to boost preempted
      RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter
      (defaulting to real-time priority 1) and the time to wait before
      boosting the readers who are blocking a given grace period is
      controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to
      500 milliseconds).
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      27f4d280
    • Paul E. McKenney's avatar
      rcu: move TREE_RCU from softirq to kthread · a26ac245
      Paul E. McKenney authored
      If RCU priority boosting is to be meaningful, callback invocation must
      be boosted in addition to preempted RCU readers.  Otherwise, in presence
      of CPU real-time threads, the grace period ends, but the callbacks don't
      get invoked.  If the callbacks don't get invoked, the associated memory
      doesn't get freed, so the system is still subject to OOM.
      
      But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
      moves the callback invocations to a kthread, which can be boosted easily.
      
      Also add comments and properly synchronized all accesses to
      rcu_cpu_kthread_task, as suggested by Lai Jiangshan.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      a26ac245
    • Paul E. McKenney's avatar
      rcu: merge TREE_PREEPT_RCU blocked_tasks[] lists · 12f5f524
      Paul E. McKenney authored
      Combine the current TREE_PREEMPT_RCU ->blocked_tasks[] lists in the
      rcu_node structure into a single ->blkd_tasks list with ->gp_tasks
      and ->exp_tasks tail pointers.  This is in preparation for RCU priority
      boosting, which will add a third dimension to the combinatorial explosion
      in the ->blocked_tasks[] case, but simply a third pointer in the new
      ->blkd_tasks case.
      
      Also update documentation to reflect blocked_tasks[] merge
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      12f5f524
    • Paul E. McKenney's avatar
      rcu: Decrease memory-barrier usage based on semi-formal proof · e59fb312
      Paul E. McKenney authored
      Commit d09b62df fixed grace-period synchronization, but left some smp_mb()
      invocations in rcu_process_callbacks() that are no longer needed, but
      sheer paranoia prevented them from being removed.  This commit removes
      them and provides a proof of correctness in their absence.  It also adds
      a memory barrier to rcu_report_qs_rsp() immediately before the update to
      rsp->completed in order to handle the theoretical possibility that the
      compiler or CPU might move massive quantities of code into a lock-based
      critical section.  This also proves that the sheer paranoia was not
      entirely unjustified, at least from a theoretical point of view.
      
      In addition, the old dyntick-idle synchronization depended on the fact
      that grace periods were many milliseconds in duration, so that it could
      be assumed that no dyntick-idle CPU could reorder a memory reference
      across an entire grace period.  Unfortunately for this design, the
      addition of expedited grace periods breaks this assumption, which has
      the unfortunate side-effect of requiring atomic operations in the
      functions that track dyntick-idle state for RCU.  (There is some hope
      that the algorithms used in user-level RCU might be applied here, but
      some work is required to handle the NMIs that user-space applications
      can happily ignore.  For the short term, better safe than sorry.)
      
      This proof assumes that neither compiler nor CPU will allow a lock
      acquisition and release to be reordered, as doing so can result in
      deadlock.  The proof is as follows:
      
      1.	A given CPU declares a quiescent state under the protection of
      	its leaf rcu_node's lock.
      
      2.	If there is more than one level of rcu_node hierarchy, the
      	last CPU to declare a quiescent state will also acquire the
      	->lock of the next rcu_node up in the hierarchy,  but only
      	after releasing the lower level's lock.  The acquisition of this
      	lock clearly cannot occur prior to the acquisition of the leaf
      	node's lock.
      
      3.	Step 2 repeats until we reach the root rcu_node structure.
      	Please note again that only one lock is held at a time through
      	this process.  The acquisition of the root rcu_node's ->lock
      	must occur after the release of that of the leaf rcu_node.
      
      4.	At this point, we set the ->completed field in the rcu_state
      	structure in rcu_report_qs_rsp().  However, if the rcu_node
      	hierarchy contains only one rcu_node, then in theory the code
      	preceding the quiescent state could leak into the critical
      	section.  We therefore precede the update of ->completed with a
      	memory barrier.  All CPUs will therefore agree that any updates
      	preceding any report of a quiescent state will have happened
      	before the update of ->completed.
      
      5.	Regardless of whether a new grace period is needed, rcu_start_gp()
      	will propagate the new value of ->completed to all of the leaf
      	rcu_node structures, under the protection of each rcu_node's ->lock.
      	If a new grace period is needed immediately, this propagation
      	will occur in the same critical section that ->completed was
      	set in, but courtesy of the memory barrier in #4 above, is still
      	seen to follow any pre-quiescent-state activity.
      
      6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
      	aware of the end of the old grace period and therefore makes
      	any RCU callbacks that were waiting on that grace period eligible
      	for invocation.
      
      	If this CPU is the same one that detected the end of the grace
      	period, and if there is but a single rcu_node in the hierarchy,
      	we will still be in the single critical section.  In this case,
      	the memory barrier in step #4 guarantees that all callbacks will
      	be seen to execute after each CPU's quiescent state.
      
      	On the other hand, if this is a different CPU, it will acquire
      	the leaf rcu_node's ->lock, and will again be serialized after
      	each CPU's quiescent state for the old grace period.
      
      On the strength of this proof, this commit therefore removes the memory
      barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
      The effect is to reduce the number of memory barriers by one and to
      reduce the frequency of execution from about once per scheduling tick
      per CPU to once per grace period.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      e59fb312
    • Paul E. McKenney's avatar
      rcu: Remove conditional compilation for RCU CPU stall warnings · a00e0d71
      Paul E. McKenney authored
      The RCU CPU stall warnings can now be controlled using the
      rcu_cpu_stall_suppress boot-time parameter or via the same parameter
      from sysfs.  There is therefore no longer any reason to have
      kernel config parameters for this feature.  This commit therefore
      removes the RCU_CPU_STALL_DETECTOR and RCU_CPU_STALL_DETECTOR_RUNNABLE
      kernel config parameters.  The RCU_CPU_STALL_TIMEOUT parameter remains
      to allow the timeout to be tuned and the RCU_CPU_STALL_VERBOSE parameter
      remains to allow task-stall information to be suppressed if desired.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      a00e0d71
  2. 04 May, 2011 4 commits
  3. 03 May, 2011 8 commits
  4. 02 May, 2011 15 commits
    • Lucian Adrian Grijincu's avatar
      sysctl: net: call unregister_net_sysctl_table where needed · ff538818
      Lucian Adrian Grijincu authored
      ctl_table_headers registered with register_net_sysctl_table should
      have been unregistered with the equivalent unregister_net_sysctl_table
      Signed-off-by: default avatarLucian Adrian Grijincu <lucian.grijincu@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff538818
    • Jiri Pirko's avatar
      Revert: veth: remove unneeded ifname code from veth_newlink() · 6c8c4446
      Jiri Pirko authored
      84c49d8c ("veth: remove unneeded
      ifname code from veth_newlink()") caused regression on veth
      creation. This patch reverts the original one.
      Reported-by: default avatarMichał Mirosław <mirqus@gmail.com>
      Signed-off-by: default avatarJiri Pirko <jpirko@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c8c4446
    • Rabin Vincent's avatar
      smsc95xx: fix reset check · d9460920
      Rabin Vincent authored
      The reset loop check should check the MII_BMCR register value for
      BMCR_RESET rather than for MII_BMCR (the register address, which also
      happens to be zero).
      Signed-off-by: default avatarRabin Vincent <rabin@rab.in>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9460920
    • Rafael J. Wysocki's avatar
      tg3: Fix failure to enable WoL by default when possible · 6fdbab9d
      Rafael J. Wysocki authored
      tg3 is supposed to enable WoL by default on adapters which support
      that, but it fails to do so unless the adapter's
      /sys/devices/.../power/wakeup file contains 'enabled' during the
      initialization of the adapter.  Fix that by making tg3 use
      device_set_wakeup_enable() to enable wakeup automatically whenever
      WoL should be enabled by default.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fdbab9d
    • Lifeng Sun's avatar
      networking: inappropriate ioctl operation should return ENOTTY · 41c31f31
      Lifeng Sun authored
      ioctl() calls against a socket with an inappropriate ioctl operation
      are incorrectly returning EINVAL rather than ENOTTY:
      
        [ENOTTY]
            Inappropriate I/O control operation.
      
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=33992Signed-off-by: default avatarLifeng Sun <lifongsun@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41c31f31
    • H. Peter Anvin's avatar
      x86, reboot: Fix relocations in reboot_32.S · 7806a49a
      H. Peter Anvin authored
      The use of base for %ebx in this file is arbitrary, *except* that we
      also use it to compute the real-mode segment.  Therefore, make it so
      that r_base really is the true address to which %ebx points.
      
      This resolves kernel bugzilla 33302.
      Reported-and-tested-by: default avatarAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Link: http://lkml.kernel.org/n/tip-08os5wi3yq1no0y4i5m4z7he@git.kernel.org
      7806a49a
    • Joe Perches's avatar
    • Stefano Stabellini's avatar
      xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top · b9269dc7
      Stefano Stabellini authored
      mask_rw_pte is currently checking if a pfn is a pagetable page if it
      falls in the range pgt_buf_start - pgt_buf_end but that is incorrect
      because pgt_buf_end is a moving target: pgt_buf_top is the real
      boundary.
      Acked-by: default avatar"H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      b9269dc7
    • Konrad Rzeszutek Wilk's avatar
      xen/mmu: Add workaround "x86-64, mm: Put early page table high" · a3864783
      Konrad Rzeszutek Wilk authored
      As a consequence of the commit:
      
      commit 4b239f45
      Author: Yinghai Lu <yinghai@kernel.org>
      Date:   Fri Dec 17 16:58:28 2010 -0800
      
          x86-64, mm: Put early page table high
      
      it causes the Linux kernel to crash under Xen:
      
      mapping kernel into physical memory
      Xen: setup ISA identity maps
      about to get started...
      (XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn b1d89 (pfn bacf7)
      (XEN) mm.c:3027:d0 Error while pinning mfn b1d89
      (XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
      (XEN) domain_crash_sync called from entry.S
      (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
      ...
      
      The reason is that at some point init_memory_mapping is going to reach
      the pagetable pages area and map those pages too (mapping them as normal
      memory that falls in the range of addresses passed to init_memory_mapping
      as argument). Some of those pages are already pagetable pages (they are
      in the range pgt_buf_start-pgt_buf_end) therefore they are going to be
      mapped RO and everything is fine.
      Some of these pages are not pagetable pages yet (they fall in the range
      pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
      are going to be mapped RW.  When these pages become pagetable pages and
      are hooked into the pagetable, xen will find that the guest has already
      a RW mapping of them somewhere and fail the operation.
      The reason Xen requires pagetables to be RO is that the hypervisor needs
      to verify that the pagetables are valid before using them. The validation
      operations are called "pinning" (more details in arch/x86/xen/mmu.c).
      
      In order to fix the issue we mark all the pages in the entire range
      pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
      is completed only the range pgt_buf_start-pgt_buf_end is reserved by
      init_memory_mapping. Hence the kernel is going to crash as soon as one
      of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
      ranges are RO).
      
      For this reason, this function is introduced which is called _after_
      the init_memory_mapping has completed (in a perfect world we would
      call this function from init_memory_mapping, but lets ignore that).
      
      Because we are called _after_ init_memory_mapping the pgt_buf_[start,
      end,top] have all changed to new values (b/c another init_memory_mapping
      is called). Hence, the first time we enter this function, we save
      away the pgt_buf_start value and update the pgt_buf_[end,top].
      
      When we detect that the "old" pgt_buf_start through pgt_buf_end
      PFNs have been reserved (so memblock_x86_reserve_range has been called),
      we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
      
      And then we update those "old" pgt_buf_[end|top] with the new ones
      so that we can redo this on the next pagetable.
      Acked-by: default avatar"H. Peter Anvin" <hpa@zytor.com>
      Reviewed-by: default avatarJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      [v1: Updated with Jeremy's comments]
      [v2: Added the crash output]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      a3864783
    • David S. Miller's avatar
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.infradead.org/ubifs-2.6 · adadfe48
      Linus Torvalds authored
      * 'for-linus' of git://git.infradead.org/ubifs-2.6:
        UBIFS: seek journal heads to the latest bud in replay
        UBIFS: do not free write-buffers when in R/O mode
      adadfe48
    • Linus Torvalds's avatar
      Merge branch 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm · 625a3b60
      Linus Torvalds authored
      * 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm: (47 commits)
        CLKDEV: Fix clkdev return value for NULL clk case
        ARM: 6891/1: prevent heap corruption in OABI semtimedop
        ARM: kprobes: Tidy-up kprobes-decode.c
        ARM: kprobes: Add emulation of hint instructions like NOP and WFI
        ARM: kprobes: Add emulation of SBFX, UBFX, BFI and BFC instructions
        ARM: kprobes: Add emulation of MOVW and MOVT instructions
        ARM: kprobes: Reject probing of undefined data processing instructions
        ARM: kprobes: Remove redundant code in space_1111
        ARM: kprobes: Fix emulation of PLD instructions
        ARM: kprobes: Reject probing of SETEND instructions
        ARM: kprobes: Consolidate stub decoding functions
        ARM: kprobes: Reject probing of all coprocessor instructions
        ARM: kprobes: Fix emulation of USAD8 instructions
        ARM: kprobes: Fix emulation of SMUAD, SMUSD and SMMUL instructions
        ARM: kprobes: Fix emulation of SXTB16, SXTB, SXTH, UXTB16, UXTB and UXTH instructions
        ARM: kprobes: Reject probing of undefined media instructions
        ARM: kprobes: Add emulation of RBIT instruction
        ARM: kprobes: Reject probing of LDRB instructions which load PC
        ARM: kprobes: Fix emulation of LDRD and STRD instructions
        ARM: kprobes: Reject probing of LDR/STR instructions which update PC unpredictably
        ...
      625a3b60
    • Geert Uytterhoeven's avatar
      genirq: Fix typo CONFIG_GENIRC_IRQ_SHOW_LEVEL · 94b2c363
      Geert Uytterhoeven authored
      commit ab7798ff ("genirq: Expand generic
      show_interrupts()") added the Kconfig option GENERIC_IRQ_SHOW_LEVEL to
      accomodate PowerPC, but this doesn't actually enable the functionality due
      to a typo in the #ifdef check.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Linux/PPC Development <linuxppc-dev@lists.ozlabs.org>
      Link: http://lkml.kernel.org/r/%3Calpine.DEB.2.00.1104302251370.19068%40ayla.of.borg%3ESigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      94b2c363
    • Artem Bityutskiy's avatar
      UBIFS: seek journal heads to the latest bud in replay · 52c6e6f9
      Artem Bityutskiy authored
      This is the second fix of the following symptom:
      
      UBIFS error (pid 34456): could not find an empty LEB
      
      which sometimes happens after power cuts when we mount the file-system - UBIFS
      refuses it with the above error message which comes from the
      'ubifs_rcvry_gc_commit()' function. I can reproduce this using the integck test
      with the UBIFS power cut emulation enabled.
      
      Analysis of the problem.
      
      Currently UBIFS replay seeks the journal heads to the last _replayed_ bud.
      But the buds are replayed out-of-order, so the replay basically seeks journal
      heads to the "random" bud belonging to this head, and not to the _last_ one.
      
      The result of this is that the GC head may be seeked to a full LEB with no free
      space, or very little free space. And 'ubifs_rcvry_gc_commit()' tries to find a
      fully or mostly dirty LEB to match the current GC head (because we need to
      garbage-collect that dirty LEB at one go, because we do not have @c->gc_lnum).
      So 'ubifs_find_dirty_leb()' fails and we fall back to finding an empty LEB and
      also fail. As a result - recovery fails and mounting fails.
      
      This patch teaches the replay to initialize the GC heads exactly to the latest
      buds, i.e. the buds which have the largest sequence number in corresponding
      log reference nodes.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Cc: stable@kernel.org
      52c6e6f9
    • Artem Bityutskiy's avatar
      UBIFS: do not free write-buffers when in R/O mode · b50b9f40
      Artem Bityutskiy authored
      Currently UBIFS has a small optimization - it frees write-buffers when it is
      re-mounted from R/W mode to R/O mode. Of course, when it is mounted R/O, it
      does not allocate write-buffers as well.
      
      This optimization is nice but it leads to subtle problems and complications
      in recovery, which I can reproduce using the integck test. The symptoms are
      that after a power cut the file-system cannot be mounted if we first mount
      it R/O, and then re-mount R/W - 'ubifs_rcvry_gc_commit()' prints:
      
      UBIFS error (pid 34456): could not find an empty LEB
      
      Analysis of the  problem.
      
      When mounting R/W, the reply process sets journal heads to buds [1], but
      when mounting R/O - it does not do this, because the write-buffers are not
      allocated. So 'ubifs_rcvry_gc_commit()' works completely differently for the
      same file-system but for the following 2 cases:
      
      1. mounting R/W after a power cut and recover
      2. mounting R/O after a power cut, re-mounting R/W and run deferred recovery
      
      In the former case, we have journal heads seeked to the a bud, in the latter
      case, they are non-seeked (wbuf->lnum == -1). So in the latter case we do not
      try to recover the GC LEB by garbage-collecting to the GC head, but we just
      try to find an empty LEB, and there may be no empty LEBs, so we just fail.
      On the other hand, in the former case (mount R/W), we are able to make a GC LEB
      (@c->gc_lnum) by garbage-collecting.
      
      Thus, let's remove this small nice optimization and always allocate
      write-buffers. This should not make too big difference - we have only 3
      of them, each of max. write unit size, which is usually 2KiB. So this is
      about 6KiB of RAM for the typical case, and only when mounted R/O.
      
      [1]: Note, currently the replay process is setting (seeking) the journal heads
      to _some_ buds, not necessarily to the buds which had been the journal heads
      before the power cut happened. This will be fixed separately.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Cc: stable@kernel.org
      b50b9f40