1. 21 Apr, 2015 27 commits
    • NeilBrown's avatar
      md/raid5: allow the stripe_cache to grow and shrink. · edbe83ab
      NeilBrown authored
      The default setting of 256 stripe_heads is probably
      much too small for many configurations.  So it is best to make it
      auto-configure.
      
      Shrinking the cache under memory pressure is easy.  The only
      interesting part here is that we put a fairly high cost
      ('seeks') on shrinking the cache as the cost is greater than
      just having to read more data, it reduces parallelism.
      
      Growing the cache on demand needs to be done carefully.  If we allow
      fast growth, that can upset memory balance as lots of dirty memory can
      quickly turn into lots of memory queued in the stripe_cache.
      It is important for the raid5 block device to appear congested to
      allow write-throttling to work.
      
      So we only add stripes slowly. We set a flag when an allocation
      fails because all stripes are in use, allocate at a convenient
      time when that flag is set, and don't allow it to be set again
      until at least one stripe_head has been released for re-use.
      
      This means that a spurt of requests will only cause one stripe_head
      to be allocated, but a steady stream of requests will slowly
      increase the cache size - until memory pressure puts it back again.
      
      It could take hours to reach a steady state.
      
      The value written to, and displayed in, stripe_cache_size is
      used as a minimum.  The cache can grow above this and shrink back
      down to it.  The actual size is not directly visible, though it can
      be deduced to some extent by watching stripe_cache_active.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      edbe83ab
    • NeilBrown's avatar
      md/raid5: change ->inactive_blocked to a bit-flag. · 5423399a
      NeilBrown authored
      This allows us to easily add more (atomic) flags.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      5423399a
    • NeilBrown's avatar
      md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe · 486f0644
      NeilBrown authored
      Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
      succeeds, do it inside the functions.
      
      Also choose the correct hash to handle next inside the functions.
      
      This removes duplication and will help with future new uses of
      {grow,drop}_one_stripe.
      
      This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
      message always said "0kB".
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      486f0644
    • NeilBrown's avatar
      md/raid5: pass gfp_t arg to grow_one_stripe() · a9683a79
      NeilBrown authored
      This is needed for future improvement to stripe cache management.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a9683a79
    • Markus Stockhausen's avatar
      md/raid5: introduce configuration option rmw_level · d06f191f
      Markus Stockhausen authored
      Depending on the available coding we allow optimized rmw logic for write
      operations. To support easier testing this patch allows manual control
      of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.
      
      The configuration can handle three levels of control.
      
      rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
      calculation has no implementation path yet to factor in/out chunks of
      a syndrome. Enforcing this level can be benefical for slow CPUs with
      hardware syndrome support and fast SSDs.
      
      rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
      save IOs. This equals the "old" unpatched behaviour and will be the
      default.
      
      rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
      equal. We might have higher CPU consumption because of calculating the
      parity twice but it can be benefical otherwise. E.g. RAID4 with fast
      dedicated parity disk/SSD. The option is implemented just to be
      forward-looking and will ONLY work with this patch!
      Signed-off-by: default avatarMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d06f191f
    • Markus Stockhausen's avatar
      md/raid5: activate raid6 rmw feature · 584acdd4
      Markus Stockhausen authored
      Glue it altogehter. The raid6 rmw path should work the same as the
      already existing raid5 logic. So emulate the prexor handling/flags
      and split functions as needed.
      
      1) Enable xor_syndrome() in the async layer.
      
      2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
      at the start of a rmw run as we did it before for the single parity.
      
      3) Take care of rmw run in ops_run_reconstruct6(). Again process only
      the changed pages to get syndrome back into sync.
      
      4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
      run. The lower layers will calculate start & end pages from that and
      call the xor_syndrome() correspondingly.
      
      5) Adapt the several places where we ignored Q handling up to now.
      
      Performance numbers for a single E5630 system with a mix of 10 7200k
      desktop/server disks. 300 seconds random write with 8 threads onto a
      3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
      
      bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
              skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
         4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
         8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
        16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
        32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
        64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
       128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
       256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
       512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
      Signed-off-by: default avatarMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      584acdd4
    • Markus Stockhausen's avatar
      md/raid6 algorithms: xor_syndrome() for SSE2 · a582564b
      Markus Stockhausen authored
      The second and (last) optimized XOR syndrome calculation. This version
      supports right and left side optimization. All CPUs with architecture
      older than Haswell will benefit from it.
      
      It should be noted that SSE2 movntdq kills performance for memory areas
      that are read and written simultaneously in chunks smaller than cache
      line size. So use movdqa instead for P/Q writes in sse21 and sse22 XOR
      functions.
      Signed-off-by: default avatarMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a582564b
    • Markus Stockhausen's avatar
      md/raid6 algorithms: xor_syndrome() for generic int · 9a5ce91d
      Markus Stockhausen authored
      Start the algorithms with the very basic one. It is left and right
      optimized. That means we can avoid all calculations for unneeded pages
      above the right stop offset. For pages below the left start offset we
      still need the syndrome multiplication but without reading data pages.
      Signed-off-by: default avatarMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      9a5ce91d
    • Markus Stockhausen's avatar
      md/raid6 algorithms: improve test program · 7e92e1d7
      Markus Stockhausen authored
      It is always helpful to have a test tool in place if we implement
      new data critical algorithms. So add some test routines to the raid6
      checker that can prove if the new xor_syndrome() works as expected.
      
      Run through all permutations of start/stop pages per algorithm and
      simulate a xor_syndrome() assisted rmw run. After each rmw check if
      the recovery algorithm still confirms that the stripe is fine.
      Signed-off-by: default avatarMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7e92e1d7
    • Markus Stockhausen's avatar
      md/raid6 algorithms: delta syndrome functions · fe5cbc6e
      Markus Stockhausen authored
      v3: s-o-b comment, explanation of performance and descision for
      the start/stop implementation
      
      Implementing rmw functionality for RAID6 requires optimized syndrome
      calculation. Up to now we can only generate a complete syndrome. The
      target P/Q pages are always overwritten. With this patch we provide
      a framework for inplace P/Q modification. In the first place simply
      fill those functions with NULL values.
      
      xor_syndrome() has two additional parameters: start & stop. These
      will indicate the first and last page that are changing during a
      rmw run. That makes it possible to avoid several unneccessary loops
      and speed up calculation. The caller needs to implement the following
      logic to make the functions work.
      
      1) xor_syndrome(disks, start, stop, ...): "Remove" all data of source
      blocks inside P/Q between (and including) start and end.
      
      2) modify any block with start <= block <= stop
      
      3) xor_syndrome(disks, start, stop, ...): "Reinsert" all data of
      source blocks into P/Q between (and including) start and end.
      
      Pages between start and stop that won't be changed should be filled
      with a pointer to the kernel zero page. The reasons for not taking NULL
      pages are:
      
      1) Algorithms cross the whole source data line by line. Thus avoid
      additional branches.
      
      2) Having a NULL page avoids calculating the XOR P parity but still
      need calulation steps for the Q parity. Depending on the algorithm
      unrolling that might be only a difference of 2 instructions per loop.
      
      The benchmark numbers of the gen_syndrome() functions are displayed in
      the kernel log. Do the same for the xor_syndrome() functions. This
      will help to analyze performance problems and give an rough estimate
      how well the algorithm works. The choice of the fastest algorithm will
      still depend on the gen_syndrome() performance.
      
      With the start/stop page implementation the speed can vary a lot in real
      life. E.g. a change of page 0 & page 15 on a stripe will be harder to
      compute than the case where page 0 & page 1 are XOR candidates. To be not
      to enthusiatic about the expected speeds we will run a worse case test
      that simulates a change on the upper half of the stripe. So we do:
      
      1) calculation of P/Q for the upper pages
      
      2) continuation of Q for the lower (empty) pages
      Signed-off-by: default avatarMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      fe5cbc6e
    • shli@kernel.org's avatar
      raid5: handle expansion/resync case with stripe batching · dabc4ec6
      shli@kernel.org authored
      expansion/resync can grab a stripe when the stripe is in batch list. Since all
      stripes in batch list must be in the same state, we can't allow some stripes
      run into expansion/resync. So we delay expansion/resync for stripe in batch
      list.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dabc4ec6
    • shli@kernel.org's avatar
      raid5: handle io error of batch list · 72ac7330
      shli@kernel.org authored
      If io error happens in any stripe of a batch list, the batch list will be
      split, then normal process will run for the stripes in the list.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      72ac7330
    • shli@kernel.org's avatar
      RAID5: batch adjacent full stripe write · 59fc630b
      shli@kernel.org authored
      stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
      unit. Idealy we should use big size for adjacent full stripe writes. Bigger
      stripe cache size means less stripes runing in the state machine so can reduce
      cpu overhead. And also bigger size can cause bigger IO size dispatched to under
      layer disks.
      
      With below patch, we will automatically batch adjacent full stripe write
      together. Such stripes will be added to the batch list. Only the first stripe
      of the list will be put to handle_list and so run handle_stripe(). Some steps
      of handle_stripe() are extended to cover all stripes of the list, including
      ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
      running in handle_stripe() and we send IO of whole stripe list together to
      increase IO size.
      
      Stripes added to a batch list have some limitations. A batch list can only
      include full stripe write and can't cross chunk boundary to make sure stripes
      have the same parity disks. Stripes in a batch list must be in the same state
      (no written, toread and so on). If a stripe is in a batch list, all new
      read/write to add_stripe_bio will be blocked to overlap conflict till the batch
      list is handled. The limitations will make sure stripes in a batch list be in
      exactly the same state in the life circly.
      
      I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
      PCIe SSD. This patch improves around 30% performance and IO size to under layer
      disk is exactly 32k. I also run a 4k randwrite test in the same array to make
      sure the performance isn't changed with the patch.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      59fc630b
    • shli@kernel.org's avatar
      raid5: track overwrite disk count · 7a87f434
      shli@kernel.org authored
      Track overwrite disk count, so we can know if a stripe is a full stripe write.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7a87f434
    • shli@kernel.org's avatar
      raid5: add a new flag to track if a stripe can be batched · da41ba65
      shli@kernel.org authored
      A freshly new stripe with write request can be batched. Any time the stripe is
      handled or new read is queued, the flag will be cleared.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      da41ba65
    • shli@kernel.org's avatar
      raid5: use flex_array for scribble data · 46d5b785
      shli@kernel.org authored
      Use flex_array for scribble data. Next patch will batch several stripes
      together, so scribble data should be able to cover several stripes, so this
      patch also allocates scribble data for stripes across a chunk.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      46d5b785
    • Heinz Mauelshagen's avatar
      md raid0: access mddev->queue (request queue member) conditionally because it... · 753f2856
      Heinz Mauelshagen authored
      md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
      
      The patch makes 3 references to mddev->queue in the raid0 personality
      conditional in order to allow for it to be accessed from dm-raid.
      Mandatory, because md instances underneath dm-raid don't manage
      a request queue of their own which'd lead to oopses without the patch.
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Tested-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      753f2856
    • NeilBrown's avatar
      md: allow resync to go faster when there is competing IO. · ac8fa419
      NeilBrown authored
      When md notices non-sync IO happening while it is trying
      to resync (or reshape or recover) it slows down to the
      set minimum.
      
      The default minimum might have made sense many years ago
      but the drives have become faster.  Changing the default
      to match the times isn't really a long term solution.
      
      This patch changes the code so that instead of waiting until the speed
      has dropped to the target, it just waits until pending requests
      have completed.
      This means that the delay inserted is a function of the speed
      of the devices.
      
      Testing shows that:
       - for some loads, the resync speed is unchanged.  For those loads
         increasing the minimum doesn't change the speed either.
         So this is a good result.  To increase resync speed under such
         loads we would probably need to increase the resync window
         size.
      
       - for other loads, resync speed does increase to a reasonable
         fraction (e.g. 20%) of maximum possible, and throughput of
         the load only drops a little bit (e.g. 10%)
      
       - for other loads, throughput of the non-sync load drops quite a bit
         more.  These seem to be latency-sensitive loads.
      
      So it isn't a perfect solution, but it is mostly an improvement.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ac8fa419
    • NeilBrown's avatar
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown authored
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      09314799
    • NeilBrown's avatar
      md: don't require sync_min to be a multiple of chunk_size. · 50c37b13
      NeilBrown authored
      There is really no need for sync_min to be a multiple of
      chunk_size, and values read from here often aren't.
      That means you cannot read a value and expect to be able
      to write it back later.
      
      So remove the chunk_size check, and round down to a multiple
      of 4K, to be sure everything works with 4K-sector devices.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      50c37b13
    • NeilBrown's avatar
      Merge branch 'cluster' into for-next · d51e4fe6
      NeilBrown authored
      d51e4fe6
    • Goldwyn Rodrigues's avatar
      md-cluster: re-add capabilities · 97f6cd39
      Goldwyn Rodrigues authored
      When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
      the clustered md:
      
      1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
         clear the Faulty bit in their respective rdev->flags.
      2. The node initiating re-add, gathers the bitmaps of all nodes
         and copies them into the local bitmap. It does not clear the bitmap
         from which it is copying.
      3. Initiating node schedules a md recovery to sync the devices.
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      97f6cd39
    • Goldwyn Rodrigues's avatar
      md: re-add a failed disk · a6da4ef8
      Goldwyn Rodrigues authored
      This adds the capability of re-adding a failed disk by
      writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.
      
      This facilitates adding disks which have encountered a temporary
      error such as a network disconnection/hiccup in an iSCSI device,
      or a SAN cable disconnection which has been restored. In such
      a situation, you do not need to remove and re-add the device.
      Writing re-add to the failed device's state would add it again
      to the array and perform the recovery of only the blocks which
      were written after the device failed.
      
      This works for generic md, and is not related to clustering. However,
      this patch is to ease re-add operations listed above in clustering
      environments.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a6da4ef8
    • Goldwyn Rodrigues's avatar
      md-cluster: remove capabilities · 88bcfef7
      Goldwyn Rodrigues authored
      This adds "remove" capabilities for the clustered environment.
      When a user initiates removal of a device from the array, a
      REMOVE message with disk number in the array is sent to all
      the nodes which kick the respective device in their own array.
      
      This facilitates the removal of failed devices.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      88bcfef7
    • Goldwyn Rodrigues's avatar
      md: Export and rename find_rdev_nr_rcu · 57d051dc
      Goldwyn Rodrigues authored
      This is required by the clustering module (patches to follow) to
      find the device to remove or re-add.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      57d051dc
    • Goldwyn Rodrigues's avatar
      md: Export and rename kick_rdev_from_array · fb56dfef
      Goldwyn Rodrigues authored
      This export is required for clustering module in order to
      co-ordinate remove/readd a rdev from all nodes.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      fb56dfef
    • Guoqing Jiang's avatar
      md-cluster: correct the num for comparison · 8c58f02e
      Guoqing Jiang authored
      
      Since the node num of md-cluster is from zero, and
      cinfo->slot_number represents the slot num of dlm,
      no need to check for equality.
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8c58f02e
  2. 10 Apr, 2015 1 commit
  3. 08 Apr, 2015 1 commit
    • Gu Zheng's avatar
      md: fix md io stats accounting broken · 74672d06
      Gu Zheng authored
      Simon reported the md io stats accounting issue:
      "
      I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
      It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
      the other two being write-mostly:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00   345.00    0.00    0.00    0.00   0.00 100.00
      md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00 58779.00    0.00    0.00    0.00   0.00 100.00
      md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00    12.00    0.00    0.00    0.00   0.00 100.00
      "
      The cause is commit "18c0b223" uses the
      generic_start_io_acct to account the disk stats rather than the open code,
      but it also introduced the increase to .in_flight[rw] which is needless to
      md. So we re-use the open code here to fix it.
      Reported-by: default avatarSimon Kirby <sim@hostway.ca>
      Cc: <stable@vger.kernel.org> 3.19
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      74672d06
  4. 06 Apr, 2015 8 commits
    • Linus Torvalds's avatar
      Linux 4.0-rc7 · f22e6e84
      Linus Torvalds authored
      f22e6e84
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 442bb4ba
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) In TCP, don't register an FRTO for cumulatively ACK'd data that was
          previously SACK'd, from Neal Cardwell.
      
       2) Need to hold RNL mutex in ipv4 multicast code namespace cleanup,
          from Cong WANG.
      
       3) Similarly we have to hold RNL mutex for fib_rules_unregister(), also
          from Cong WANG.
      
       4) Revert and rework netns nsid allocation fix, from Nicolas Dichtel.
      
       5) When we encapsulate for a tunnel device, skb->sk still points to the
          user socket.  So this leads to cases where we retraverse the
          ipv4/ipv6 output path with skb->sk being of some other address
          family (f.e. AF_PACKET).  This can cause things to crash since the
          ipv4 output path is dereferencing an AF_PACKET socket as if it were
          an ipv4 one.
      
          The short term fix for 'net' and -stable is to elide these socket
          checks once we've entered an encapsulation sequence by testing
          xmit_recursion.
      
          Longer term we have a better solution wherein we pass the tunnel's
          socket down through the output paths, but that is way too invasive
          for 'net' and -stable.
      
          From Hannes Frederic Sowa.
      
       6) l2tp_init() failure path forgets to unregister per-net ops, from
          Cong WANG.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net/mlx4_core: Fix error message deprecation for ConnectX-2 cards
        net: dsa: fix filling routing table from OF description
        l2tp: unregister l2tp_net_ops on failure path
        mvneta: dont call mvneta_adjust_link() manually
        ipv6: protect skb->sk accesses from recursive dereference inside the stack
        netns: don't allocate an id for dead netns
        Revert "netns: don't clear nsid too early on removal"
        ip6mr: call del_timer_sync() in ip6mr_free_table()
        net: move fib_rules_unregister() under rtnl lock
        ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup
        tcp: fix FRTO undo on cumulative ACK of SACKed range
        xen-netfront: transmit fully GSO-sized packets
      442bb4ba
    • Jack Morgenstein's avatar
      net/mlx4_core: Fix error message deprecation for ConnectX-2 cards · fde913e2
      Jack Morgenstein authored
      Commit 1daa4303 ("net/mlx4_core: Deprecate error message at
      ConnectX-2 cards startup to debug") did the deprecation only for port 1
      of the card. Need to deprecate for port 2 as well.
      
      Fixes: 1daa4303 ("net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug")
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fde913e2
    • Pavel Nakonechny's avatar
      net: dsa: fix filling routing table from OF description · 30303813
      Pavel Nakonechny authored
      According to description in 'include/net/dsa.h', in cascade switches
      configurations where there are more than one interconnected devices,
      'rtable' array in 'dsa_chip_data' structure is used to indicate which
      port on this switch should be used to send packets to that are destined
      for corresponding switch.
      
      However, dsa_of_setup_routing_table() fills 'rtable' with port numbers
      of the _target_ switch, but not current one.
      
      This commit removes redundant devicetree parsing and adds needed port
      number as a function argument. So dsa_of_setup_routing_table() now just
      looks for target switch number by parsing parent of 'link' device node.
      
      To remove possible misunderstandings with the way of determining target
      switch number, a corresponding comment was added to the source code and
      to the DSA device tree bindings documentation file.
      
      This was tested on a custom board with two Marvell 88E6095 switches with
      following corresponding routing tables: { -1, 10 } and { 8, -1 }.
      Signed-off-by: default avatarPavel Nakonechny <pavel.nakonechny@skitlab.ru>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30303813
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 9e441639
      Linus Torvalds authored
      Pull input fixes from Dmitry Torokhov:
       "Updates for the input subsystem - two more tweaks for ALPS driver to
        work out kinks after splitting the touchpad, trackstick, and potential
        external PS/2 mouse into separate input devices.
      
        Changes to support ALPS SS4 devices (protocol V8) will be coming in
        4.1..."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: alps - document stick behavior for protocol V2
        Input: alps - report V2 Dualpoint Stick events via the right evdev node
        Input: alps - report interleaved bare PS/2 packets via dev3
      9e441639
    • WANG Cong's avatar
      67e04c29
    • Stas Sergeev's avatar
      mvneta: dont call mvneta_adjust_link() manually · ecf7b361
      Stas Sergeev authored
      mvneta_adjust_link() is a callback for of_phy_connect() and should
      not be called directly. The result of calling it directly is as below:
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ecf7b361
    • hannes@stressinduktion.org's avatar
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org authored
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f60e5990
  5. 05 Apr, 2015 3 commits