1. 21 Apr, 2015 15 commits
    • shli@kernel.org's avatar
      RAID5: batch adjacent full stripe write · 59fc630b
      shli@kernel.org authored
      stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
      unit. Idealy we should use big size for adjacent full stripe writes. Bigger
      stripe cache size means less stripes runing in the state machine so can reduce
      cpu overhead. And also bigger size can cause bigger IO size dispatched to under
      layer disks.
      
      With below patch, we will automatically batch adjacent full stripe write
      together. Such stripes will be added to the batch list. Only the first stripe
      of the list will be put to handle_list and so run handle_stripe(). Some steps
      of handle_stripe() are extended to cover all stripes of the list, including
      ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
      running in handle_stripe() and we send IO of whole stripe list together to
      increase IO size.
      
      Stripes added to a batch list have some limitations. A batch list can only
      include full stripe write and can't cross chunk boundary to make sure stripes
      have the same parity disks. Stripes in a batch list must be in the same state
      (no written, toread and so on). If a stripe is in a batch list, all new
      read/write to add_stripe_bio will be blocked to overlap conflict till the batch
      list is handled. The limitations will make sure stripes in a batch list be in
      exactly the same state in the life circly.
      
      I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
      PCIe SSD. This patch improves around 30% performance and IO size to under layer
      disk is exactly 32k. I also run a 4k randwrite test in the same array to make
      sure the performance isn't changed with the patch.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      59fc630b
    • shli@kernel.org's avatar
      raid5: track overwrite disk count · 7a87f434
      shli@kernel.org authored
      Track overwrite disk count, so we can know if a stripe is a full stripe write.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7a87f434
    • shli@kernel.org's avatar
      raid5: add a new flag to track if a stripe can be batched · da41ba65
      shli@kernel.org authored
      A freshly new stripe with write request can be batched. Any time the stripe is
      handled or new read is queued, the flag will be cleared.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      da41ba65
    • shli@kernel.org's avatar
      raid5: use flex_array for scribble data · 46d5b785
      shli@kernel.org authored
      Use flex_array for scribble data. Next patch will batch several stripes
      together, so scribble data should be able to cover several stripes, so this
      patch also allocates scribble data for stripes across a chunk.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      46d5b785
    • Heinz Mauelshagen's avatar
      md raid0: access mddev->queue (request queue member) conditionally because it... · 753f2856
      Heinz Mauelshagen authored
      md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
      
      The patch makes 3 references to mddev->queue in the raid0 personality
      conditional in order to allow for it to be accessed from dm-raid.
      Mandatory, because md instances underneath dm-raid don't manage
      a request queue of their own which'd lead to oopses without the patch.
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Tested-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      753f2856
    • NeilBrown's avatar
      md: allow resync to go faster when there is competing IO. · ac8fa419
      NeilBrown authored
      When md notices non-sync IO happening while it is trying
      to resync (or reshape or recover) it slows down to the
      set minimum.
      
      The default minimum might have made sense many years ago
      but the drives have become faster.  Changing the default
      to match the times isn't really a long term solution.
      
      This patch changes the code so that instead of waiting until the speed
      has dropped to the target, it just waits until pending requests
      have completed.
      This means that the delay inserted is a function of the speed
      of the devices.
      
      Testing shows that:
       - for some loads, the resync speed is unchanged.  For those loads
         increasing the minimum doesn't change the speed either.
         So this is a good result.  To increase resync speed under such
         loads we would probably need to increase the resync window
         size.
      
       - for other loads, resync speed does increase to a reasonable
         fraction (e.g. 20%) of maximum possible, and throughput of
         the load only drops a little bit (e.g. 10%)
      
       - for other loads, throughput of the non-sync load drops quite a bit
         more.  These seem to be latency-sensitive loads.
      
      So it isn't a perfect solution, but it is mostly an improvement.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ac8fa419
    • NeilBrown's avatar
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown authored
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      09314799
    • NeilBrown's avatar
      md: don't require sync_min to be a multiple of chunk_size. · 50c37b13
      NeilBrown authored
      There is really no need for sync_min to be a multiple of
      chunk_size, and values read from here often aren't.
      That means you cannot read a value and expect to be able
      to write it back later.
      
      So remove the chunk_size check, and round down to a multiple
      of 4K, to be sure everything works with 4K-sector devices.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      50c37b13
    • NeilBrown's avatar
      Merge branch 'cluster' into for-next · d51e4fe6
      NeilBrown authored
      d51e4fe6
    • Goldwyn Rodrigues's avatar
      md-cluster: re-add capabilities · 97f6cd39
      Goldwyn Rodrigues authored
      When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state,
      the clustered md:
      
      1. Sends RE_ADD message with the desc_nr. Nodes receiving the message
         clear the Faulty bit in their respective rdev->flags.
      2. The node initiating re-add, gathers the bitmaps of all nodes
         and copies them into the local bitmap. It does not clear the bitmap
         from which it is copying.
      3. Initiating node schedules a md recovery to sync the devices.
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      97f6cd39
    • Goldwyn Rodrigues's avatar
      md: re-add a failed disk · a6da4ef8
      Goldwyn Rodrigues authored
      This adds the capability of re-adding a failed disk by
      writing "re-add" to /sys/block/mdXX/md/dev-YYY/state.
      
      This facilitates adding disks which have encountered a temporary
      error such as a network disconnection/hiccup in an iSCSI device,
      or a SAN cable disconnection which has been restored. In such
      a situation, you do not need to remove and re-add the device.
      Writing re-add to the failed device's state would add it again
      to the array and perform the recovery of only the blocks which
      were written after the device failed.
      
      This works for generic md, and is not related to clustering. However,
      this patch is to ease re-add operations listed above in clustering
      environments.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a6da4ef8
    • Goldwyn Rodrigues's avatar
      md-cluster: remove capabilities · 88bcfef7
      Goldwyn Rodrigues authored
      This adds "remove" capabilities for the clustered environment.
      When a user initiates removal of a device from the array, a
      REMOVE message with disk number in the array is sent to all
      the nodes which kick the respective device in their own array.
      
      This facilitates the removal of failed devices.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      88bcfef7
    • Goldwyn Rodrigues's avatar
      md: Export and rename find_rdev_nr_rcu · 57d051dc
      Goldwyn Rodrigues authored
      This is required by the clustering module (patches to follow) to
      find the device to remove or re-add.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      57d051dc
    • Goldwyn Rodrigues's avatar
      md: Export and rename kick_rdev_from_array · fb56dfef
      Goldwyn Rodrigues authored
      This export is required for clustering module in order to
      co-ordinate remove/readd a rdev from all nodes.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      fb56dfef
    • Guoqing Jiang's avatar
      md-cluster: correct the num for comparison · 8c58f02e
      Guoqing Jiang authored
      
      Since the node num of md-cluster is from zero, and
      cinfo->slot_number represents the slot num of dlm,
      no need to check for equality.
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8c58f02e
  2. 10 Apr, 2015 1 commit
  3. 08 Apr, 2015 1 commit
    • Gu Zheng's avatar
      md: fix md io stats accounting broken · 74672d06
      Gu Zheng authored
      Simon reported the md io stats accounting issue:
      "
      I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
      It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
      the other two being write-mostly:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00   345.00    0.00    0.00    0.00   0.00 100.00
      md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00 58779.00    0.00    0.00    0.00   0.00 100.00
      md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00    12.00    0.00    0.00    0.00   0.00 100.00
      "
      The cause is commit "18c0b223" uses the
      generic_start_io_acct to account the disk stats rather than the open code,
      but it also introduced the increase to .in_flight[rw] which is needless to
      md. So we re-use the open code here to fix it.
      Reported-by: default avatarSimon Kirby <sim@hostway.ca>
      Cc: <stable@vger.kernel.org> 3.19
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      74672d06
  4. 06 Apr, 2015 8 commits
    • Linus Torvalds's avatar
      Linux 4.0-rc7 · f22e6e84
      Linus Torvalds authored
      f22e6e84
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 442bb4ba
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) In TCP, don't register an FRTO for cumulatively ACK'd data that was
          previously SACK'd, from Neal Cardwell.
      
       2) Need to hold RNL mutex in ipv4 multicast code namespace cleanup,
          from Cong WANG.
      
       3) Similarly we have to hold RNL mutex for fib_rules_unregister(), also
          from Cong WANG.
      
       4) Revert and rework netns nsid allocation fix, from Nicolas Dichtel.
      
       5) When we encapsulate for a tunnel device, skb->sk still points to the
          user socket.  So this leads to cases where we retraverse the
          ipv4/ipv6 output path with skb->sk being of some other address
          family (f.e. AF_PACKET).  This can cause things to crash since the
          ipv4 output path is dereferencing an AF_PACKET socket as if it were
          an ipv4 one.
      
          The short term fix for 'net' and -stable is to elide these socket
          checks once we've entered an encapsulation sequence by testing
          xmit_recursion.
      
          Longer term we have a better solution wherein we pass the tunnel's
          socket down through the output paths, but that is way too invasive
          for 'net' and -stable.
      
          From Hannes Frederic Sowa.
      
       6) l2tp_init() failure path forgets to unregister per-net ops, from
          Cong WANG.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net/mlx4_core: Fix error message deprecation for ConnectX-2 cards
        net: dsa: fix filling routing table from OF description
        l2tp: unregister l2tp_net_ops on failure path
        mvneta: dont call mvneta_adjust_link() manually
        ipv6: protect skb->sk accesses from recursive dereference inside the stack
        netns: don't allocate an id for dead netns
        Revert "netns: don't clear nsid too early on removal"
        ip6mr: call del_timer_sync() in ip6mr_free_table()
        net: move fib_rules_unregister() under rtnl lock
        ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup
        tcp: fix FRTO undo on cumulative ACK of SACKed range
        xen-netfront: transmit fully GSO-sized packets
      442bb4ba
    • Jack Morgenstein's avatar
      net/mlx4_core: Fix error message deprecation for ConnectX-2 cards · fde913e2
      Jack Morgenstein authored
      Commit 1daa4303 ("net/mlx4_core: Deprecate error message at
      ConnectX-2 cards startup to debug") did the deprecation only for port 1
      of the card. Need to deprecate for port 2 as well.
      
      Fixes: 1daa4303 ("net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug")
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fde913e2
    • Pavel Nakonechny's avatar
      net: dsa: fix filling routing table from OF description · 30303813
      Pavel Nakonechny authored
      According to description in 'include/net/dsa.h', in cascade switches
      configurations where there are more than one interconnected devices,
      'rtable' array in 'dsa_chip_data' structure is used to indicate which
      port on this switch should be used to send packets to that are destined
      for corresponding switch.
      
      However, dsa_of_setup_routing_table() fills 'rtable' with port numbers
      of the _target_ switch, but not current one.
      
      This commit removes redundant devicetree parsing and adds needed port
      number as a function argument. So dsa_of_setup_routing_table() now just
      looks for target switch number by parsing parent of 'link' device node.
      
      To remove possible misunderstandings with the way of determining target
      switch number, a corresponding comment was added to the source code and
      to the DSA device tree bindings documentation file.
      
      This was tested on a custom board with two Marvell 88E6095 switches with
      following corresponding routing tables: { -1, 10 } and { 8, -1 }.
      Signed-off-by: default avatarPavel Nakonechny <pavel.nakonechny@skitlab.ru>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30303813
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 9e441639
      Linus Torvalds authored
      Pull input fixes from Dmitry Torokhov:
       "Updates for the input subsystem - two more tweaks for ALPS driver to
        work out kinks after splitting the touchpad, trackstick, and potential
        external PS/2 mouse into separate input devices.
      
        Changes to support ALPS SS4 devices (protocol V8) will be coming in
        4.1..."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: alps - document stick behavior for protocol V2
        Input: alps - report V2 Dualpoint Stick events via the right evdev node
        Input: alps - report interleaved bare PS/2 packets via dev3
      9e441639
    • WANG Cong's avatar
      67e04c29
    • Stas Sergeev's avatar
      mvneta: dont call mvneta_adjust_link() manually · ecf7b361
      Stas Sergeev authored
      mvneta_adjust_link() is a callback for of_phy_connect() and should
      not be called directly. The result of calling it directly is as below:
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ecf7b361
    • hannes@stressinduktion.org's avatar
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org authored
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f60e5990
  5. 05 Apr, 2015 3 commits
  6. 04 Apr, 2015 3 commits
    • Linus Torvalds's avatar
      Merge tag 'usb-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · f8b3d8a5
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small USB fixes and new device ids for 4.0-rc6.  Nothing
        major, some xhci fixes for reported problems, and some usb-serial
        device ids.
      
        All have been in linux-next for a while"
      
      * tag 'usb-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        USB: ftdi_sio: Use jtag quirk for SNAP Connect E10
        usb: isp1760: fix spin unlock in the error path of isp1760_udc_start
        usb: xhci: apply XHCI_AVOID_BEI quirk to all Intel xHCI controllers
        usb: xhci: handle Config Error Change (CEC) in xhci driver
        USB: keyspan_pda: add new device id
        USB: ftdi_sio: Added custom PID for Synapse Wireless product
      f8b3d8a5
    • Linus Torvalds's avatar
      Merge tag 'staging-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 8eb6dcf9
      Linus Torvalds authored
      Pull staging driver fixes from Greg KH:
       "Here are some staging driver fixes, well, really all just IIO driver
        fixes, for 4.0-rc6.  They fix issues that have been reported with
        these drivers.
      
        All of these patches have been in linux-next for a while"
      
      * tag 'staging-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        iio: imu: Use iio_trigger_get for indio_dev->trig assignment
        iio: adc: vf610: use ADC clock within specification
        iio/adc/cc10001_adc.c: Fix !HAS_IOMEM build
        iio: core: Fix double free.
        iio:inv-mpu6050: Fix inconsistency for the scale channel
        staging: iio: dummy: Fix undefined symbol build error
        iio: inv_mpu6050: Clear timestamps fifo while resetting hardware fifo
        staging: iio: hmc5843: Set iio name property in sysfs
        iio: bmc150: change sampling frequency
        iio: fix drivers that check buffer->scan_mask
      8eb6dcf9
    • Linus Torvalds's avatar
      Merge tag 'tty-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · eca8258b
      Linus Torvalds authored
      Pull tty/serial fixes from Greg KH:
       "Here are 3 serial driver fixes for 4.0-rc6.  They fix some reported
        issues with the samsung and fsl_lpuart drivers.
      
        All have been in linux-next for a while"
      
      * tag 'tty-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        tty: serial: fsl_lpuart: clear receive flag on FIFO flush
        tty: serial: fsl_lpuart: specify transmit FIFO size
        serial: samsung: Clear operation mode on UART shutdown
      eca8258b
  7. 03 Apr, 2015 9 commits