Commits · e4a229196a7c676514c78f6783f8994f64bf681c · nexedi / linux

26 Nov, 2009 7 commits

cfq-iosched: fix no-idle preemption logic · e4a22919

Corrado Zoccolo authored Nov 26, 2009

An incoming no-idle queue should preempt the active no-idle queue
 only if the active queue is idling due to service tree empty.
 Previous code was buggy in two ways:
 * it relied on service_tree field to be set on the active queue, while
   it is not set when the code is idling for a new request
 * it didn't check for the service tree empty condition, so could lead to
   LIFO behaviour if multiple queues with depth > 1 were preempting each
   other on an non-NCQ device.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

e4a22919

cfq-iosched: fix ncq detection code · e459dd08

Corrado Zoccolo authored Nov 26, 2009

CFQ's detection of queueing devices initially assumes a queuing device
and detects if the queue depth reaches a certain threshold.
However, it will reconsider this choice periodically.

Unfortunately, if device is considered not queuing, CFQ will force a
unit queue depth for some workloads, thus defeating the detection logic.
This leads to poor performance on queuing hardware,
since the idle window remains enabled.

Given this premise, switching to hw_tag = 0 after we have proved at
least once that the device is NCQ capable is not a good choice.

The new detection code starts in an indeterminate state, in which CFQ behaves
as if hw_tag = 1, and then, if for a long observation period we never saw
large depth, we switch to hw_tag = 0, otherwise we stick to hw_tag = 1,
without reconsidering it again.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

e459dd08

Merge branch 'for-jens' of git://git.drbd.org/linux-2.6-drbd into for-2.6.33 · 75e7b634
Jens Axboe authored Nov 26, 2009

75e7b634

Fix regression in direct writes performance due to WRITE_ODIRECT flag removal · d9449ce3

Vivek Goyal authored Nov 26, 2009

There seems to be a regression in direct write path due to following
commit in for-2.6.33 branch of block tree.

commit 1af60fbd
Author: Jeff Moyer <jmoyer@redhat.com>
Date:   Fri Oct 2 18:56:53 2009 -0400

    block: get rid of the WRITE_ODIRECT flag

Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
more request from the queue and not idle on it (despite the fact that
queue's think time is less and it is not seeky).

So direct writers lose big time when competing with sequential readers.

Using fio, I have run one direct writer and two sequential readers and
following are the results with 2.6.32-rc7 kernel and with for-2.6.33
branch.

Test
====
1 direct writer and 2 sequential reader running simultaneously.

[global]
directory=/mnt/sdc/fio/
runtime=10

[seqwrite]
rw=write
size=4G
direct=1

[seqread]
rw=read
size=2G
numjobs=2

2.6.32-rc7
==========
direct writes: aggrb=2,968KB/s
readers	     : aggrb=101MB/s

for-2.6.33 branch
=================
direct write: aggrb=19KB/s
readers	      aggrb=137MB/s

This patch brings back the WRITE_ODIRECT flag, with the difference that we
don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
submission of request and an explicit unplug from submitter is required.

That way we fix the jeff's issue of not enough merging taking place in aio
path as well as make sure direct writes get their fair share.

After the fix
=============
for-2.6.33 + fix
----------------
direct writes: aggrb=2,728KB/s
reads: aggrb=103MB/s

Thanks
Vivek
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

d9449ce3

cfq-iosched: cleanup unreachable code · c16632ba

Corrado Zoccolo authored Nov 26, 2009

cfq_should_idle returns false for no-idle queues that are not the last,
so the control flow will never reach the removed code in a state that
satisfies the if condition.
The unreachable code was added to emulate previous cfq behaviour for
non-NCQ rotational devices. My tests show that even without it, the
performances and fairness are comparable with previous cfq, thanks to
the fact that all seeky queues are grouped together, and that we idle at
the end of the tree.
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

c16632ba

block: add helpers to run flush_dcache_page() against a bio and a request's pages · 2d4dc890

Ilya Loginov authored Nov 26, 2009

Mtdblock driver doesn't call flush_dcache_page for pages in request.  So,
this causes problems on architectures where the icache doesn't fill from
the dcache or with dcache aliases.  The patch fixes this.

The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid
pointless empty cache-thrashing loops on architectures for which
flush_dcache_page() is a no-op.  Every architecture was provided with this
flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is
equal 1 or do nothing otherwise.

See "fix mtd_blkdevs problem with caches on some architectures" discussion
on LKML for more information.
Signed-off-by: Ilya Loginov <isloginov@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Peter Horton <phorton@bitbox.co.uk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

2d4dc890

cfq: Make use of service count to estimate the rb_key offset · 3586e917

Gui Jianfeng authored Nov 26, 2009

For the moment, different workload cfq queues are put into different
service trees. But CFQ still uses "busy_queues" to estimate rb_key
offset when inserting a cfq queue into a service tree. I think this
isn't appropriate, and it should make use of service tree count to do
this estimation. This patch is for for-2.6.33 branch.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

3586e917

25 Nov, 2009 1 commit

drbd: moved CN_IDX_DRBD and CN_VAL_DRBD to the right file · 35a8a3fd

Philipp Reisner authored Nov 25, 2009

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

35a8a3fd

24 Nov, 2009 4 commits

DRBD: Now the code is 8.3.6 + 3 fixes (without compat crap) · ad85dfe6

Philipp Reisner authored Nov 18, 2009

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

ad85dfe6

Fixed a regression in resync decission code drbd_uuid_compare() [Bugz 260] · d8c2a36b

Philipp Reisner authored Nov 18, 2009

Since 8.3.3 we fail to do the resync when a partial resynch is not
possible, but a full synch is necessary.

This regression was introduced with 7101539930c0a89146959e7a39c09ad9c3516434
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

d8c2a36b

add missing state change on corrupt packet header in drbd_recv_header · 0b33a916

Lars Ellenberg authored Nov 16, 2009

Otherwise the 'state fixup' in the receiver will change to Unconnected,
but the receiver will terminate itself, and any attempt at 'down'ing
that drbd later will block forever.

see also Bugz. #259
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

0b33a916

fix in-kernel configuration serialization · 6c6c7951

Lars Ellenberg authored Nov 16, 2009

this is uncritical, as we still also serialize in userland,
but to correctly serialize on the CONFIG_PENDING bit,
it must be wait_event(state_wait, \!test_and_set_bit)
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

6c6c7951

23 Nov, 2009 4 commits

cciss: change Cmd_sg_list.sg_chain_dma type to dma_addr_t · 32a87c01

Alex Chiang authored Nov 23, 2009

A recent commit broke the ia64 build:

	Author: Don Brace <brace@beardog.cce.hp.com>
	Date:   Thu Nov 12 12:50:01 2009 -0600

	cciss: Add enhanced scatter-gather support.

because of this hunk:

	--- a/drivers/block/cciss.h
	+++ b/drivers/block/cciss.h
	+struct Cmd_sg_list {
	+       SGDescriptor_struct     *sgchain;
	+       dma64_addr_t            sg_chain_dma;
	+       int                     chain_block_size;
	+};

The issue is that dma64_addr_t isn't #define'd on ia64.

The way that we're using Cmd_sg_list.sg_chain_dma is to hold an
address returned from pci_map_single().

	+               temp64.val = pci_map_single(h->pdev,
	+                                 h->cmd_sg_list[c->cmdindex]->sgchain,
	+                                 len, dir);
	+
	+               h->cmd_sg_list[c->cmdindex]->sg_chain_dma = temp64.val;

pci_map_single() returns a dma_addr_t too.

This code will still work even on a 32-bit x86 build, where
dma_addr_t is defined to be a u32 because it will simply be
promoted to the __u64 that temp64.val is defined as.

Thus, declaring Cmd_sg_list.sg_chain_dma as dma_addr_t is safe.

Cc: Don Brace <brace@beardog.cce.hp.com>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

32a87c01

cciss: fix scatter gather cleanup problems · d61c4269

Stephen M. Cameron authored Nov 23, 2009

On driver unload, only free up the extra scatter gather data if they were
allocated in the first place (the controller supports it) and don't forget
to free up the sg_cmd_list array of pointers.
Signed-off-by: Don Brace <brace@beardog.cce.hp.com>
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

d61c4269

partitions: read whole sector with EFI GPT header · 87038c2d

Karel Zak authored Nov 23, 2009

The size of EFI GPT header is not static, but whole sector is
allocated for the header. The HeaderSize field must be greater
than 92 (= sizeof(struct gpt_header) and must be less than or
equal to the logical block size.

It means we have to read whole sector with the header, because the
header crc32 checksum is calculated according to HeaderSize.

For more details see UEFI standard (version 2.3, May 2009):
  - 5.3.1 GUID Format overview, page 93
  - Table 13. GUID Partition Table Header, page 96
Signed-off-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

87038c2d

partitions: use sector size for EFI GPT · 7d13af32

Karel Zak authored Nov 23, 2009

Currently, kernel uses strictly 512-byte sectors for EFI GPT parsing.
That's wrong.

UEFI standard (version 2.3, May 2009, 5.3.1 GUID Format overview, page
95) defines that LBA is always based on the logical block size. It
means bdev_logical_block_size() (aka BLKSSZGET) for Linux.

This patch removes static sector size from EFI GPT parser.

The problem is reproducible with the latest GNU Parted:

 # modprobe scsi_debug dev_size_mb=50 sector_size=4096

  # ./parted /dev/sdb print
  Model: Linux scsi_debug (scsi)
  Disk /dev/sdb: 52.4MB
  Sector size (logical/physical): 4096B/4096B
  Partition Table: gpt

  Number  Start   End     Size    File system  Name     Flags
   1      24.6kB  3002kB  2978kB               primary
   2      3002kB  6001kB  2998kB               primary
   3      6001kB  9003kB  3002kB               primary

  # blockdev --rereadpt /dev/sdb
  # dmesg | tail -1
   sdb: unknown partition table      <---- !!!

with this patch:

  # blockdev --rereadpt /dev/sdb
  # dmesg | tail -1
   sdb: sdb1 sdb2 sdb3
Signed-off-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

7d13af32

13 Nov, 2009 11 commits

cciss: Fix weird usage of ENXIO in cciss_scsi.c · 8721c81f

Stephen M. Cameron authored Nov 12, 2009

cciss: Fix weird usage of ENXIO in cciss_scsi.c
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

8721c81f

cciss: Add enhanced scatter-gather support. · 5c07a311

Don Brace authored Nov 12, 2009

cciss: Add enhanced scatter-gather support.  For controllers which
supported, more than 512 scatter-gather elements per command may
be used, and the max transfer size can be increased to 8192 blocks.
Signed-off-by: Don Brace <brace@beardog.cce.hp.com>
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

5c07a311

cciss: Do not automatically rescan on UNIT ATTENTION/LUN DATA CHANGED · da002184

Stephen M. Cameron authored Nov 12, 2009

cciss: Do not automatically rescan on UNIT ATTENTION/LUN DATA CHANGED
There are problems with doing this. If, say, several logical drives
are deleted at once, several such UNIT ATTENTIONS will be encountered,
often during the rescan triggered by the first such UNIT ATTENTION.
The block layer may be in the midst of trying to add logical drives
which were just deleted (resulting in the subsequent UNIT ATTENTION(s).)
Making the rescan code robust enough to tolerate this kind of thing
is too complicated for the moment. So, for now, we just don't do it.
Note: This UNIT ATTENTION/LUN DATA CHANGED situation only occurs on
the MSA2012.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

da002184

cciss: Remove unnecessary check in scan_thread · d06dfbd2

Stephen M. Cameron authored Nov 12, 2009

cciss: Remove unnecessary check in scan_thread
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

d06dfbd2

cciss: fix typo that causes scsi status to be lost. · b0e15f6d

Stephen M. Cameron authored Nov 12, 2009

cciss: fix typo that causes scsi status to be lost.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

b0e15f6d

cciss: remove sendcmd() as it is no longer used. · aa43f111

Stephen M. Cameron authored Nov 12, 2009

cciss: remove sendcmd() as it is no longer used.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

aa43f111

cciss: clean up code in cciss_shutdown · 29009a03

Stephen M. Cameron authored Nov 12, 2009

cciss: clean up code in cciss_shutdown.  Send the flush cache
command down with interrupts still enabled, and do not do DMA
from the stack.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

29009a03

cciss: Remove the "withirq" parameter from various functions where possible · 7b838bde

Stephen M. Cameron authored Nov 12, 2009

cciss: Remove the "withirq" parameter from various functions where possible
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

7b838bde

cciss: Retry driver initiated cmds with unit attention condition · c08fac65

Stephen M. Cameron authored Nov 12, 2009

cciss:  Retry driver initiated cmds with unit attention condition
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

c08fac65

cciss: Fix problem with remove_from_scan_list on driver unload · fd8489cf

Stephen M. Cameron authored Nov 12, 2009

cciss: Fix problem with remove_from_scan_list that on driver unload
it doesn't remove the controller from the scan list correctly if
the controller is currently being scanned for new devices.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

fd8489cf

cciss: Make device attributes static · 8ba95c69

Alex Chiang authored Nov 12, 2009

cciss: Make device attributes static

Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Acked-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

8ba95c69

11 Nov, 2009 1 commit

block: jiffies fixes · ad5ebd2f

Randy Dunlap authored Nov 11, 2009

Use HZ-independent calculation of milliseconds.
Add jiffies.h where it was missing since functions or macros
from it are used.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

ad5ebd2f

10 Nov, 2009 1 commit

block: Expose discard granularity · 86b37281

Martin K. Petersen authored Nov 10, 2009

While SSDs track block usage on a per-sector basis, RAID arrays often
have allocation blocks that are bigger.  Allow the discard granularity
and alignment to be set and teach the topology stacking logic how to
handle them.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

86b37281

08 Nov, 2009 1 commit

cfq-iosched: fix next_rq computation · cf7c25cf

Corrado Zoccolo authored Nov 08, 2009

Cfq has a bug in computation of next_rq, that affects transition
between multiple sequential request streams in a single queue
(e.g.: two sequential buffered writers of the same priority),
causing the alternation between the two streams for a transient period.

8,0 1 18737 0.260400660 5312 D W 141653311 + 256
8,0 1 20839 0.273239461 5400 D W 141653567 + 256
8,0 1 20841 0.276343885 5394 D W 142803919 + 256
8,0 1 20843 0.279490878 5394 D W 141668927 + 256
8,0 1 20845 0.292459993 5400 D W 142804175 + 256
8,0 1 20847 0.295537247 5400 D W 141668671 + 256
8,0 1 20849 0.298656337 5400 D W 142804431 + 256
8,0 1 20851 0.311481148 5394 D W 141668415 + 256
8,0 1 20853 0.314421305 5394 D W 142804687 + 256
8,0 1 20855 0.318960112 5400 D W 142804943 + 256

The fix makes sure that the next_rq is computed from the last
dispatched request, and not affected by merging.

8,0 1 37776 4.305161306 0 D W 141738087 + 256
8,0 1 37778 4.308298091 0 D W 141738343 + 256
8,0 1 37780 4.312885190 0 D W 141738599 + 256
8,0 1 37782 4.315933291 0 D W 141738855 + 256
8,0 1 37784 4.319064459 0 D W 141739111 + 256
8,0 1 37786 4.331918431 5672 D W 142803007 + 256
8,0 1 37788 4.334930332 5672 D W 142803263 + 256
8,0 1 37790 4.337902723 5672 D W 142803519 + 256
8,0 1 37792 4.342359774 5672 D W 142803775 + 256
8,0 1 37794 4.345318286 0 D W 142804031 + 256
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

cf7c25cf

04 Nov, 2009 10 commits

Merge branch 'for-jens' of git://git.drbd.org/linux-2.6-drbd into for-2.6.33 · 622d32d3
Jens Axboe authored Nov 04, 2009

622d32d3

Now it is equal to DRBD release 8.3.5 without compat crap · ed814525

Philipp Reisner authored Oct 27, 2009

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

ed814525

drbd: performance - don't lose unplug events · 83c38830

Lars Ellenberg authored Nov 03, 2009

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

83c38830

Do not deadlock in drbd_disconnect() [bugz 258] · e656ec8a

Philipp Reisner authored Oct 23, 2009

When there are many blocks on the fly (ua), and the AL gets into "starving"
mode (random IO, scattered all over the device), and the connections gets
interrupted, the receiver thread deadlocks in the drbd_disconnect() code path.

Affected are only nodes in Primary role.

The bug triggers most likely on system that mirror over "long distances"

Regression introduced shortly before 8.3.3
with git commit 31e0f1250f174ac1ee317f360943a0159e19edc8
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

e656ec8a

drbdsetup X resume-io should be usable to resume IO [Bugz 256] · 0a492166

Philipp Reisner authored Oct 21, 2009

When IO gets frozen due to a broken fence-peer script, the user
should be able to thaw IO by the resume-io command.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

0a492166

drbd: fix check for too large lower level device · 1352994b

Lars Ellenberg authored Oct 12, 2009

To check wether we are truncating a very large device due to limited
meta data space, we need to check the ll_dev size.

Also improve the printk to suggest "flexible" or "internal".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

1352994b

fix grammar in printk · ad19bf6e

Lars Ellenberg authored Oct 14, 2009

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

ad19bf6e

change default: by default, use socket buffer auto tuning · 89e1838f

Lars Ellenberg authored Sep 21, 2009

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>

89e1838f

block/scsi_ioctl.c: quiet sparse noise · 476d42f1

H Hartley Sweeten authored Nov 04, 2009

Quiet sparse noise about symbol's not being declared.

Symbol blk_default_cmd_filter is only used locally and should be static.

The function blk_scsi_ioctl_init() is a fs_initcall and should also be
static.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

476d42f1

sendfile(): check f_op.splice_write() rather than f_op.sendpage() · cc56f7de

Changli Gao authored Nov 04, 2009

sendfile(2) was reworked with the splice infrastructure, but it still
checks f_op.sendpage() instead of f_op.splice_write() wrongly. Although
if f_op.sendpage() exists, f_op.splice_write() always exists at the same
time currently, the assumption will be broken in future silently. This
patch also brings a side effect: sendfile(2) can work with any output
file. Some security checks related to f_op are added too.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

cc56f7de