- 22 Jun, 2009 40 commits
-
-
Jonthan Brassow authored
This patch contains a device-mapper mirror log module that forwards requests to userspace for processing. The structures used for communication between kernel and userspace are located in include/linux/dm-log-userspace.h. Due to the frequency, diversity, and 2-way communication nature of the exchanges between kernel and userspace, 'connector' was chosen as the interface for communication. The first log implementations written in userspace - "clustered-disk" and "clustered-core" - support clustered shared storage. A userspace daemon (in the LVM2 source code repository) uses openAIS/corosync to process requests in an ordered fashion with the rest of the nodes in the cluster so as to prevent log state corruption. Other implementations with no association to LVM or openAIS/corosync, are certainly possible. (Imagine if two machines are writing to the same region of a mirror. They would both mark the region dirty, but you need a cluster-aware entity that can handle properly marking the region clean when they are done. Otherwise, you might clear the region when the first machine is done, not the second.) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mike Snitzer authored
Currently, device-mapper maintains a separate instance of 'struct queue_limits' for each table of each device. When the configuration of a device is to be changed, first its table is loaded and this structure is populated, then the device is 'resumed' and the calculated queue_limits are applied. This places restrictions on how userspace may process related devices, where it is often advantageous to 'load' tables for several devices at once before 'resuming' them together. As the new queue_limits only take effect after the 'resume', if they are changing and one device uses another, the latter must be 'resumed' before the former may be 'loaded'. This patch moves the calculation of these queue_limits out of the 'load' operation into 'resume'. Since we are no longer pre-calculating this struct, we no longer need to maintain copies within our dm structs. dm_set_device_limits() now passes the 'start' of the device's data area (aka pe_start) as the 'offset' to blk_stack_limits(). init_valid_queue_limits() is replaced by blk_set_default_limits(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: martin.petersen@oracle.com Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mike Snitzer authored
create_log_context() must use the logical_block_size from the log disk, where the I/O happens, not the target's logical_block_size. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mike Snitzer authored
Add .iterate_devices to 'struct target_type' to allow a function to be called for all devices in a DM target. Implemented it for all targets except those in dm-snap.c (origin and snapshot). (The raid1 version number jumps to 1.12 because we originally reserved 1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors' instead.) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: martin.petersen@oracle.com
-
Mike Snitzer authored
Copy the table's queue_limits to the DM device's request_queue. This properly initializes the queue's topology limits and also avoids having to track the evolution of 'struct queue_limits' in dm_table_set_restrictions() Also fixes a bug that was introduced in dm_table_set_restrictions() via commit ae03bf63. In addition to establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit() also performs an allocation to setup the ISA DMA pool. This allocation resulted in "sleeping function called from invalid context" when called from dm_table_set_restrictions(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mike Snitzer authored
Use blk_stack_limits() to stack block limits (including topology) rather than duplicate the equivalent within Device Mapper. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mike Snitzer authored
Impose necessary and sufficient conditions on a devices's table such that any incoming bio which respects its logical_block_size can be processed successfully. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mike Snitzer authored
Ensure I/O is aligned to the logical block size of target devices. Rename check_device_area() to device_area_is_valid() for clarity and establish the device limits including the logical block size prior to calling it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Milan Broz authored
Add support for passing a 32 bit "cookie" into the kernel with the DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls. The (unsigned) value of this cookie is returned to userspace alongside the uevents issued by these ioctls in the variable DM_COOKIE. This means the userspace process issuing these ioctls can be notified by udev after udev has completed any actions triggered. To minimise the interface extension, we pass the cookie into the kernel in the event_nr field which is otherwise unused when calling these ioctls. Incrementing the version number allows userspace to determine in advance whether or not the kernel supports the cookie. If the kernel does support this but userspace does not, there should be no impact as the new variable will just get ignored. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Peter Rajnoha authored
Add a file named 'suspended' to each device-mapper device directory in sysfs. It holds the value 1 while the device is suspended. Otherwise it holds 0. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Jonthan Brassow authored
Report any devices forgotten to be freed before a table is destroyed. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Kiyoshi Ueda authored
This patch adds a service time oriented dynamic load balancer, dm-service-time, which selects the path with the shortest estimated service time for the incoming I/O. The service time is estimated by dividing the in-flight I/O size by a performance value of each path. The performance value can be given as a table argument at the table loading time. If no performance value is given, all paths are considered equal. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Kiyoshi Ueda authored
This patch adds a dynamic load balancer, dm-queue-length, which balances the number of in-flight I/Os across the paths. The code is based on the patch posted by Stefan Bader: https://www.redhat.com/archives/dm-devel/2005-October/msg00050.htmlSigned-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Kiyoshi Ueda authored
This patch makes two additions to the dm path selector interface for dynamic load balancers: o a new hook, start_io() o a new parameter 'nr_bytes' to select_path()/start_io()/end_io() to pass the size of the I/O start_io() is called when a target driver actually submits I/O to the selected path. Path selectors can use it to start accounting of the I/O. (e.g. counting the number of in-flight I/Os.) The start_io hook is based on the patch posted by Stefan Bader: https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html nr_bytes, the size of the I/O, is so path selectors can take the size of the I/O into account when deciding which path to use. dm-service-time uses it to estimate service time, for example. (Added the nr_bytes member to dm_mpath_io instead of using existing details.bi_size, since request-based dm patch deletes it.) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Send barrier requests when updating the exception area. Exception area updates need to be ordered w.r.t. data writes, so that the writes are not reordered in hardware disk cache. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
If -EOPNOTSUPP was returned and the request was a barrier request, retry it without barrier. Retry all regions for now. Barriers are submitted only for one-region requests, so it doesn't matter. (In the future, retries can be limited to the actual regions that failed.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Add another field, eopnotsupp_bits. It is subset of error_bits, representing regions that returned -EOPNOTSUPP. (The bit is set in both error_bits and eopnotsupp_bits). This value will be used in further patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Flush support for dm-snapshot target. This patch just forwards the flush request to either the origin or the snapshot device. (It doesn't flush exception store metadata.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Flush support for dm-multipath target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Flush support for dm-delay target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Flush support for dm-crypt target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Flush support for the stripe target. This sets ti->num_flush_requests to the number of stripes and remaps individual flush requests to the appropriate stripe devices. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Flush support for the linear target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Pass empty barrier flushes to the targets in dm_flush(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Alasdair G Kergon authored
Move repeated dm_target_io initialisation inside alloc_tio(). Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Introduce num_flush_requests for a target to set to say how many flush instructions (empty barriers) it wants to receive. These are sent by __clone_and_map_empty_barrier with map_info->flush_request going from 0 to (num_flush_requests - 1). Old targets without flush support won't receive any flush requests. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Remove the check that the size of the cloned bio is not zero because a subsequent patch needs to send zero-sized barriers down this path. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
If the underlying device doesn't support barriers and dm receives a barrier, it waits until all requests on that device drain so it no longer needs to report -EOPNOTSUPP to the caller. This patch deals with the confusing situation when moving a volume from one physical device to another triggers an EOPNOTSUPP on a volume that didn't report it before. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
With the following patches, more than one error can occur during processing. Change md->barrier_error so that only the first one is recorded and returned to the caller. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
If barrier request was returned with DM_ENDIO_REQUEUE, requeue it in dm_wq_work instead of dec_pending. This allows us to correctly handle a situation when some targets are asking for a requeue and other targets signal an error. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Make dm_flush return void. The first error during flush is stored in md->barrier_error instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Fix a potential deadlock when creating multiple snapshots by holding a reference to struct block_device for the whole lifecycle of every dm device instead of obtaining it independently at each point it is needed. bdget_disk() was called while the device was being suspended, in dm_suspend(). However there could be other devices already suspended, for example when creating additional snapshots of a device. bdget_disk() can wait for IO and allocate memory resulting in waiting for the already-suspended device - deadlock. This patch changes the code so that it gets the reference to struct block_device when struct mapped_device is allocated and initialized in alloc_dev() where it is always OK to allocate memory or wait for I/O. It drops the reference when it is destroyed in free_dev(). Thus there is no call to bdget_disk() while any device is suspended. Previously unlock_fs() was called only if bdev was held. Now it is called unconditionally, but the superfluous calls are harmless because it returns immediately if the filesystem was not previously frozen. This patch also now allows the device size to be changed in a noflush suspend because the bdev is held. This has no adverse effect. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Rename suspended_bdev to bdev. This patch doesn't change any functionality, just renames the variable. In the next patch, the variable will be used even for non-suspended device. (Pre-requisite for the per-target barrier support patches.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Jonathan Brassow authored
When snapshots are created using 'p' instead of 'P' as the exception store type, the device-mapper table loading fails. This patch makes the code case insensitive as intended and fixes some regressions reported with device-mapper snapshots. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
Use i_size_read() instead of reading i_size. If someone changes the size of the device simultaneously, i_size_read is guaranteed to return a valid value (either the old one or the new one). i_size can return some intermediate invalid value (on 32-bit computers with 64-bit i_size, the reads to both halves of i_size can be interleaved with updates to i_size, resulting in garbage being returned). Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
A bio that has two or more vector entries, size less than or equal to page size, that crosses a stripe boundary of an underlying md device is accepted by device mapper (it conforms to all its limits) but not by the underlying device. The fix is: If device mapper selects the one-page maximum request size, it also needs to set its own q->merge_bvec_fn to reject any bios with multiple vector entries that span more pages. The problem was discovered in the following scenario: * MD - RAID-0 * LV on the top of it (raid1, snapshot or striped with chunk size/stripe larger than RAID-0 stripe) * one of the logical volumes is exported to xen domU * inside xen domU it is partitioned, the key point is that the partition must be unaligned on page boundary (fdisk normally aligns the partition to 63 sectors which will trigger it) * install the system on the partitioned disk in domU This causes I/O failures in dom0. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Mikulas Patocka authored
The commit fe9cf30e moves dm table event submission from kmultipath queue to kernel kevent queue to avoid a deadlock. There is a possibility of race condition because kevent queue is not flushed in the multipath destructor. The scenario is: - some event happens and is queued to keventd - keventd thread is delayed due to scheuling latency or some other work - multipath device is destroyed - keventd now attempts to process work_struct that is residing in already released memory. The patch flushes the keventd queue in multipath constructor. I've already fixed similar bug in dm-raid1. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
-
Mikulas Patocka authored
If the code can't handle allocation failures, use __GFP_NOFAIL so that in case of memory pressure the allocator will retry indefinitely and won't return NULL which would cause a crash in the function. This is still not a correct fix, it may cause a classic deadlock when memory manager waits for I/O being done and I/O waits for some free memory. I/O code shouldn't allocate any memory. But in this case it probably doesn't matter much in practice, people usually do not swap on RAID. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Chandra Seetharaman authored
Fixed a problem affecting reinstatement of passive paths. Before we moved the hardware handler from dm to SCSI, it performed a pg_init for a path group and didn't maintain any state about each path in hardware handler code. But in SCSI dh, such state is now maintained, as we want to fail I/O early on a path if it is not the active path. All the hardware handlers have a state now and set to active or some form of inactive. They have prep_fn() which uses this state to fail the I/O without it ever being sent to the device. So in effect when dm-multipath calls scsi_dh_activate(), activate is sent to only one path and the "state" of that path is changed appropriately to "active" while other paths in the same path group are never changed as they never got an "activate". In order make sure all the paths in a path group gets their state set properly when a pg_init happens, we need to call scsi_dh_activate() on all paths in a path group. Doing this at the hardware handler layer is not a good option as we want the multipath layer to define the relationship between path and path groups and not the hardware handler. Attached patch sends an "activate" on each path in a path group when a path group is switched. It also sends an activate when a path is reinstated. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-
Hannes Reinecke authored
When specifying a different hardware handler via multipath features we should be able to override the built-in defaults. The problem here is the hardware table from scsi_dh is compiled in and cannot be changed from userland. The multipath.conf OTOH is purely user-defined and, what's more, the user might have a valid reason for modifying it. (EG EMC Clariion can well be run in PNR mode even though ALUA is active, or the user might want to try ALUA on any as-of-yet unknown devices) So _not_ allowing multipath to override the device handler setting will just add to the confusion and makes error tracking even more difficult. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
-