Commit 486088bc authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'standardize-docs' of git://git.lwn.net/linux

Pull documentation format standardization from Jonathan Corbet:
 "This series converts a number of top-level documents to the RST format
  without incorporating them into the Sphinx tree. The hope is to bring
  some uniformity to kernel documentation and, perhaps more importantly,
  have our existing docs serve as an example of the desired formatting
  for those that will be added later.

  Mauro has gone through and fixed up a lot of top-level documentation
  files to make them conform to the RST format, but without moving or
  renaming them in any way. This will help when we incorporate the ones
  we want to keep into the Sphinx doctree, but the real purpose is to
  bring a bit of uniformity to our documentation and let the top-level
  docs serve as examples for those writing new ones"

* tag 'standardize-docs' of git://git.lwn.net/linux: (84 commits)
  docs: kprobes.txt: Fix whitespacing
  tee.txt: standardize document format
  cgroup-v2.txt: standardize document format
  dell_rbu.txt: standardize document format
  zorro.txt: standardize document format
  xz.txt: standardize document format
  xillybus.txt: standardize document format
  vfio.txt: standardize document format
  vfio-mediated-device.txt: standardize document format
  unaligned-memory-access.txt: standardize document format
  this_cpu_ops.txt: standardize document format
  svga.txt: standardize document format
  static-keys.txt: standardize document format
  smsc_ece1099.txt: standardize document format
  SM501.txt: standardize document format
  siphash.txt: standardize document format
  sgi-ioc4.txt: standardize document format
  SAK.txt: standardize document format
  rpmsg.txt: standardize document format
  robust-futexes.txt: standardize document format
  ...
parents 52f6c588 43e5f7e1
Dynamic DMA mapping Guide
=========================
=========================
Dynamic DMA mapping Guide
=========================
David S. Miller <davem@redhat.com>
Richard Henderson <rth@cygnus.com>
Jakub Jelinek <jakub@redhat.com>
:Author: David S. Miller <davem@redhat.com>
:Author: Richard Henderson <rth@cygnus.com>
:Author: Jakub Jelinek <jakub@redhat.com>
This is a guide to device driver writers on how to use the DMA API
with example pseudo-code. For a concise description of the API, see
DMA-API.txt.
CPU and DMA addresses
CPU and DMA addresses
=====================
There are several kinds of addresses involved in the DMA API, and it's
important to understand the differences.
The kernel normally uses virtual addresses. Any address returned by
kmalloc(), vmalloc(), and similar interfaces is a virtual address and can
be stored in a "void *".
be stored in a ``void *``.
The virtual memory system (TLB, page tables, etc.) translates virtual
addresses to CPU physical addresses, which are stored as "phys_addr_t" or
......@@ -37,7 +39,7 @@ be restricted to a subset of that space. For example, even if a system
supports 64-bit addresses for main memory and PCI BARs, it may use an IOMMU
so devices only need to use 32-bit DMA addresses.
Here's a picture and some examples:
Here's a picture and some examples::
CPU CPU Bus
Virtual Physical Address
......@@ -98,15 +100,16 @@ microprocessor architecture. You should use the DMA API rather than the
bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
pci_map_*() interfaces.
First of all, you should make sure
First of all, you should make sure::
#include <linux/dma-mapping.h>
#include <linux/dma-mapping.h>
is in your driver, which provides the definition of dma_addr_t. This type
can hold any valid DMA address for the platform and should be used
everywhere you hold a DMA address returned from the DMA mapping functions.
What memory is DMA'able?
What memory is DMA'able?
========================
The first piece of information you must know is what kernel memory can
be used with the DMA mapping facilities. There has been an unwritten
......@@ -143,7 +146,8 @@ What about block I/O and networking buffers? The block I/O and
networking subsystems make sure that the buffers they use are valid
for you to DMA from/to.
DMA addressing limitations
DMA addressing limitations
==========================
Does your device have any DMA addressing limitations? For example, is
your device only capable of driving the low order 24-bits of address?
......@@ -166,7 +170,7 @@ style to do this even if your device holds the default setting,
because this shows that you did think about these issues wrt. your
device.
The query is performed via a call to dma_set_mask_and_coherent():
The query is performed via a call to dma_set_mask_and_coherent()::
int dma_set_mask_and_coherent(struct device *dev, u64 mask);
......@@ -175,12 +179,12 @@ If you have some special requirements, then the following two separate
queries can be used instead:
The query for streaming mappings is performed via a call to
dma_set_mask():
dma_set_mask()::
int dma_set_mask(struct device *dev, u64 mask);
The query for consistent allocations is performed via a call
to dma_set_coherent_mask():
to dma_set_coherent_mask()::
int dma_set_coherent_mask(struct device *dev, u64 mask);
......@@ -209,7 +213,7 @@ of your driver reports that performance is bad or that the device is not
even detected, you can ask them for the kernel messages to find out
exactly why.
The standard 32-bit addressing device would do something like this:
The standard 32-bit addressing device would do something like this::
if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32))) {
dev_warn(dev, "mydev: No suitable DMA available\n");
......@@ -225,7 +229,7 @@ than 64-bit addressing. For example, Sparc64 PCI SAC addressing is
more efficient than DAC addressing.
Here is how you would handle a 64-bit capable device which can drive
all 64-bits when accessing streaming DMA:
all 64-bits when accessing streaming DMA::
int using_dac;
......@@ -239,7 +243,7 @@ all 64-bits when accessing streaming DMA:
}
If a card is capable of using 64-bit consistent allocations as well,
the case would look like this:
the case would look like this::
int using_dac, consistent_using_dac;
......@@ -260,7 +264,7 @@ uses consistent allocations, one would have to check the return value from
dma_set_coherent_mask().
Finally, if your device can only drive the low 24-bits of
address you might do something like:
address you might do something like::
if (dma_set_mask(dev, DMA_BIT_MASK(24))) {
dev_warn(dev, "mydev: 24-bit DMA addressing not available\n");
......@@ -280,7 +284,7 @@ only provide the functionality which the machine can handle. It
is important that the last call to dma_set_mask() be for the
most specific mask.
Here is pseudo-code showing how this might be done:
Here is pseudo-code showing how this might be done::
#define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32)
#define RECORD_ADDRESS_BITS DMA_BIT_MASK(24)
......@@ -308,7 +312,8 @@ A sound card was used as an example here because this genre of PCI
devices seems to be littered with ISA chips given a PCI front end,
and thus retaining the 16MB DMA addressing limitations of ISA.
Types of DMA mappings
Types of DMA mappings
=====================
There are two types of DMA mappings:
......@@ -336,12 +341,14 @@ There are two types of DMA mappings:
to memory is immediately visible to the device, and vice
versa. Consistent mappings guarantee this.
IMPORTANT: Consistent DMA memory does not preclude the usage of
proper memory barriers. The CPU may reorder stores to
.. important::
Consistent DMA memory does not preclude the usage of
proper memory barriers. The CPU may reorder stores to
consistent memory just as it may normal memory. Example:
if it is important for the device to see the first word
of a descriptor updated before the second, you must do
something like:
something like::
desc->word0 = address;
wmb();
......@@ -377,16 +384,17 @@ Also, systems with caches that aren't DMA-coherent will work better
when the underlying buffers don't share cache lines with other data.
Using Consistent DMA mappings.
Using Consistent DMA mappings
=============================
To allocate and map large (PAGE_SIZE or so) consistent DMA regions,
you should do:
you should do::
dma_addr_t dma_handle;
cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp);
where device is a struct device *. This may be called in interrupt
where device is a ``struct device *``. This may be called in interrupt
context with the GFP_ATOMIC flag.
Size is the length of the region you want to allocate, in bytes.
......@@ -415,7 +423,7 @@ exists (for example) to guarantee that if you allocate a chunk
which is smaller than or equal to 64 kilobytes, the extent of the
buffer you receive will not cross a 64K boundary.
To unmap and free such a DMA region, you call:
To unmap and free such a DMA region, you call::
dma_free_coherent(dev, size, cpu_addr, dma_handle);
......@@ -430,7 +438,7 @@ a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages().
Also, it understands common hardware constraints for alignment,
like queue heads needing to be aligned on N byte boundaries.
Create a dma_pool like this:
Create a dma_pool like this::
struct dma_pool *pool;
......@@ -444,7 +452,7 @@ pass 0 for boundary; passing 4096 says memory allocated from this pool
must not cross 4KByte boundaries (but at that time it may be better to
use dma_alloc_coherent() directly instead).
Allocate memory from a DMA pool like this:
Allocate memory from a DMA pool like this::
cpu_addr = dma_pool_alloc(pool, flags, &dma_handle);
......@@ -452,7 +460,7 @@ flags are GFP_KERNEL if blocking is permitted (not in_interrupt nor
holding SMP locks), GFP_ATOMIC otherwise. Like dma_alloc_coherent(),
this returns two values, cpu_addr and dma_handle.
Free memory that was allocated from a dma_pool like this:
Free memory that was allocated from a dma_pool like this::
dma_pool_free(pool, cpu_addr, dma_handle);
......@@ -460,7 +468,7 @@ where pool is what you passed to dma_pool_alloc(), and cpu_addr and
dma_handle are the values dma_pool_alloc() returned. This function
may be called in interrupt context.
Destroy a dma_pool by calling:
Destroy a dma_pool by calling::
dma_pool_destroy(pool);
......@@ -468,11 +476,12 @@ Make sure you've called dma_pool_free() for all memory allocated
from a pool before you destroy the pool. This function may not
be called in interrupt context.
DMA Direction
DMA Direction
=============
The interfaces described in subsequent portions of this document
take a DMA direction argument, which is an integer and takes on
one of the following values:
one of the following values::
DMA_BIDIRECTIONAL
DMA_TO_DEVICE
......@@ -521,14 +530,15 @@ packets, map/unmap them with the DMA_TO_DEVICE direction
specifier. For receive packets, just the opposite, map/unmap them
with the DMA_FROM_DEVICE direction specifier.
Using Streaming DMA mappings
Using Streaming DMA mappings
============================
The streaming DMA mapping routines can be called from interrupt
context. There are two versions of each map/unmap, one which will
map/unmap a single memory region, and one which will map/unmap a
scatterlist.
To map a single region, you do:
To map a single region, you do::
struct device *dev = &my_dev->dev;
dma_addr_t dma_handle;
......@@ -545,7 +555,7 @@ To map a single region, you do:
goto map_error_handling;
}
and to unmap it:
and to unmap it::
dma_unmap_single(dev, dma_handle, size, direction);
......@@ -563,7 +573,7 @@ Using CPU pointers like this for single mappings has a disadvantage:
you cannot reference HIGHMEM memory in this way. Thus, there is a
map/unmap interface pair akin to dma_{map,unmap}_single(). These
interfaces deal with page/offset pairs instead of CPU pointers.
Specifically:
Specifically::
struct device *dev = &my_dev->dev;
dma_addr_t dma_handle;
......@@ -593,7 +603,7 @@ error as outlined under the dma_map_single() discussion.
You should call dma_unmap_page() when the DMA activity is finished, e.g.,
from the interrupt which told you that the DMA transfer is done.
With scatterlists, you map a region gathered from several regions by:
With scatterlists, you map a region gathered from several regions by::
int i, count = dma_map_sg(dev, sglist, nents, direction);
struct scatterlist *sg;
......@@ -617,16 +627,18 @@ Then you should loop count times (note: this can be less than nents times)
and use sg_dma_address() and sg_dma_len() macros where you previously
accessed sg->address and sg->length as shown above.
To unmap a scatterlist, just call:
To unmap a scatterlist, just call::
dma_unmap_sg(dev, sglist, nents, direction);
Again, make sure DMA activity has already finished.
PLEASE NOTE: The 'nents' argument to the dma_unmap_sg call must be
the _same_ one you passed into the dma_map_sg call,
it should _NOT_ be the 'count' value _returned_ from the
dma_map_sg call.
.. note::
The 'nents' argument to the dma_unmap_sg call must be
the _same_ one you passed into the dma_map_sg call,
it should _NOT_ be the 'count' value _returned_ from the
dma_map_sg call.
Every dma_map_{single,sg}() call should have its dma_unmap_{single,sg}()
counterpart, because the DMA address space is a shared resource and
......@@ -638,11 +650,11 @@ properly in order for the CPU and device to see the most up-to-date and
correct copy of the DMA buffer.
So, firstly, just map it with dma_map_{single,sg}(), and after each DMA
transfer call either:
transfer call either::
dma_sync_single_for_cpu(dev, dma_handle, size, direction);
or:
or::
dma_sync_sg_for_cpu(dev, sglist, nents, direction);
......@@ -650,17 +662,19 @@ as appropriate.
Then, if you wish to let the device get at the DMA area again,
finish accessing the data with the CPU, and then before actually
giving the buffer to the hardware call either:
giving the buffer to the hardware call either::
dma_sync_single_for_device(dev, dma_handle, size, direction);
or:
or::
dma_sync_sg_for_device(dev, sglist, nents, direction);
as appropriate.
PLEASE NOTE: The 'nents' argument to dma_sync_sg_for_cpu() and
.. note::
The 'nents' argument to dma_sync_sg_for_cpu() and
dma_sync_sg_for_device() must be the same passed to
dma_map_sg(). It is _NOT_ the count returned by
dma_map_sg().
......@@ -671,7 +685,7 @@ dma_map_*() call till dma_unmap_*(), then you don't have to call the
dma_sync_*() routines at all.
Here is pseudo code which shows a situation in which you would need
to use the dma_sync_*() interfaces.
to use the dma_sync_*() interfaces::
my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len)
{
......@@ -747,7 +761,8 @@ is planned to completely remove virt_to_bus() and bus_to_virt() as
they are entirely deprecated. Some ports already do not provide these
as it is impossible to correctly support them.
Handling Errors
Handling Errors
===============
DMA address space is limited on some architectures and an allocation
failure can be determined by:
......@@ -755,7 +770,7 @@ failure can be determined by:
- checking if dma_alloc_coherent() returns NULL or dma_map_sg returns 0
- checking the dma_addr_t returned from dma_map_single() and dma_map_page()
by using dma_mapping_error():
by using dma_mapping_error()::
dma_addr_t dma_handle;
......@@ -773,7 +788,8 @@ failure can be determined by:
of a multiple page mapping attempt. These example are applicable to
dma_map_page() as well.
Example 1:
Example 1::
dma_addr_t dma_handle1;
dma_addr_t dma_handle2;
......@@ -802,8 +818,12 @@ Example 1:
dma_unmap_single(dma_handle1);
map_error_handling1:
Example 2: (if buffers are allocated in a loop, unmap all mapped buffers when
mapping error is detected in the middle)
Example 2::
/*
* if buffers are allocated in a loop, unmap all mapped buffers when
* mapping error is detected in the middle
*/
dma_addr_t dma_addr;
dma_addr_t array[DMA_BUFFERS];
......@@ -846,7 +866,8 @@ SCSI drivers must return SCSI_MLQUEUE_HOST_BUSY if the DMA mapping
fails in the queuecommand hook. This means that the SCSI subsystem
passes the command to the driver again later.
Optimizing Unmap State Space Consumption
Optimizing Unmap State Space Consumption
========================================
On many platforms, dma_unmap_{single,page}() is simply a nop.
Therefore, keeping track of the mapping address and length is a waste
......@@ -858,7 +879,7 @@ Actually, instead of describing the macros one by one, we'll
transform some example code.
1) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures.
Example, before:
Example, before::
struct ring_state {
struct sk_buff *skb;
......@@ -866,7 +887,7 @@ transform some example code.
__u32 len;
};
after:
after::
struct ring_state {
struct sk_buff *skb;
......@@ -875,23 +896,23 @@ transform some example code.
};
2) Use dma_unmap_{addr,len}_set() to set these values.
Example, before:
Example, before::
ringp->mapping = FOO;
ringp->len = BAR;
after:
after::
dma_unmap_addr_set(ringp, mapping, FOO);
dma_unmap_len_set(ringp, len, BAR);
3) Use dma_unmap_{addr,len}() to access these values.
Example, before:
Example, before::
dma_unmap_single(dev, ringp->mapping, ringp->len,
DMA_FROM_DEVICE);
after:
after::
dma_unmap_single(dev,
dma_unmap_addr(ringp, mapping),
......@@ -902,7 +923,8 @@ It really should be self-explanatory. We treat the ADDR and LEN
separately, because it is possible for an implementation to only
need the address in order to perform the unmap operation.
Platform Issues
Platform Issues
===============
If you are just writing drivers for Linux and do not maintain
an architecture port for the kernel, you can safely skip down
......@@ -928,12 +950,13 @@ to "Closing".
alignment constraints (e.g. the alignment constraints about 64-bit
objects).
Closing
Closing
=======
This document, and the API itself, would not be in its current
form without the feedback and suggestions from numerous individuals.
We would like to specifically mention, in no particular order, the
following people:
following people::
Russell King <rmk@arm.linux.org.uk>
Leo Dagum <dagum@barrel.engr.sgi.com>
......
This diff is collapsed.
DMA with ISA and LPC devices
============================
============================
DMA with ISA and LPC devices
============================
Pierre Ossman <drzeus@drzeus.cx>
:Author: Pierre Ossman <drzeus@drzeus.cx>
This document describes how to do DMA transfers using the old ISA DMA
controller. Even though ISA is more or less dead today the LPC bus
uses the same DMA system so it will be around for quite some time.
Part I - Headers and dependencies
---------------------------------
Headers and dependencies
------------------------
To do ISA style DMA you need to include two headers:
To do ISA style DMA you need to include two headers::
#include <linux/dma-mapping.h>
#include <asm/dma.h>
#include <linux/dma-mapping.h>
#include <asm/dma.h>
The first is the generic DMA API used to convert virtual addresses to
bus addresses (see Documentation/DMA-API.txt for details).
......@@ -23,8 +24,8 @@ this is not present on all platforms make sure you construct your
Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries
to build your driver on unsupported platforms.
Part II - Buffer allocation
---------------------------
Buffer allocation
-----------------
The ISA DMA controller has some very strict requirements on which
memory it can access so extra care must be taken when allocating
......@@ -47,8 +48,8 @@ __GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder.
(This scarcity also means that you should allocate the buffer as
early as possible and not release it until the driver is unloaded.)
Part III - Address translation
------------------------------
Address translation
-------------------
To translate the virtual address to a bus address, use the normal DMA
API. Do _not_ use isa_virt_to_phys() even though it does the same
......@@ -61,8 +62,8 @@ Note: x86_64 had a broken DMA API when it came to ISA but has since
been fixed. If your arch has problems then fix the DMA API instead of
reverting to the ISA functions.
Part IV - Channels
------------------
Channels
--------
A normal ISA DMA controller has 8 channels. The lower four are for
8-bit transfers and the upper four are for 16-bit transfers.
......@@ -80,8 +81,8 @@ The ability to use 16-bit or 8-bit transfers is _not_ up to you as a
driver author but depends on what the hardware supports. Check your
specs or test different channels.
Part V - Transfer data
----------------------
Transfer data
-------------
Now for the good stuff, the actual DMA transfer. :)
......@@ -112,37 +113,37 @@ Once the DMA transfer is finished (or timed out) you should disable
the channel again. You should also check get_dma_residue() to make
sure that all data has been transferred.
Example:
Example::
int flags, residue;
int flags, residue;
flags = claim_dma_lock();
flags = claim_dma_lock();
clear_dma_ff();
clear_dma_ff();
set_dma_mode(channel, DMA_MODE_WRITE);
set_dma_addr(channel, phys_addr);
set_dma_count(channel, num_bytes);
set_dma_mode(channel, DMA_MODE_WRITE);
set_dma_addr(channel, phys_addr);
set_dma_count(channel, num_bytes);
dma_enable(channel);
dma_enable(channel);
release_dma_lock(flags);
release_dma_lock(flags);
while (!device_done());
while (!device_done());
flags = claim_dma_lock();
flags = claim_dma_lock();
dma_disable(channel);
dma_disable(channel);
residue = dma_get_residue(channel);
if (residue != 0)
printk(KERN_ERR "driver: Incomplete DMA transfer!"
" %d bytes left!\n", residue);
residue = dma_get_residue(channel);
if (residue != 0)
printk(KERN_ERR "driver: Incomplete DMA transfer!"
" %d bytes left!\n", residue);
release_dma_lock(flags);
release_dma_lock(flags);
Part VI - Suspend/resume
------------------------
Suspend/resume
--------------
It is the driver's responsibility to make sure that the machine isn't
suspended while a DMA transfer is in progress. Also, all DMA settings
......
DMA attributes
==============
==============
DMA attributes
==============
This document describes the semantics of the DMA attributes that are
defined in linux/dma-mapping.h.
......@@ -108,6 +109,7 @@ This is a hint to the DMA-mapping subsystem that it's probably not worth
the time to try to allocate memory to in a way that gives better TLB
efficiency (AKA it's not worth trying to build the mapping out of larger
pages). You might want to specify this if:
- You know that the accesses to this memory won't thrash the TLB.
You might know that the accesses are likely to be sequential or
that they aren't sequential but it's unlikely you'll ping-pong
......@@ -121,11 +123,12 @@ pages). You might want to specify this if:
the mapping to have a short lifetime then it may be worth it to
optimize allocation (avoid coming up with large pages) instead of
getting the slight performance win of larger pages.
Setting this hint doesn't guarantee that you won't get huge pages, but it
means that we won't try quite as hard to get them.
NOTE: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM,
though ARM64 patches will likely be posted soon.
.. note:: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM,
though ARM64 patches will likely be posted soon.
DMA_ATTR_NO_WARN
----------------
......@@ -142,10 +145,10 @@ problem at all, depending on the implementation of the retry mechanism.
So, this provides a way for drivers to avoid those error messages on calls
where allocation failures are not a problem, and shouldn't bother the logs.
NOTE: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC.
.. note:: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC.
DMA_ATTR_PRIVILEGED
------------------------------
-------------------
Some advanced peripherals such as remote processors and GPUs perform
accesses to DMA buffers in both privileged "supervisor" and unprivileged
......
=====================
The Linux IPMI Driver
=====================
The Linux IPMI Driver
---------------------
Corey Minyard
<minyard@mvista.com>
<minyard@acm.org>
:Author: Corey Minyard <minyard@mvista.com> / <minyard@acm.org>
The Intelligent Platform Management Interface, or IPMI, is a
standard for controlling intelligent devices that monitor a system.
......@@ -141,7 +140,7 @@ Addressing
----------
The IPMI addressing works much like IP addresses, you have an overlay
to handle the different address types. The overlay is:
to handle the different address types. The overlay is::
struct ipmi_addr
{
......@@ -153,7 +152,7 @@ to handle the different address types. The overlay is:
The addr_type determines what the address really is. The driver
currently understands two different types of addresses.
"System Interface" addresses are defined as:
"System Interface" addresses are defined as::
struct ipmi_system_interface_addr
{
......@@ -166,7 +165,7 @@ straight to the BMC on the current card. The channel must be
IPMI_BMC_CHANNEL.
Messages that are destined to go out on the IPMB bus use the
IPMI_IPMB_ADDR_TYPE address type. The format is
IPMI_IPMB_ADDR_TYPE address type. The format is::
struct ipmi_ipmb_addr
{
......@@ -184,16 +183,16 @@ spec.
Messages
--------
Messages are defined as:
Messages are defined as::
struct ipmi_msg
{
struct ipmi_msg
{
unsigned char netfn;
unsigned char lun;
unsigned char cmd;
unsigned char *data;
int data_len;
};
};
The driver takes care of adding/stripping the header information. The
data portion is just the data to be send (do NOT put addressing info
......@@ -208,7 +207,7 @@ block of data, even when receiving messages. Otherwise the driver
will have no place to put the message.
Messages coming up from the message handler in kernelland will come in
as:
as::
struct ipmi_recv_msg
{
......@@ -246,6 +245,7 @@ and the user should not have to care what type of SMI is below them.
Watching For Interfaces
^^^^^^^^^^^^^^^^^^^^^^^
When your code comes up, the IPMI driver may or may not have detected
if IPMI devices exist. So you might have to defer your setup until
......@@ -256,6 +256,7 @@ and tell you when they come and go.
Creating the User
^^^^^^^^^^^^^^^^^
To use the message handler, you must first create a user using
ipmi_create_user. The interface number specifies which SMI you want
......@@ -272,6 +273,7 @@ closing the device automatically destroys the user.
Messaging
^^^^^^^^^
To send a message from kernel-land, the ipmi_request_settime() call does
pretty much all message handling. Most of the parameter are
......@@ -321,6 +323,7 @@ though, since it is tricky to manage your own buffers.
Events and Incoming Commands
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The driver takes care of polling for IPMI events and receiving
commands (commands are messages that are not responses, they are
......@@ -367,7 +370,7 @@ in the system. It discovers interfaces through a host of different
methods, depending on the system.
You can specify up to four interfaces on the module load line and
control some module parameters:
control some module parameters::
modprobe ipmi_si.o type=<type1>,<type2>....
ports=<port1>,<port2>... addrs=<addr1>,<addr2>...
......@@ -437,7 +440,7 @@ default is one. Setting to 0 is useful with the hotmod, but is
obviously only useful for modules.
When compiled into the kernel, the parameters can be specified on the
kernel command line as:
kernel command line as::
ipmi_si.type=<type1>,<type2>...
ipmi_si.ports=<port1>,<port2>... ipmi_si.addrs=<addr1>,<addr2>...
......@@ -474,16 +477,22 @@ The driver supports a hot add and remove of interfaces. This way,
interfaces can be added or removed after the kernel is up and running.
This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a
write-only parameter. You write a string to this interface. The string
has the format:
has the format::
<op1>[:op2[:op3...]]
The "op"s are:
The "op"s are::
add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]]
You can specify more than one interface on the line. The "opt"s are:
You can specify more than one interface on the line. The "opt"s are::
rsp=<regspacing>
rsi=<regsize>
rsh=<regshift>
irq=<irq>
ipmb=<ipmb slave addr>
and these have the same meanings as discussed above. Note that you
can also use this on the kernel command line for a more compact format
for specifying an interface. Note that when removing an interface,
......@@ -496,7 +505,7 @@ The SMBus Driver (SSIF)
The SMBus driver allows up to 4 SMBus devices to be configured in the
system. By default, the driver will only register with something it
finds in DMI or ACPI tables. You can change this
at module load time (for a module) with:
at module load time (for a module) with::
modprobe ipmi_ssif.o
addr=<i2caddr1>[,<i2caddr2>[,...]]
......@@ -535,7 +544,7 @@ the smb_addr parameter unless you have DMI or ACPI data to tell the
driver what to use.
When compiled into the kernel, the addresses can be specified on the
kernel command line as:
kernel command line as::
ipmb_ssif.addr=<i2caddr1>[,<i2caddr2>[...]]
ipmi_ssif.adapter=<adapter1>[,<adapter2>[...]]
......@@ -565,9 +574,9 @@ Some users need more detailed information about a device, like where
the address came from or the raw base device for the IPMI interface.
You can use the IPMI smi_watcher to catch the IPMI interfaces as they
come or go, and to grab the information, you can use the function
ipmi_get_smi_info(), which returns the following structure:
ipmi_get_smi_info(), which returns the following structure::
struct ipmi_smi_info {
struct ipmi_smi_info {
enum ipmi_addr_src addr_src;
struct device *dev;
union {
......@@ -575,7 +584,7 @@ struct ipmi_smi_info {
void *acpi_handle;
} acpi_info;
} addr_info;
};
};
Currently special info for only for SI_ACPI address sources is
returned. Others may be added as necessary.
......@@ -590,7 +599,7 @@ Watchdog
A watchdog timer is provided that implements the Linux-standard
watchdog timer interface. It has three module parameters that can be
used to control it:
used to control it::
modprobe ipmi_watchdog timeout=<t> pretimeout=<t> action=<action type>
preaction=<preaction type> preop=<preop type> start_now=x
......@@ -635,7 +644,7 @@ watchdog device is closed. The default value of nowayout is true
if the CONFIG_WATCHDOG_NOWAYOUT option is enabled, or false if not.
When compiled into the kernel, the kernel command line is available
for configuring the watchdog:
for configuring the watchdog::
ipmi_watchdog.timeout=<t> ipmi_watchdog.pretimeout=<t>
ipmi_watchdog.action=<action type>
......@@ -675,6 +684,7 @@ also get a bunch of OEM events holding the panic string.
The field settings of the events are:
* Generator ID: 0x21 (kernel)
* EvM Rev: 0x03 (this event is formatting in IPMI 1.0 format)
* Sensor Type: 0x20 (OS critical stop sensor)
......@@ -683,18 +693,20 @@ The field settings of the events are:
* Event Data 1: 0xa1 (Runtime stop in OEM bytes 2 and 3)
* Event data 2: second byte of panic string
* Event data 3: third byte of panic string
See the IPMI spec for the details of the event layout. This event is
always sent to the local management controller. It will handle routing
the message to the right place
Other OEM events have the following format:
Record ID (bytes 0-1): Set by the SEL.
Record type (byte 2): 0xf0 (OEM non-timestamped)
byte 3: The slave address of the card saving the panic
byte 4: A sequence number (starting at zero)
The rest of the bytes (11 bytes) are the panic string. If the panic string
is longer than 11 bytes, multiple messages will be sent with increasing
sequence numbers.
* Record ID (bytes 0-1): Set by the SEL.
* Record type (byte 2): 0xf0 (OEM non-timestamped)
* byte 3: The slave address of the card saving the panic
* byte 4: A sequence number (starting at zero)
The rest of the bytes (11 bytes) are the panic string. If the panic string
is longer than 11 bytes, multiple messages will be sent with increasing
sequence numbers.
Because you cannot send OEM events using the standard interface, this
function will attempt to find an SEL and add the events there. It
......
================
SMP IRQ affinity
================
ChangeLog:
Started by Ingo Molnar <mingo@redhat.com>
Update by Max Krasnyansky <maxk@qualcomm.com>
- Started by Ingo Molnar <mingo@redhat.com>
- Update by Max Krasnyansky <maxk@qualcomm.com>
SMP IRQ affinity
/proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify
which target CPUs are permitted for a given IRQ source. It's a bitmask
......@@ -16,50 +19,52 @@ will be set to the default mask. It can then be changed as described above.
Default mask is 0xffffffff.
Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
it to CPU4-7 (this is an 8-CPU SMP box):
it to CPU4-7 (this is an 8-CPU SMP box)::
[root@moon 44]# cd /proc/irq/44
[root@moon 44]# cat smp_affinity
ffffffff
[root@moon 44]# cd /proc/irq/44
[root@moon 44]# cat smp_affinity
ffffffff
[root@moon 44]# echo 0f > smp_affinity
[root@moon 44]# cat smp_affinity
0000000f
[root@moon 44]# ping -f h
PING hell (195.4.7.3): 56 data bytes
...
--- hell ping statistics ---
6029 packets transmitted, 6027 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.1/0.4 ms
[root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:'
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1
[root@moon 44]# echo 0f > smp_affinity
[root@moon 44]# cat smp_affinity
0000000f
[root@moon 44]# ping -f h
PING hell (195.4.7.3): 56 data bytes
...
--- hell ping statistics ---
6029 packets transmitted, 6027 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.1/0.4 ms
[root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:'
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1
As can be seen from the line above IRQ44 was delivered only to the first four
processors (0-3).
Now lets restrict that IRQ to CPU(4-7).
[root@moon 44]# echo f0 > smp_affinity
[root@moon 44]# cat smp_affinity
000000f0
[root@moon 44]# ping -f h
PING hell (195.4.7.3): 56 data bytes
..
--- hell ping statistics ---
2779 packets transmitted, 2777 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.5/585.4 ms
[root@moon 44]# cat /proc/interrupts | 'CPU\|44:'
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1
::
[root@moon 44]# echo f0 > smp_affinity
[root@moon 44]# cat smp_affinity
000000f0
[root@moon 44]# ping -f h
PING hell (195.4.7.3): 56 data bytes
..
--- hell ping statistics ---
2779 packets transmitted, 2777 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.5/585.4 ms
[root@moon 44]# cat /proc/interrupts | 'CPU\|44:'
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1
This time around IRQ44 was delivered only to the last four processors.
i.e counters for the CPU0-3 did not change.
Here is an example of limiting that same irq (44) to cpus 1024 to 1031:
Here is an example of limiting that same irq (44) to cpus 1024 to 1031::
[root@moon 44]# echo 1024-1031 > smp_affinity_list
[root@moon 44]# cat smp_affinity_list
1024-1031
[root@moon 44]# echo 1024-1031 > smp_affinity_list
[root@moon 44]# cat smp_affinity_list
1024-1031
Note that to do this with a bitmask would require 32 bitmasks of zero
to follow the pertinent one.
irq_domain interrupt number mapping library
===============================================
The irq_domain interrupt number mapping library
===============================================
The current design of the Linux kernel uses a single large number
space where each separate IRQ source is assigned a different number.
......@@ -36,7 +38,9 @@ irq_domain also implements translation from an abstract irq_fwspec
structure to hwirq numbers (Device Tree and ACPI GSI so far), and can
be easily extended to support other IRQ topology data sources.
=== irq_domain usage ===
irq_domain usage
================
An interrupt controller driver creates and registers an irq_domain by
calling one of the irq_domain_add_*() functions (each mapping method
has a different allocator function, more on that later). The function
......@@ -62,15 +66,21 @@ If the driver has the Linux IRQ number or the irq_data pointer, and
needs to know the associated hwirq number (such as in the irq_chip
callbacks) then it can be directly obtained from irq_data->hwirq.
=== Types of irq_domain mappings ===
Types of irq_domain mappings
============================
There are several mechanisms available for reverse mapping from hwirq
to Linux irq, and each mechanism uses a different allocation function.
Which reverse map type should be used depends on the use case. Each
of the reverse map types are described below:
==== Linear ====
irq_domain_add_linear()
irq_domain_create_linear()
Linear
------
::
irq_domain_add_linear()
irq_domain_create_linear()
The linear reverse map maintains a fixed size table indexed by the
hwirq number. When a hwirq is mapped, an irq_desc is allocated for
......@@ -89,9 +99,13 @@ accepts a more general abstraction 'struct fwnode_handle'.
The majority of drivers should use the linear map.
==== Tree ====
irq_domain_add_tree()
irq_domain_create_tree()
Tree
----
::
irq_domain_add_tree()
irq_domain_create_tree()
The irq_domain maintains a radix tree map from hwirq numbers to Linux
IRQs. When an hwirq is mapped, an irq_desc is allocated and the
......@@ -109,8 +123,12 @@ accepts a more general abstraction 'struct fwnode_handle'.
Very few drivers should need this mapping.
==== No Map ===-
irq_domain_add_nomap()
No Map
------
::
irq_domain_add_nomap()
The No Map mapping is to be used when the hwirq number is
programmable in the hardware. In this case it is best to program the
......@@ -121,10 +139,14 @@ Linux IRQ number into the hardware.
Most drivers cannot use this mapping.
==== Legacy ====
irq_domain_add_simple()
irq_domain_add_legacy()
irq_domain_add_legacy_isa()
Legacy
------
::
irq_domain_add_simple()
irq_domain_add_legacy()
irq_domain_add_legacy_isa()
The Legacy mapping is a special case for drivers that already have a
range of irq_descs allocated for the hwirqs. It is used when the
......@@ -163,14 +185,17 @@ that the driver using the simple domain call irq_create_mapping()
before any irq_find_mapping() since the latter will actually work
for the static IRQ assignment case.
==== Hierarchy IRQ domain ====
Hierarchy IRQ domain
--------------------
On some architectures, there may be multiple interrupt controllers
involved in delivering an interrupt from the device to the target CPU.
Let's look at a typical interrupt delivering path on x86 platforms:
Let's look at a typical interrupt delivering path on x86 platforms::
Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU
Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU
There are three interrupt controllers involved:
1) IOAPIC controller
2) Interrupt remapping controller
3) Local APIC controller
......@@ -180,7 +205,8 @@ hardware architecture, an irq_domain data structure is built for each
interrupt controller and those irq_domains are organized into hierarchy.
When building irq_domain hierarchy, the irq_domain near to the device is
child and the irq_domain near to CPU is parent. So a hierarchy structure
as below will be built for the example above.
as below will be built for the example above::
CPU Vector irq_domain (root irq_domain to manage CPU vectors)
^
|
......@@ -190,6 +216,7 @@ as below will be built for the example above.
IOAPIC irq_domain (manage IOAPIC delivery entries/pins)
There are four major interfaces to use hierarchy irq_domain:
1) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt
controller related resources to deliver these interrupts.
2) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller
......@@ -199,7 +226,8 @@ There are four major interfaces to use hierarchy irq_domain:
4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware
to stop delivering the interrupt.
Following changes are needed to support hierarchy irq_domain.
Following changes are needed to support hierarchy irq_domain:
1) a new field 'parent' is added to struct irq_domain; it's used to
maintain irq_domain hierarchy information.
2) a new field 'parent_data' is added to struct irq_data; it's used to
......@@ -223,6 +251,7 @@ software architecture.
For an interrupt controller driver to support hierarchy irq_domain, it
needs to:
1) Implement irq_domain_ops.alloc and irq_domain_ops.free
2) Optionally implement irq_domain_ops.activate and
irq_domain_ops.deactivate.
......
===============
What is an IRQ?
===============
An IRQ is an interrupt request from a device.
Currently they can come in over a pin, or over a packet.
......
===================
Linux IOMMU Support
===================
......@@ -9,11 +10,11 @@ This guide gives a quick cheat sheet for some basic understanding.
Some Keywords
DMAR - DMA remapping
DRHD - DMA Remapping Hardware Unit Definition
RMRR - Reserved memory Region Reporting Structure
ZLR - Zero length reads from PCI devices
IOVA - IO Virtual address.
- DMAR - DMA remapping
- DRHD - DMA Remapping Hardware Unit Definition
- RMRR - Reserved memory Region Reporting Structure
- ZLR - Zero length reads from PCI devices
- IOVA - IO Virtual address.
Basic stuff
-----------
......@@ -33,7 +34,7 @@ devices that need to access these regions. OS is expected to setup
unity mappings for these regions for these devices to access these regions.
How is IOVA generated?
---------------------
----------------------
Well behaved drivers call pci_map_*() calls before sending command to device
that needs to perform DMA. Once DMA is completed and mapping is no longer
......@@ -82,14 +83,14 @@ in ACPI.
ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0
When DMAR is being processed and initialized by ACPI, prints DMAR locations
and any RMRR's processed.
and any RMRR's processed::
ACPI DMAR:Host address width 36
ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000
ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff
ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff
ACPI DMAR:Host address width 36
ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000
ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff
ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff
When DMAR is enabled for use, you will notice..
......@@ -98,10 +99,12 @@ PCI-DMA: Using DMAR IOMMU
Fault reporting
---------------
DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
DMAR:[fault reason 05] PTE Write access is not set
DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
DMAR:[fault reason 05] PTE Write access is not set
::
DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
DMAR:[fault reason 05] PTE Write access is not set
DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
DMAR:[fault reason 05] PTE Write access is not set
TBD
----
......
Linux 2.4.2 Secure Attention Key (SAK) handling
18 March 2001, Andrew Morton
=========================================
Linux Secure Attention Key (SAK) handling
=========================================
:Date: 18 March 2001
:Author: Andrew Morton
An operating system's Secure Attention Key is a security tool which is
provided as protection against trojan password capturing programs. It
......@@ -13,7 +17,7 @@ this sequence. It is only available if the kernel was compiled with
sysrq support.
The proper way of generating a SAK is to define the key sequence using
`loadkeys'. This will work whether or not sysrq support is compiled
``loadkeys``. This will work whether or not sysrq support is compiled
into the kernel.
SAK works correctly when the keyboard is in raw mode. This means that
......@@ -25,64 +29,63 @@ What key sequence should you use? Well, CTRL-ALT-DEL is used to reboot
the machine. CTRL-ALT-BACKSPACE is magical to the X server. We'll
choose CTRL-ALT-PAUSE.
In your rc.sysinit (or rc.local) file, add the command
In your rc.sysinit (or rc.local) file, add the command::
echo "control alt keycode 101 = SAK" | /bin/loadkeys
And that's it! Only the superuser may reprogram the SAK key.
NOTES
=====
.. note::
1: Linux SAK is said to be not a "true SAK" as is required by
systems which implement C2 level security. This author does not
know why.
1. Linux SAK is said to be not a "true SAK" as is required by
systems which implement C2 level security. This author does not
know why.
2: On the PC keyboard, SAK kills all applications which have
/dev/console opened.
2. On the PC keyboard, SAK kills all applications which have
/dev/console opened.
Unfortunately this includes a number of things which you don't
actually want killed. This is because these applications are
incorrectly holding /dev/console open. Be sure to complain to your
Linux distributor about this!
Unfortunately this includes a number of things which you don't
actually want killed. This is because these applications are
incorrectly holding /dev/console open. Be sure to complain to your
Linux distributor about this!
You can identify processes which will be killed by SAK with the
command
You can identify processes which will be killed by SAK with the
command::
# ls -l /proc/[0-9]*/fd/* | grep console
l-wx------ 1 root root 64 Mar 18 00:46 /proc/579/fd/0 -> /dev/console
Then:
Then::
# ps aux|grep 579
root 579 0.0 0.1 1088 436 ? S 00:43 0:00 gpm -t ps/2
So `gpm' will be killed by SAK. This is a bug in gpm. It should
be closing standard input. You can work around this by finding the
initscript which launches gpm and changing it thusly:
So ``gpm`` will be killed by SAK. This is a bug in gpm. It should
be closing standard input. You can work around this by finding the
initscript which launches gpm and changing it thusly:
Old:
Old::
daemon gpm
New:
New::
daemon gpm < /dev/null
Vixie cron also seems to have this problem, and needs the same treatment.
Vixie cron also seems to have this problem, and needs the same treatment.
Also, one prominent Linux distribution has the following three
lines in its rc.sysinit and rc scripts:
Also, one prominent Linux distribution has the following three
lines in its rc.sysinit and rc scripts::
exec 3<&0
exec 4>&1
exec 5>&2
These commands cause *all* daemons which are launched by the
initscripts to have file descriptors 3, 4 and 5 attached to
/dev/console. So SAK kills them all. A workaround is to simply
delete these lines, but this may cause system management
applications to malfunction - test everything well.
These commands cause **all** daemons which are launched by the
initscripts to have file descriptors 3, 4 and 5 attached to
/dev/console. So SAK kills them all. A workaround is to simply
delete these lines, but this may cause system management
applications to malfunction - test everything well.
SM501 Driver
============
.. include:: <isonum.txt>
Copyright 2006, 2007 Simtec Electronics
============
SM501 Driver
============
:Copyright: |copy| 2006, 2007 Simtec Electronics
The Silicon Motion SM501 multimedia companion chip is a multifunction device
which may provide numerous interfaces including USB host controller USB gadget,
......
============================
A block layer cache (bcache)
============================
Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be
nice if you could use them as cache... Hence bcache.
Wiki and git repositories are at:
http://bcache.evilpiepirate.org
http://evilpiepirate.org/git/linux-bcache.git
http://evilpiepirate.org/git/bcache-tools.git
- http://bcache.evilpiepirate.org
- http://evilpiepirate.org/git/linux-bcache.git
- http://evilpiepirate.org/git/bcache-tools.git
It's designed around the performance characteristics of SSDs - it only allocates
in erase block sized buckets, and it uses a hybrid btree/log to track cached
......@@ -37,17 +42,19 @@ to be flushed.
Getting started:
You'll need make-bcache from the bcache-tools repository. Both the cache device
and backing device must be formatted before use.
and backing device must be formatted before use::
make-bcache -B /dev/sdb
make-bcache -C /dev/sdc
make-bcache has the ability to format multiple devices at the same time - if
you format your backing devices and cache device at the same time, you won't
have to manually attach:
have to manually attach::
make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
bcache-tools now ships udev rules, and bcache devices are known to the kernel
immediately. Without udev, you can manually register devices like this:
immediately. Without udev, you can manually register devices like this::
echo /dev/sdb > /sys/fs/bcache/register
echo /dev/sdc > /sys/fs/bcache/register
......@@ -60,16 +67,16 @@ slow devices as bcache backing devices without a cache, and you can choose to ad
a caching device later.
See 'ATTACHING' section below.
The devices show up as:
The devices show up as::
/dev/bcache<N>
As well as (with udev):
As well as (with udev)::
/dev/bcache/by-uuid/<uuid>
/dev/bcache/by-label/<label>
To get started:
To get started::
mkfs.ext4 /dev/bcache0
mount /dev/bcache0 /mnt
......@@ -81,13 +88,13 @@ Cache devices are managed as sets; multiple caches per set isn't supported yet
but will allow for mirroring of metadata and dirty data in the future. Your new
cache set shows up as /sys/fs/bcache/<UUID>
ATTACHING
Attaching
---------
After your cache device and backing device are registered, the backing device
must be attached to your cache set to enable caching. Attaching a backing
device to a cache set is done thusly, with the UUID of the cache set in
/sys/fs/bcache:
/sys/fs/bcache::
echo <CSET-UUID> > /sys/block/bcache0/bcache/attach
......@@ -97,7 +104,7 @@ your bcache devices. If a backing device has data in a cache somewhere, the
important if you have writeback caching turned on.
If you're booting up and your cache device is gone and never coming back, you
can force run the backing device:
can force run the backing device::
echo 1 > /sys/block/sdb/bcache/running
......@@ -110,7 +117,7 @@ but all the cached data will be invalidated. If there was dirty data in the
cache, don't expect the filesystem to be recoverable - you will have massive
filesystem corruption, though ext4's fsck does work miracles.
ERROR HANDLING
Error Handling
--------------
Bcache tries to transparently handle IO errors to/from the cache device without
......@@ -134,25 +141,27 @@ the backing devices to passthrough mode.
read some of the dirty data, though.
HOWTO/COOKBOOK
Howto/cookbook
--------------
A) Starting a bcache with a missing caching device
If registering the backing device doesn't help, it's already there, you just need
to force it to run without the cache:
to force it to run without the cache::
host:~# echo /dev/sdb1 > /sys/fs/bcache/register
[ 119.844831] bcache: register_bcache() error opening /dev/sdb1: device already registered
Next, you try to register your caching device if it's present. However
if it's absent, or registration fails for some reason, you can still
start your bcache without its cache, like so:
start your bcache without its cache, like so::
host:/sys/block/sdb/sdb1/bcache# echo 1 > running
Note that this may cause data loss if you were running in writeback mode.
B) Bcache does not find its cache
B) Bcache does not find its cache::
host:/sys/block/md5/bcache# echo 0226553a-37cf-41d5-b3ce-8b1e944543a8 > attach
[ 1933.455082] bcache: bch_cached_dev_attach() Couldn't find uuid for md5 in set
......@@ -160,7 +169,8 @@ B) Bcache does not find its cache
[ 1933.478179] : cache set not found
In this case, the caching device was simply not registered at boot
or disappeared and came back, and needs to be (re-)registered:
or disappeared and came back, and needs to be (re-)registered::
host:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register
......@@ -180,7 +190,8 @@ device is still available at an 8KiB offset. So either via a loopdev
of the backing device created with --offset 8K, or any value defined by
--data-offset when you originally formatted bcache with `make-bcache`.
For example:
For example::
losetup -o 8192 /dev/loop0 /dev/your_bcache_backing_dev
This should present your unmodified backing device data in /dev/loop0
......@@ -191,33 +202,38 @@ cache device without loosing data.
E) Wiping a cache device
host:~# wipefs -a /dev/sdh2
16 bytes were erased at offset 0x1018 (bcache)
they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
::
host:~# wipefs -a /dev/sdh2
16 bytes were erased at offset 0x1018 (bcache)
they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
After you boot back with bcache enabled, you recreate the cache and attach it::
After you boot back with bcache enabled, you recreate the cache and attach it:
host:~# make-bcache -C /dev/sdh2
UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045
Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1
version: 0
nbuckets: 106874
block_size: 1
bucket_size: 1024
nr_in_set: 1
nr_this_dev: 0
first_bucket: 1
[ 650.511912] bcache: run_cache_set() invalidating existing data
[ 650.549228] bcache: register_cache() registered cache device sdh2
host:~# make-bcache -C /dev/sdh2
UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045
Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1
version: 0
nbuckets: 106874
block_size: 1
bucket_size: 1024
nr_in_set: 1
nr_this_dev: 0
first_bucket: 1
[ 650.511912] bcache: run_cache_set() invalidating existing data
[ 650.549228] bcache: register_cache() registered cache device sdh2
start backing device with missing cache:
host:/sys/block/md5/bcache# echo 1 > running
start backing device with missing cache::
attach new cache:
host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach
[ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1
host:/sys/block/md5/bcache# echo 1 > running
attach new cache::
F) Remove or replace a caching device
host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach
[ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1
F) Remove or replace a caching device::
host:/sys/block/sda/sda7/bcache# echo 1 > detach
[ 695.872542] bcache: cached_dev_detach_finish() Caching disabled for sda7
......@@ -226,13 +242,15 @@ F) Remove or replace a caching device
wipefs: error: /dev/nvme0n1p4: probing initialization failed: Device or resource busy
Ooops, it's disabled, but not unregistered, so it's still protected
We need to go and unregister it:
We need to go and unregister it::
host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# ls -l cache0
lrwxrwxrwx 1 root root 0 Feb 25 18:33 cache0 -> ../../../devices/pci0000:00/0000:00:1d.0/0000:70:00.0/nvme/nvme0/nvme0n1/nvme0n1p4/bcache/
host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# echo 1 > stop
kernel: [ 917.041908] bcache: cache_set_free() Cache set b7ba27a1-2398-4649-8ae3-0959f57ba128 unregistered
Now we can wipe it:
Now we can wipe it::
host:~# wipefs -a /dev/nvme0n1p4
/dev/nvme0n1p4: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81
......@@ -252,40 +270,44 @@ if there are any active backing or caching devices left on it:
1) Is it present in /dev/bcache* ? (there are times where it won't be)
If so, it's easy:
If so, it's easy::
host:/sys/block/bcache0/bcache# echo 1 > stop
2) But if your backing device is gone, this won't work:
2) But if your backing device is gone, this won't work::
host:/sys/block/bcache0# cd bcache
bash: cd: bcache: No such file or directory
In this case, you may have to unregister the dmcrypt block device that
references this bcache to free it up:
In this case, you may have to unregister the dmcrypt block device that
references this bcache to free it up::
host:~# dmsetup remove oldds1
bcache: bcache_device_free() bcache0 stopped
bcache: cache_set_free() Cache set 5bc072a8-ab17-446d-9744-e247949913c1 unregistered
This causes the backing bcache to be removed from /sys/fs/bcache and
then it can be reused. This would be true of any block device stacking
where bcache is a lower device.
This causes the backing bcache to be removed from /sys/fs/bcache and
then it can be reused. This would be true of any block device stacking
where bcache is a lower device.
3) In other cases, you can also look in /sys/fs/bcache/::
3) In other cases, you can also look in /sys/fs/bcache/:
host:/sys/fs/bcache# ls -l */{cache?,bdev?}
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/
lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/
host:/sys/fs/bcache# ls -l */{cache?,bdev?}
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/
lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/
lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/
The device names will show which UUID is relevant, cd in that directory
and stop the cache::
The device names will show which UUID is relevant, cd in that directory
and stop the cache:
host:/sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1# echo 1 > stop
This will free up bcache references and let you reuse the partition for
other purposes.
This will free up bcache references and let you reuse the partition for
other purposes.
TROUBLESHOOTING PERFORMANCE
Troubleshooting performance
---------------------------
Bcache has a bunch of config options and tunables. The defaults are intended to
......@@ -301,11 +323,13 @@ want for getting the best possible numbers when benchmarking.
raid stripe size to get the disk multiples that you would like.
For example: If you have a 64k stripe size, then the following offset
would provide alignment for many common RAID5 data spindle counts:
would provide alignment for many common RAID5 data spindle counts::
64k * 2*2*2*3*3*5*7 bytes = 161280k
That space is wasted, but for only 157.5MB you can grow your RAID 5
volume to the following data-spindle counts without re-aligning:
volume to the following data-spindle counts without re-aligning::
3,4,5,6,7,8,9,10,12,14,15,18,20,21 ...
- Bad write performance
......@@ -313,9 +337,9 @@ want for getting the best possible numbers when benchmarking.
If write performance is not what you expected, you probably wanted to be
running in writeback mode, which isn't the default (not due to a lack of
maturity, but simply because in writeback mode you'll lose data if something
happens to your SSD)
happens to your SSD)::
# echo writeback > /sys/block/bcache0/bcache/cache_mode
# echo writeback > /sys/block/bcache0/bcache/cache_mode
- Bad performance, or traffic not going to the SSD that you'd expect
......@@ -325,13 +349,13 @@ want for getting the best possible numbers when benchmarking.
accessed data out of your cache.
But if you want to benchmark reads from cache, and you start out with fio
writing an 8 gigabyte test file - so you want to disable that.
writing an 8 gigabyte test file - so you want to disable that::
# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
To set it back to the default (4 mb), do
To set it back to the default (4 mb), do::
# echo 4M > /sys/block/bcache0/bcache/sequential_cutoff
# echo 4M > /sys/block/bcache0/bcache/sequential_cutoff
- Traffic's still going to the spindle/still getting cache misses
......@@ -344,10 +368,10 @@ want for getting the best possible numbers when benchmarking.
throttles traffic if the latency exceeds a threshold (it does this by
cranking down the sequential bypass).
You can disable this if you need to by setting the thresholds to 0:
You can disable this if you need to by setting the thresholds to 0::
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
The default is 2000 us (2 milliseconds) for reads, and 20000 for writes.
......@@ -369,7 +393,7 @@ want for getting the best possible numbers when benchmarking.
a fix for the issue there).
SYSFS - BACKING DEVICE
Sysfs - backing device
----------------------
Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and
......@@ -454,7 +478,8 @@ writeback_running
still be added to the cache until it is mostly full; only meant for
benchmarking. Defaults to on.
SYSFS - BACKING DEVICE STATS:
Sysfs - backing device stats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are directories with these numbers for a running total, as well as
versions that decay over the past day, hour and 5 minutes; they're also
......@@ -463,14 +488,11 @@ aggregated in the cache set directory as well.
bypassed
Amount of IO (both reads and writes) that has bypassed the cache
cache_hits
cache_misses
cache_hit_ratio
cache_hits, cache_misses, cache_hit_ratio
Hits and misses are counted per individual IO as bcache sees them; a
partial hit is counted as a miss.
cache_bypass_hits
cache_bypass_misses
cache_bypass_hits, cache_bypass_misses
Hits and misses for IO that is intended to skip the cache are still counted,
but broken out here.
......@@ -482,7 +504,8 @@ cache_miss_collisions
cache_readaheads
Count of times readahead occurred.
SYSFS - CACHE SET:
Sysfs - cache set
~~~~~~~~~~~~~~~~~
Available at /sys/fs/bcache/<cset-uuid>
......@@ -520,8 +543,7 @@ flash_vol_create
Echoing a size to this file (in human readable units, k/M/G) creates a thinly
provisioned volume backed by the cache set.
io_error_halflife
io_error_limit
io_error_halflife, io_error_limit
These determines how many errors we accept before disabling the cache.
Each error is decayed by the half life (in # ios). If the decaying count
reaches io_error_limit dirty data is written out and the cache is disabled.
......@@ -545,7 +567,8 @@ unregister
Detaches all backing devices and closes the cache devices; if dirty data is
present it will disable writeback caching and wait for it to be flushed.
SYSFS - CACHE SET INTERNAL:
Sysfs - cache set internal
~~~~~~~~~~~~~~~~~~~~~~~~~~
This directory also exposes timings for a number of internal operations, with
separate files for average duration, average frequency, last occurrence and max
......@@ -574,7 +597,8 @@ cache_read_races
trigger_gc
Writing to this file forces garbage collection to run.
SYSFS - CACHE DEVICE:
Sysfs - Cache device
~~~~~~~~~~~~~~~~~~~~
Available at /sys/block/<cdev>/bcache
......
===============================================================
== BT8XXGPIO driver ==
== ==
== A driver for a selfmade cheap BT8xx based PCI GPIO-card ==
== ==
== For advanced documentation, see ==
== http://www.bu3sch.de/btgpio.php ==
===============================================================
===================================================================
A driver for a selfmade cheap BT8xx based PCI GPIO-card (bt8xxgpio)
===================================================================
For advanced documentation, see http://www.bu3sch.de/btgpio.php
A generic digital 24-port PCI GPIO card can be built out of an ordinary
Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The
......@@ -17,9 +13,8 @@ The bt8xx chip does have 24 digital GPIO ports.
These ports are accessible via 24 pins on the SMD chip package.
==============================================
== How to physically access the GPIO pins ==
==============================================
How to physically access the GPIO pins
======================================
The are several ways to access these pins. One might unsolder the whole chip
and put it on a custom PCI board, or one might only unsolder each individual
......@@ -27,7 +22,7 @@ GPIO pin and solder that to some tiny wire. As the chip package really is tiny
there are some advanced soldering skills needed in any case.
The physical pinouts are drawn in the following ASCII art.
The GPIO pins are marked with G00-G23
The GPIO pins are marked with G00-G23::
G G G G G G G G G G G G G G G G G G
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
......
=======================================================================
README for btmrvl driver
=======================================================================
=============
btmrvl driver
=============
All commands are used via debugfs interface.
=====================
Set/get driver configurations:
Set/get driver configurations
=============================
Path: /debug/btmrvl/config/
gpiogap=[n]
hscfgcmd
These commands are used to configure the host sleep parameters.
gpiogap=[n], hscfgcmd
These commands are used to configure the host sleep parameters::
bit 8:0 -- Gap
bit 16:8 -- GPIO
......@@ -23,7 +21,8 @@ hscfgcmd
where Gap is the gap in milli seconds between wakeup signal and
wakeup event, or 0xff for special host sleep setting.
Usage:
Usage::
# Use SDIO interface to wake up the host and set GAP to 0x80:
echo 0xff80 > /debug/btmrvl/config/gpiogap
echo 1 > /debug/btmrvl/config/hscfgcmd
......@@ -32,15 +31,16 @@ hscfgcmd
echo 0x03ff > /debug/btmrvl/config/gpiogap
echo 1 > /debug/btmrvl/config/hscfgcmd
psmode=[n]
pscmd
psmode=[n], pscmd
These commands are used to enable/disable auto sleep mode
where the option is:
where the option is::
1 -- Enable auto sleep mode
0 -- Disable auto sleep mode
Usage:
Usage::
# Enable auto sleep mode
echo 1 > /debug/btmrvl/config/psmode
echo 1 > /debug/btmrvl/config/pscmd
......@@ -50,15 +50,16 @@ pscmd
echo 1 > /debug/btmrvl/config/pscmd
hsmode=[n]
hscmd
hsmode=[n], hscmd
These commands are used to enable host sleep or wake up firmware
where the option is:
where the option is::
1 -- Enable host sleep
0 -- Wake up firmware
Usage:
Usage::
# Enable host sleep
echo 1 > /debug/btmrvl/config/hsmode
echo 1 > /debug/btmrvl/config/hscmd
......@@ -68,12 +69,13 @@ hscmd
echo 1 > /debug/btmrvl/config/hscmd
======================
Get driver status:
Get driver status
=================
Path: /debug/btmrvl/status/
Usage:
Usage::
cat /debug/btmrvl/status/<args>
where the args are:
......@@ -90,14 +92,17 @@ hsstate
txdnldrdy
This command displays the value of Tx download ready flag.
=====================
Issuing a raw hci command
=========================
Use hcitool to issue raw hci command, refer to hcitool manual
Usage: Hcitool cmd <ogf> <ocf> [Parameters]
Usage::
Hcitool cmd <ogf> <ocf> [Parameters]
Interface Control Command::
Interface Control Command
hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface
hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface
hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface
......@@ -105,13 +110,13 @@ Use hcitool to issue raw hci command, refer to hcitool manual
hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface
hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface
=======================================================================
SD8688 firmware
===============
SD8688 firmware:
Images:
/lib/firmware/sd8688_helper.bin
/lib/firmware/sd8688.bin
- /lib/firmware/sd8688_helper.bin
- /lib/firmware/sd8688.bin
The images can be downloaded from:
......
[ NOTE: The virt_to_bus() and bus_to_virt() functions have been
==========================================================
How to access I/O mapped memory from within device drivers
==========================================================
:Author: Linus
.. warning::
The virt_to_bus() and bus_to_virt() functions have been
superseded by the functionality provided by the PCI DMA interface
(see Documentation/DMA-API-HOWTO.txt). They continue
to be documented below for historical purposes, but new code
must not use them. --davidm 00/12/12 ]
must not use them. --davidm 00/12/12
[ This is a mail message in response to a query on IO mapping, thus the
strange format for a "document" ]
::
[ This is a mail message in response to a query on IO mapping, thus the
strange format for a "document" ]
The AHA-1542 is a bus-master device, and your patch makes the driver give the
controller the physical address of the buffers, which is correct on x86
(because all bus master devices see the physical memory mappings directly).
However, on many setups, there are actually _three_ different ways of looking
However, on many setups, there are actually **three** different ways of looking
at memory addresses, and in this case we actually want the third, the
so-called "bus address".
......@@ -38,7 +48,7 @@ because the memory and the devices share the same address space, and that is
not generally necessarily true on other PCI/ISA setups.
Now, just as an example, on the PReP (PowerPC Reference Platform), the
CPU sees a memory map something like this (this is from memory):
CPU sees a memory map something like this (this is from memory)::
0-2 GB "real memory"
2 GB-3 GB "system IO" (inb/out and similar accesses on x86)
......@@ -52,7 +62,7 @@ So when the CPU wants any bus master to write to physical memory 0, it
has to give the master address 0x80000000 as the memory address.
So, for example, depending on how the kernel is actually mapped on the
PPC, you can end up with a setup like this:
PPC, you can end up with a setup like this::
physical address: 0
virtual address: 0xC0000000
......@@ -61,7 +71,7 @@ PPC, you can end up with a setup like this:
where all the addresses actually point to the same thing. It's just seen
through different translations..
Similarly, on the Alpha, the normal translation is
Similarly, on the Alpha, the normal translation is::
physical address: 0
virtual address: 0xfffffc0000000000
......@@ -70,7 +80,7 @@ Similarly, on the Alpha, the normal translation is
(but there are also Alphas where the physical address and the bus address
are the same).
Anyway, the way to look up all these translations, you do
Anyway, the way to look up all these translations, you do::
#include <asm/io.h>
......@@ -81,8 +91,8 @@ Anyway, the way to look up all these translations, you do
Now, when do you need these?
You want the _virtual_ address when you are actually going to access that
pointer from the kernel. So you can have something like this:
You want the **virtual** address when you are actually going to access that
pointer from the kernel. So you can have something like this::
/*
* this is the hardware "mailbox" we use to communicate with
......@@ -104,7 +114,7 @@ pointer from the kernel. So you can have something like this:
...
on the other hand, you want the bus address when you have a buffer that
you want to give to the controller:
you want to give to the controller::
/* ask the controller to read the sense status into "sense_buffer" */
mbox.bufstart = virt_to_bus(&sense_buffer);
......@@ -112,7 +122,7 @@ you want to give to the controller:
mbox.status = 0;
notify_controller(&mbox);
And you generally _never_ want to use the physical address, because you can't
And you generally **never** want to use the physical address, because you can't
use that from the CPU (the CPU only uses translated virtual addresses), and
you can't use it from the bus master.
......@@ -124,8 +134,10 @@ be remapped as measured in units of pages, a.k.a. the pfn (the memory
management layer doesn't know about devices outside the CPU, so it
shouldn't need to know about "bus addresses" etc).
NOTE NOTE NOTE! The above is only one part of the whole equation. The above
only talks about "real memory", that is, CPU memory (RAM).
.. note::
The above is only one part of the whole equation. The above
only talks about "real memory", that is, CPU memory (RAM).
There is a completely different type of memory too, and that's the "shared
memory" on the PCI or ISA bus. That's generally not RAM (although in the case
......@@ -137,20 +149,22 @@ whatever, and there is only one way to access it: the readb/writeb and
related functions. You should never take the address of such memory, because
there is really nothing you can do with such an address: it's not
conceptually in the same memory space as "real memory" at all, so you cannot
just dereference a pointer. (Sadly, on x86 it _is_ in the same memory space,
just dereference a pointer. (Sadly, on x86 it **is** in the same memory space,
so on x86 it actually works to just deference a pointer, but it's not
portable).
For such memory, you can do things like
For such memory, you can do things like:
- reading::
- reading:
/*
* read first 32 bits from ISA memory at 0xC0000, aka
* C000:0000 in DOS terms
*/
unsigned int signature = isa_readl(0xC0000);
- remapping and writing:
- remapping and writing::
/*
* remap framebuffer PCI memory area at 0xFC000000,
* size 1MB, so that we can access it: We can directly
......@@ -165,7 +179,8 @@ For such memory, you can do things like
/* unmap when we unload the driver */
iounmap(baseptr);
- copying and clearing:
- copying and clearing::
/* get the 6-byte Ethernet address at ISA address E000:0040 */
memcpy_fromio(kernel_buffer, 0xE0040, 6);
/* write a packet to the driver */
......@@ -181,10 +196,10 @@ happy that your driver works ;)
Note that kernel versions 2.0.x (and earlier) mistakenly called the
ioremap() function "vremap()". ioremap() is the proper name, but I
didn't think straight when I wrote it originally. People who have to
support both can do something like:
support both can do something like::
/* support old naming silliness */
#if LINUX_VERSION_CODE < 0x020100
#if LINUX_VERSION_CODE < 0x020100
#define ioremap vremap
#define iounmap vfree
#endif
......@@ -196,13 +211,10 @@ And the above sounds worse than it really is. Most real drivers really
don't do all that complex things (or rather: the complexity is not so
much in the actual IO accesses as in error handling and timeouts etc).
It's generally not hard to fix drivers, and in many cases the code
actually looks better afterwards:
actually looks better afterwards::
unsigned long signature = *(unsigned int *) 0xC0000;
vs
unsigned long signature = readl(0xC0000);
I think the second version actually is more readable, no?
Linus
Cache and TLB Flushing
Under Linux
==================================
Cache and TLB Flushing Under Linux
==================================
David S. Miller <davem@redhat.com>
:Author: David S. Miller <davem@redhat.com>
This document describes the cache/tlb flushing interfaces called
by the Linux VM subsystem. It enumerates over each interface,
......@@ -28,7 +29,7 @@ Therefore when software page table changes occur, the kernel will
invoke one of the following flush methods _after_ the page table
changes occur:
1) void flush_tlb_all(void)
1) ``void flush_tlb_all(void)``
The most severe flush of all. After this interface runs,
any previous page table modification whatsoever will be
......@@ -37,7 +38,7 @@ changes occur:
This is usually invoked when the kernel page tables are
changed, since such translations are "global" in nature.
2) void flush_tlb_mm(struct mm_struct *mm)
2) ``void flush_tlb_mm(struct mm_struct *mm)``
This interface flushes an entire user address space from
the TLB. After running, this interface must make sure that
......@@ -49,8 +50,8 @@ changes occur:
page table operations such as what happens during
fork, and exec.
3) void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
3) ``void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)``
Here we are flushing a specific range of (user) virtual
address translations from the TLB. After running, this
......@@ -69,7 +70,7 @@ changes occur:
call flush_tlb_page (see below) for each entry which may be
modified.
4) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
4) ``void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)``
This time we need to remove the PAGE_SIZE sized translation
from the TLB. The 'vma' is the backing structure used by
......@@ -87,8 +88,8 @@ changes occur:
This is used primarily during fault processing.
5) void update_mmu_cache(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
5) ``void update_mmu_cache(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)``
At the end of every page fault, this routine is invoked to
tell the architecture specific code that a translation
......@@ -100,7 +101,7 @@ changes occur:
translations for software managed TLB configurations.
The sparc64 port currently does this.
6) void tlb_migrate_finish(struct mm_struct *mm)
6) ``void tlb_migrate_finish(struct mm_struct *mm)``
This interface is called at the end of an explicit
process migration. This interface provides a hook
......@@ -112,7 +113,7 @@ changes occur:
Next, we have the cache flushing interfaces. In general, when Linux
is changing an existing virtual-->physical mapping to a new value,
the sequence will be in one of the following forms:
the sequence will be in one of the following forms::
1) flush_cache_mm(mm);
change_all_page_tables_of(mm);
......@@ -143,7 +144,7 @@ and have no dependency on translation information.
Here are the routines, one by one:
1) void flush_cache_mm(struct mm_struct *mm)
1) ``void flush_cache_mm(struct mm_struct *mm)``
This interface flushes an entire user address space from
the caches. That is, after running, there will be no cache
......@@ -152,7 +153,7 @@ Here are the routines, one by one:
This interface is used to handle whole address space
page table operations such as what happens during exit and exec.
2) void flush_cache_dup_mm(struct mm_struct *mm)
2) ``void flush_cache_dup_mm(struct mm_struct *mm)``
This interface flushes an entire user address space from
the caches. That is, after running, there will be no cache
......@@ -164,8 +165,8 @@ Here are the routines, one by one:
This option is separate from flush_cache_mm to allow some
optimizations for VIPT caches.
3) void flush_cache_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
3) ``void flush_cache_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)``
Here we are flushing a specific range of (user) virtual
addresses from the cache. After running, there will be no
......@@ -181,7 +182,7 @@ Here are the routines, one by one:
call flush_cache_page (see below) for each entry which may be
modified.
4) void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)
4) ``void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)``
This time we need to remove a PAGE_SIZE sized range
from the cache. The 'vma' is the backing structure used by
......@@ -202,7 +203,7 @@ Here are the routines, one by one:
This is used primarily during fault processing.
5) void flush_cache_kmaps(void)
5) ``void flush_cache_kmaps(void)``
This routine need only be implemented if the platform utilizes
highmem. It will be called right before all of the kmaps
......@@ -214,8 +215,8 @@ Here are the routines, one by one:
This routing should be implemented in asm/highmem.h
6) void flush_cache_vmap(unsigned long start, unsigned long end)
void flush_cache_vunmap(unsigned long start, unsigned long end)
6) ``void flush_cache_vmap(unsigned long start, unsigned long end)``
``void flush_cache_vunmap(unsigned long start, unsigned long end)``
Here in these two interfaces we are flushing a specific range
of (kernel) virtual addresses from the cache. After running,
......@@ -243,8 +244,10 @@ size). This setting will force the SYSv IPC layer to only allow user
processes to mmap shared memory at address which are a multiple of
this value.
NOTE: This does not fix shared mmaps, check out the sparc64 port for
one way to solve this (in particular SPARC_FLAG_MMAPSHARED).
.. note::
This does not fix shared mmaps, check out the sparc64 port for
one way to solve this (in particular SPARC_FLAG_MMAPSHARED).
Next, you have to solve the D-cache aliasing issue for all
other cases. Please keep in mind that fact that, for a given page
......@@ -255,8 +258,8 @@ physical page into its address space, by implication the D-cache
aliasing problem has the potential to exist since the kernel already
maps this page at its virtual address.
void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)
void clear_user_page(void *to, unsigned long addr, struct page *page)
``void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)``
``void clear_user_page(void *to, unsigned long addr, struct page *page)``
These two routines store data in user anonymous or COW
pages. It allows a port to efficiently avoid D-cache alias
......@@ -276,14 +279,16 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, these two routines may
simply call memcpy/memset directly and do nothing more.
void flush_dcache_page(struct page *page)
``void flush_dcache_page(struct page *page)``
Any time the kernel writes to a page cache page, _OR_
the kernel is about to read from a page cache page and
user space shared/writable mappings of this page potentially
exist, this routine is called.
NOTE: This routine need only be called for page cache pages
.. note::
This routine need only be called for page cache pages
which can potentially ever be mapped into the address
space of a user process. So for example, VFS layer code
handling vfs symlinks in the page cache need not call
......@@ -322,18 +327,19 @@ maps this page at its virtual address.
made of this flag bit, and if set the flush is done and the flag
bit is cleared.
IMPORTANT NOTE: It is often important, if you defer the flush,
.. important::
It is often important, if you defer the flush,
that the actual flush occurs on the same CPU
as did the cpu stores into the page to make it
dirty. Again, see sparc64 for examples of how
to deal with this.
void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr,
void *dst, void *src, int len)
void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr,
void *dst, void *src, int len)
``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, void *dst, void *src, int len)``
``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, void *dst, void *src, int len)``
When the kernel needs to copy arbitrary data in and out
of arbitrary user pages (f.e. for ptrace()) it will use
these two routines.
......@@ -344,8 +350,9 @@ maps this page at its virtual address.
likely that you will need to flush the instruction cache
for copy_to_user_page().
void flush_anon_page(struct vm_area_struct *vma, struct page *page,
unsigned long vmaddr)
``void flush_anon_page(struct vm_area_struct *vma, struct page *page,
unsigned long vmaddr)``
When the kernel needs to access the contents of an anonymous
page, it calls this function (currently only
get_user_pages()). Note: flush_dcache_page() deliberately
......@@ -354,7 +361,8 @@ maps this page at its virtual address.
architectures). For incoherent architectures, it should flush
the cache of the page at vmaddr.
void flush_kernel_dcache_page(struct page *page)
``void flush_kernel_dcache_page(struct page *page)``
When the kernel needs to modify a user page is has obtained
with kmap, it calls this function after all modifications are
complete (but before kunmapping it) to bring the underlying
......@@ -366,14 +374,16 @@ maps this page at its virtual address.
the kernel cache for page (using page_address(page)).
void flush_icache_range(unsigned long start, unsigned long end)
``void flush_icache_range(unsigned long start, unsigned long end)``
When the kernel stores into addresses that it will execute
out of (eg when loading modules), this function is called.
If the icache does not snoop stores then this routine will need
to flush it.
void flush_icache_page(struct vm_area_struct *vma, struct page *page)
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
All the functionality of flush_icache_page can be implemented in
flush_dcache_page and update_mmu_cache. In the future, the hope
is to remove this interface completely.
......@@ -387,7 +397,8 @@ the kernel trying to do I/O to vmap areas must manually manage
coherency. It must do this by flushing the vmap range before doing
I/O and invalidating it after the I/O returns.
void flush_kernel_vmap_range(void *vaddr, int size)
``void flush_kernel_vmap_range(void *vaddr, int size)``
flushes the kernel cache for a given virtual address range in
the vmap area. This is to make sure that any data the kernel
modified in the vmap range is made visible to the physical
......@@ -395,7 +406,8 @@ I/O and invalidating it after the I/O returns.
Note that this API does *not* also flush the offset map alias
of the area.
void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates
``void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates``
the cache for a given virtual address range in the vmap area
which prevents the processor from making the cache stale by
speculatively reading data while the I/O was occurring to the
......
This diff is collapsed.
================
CIRCULAR BUFFERS
================
================
Circular Buffers
================
By: David Howells <dhowells@redhat.com>
Paul E. McKenney <paulmck@linux.vnet.ibm.com>
:Author: David Howells <dhowells@redhat.com>
:Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Linux provides a number of features that can be used to implement circular
......@@ -20,7 +20,7 @@ producer and just one consumer. It is possible to handle multiple producers by
serialising them, and to handle multiple consumers by serialising them.
Contents:
.. Contents:
(*) What is a circular buffer?
......@@ -31,8 +31,8 @@ Contents:
- The consumer.
==========================
WHAT IS A CIRCULAR BUFFER?
What is a circular buffer?
==========================
First of all, what is a circular buffer? A circular buffer is a buffer of
......@@ -60,9 +60,7 @@ buffer, provided that neither index overtakes the other. The implementer must
be careful, however, as a region more than one unit in size may wrap the end of
the buffer and be broken into two segments.
============================
MEASURING POWER-OF-2 BUFFERS
Measuring power-of-2 buffers
============================
Calculation of the occupancy or the remaining capacity of an arbitrarily sized
......@@ -71,13 +69,13 @@ modulus (divide) instruction. However, if the buffer is of a power-of-2 size,
then a much quicker bitwise-AND instruction can be used instead.
Linux provides a set of macros for handling power-of-2 circular buffers. These
can be made use of by:
can be made use of by::
#include <linux/circ_buf.h>
The macros are:
(*) Measure the remaining capacity of a buffer:
(#) Measure the remaining capacity of a buffer::
CIRC_SPACE(head_index, tail_index, buffer_size);
......@@ -85,7 +83,7 @@ The macros are:
can be inserted.
(*) Measure the maximum consecutive immediate space in a buffer:
(#) Measure the maximum consecutive immediate space in a buffer::
CIRC_SPACE_TO_END(head_index, tail_index, buffer_size);
......@@ -94,14 +92,14 @@ The macros are:
beginning of the buffer.
(*) Measure the occupancy of a buffer:
(#) Measure the occupancy of a buffer::
CIRC_CNT(head_index, tail_index, buffer_size);
This returns the number of items currently occupying a buffer[2].
(*) Measure the non-wrapping occupancy of a buffer:
(#) Measure the non-wrapping occupancy of a buffer::
CIRC_CNT_TO_END(head_index, tail_index, buffer_size);
......@@ -112,7 +110,7 @@ The macros are:
Each of these macros will nominally return a value between 0 and buffer_size-1,
however:
[1] CIRC_SPACE*() are intended to be used in the producer. To the producer
(1) CIRC_SPACE*() are intended to be used in the producer. To the producer
they will return a lower bound as the producer controls the head index,
but the consumer may still be depleting the buffer on another CPU and
moving the tail index.
......@@ -120,7 +118,7 @@ however:
To the consumer it will show an upper bound as the producer may be busy
depleting the space.
[2] CIRC_CNT*() are intended to be used in the consumer. To the consumer they
(2) CIRC_CNT*() are intended to be used in the consumer. To the consumer they
will return a lower bound as the consumer controls the tail index, but the
producer may still be filling the buffer on another CPU and moving the
head index.
......@@ -128,14 +126,12 @@ however:
To the producer it will show an upper bound as the consumer may be busy
emptying the buffer.
[3] To a third party, the order in which the writes to the indices by the
(3) To a third party, the order in which the writes to the indices by the
producer and consumer become visible cannot be guaranteed as they are
independent and may be made on different CPUs - so the result in such a
situation will merely be a guess, and may even be negative.
===========================================
USING MEMORY BARRIERS WITH CIRCULAR BUFFERS
Using memory barriers with circular buffers
===========================================
By using memory barriers in conjunction with circular buffers, you can avoid
......@@ -152,10 +148,10 @@ time, and only one thing should be emptying a buffer at any one time, but the
two sides can operate simultaneously.
THE PRODUCER
The producer
------------
The producer will look something like this:
The producer will look something like this::
spin_lock(&producer_lock);
......@@ -193,10 +189,10 @@ ordering between the read of the index indicating that the consumer has
vacated a given element and the write by the producer to that same element.
THE CONSUMER
The Consumer
------------
The consumer will look something like this:
The consumer will look something like this::
spin_lock(&consumer_lock);
......@@ -235,8 +231,7 @@ prevents the compiler from tearing the store, and enforces ordering
against previous accesses.
===============
FURTHER READING
Further reading
===============
See also Documentation/memory-barriers.txt for a description of Linux's memory
......
The Common Clk Framework
Mike Turquette <mturquette@ti.com>
========================
The Common Clk Framework
========================
:Author: Mike Turquette <mturquette@ti.com>
This document endeavours to explain the common clk framework details,
and how to port a platform over to this framework. It is not yet a
detailed explanation of the clock api in include/linux/clk.h, but
perhaps someday it will include that information.
Part 1 - introduction and interface split
Introduction and interface split
================================
The common clk framework is an interface to control the clock nodes
available on various devices today. This may come in the form of clock
......@@ -35,10 +39,11 @@ is defined in struct clk_foo and pointed to within struct clk_core. This
allows for easy navigation between the two discrete halves of the common
clock interface.
Part 2 - common data structures and api
Common data structures and api
==============================
Below is the common struct clk_core definition from
drivers/clk/clk.c, modified for brevity:
drivers/clk/clk.c, modified for brevity::
struct clk_core {
const char *name;
......@@ -59,7 +64,7 @@ struct clk. That api is documented in include/linux/clk.h.
Platforms and devices utilizing the common struct clk_core use the struct
clk_ops pointer in struct clk_core to perform the hardware-specific parts of
the operations defined in clk-provider.h:
the operations defined in clk-provider.h::
struct clk_ops {
int (*prepare)(struct clk_hw *hw);
......@@ -95,19 +100,20 @@ the operations defined in clk-provider.h:
struct dentry *dentry);
};
Part 3 - hardware clk implementations
Hardware clk implementations
============================
The strength of the common struct clk_core comes from its .ops and .hw pointers
which abstract the details of struct clk from the hardware-specific bits, and
vice versa. To illustrate consider the simple gateable clk implementation in
drivers/clk/clk-gate.c:
drivers/clk/clk-gate.c::
struct clk_gate {
struct clk_hw hw;
void __iomem *reg;
u8 bit_idx;
...
};
struct clk_gate {
struct clk_hw hw;
void __iomem *reg;
u8 bit_idx;
...
};
struct clk_gate contains struct clk_hw hw as well as hardware-specific
knowledge about which register and bit controls this clk's gating.
......@@ -115,7 +121,7 @@ Nothing about clock topology or accounting, such as enable_count or
notifier_count, is needed here. That is all handled by the common
framework code and struct clk_core.
Let's walk through enabling this clk from driver code:
Let's walk through enabling this clk from driver code::
struct clk *clk;
clk = clk_get(NULL, "my_gateable_clk");
......@@ -123,70 +129,71 @@ Let's walk through enabling this clk from driver code:
clk_prepare(clk);
clk_enable(clk);
The call graph for clk_enable is very simple:
The call graph for clk_enable is very simple::
clk_enable(clk);
clk->ops->enable(clk->hw);
[resolves to...]
clk_gate_enable(hw);
[resolves struct clk gate with to_clk_gate(hw)]
clk_gate_set_bit(gate);
clk_enable(clk);
clk->ops->enable(clk->hw);
[resolves to...]
clk_gate_enable(hw);
[resolves struct clk gate with to_clk_gate(hw)]
clk_gate_set_bit(gate);
And the definition of clk_gate_set_bit:
And the definition of clk_gate_set_bit::
static void clk_gate_set_bit(struct clk_gate *gate)
{
u32 reg;
static void clk_gate_set_bit(struct clk_gate *gate)
{
u32 reg;
reg = __raw_readl(gate->reg);
reg |= BIT(gate->bit_idx);
writel(reg, gate->reg);
}
reg = __raw_readl(gate->reg);
reg |= BIT(gate->bit_idx);
writel(reg, gate->reg);
}
Note that to_clk_gate is defined as:
Note that to_clk_gate is defined as::
#define to_clk_gate(_hw) container_of(_hw, struct clk_gate, hw)
#define to_clk_gate(_hw) container_of(_hw, struct clk_gate, hw)
This pattern of abstraction is used for every clock hardware
representation.
Part 4 - supporting your own clk hardware
Supporting your own clk hardware
================================
When implementing support for a new type of clock it is only necessary to
include the following header:
include the following header::
#include <linux/clk-provider.h>
#include <linux/clk-provider.h>
To construct a clk hardware structure for your platform you must define
the following:
the following::
struct clk_foo {
struct clk_hw hw;
... hardware specific data goes here ...
};
struct clk_foo {
struct clk_hw hw;
... hardware specific data goes here ...
};
To take advantage of your data you'll need to support valid operations
for your clk:
for your clk::
struct clk_ops clk_foo_ops {
.enable = &clk_foo_enable;
.disable = &clk_foo_disable;
};
struct clk_ops clk_foo_ops {
.enable = &clk_foo_enable;
.disable = &clk_foo_disable;
};
Implement the above functions using container_of:
Implement the above functions using container_of::
#define to_clk_foo(_hw) container_of(_hw, struct clk_foo, hw)
#define to_clk_foo(_hw) container_of(_hw, struct clk_foo, hw)
int clk_foo_enable(struct clk_hw *hw)
{
struct clk_foo *foo;
int clk_foo_enable(struct clk_hw *hw)
{
struct clk_foo *foo;
foo = to_clk_foo(hw);
foo = to_clk_foo(hw);
... perform magic on foo ...
... perform magic on foo ...
return 0;
};
return 0;
};
Below is a matrix detailing which clk_ops are mandatory based upon the
hardware capabilities of that clock. A cell marked as "y" means
......@@ -194,41 +201,56 @@ mandatory, a cell marked as "n" implies that either including that
callback is invalid or otherwise unnecessary. Empty cells are either
optional or must be evaluated on a case-by-case basis.
clock hardware characteristics
-----------------------------------------------------------
| gate | change rate | single parent | multiplexer | root |
|------|-------------|---------------|-------------|------|
.prepare | | | | | |
.unprepare | | | | | |
| | | | | |
.enable | y | | | | |
.disable | y | | | | |
.is_enabled | y | | | | |
| | | | | |
.recalc_rate | | y | | | |
.round_rate | | y [1] | | | |
.determine_rate | | y [1] | | | |
.set_rate | | y | | | |
| | | | | |
.set_parent | | | n | y | n |
.get_parent | | | n | y | n |
| | | | | |
.recalc_accuracy| | | | | |
| | | | | |
.init | | | | | |
-----------------------------------------------------------
[1] either one of round_rate or determine_rate is required.
.. table:: clock hardware characteristics
+----------------+------+-------------+---------------+-------------+------+
| | gate | change rate | single parent | multiplexer | root |
+================+======+=============+===============+=============+======+
|.prepare | | | | | |
+----------------+------+-------------+---------------+-------------+------+
|.unprepare | | | | | |
+----------------+------+-------------+---------------+-------------+------+
+----------------+------+-------------+---------------+-------------+------+
|.enable | y | | | | |
+----------------+------+-------------+---------------+-------------+------+
|.disable | y | | | | |
+----------------+------+-------------+---------------+-------------+------+
|.is_enabled | y | | | | |
+----------------+------+-------------+---------------+-------------+------+
+----------------+------+-------------+---------------+-------------+------+
|.recalc_rate | | y | | | |
+----------------+------+-------------+---------------+-------------+------+
|.round_rate | | y [1]_ | | | |
+----------------+------+-------------+---------------+-------------+------+
|.determine_rate | | y [1]_ | | | |
+----------------+------+-------------+---------------+-------------+------+
|.set_rate | | y | | | |
+----------------+------+-------------+---------------+-------------+------+
+----------------+------+-------------+---------------+-------------+------+
|.set_parent | | | n | y | n |
+----------------+------+-------------+---------------+-------------+------+
|.get_parent | | | n | y | n |
+----------------+------+-------------+---------------+-------------+------+
+----------------+------+-------------+---------------+-------------+------+
|.recalc_accuracy| | | | | |
+----------------+------+-------------+---------------+-------------+------+
+----------------+------+-------------+---------------+-------------+------+
|.init | | | | | |
+----------------+------+-------------+---------------+-------------+------+
.. [1] either one of round_rate or determine_rate is required.
Finally, register your clock at run-time with a hardware-specific
registration function. This function simply populates struct clk_foo's
data and then passes the common struct clk parameters to the framework
with a call to:
with a call to::
clk_register(...)
clk_register(...)
See the basic clock types in drivers/clk/clk-*.c for examples.
See the basic clock types in ``drivers/clk/clk-*.c`` for examples.
Part 5 - Disabling clock gating of unused clocks
Disabling clock gating of unused clocks
=======================================
Sometimes during development it can be useful to be able to bypass the
default disabling of unused clocks. For example, if drivers aren't enabling
......@@ -239,7 +261,8 @@ are sorted out.
To bypass this disabling, include "clk_ignore_unused" in the bootargs to the
kernel.
Part 6 - Locking
Locking
=======
The common clock framework uses two global locks, the prepare lock and the
enable lock.
......
========
CPU load
--------
========
Linux exports various bits of information via `/proc/stat' and
`/proc/uptime' that userland tools, such as top(1), use to calculate
the average time system spent in a particular state, for example:
Linux exports various bits of information via ``/proc/stat`` and
``/proc/uptime`` that userland tools, such as top(1), use to calculate
the average time system spent in a particular state, for example::
$ iostat
Linux 2.6.18.3-exp (linmac) 02/20/2007
......@@ -17,7 +18,7 @@ Here the system thinks that over the default sampling period the
system spent 10.01% of the time doing work in user space, 2.92% in the
kernel, and was overall 81.63% of the time idle.
In most cases the `/proc/stat' information reflects the reality quite
In most cases the ``/proc/stat`` information reflects the reality quite
closely, however due to the nature of how/when the kernel collects
this data sometimes it can not be trusted at all.
......@@ -33,78 +34,78 @@ Example
-------
If we imagine the system with one task that periodically burns cycles
in the following manner:
in the following manner::
time line between two timer interrupts
|--------------------------------------|
^ ^
|_ something begins working |
|_ something goes to sleep
(only to be awaken quite soon)
time line between two timer interrupts
|--------------------------------------|
^ ^
|_ something begins working |
|_ something goes to sleep
(only to be awaken quite soon)
In the above situation the system will be 0% loaded according to the
`/proc/stat' (since the timer interrupt will always happen when the
``/proc/stat`` (since the timer interrupt will always happen when the
system is executing the idle handler), but in reality the load is
closer to 99%.
One can imagine many more situations where this behavior of the kernel
will lead to quite erratic information inside `/proc/stat'.
/* gcc -o hog smallhog.c */
#include <time.h>
#include <limits.h>
#include <signal.h>
#include <sys/time.h>
#define HIST 10
static volatile sig_atomic_t stop;
static void sighandler (int signr)
{
(void) signr;
stop = 1;
}
static unsigned long hog (unsigned long niters)
{
stop = 0;
while (!stop && --niters);
return niters;
}
int main (void)
{
int i;
struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
.it_value = { .tv_sec = 0, .tv_usec = 1 } };
sigset_t set;
unsigned long v[HIST];
double tmp = 0.0;
unsigned long n;
signal (SIGALRM, &sighandler);
setitimer (ITIMER_REAL, &it, NULL);
hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) tmp += v[i];
tmp /= HIST;
n = tmp - (tmp / 3.0);
sigemptyset (&set);
sigaddset (&set, SIGALRM);
for (;;) {
hog (n);
sigwait (&set, &i);
}
return 0;
}
will lead to quite erratic information inside ``/proc/stat``::
/* gcc -o hog smallhog.c */
#include <time.h>
#include <limits.h>
#include <signal.h>
#include <sys/time.h>
#define HIST 10
static volatile sig_atomic_t stop;
static void sighandler (int signr)
{
(void) signr;
stop = 1;
}
static unsigned long hog (unsigned long niters)
{
stop = 0;
while (!stop && --niters);
return niters;
}
int main (void)
{
int i;
struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
.it_value = { .tv_sec = 0, .tv_usec = 1 } };
sigset_t set;
unsigned long v[HIST];
double tmp = 0.0;
unsigned long n;
signal (SIGALRM, &sighandler);
setitimer (ITIMER_REAL, &it, NULL);
hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) tmp += v[i];
tmp /= HIST;
n = tmp - (tmp / 3.0);
sigemptyset (&set);
sigaddset (&set, SIGALRM);
for (;;) {
hog (n);
sigwait (&set, &i);
}
return 0;
}
References
----------
http://lkml.org/lkml/2007/2/12/6
Documentation/filesystems/proc.txt (1.8)
- http://lkml.org/lkml/2007/2/12/6
- Documentation/filesystems/proc.txt (1.8)
Thanks
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment