Commits · df849a1529c106f7460e51479ca78fe07b07dc8c · nexedi / linux

30 Jun, 2006 20 commits

[PATCH] zoned vm counters: conversion of nr_pagetables to per zone counter · df849a15

Christoph Lameter authored Jun 30, 2006

Conversion of nr_page_table_pages to a per zone counter

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

df849a15

[PATCH] zoned vm counters: conversion of nr_slab to per zone counter · 9a865ffa

Christoph Lameter authored Jun 30, 2006

- Allows reclaim to access counter without looping over processor counts.

- Allows accurate statistics on how many pages are used in a zone by
  the slab. This may become useful to balance slab allocations over
  various zones.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

9a865ffa

[PATCH] zoned vm counters: zone_reclaim: remove /proc/sys/vm/zone_reclaim_interval · 34aa1330

Christoph Lameter authored Jun 30, 2006

The zone_reclaim_interval was necessary because we were not able to determine
how many unmapped pages exist in a zone.  Therefore we had to scan in
intervals to figure out if any pages were unmapped.

With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
pages and the number of mapped pages in a zone.  So we can simply skip the
reclaim if there is an insufficient number of unmapped pages.  We use
SWAP_CLUSTER_MAX as the boundary.

Drop all support for /proc/sys/vm/zone_reclaim_interval.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

34aa1330

[PATCH] zoned vm counters: split NR_ANON_PAGES off from NR_FILE_MAPPED · f3dbd344

Christoph Lameter authored Jun 30, 2006

The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
calculation as the number of mapped pagecache pages.  However, that is not
true.  NR_FILE_MAPPED includes the mapped anonymous pages.  This patch
separates those and therefore allows an accurate tracking of the anonymous
pages per zone.

It then becomes possible to determine the number of unmapped pages per zone
and we can avoid scanning for unmapped pages if there are none.

Also it may now be possible to determine the mapped/unmapped ratio in
get_dirty_limit.  Isnt the number of anonymous pages irrelevant in that
calculation?

Note that this will change the meaning of the number of mapped pages reported
in /proc/vmstat /proc/meminfo and in the per node statistics.  This may affect
user space tools that monitor these counters!  NR_FILE_MAPPED works like
NR_FILE_DIRTY.  It is only valid for pagecache pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

f3dbd344

[PATCH] zoned vm counters: remove NR_FILE_MAPPED from scan control structure · bf02cf4b

Christoph Lameter authored Jun 30, 2006

We can now access the number of pages in a mapped state in an inexpensive way
in shrink_active_list.  So drop the nr_mapped field from scan_control.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

bf02cf4b

[PATCH] zoned vm counters: conversion of nr_pagecache to per zone counter · 347ce434

Christoph Lameter authored Jun 30, 2006

Currently a single atomic variable is used to establish the size of the page
cache in the whole machine.  The zoned VM counters have the same method of
implementation as the nr_pagecache code but also allow the determination of
the pagecache size per zone.

Remove the special implementation for nr_pagecache and make it a zoned counter
named NR_FILE_PAGES.

Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

347ce434

[PATCH] zoned vm counters: convert nr_mapped to per zone counter · 65ba55f5

Christoph Lameter authored Jun 30, 2006

nr_mapped is important because it allows a determination of how many pages of
a zone are not mapped, which would allow a more efficient means of determining
when we need to reclaim memory in a zone.

We take the nr_mapped field out of the page state structure and define a new
per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
from NR_MAPPED in the next patch).

We replace the use of nr_mapped in various kernel locations.  This avoids the
looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
zone reclaim).

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

65ba55f5

[PATCH] zoned vm counters: basic ZVC (zoned vm counter) implementation · 2244b95a

Christoph Lameter authored Jun 30, 2006

Per zone counter infrastructure

The counters that we currently have for the VM are split per processor.  The
processor however has not much to do with the zone these pages belong to.  We
cannot tell f.e.  how many ZONE_DMA pages are dirty.

So we are blind to potentially inbalances in the usage of memory in various
zones.  F.e.  in a NUMA system we cannot tell how many pages are dirty on a
particular node.  If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system.  For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.

Another example is zone reclaim.  We do not know how many unmapped pages exist
per zone.  So we just have to try to reclaim.  If it is not working then we
pause and try again later.  It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone.  This patchset allows the determination
of the number of unmapped pages per zone.  We can remove the zone reclaim
interval with the counters introduced here.

Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.

The counter framework here implements differential counters for each processor
in struct zone.  The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.

Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array.  VM functions can
access the counts by simply indexing a global or zone specific array.

The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.

Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state.  The second group of
functions can be called if it is known that interrupts are disabled.

Special optimized increment and decrement functions are provided.  These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.

We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated.  This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state().  In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

2244b95a

[PATCH] zoned vm counters: create vmstat.c/.h from page_alloc.c/.h · f6ac2354

Christoph Lameter authored Jun 30, 2006

NOTE: ZVC are *not* the lightweight event counters.  ZVCs are reliable whereas
event counters do not need to be.

Zone based VM statistics are necessary to be able to determine what the state
of memory in one zone is.  In a NUMA system this can be helpful for local
reclaim and other memory optimizations that may be able to shift VM load in
order to get more balanced memory use.

It is also useful to know how the computing load affects the memory
allocations on various zones.  This patchset allows the retrieval of that data
from userspace.

The patchset introduces a framework for counters that is a cross between the
existing page_stats --which are simply global counters split per cpu-- and the
approach of deferred incremental updates implemented for nr_pagecache.

Small per cpu 8 bit counters are added to struct zone.  If the counter exceeds
certain thresholds then the counters are accumulated in an array of
atomic_long in the zone and in a global array that sums up all zone values.
The small 8 bit counters are next to the per cpu page pointers and so they
will be in high in the cpu cache when pages are allocated and freed.

Access to VM counter information for a zone and for the whole machine is then
possible by simply indexing an array (Thanks to Nick Piggin for pointing out
that approach).  The access to the total number of pages of various types does
no longer require the summing up of all per cpu counters.

Benefits of this patchset right now:

- Ability for UP and SMP configuration to determine how memory
  is balanced between the DMA, NORMAL and HIGHMEM zones.

- loops over all processors are avoided in writeback and
  reclaim paths. We can avoid caching the writeback information
  because the needed information is directly accessible.

- Special handling for nr_pagecache removed.

- zone_reclaim_interval vanishes since VM stats can now determine
  when it is worth to do local reclaim.

- Fast inline per node page state determination.

- Accurate counters in /sys/devices/system/node/node*/meminfo. Current
  counters are counting simply which processor allocated a page somewhere
  and guestimate based on that. So the counters were not useful to show
  the actual distribution of page use on a specific zone.

- The swap_prefetch patch requires per node statistics in order to
  figure out when processors of a node can prefetch. This patch provides
  some of the needed numbers.

- Detailed VM counters available in more /proc and /sys status files.

References to earlier discussions:
V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2

Performance tests with AIM7 did not show any regressions.  Seems to be a tad
faster even.  Tested on ia64/NUMA.  Builds fine on i386, SMP / UP.  Includes
fixes for s390/arm/uml arch code.

This patch:

Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.

Create vmstat.c/vmstat.h by separating the counter code and the proc
functions.

Move the vm_stat_text array before zoneinfo_show.

[akpm@osdl.org: s390 build fix]
[akpm@osdl.org: HOTPLUG_CPU build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

f6ac2354

[PATCH] fix ISTALLION=y · 672b2714

Adrian Bunk authored Jun 30, 2006

drivers/char/istallion.c: In function âstli_initbrdsâ:
drivers/char/istallion.c:4150: error: implicit declaration of function âstli_parsebrdâ
drivers/char/istallion.c:4150: error: âstli_brdspâ undeclared (first use in this function)
drivers/char/istallion.c:4150: error: (Each undeclared identifier is reported only once
drivers/char/istallion.c:4150: error: for each function it appears in.)
drivers/char/istallion.c:4164: error: implicit declaration of function âstli_argbrdsâ

While I was at it, I also removed the #ifdef MODULE around the initialation
code to allow it to perhaps work when built into the kernel and made a
needlessly global function static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

672b2714

[PATCH] msr.c: use register_hotcpu_notifier() · e09793bb

Andrew Morton authored Jun 30, 2006

register_cpu_notifier() cannot do anything in a module, in a
!CONFIG_HOTPLUG_CPU kernel.

Cc: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

e09793bb

[PATCH] fix platform_device_put/del mishaps · 1017f6af

Ingo Molnar authored Jun 30, 2006

This fixes drivers/char/pc8736x_gpio.c and drivers/char/scx200_gpio.c to
use the platform_device_del/put ops correctly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

1017f6af

[PATCH] fix drivers/video/imacfb.c compilation · 491d525f

Ingo Molnar authored Jun 30, 2006

Fix build error on x86_64.  There's nothing even remotely close to
imacmp_seg in the kernel, so I removed the whole line.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Edgar Hucek <hostmaster@ed-soft.at>
Cc: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

491d525f

Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 · 501b7c77

Linus Torvalds authored Jun 29, 2006

* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  ocfs2: remove redundant NULL checks in ocfs2_direct_IO_get_blocks()
  ocfs2: clean up some osb fields
  ocfs2: fix init of uuid_net_key
  ocfs2: silence a debug print
  ocfs2: silence ENOENT during lookup of broken links
  ocfs2: Cleanup message prints
  ocfs2: silence -EEXIST from ocfs2_extent_map_insert/lookup
  [PATCH] fs/ocfs2/dlm/dlmrecovery.c: make dlm_lockres_master_requery() static
  ocfs2: warn the user on a dead timeout mismatch
  ocfs2: OCFS2_FS must depend on SYSFS
  ocfs2: Compile-time disabling of ocfs2 debugging output.
  configfs: Clear up a few extra spaces where there should be TABs.
  configfs: Release memory in configfs_example.

501b7c77

Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 · 74e651f0

Linus Torvalds authored Jun 29, 2006

* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits)
  [TIPC]: Initial activation message now includes TIPC version number
  [TIPC]: Improve response to requests for node/link information
  [TIPC]: Fixed skb_under_panic caused by tipc_link_bundle_buf
  [IrDA]: Fix the AU1000 FIR dependencies
  [IrDA]: Fix RCU lock pairing on error path
  [XFRM]: unexport xfrm_state_mtu
  [NET]: make skb_release_data() static
  [NETFILTE] ipv4: Fix typo (Bugzilla #6753)
  [IrDA]: MCS7780 usb_driver struct should be static
  [BNX2]: Turn off link during shutdown
  [BNX2]: Use dev_kfree_skb() instead of the _irq version
  [ATM]: basic sysfs support for ATM devices
  [ATM]: [suni] change suni_init to __devinit
  [ATM]: [iphase] should be __devinit not __init
  [ATM]: [idt77105] should be __devinit not __init
  [BNX2]: Add NETIF_F_TSO_ECN
  [NET]: Add ECN support for TSO
  [AF_UNIX]: Datagram getpeersec
  [NET]: Fix logical error in skb_gso_ok
  [PKT_SCHED]: PSCHED_TADD() and PSCHED_TADD2() can result,tv_usec >= 1000000
  ...

74e651f0

[TIPC]: Initial activation message now includes TIPC version number · 0702056f

Allan Stephens authored Jun 29, 2006

Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Per Liden <per.liden@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0702056f

[TIPC]: Improve response to requests for node/link information · ea13847b

Allan Stephens authored Jun 29, 2006

Now allocates reply space for "get links" request based on number of actual
links, not number of potential links.  Also, limits reply to "get links" and
"get nodes" requests to 32KB to match capabilities of tipc-config utility
that issued request.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Per Liden <per.liden@ericsson.com>

ea13847b

[TIPC]: Fixed skb_under_panic caused by tipc_link_bundle_buf · e49060c7

Allan Stephens authored Jun 29, 2006

Now determines tailroom of bundle buffer by directly inspection of buffer.
Previously, buffer was assumed to have a max capacity equal to the link MTU,
but the addition of link MTU negotiation means that the link MTU can increase
after the bundle buffer is allocated.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Per Liden <per.liden@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e49060c7

[IrDA]: Fix the AU1000 FIR dependencies · caf430f3

Adrian Bunk authored Jun 29, 2006

AU1000 FIR is broken, it should depend on SOC_AU1000.

Spotted by Jean-Luc Leger.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

caf430f3

[IrDA]: Fix RCU lock pairing on error path · 1bc17311

Josh Triplett authored Jun 29, 2006

irlan_client_discovery_indication calls rcu_read_lock and rcu_read_unlock, but
returns without unlocking in an error case.  Fix that by replacing the return
with a goto so that the rcu_read_unlock always gets executed.
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Samuel Ortiz samuel@sortiz.org <>
Signed-off-by: David S. Miller <davem@davemloft.net>

1bc17311

29 Jun, 2006 20 commits

[XFRM]: unexport xfrm_state_mtu · 244055fd

Adrian Bunk authored Jun 29, 2006

This patch removes the unused EXPORT_SYMBOL(xfrm_state_mtu).
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

244055fd

[NET]: make skb_release_data() static · 5bba1712

Adrian Bunk authored Jun 29, 2006

skb_release_data() no longer has any users in other files.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

5bba1712

[NETFILTE] ipv4: Fix typo (Bugzilla #6753) · c22751b7

Matt LaPlante authored Jun 29, 2006

This patch fixes bugzilla #6753, a typo in the netfilter Kconfig
Signed-off-by: David S. Miller <davem@davemloft.net>

c22751b7

[IrDA]: MCS7780 usb_driver struct should be static · 7263ade1

Adrian Bunk authored Jun 29, 2006

This patch makes a needlessly global struct static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

7263ade1

[BNX2]: Turn off link during shutdown · 6c4f095e

Michael Chan authored Jun 29, 2006

Minor change in shutdown logic to effect a link down.

Update version to 1.4.43.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c4f095e

[BNX2]: Use dev_kfree_skb() instead of the _irq version · 745720e5

Michael Chan authored Jun 29, 2006

Change all dev_kfree_skb_irq() and dev_kfree_skb_any() to
dev_kfree_skb().  These calls are never used in irq context.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

745720e5

[ATM]: basic sysfs support for ATM devices · 656d98b0

Roman Kagan authored Jun 29, 2006

Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>

656d98b0

[ATM]: [suni] change suni_init to __devinit · d17f0865

Chas Williams authored Jun 29, 2006

Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>

d17f0865

[ATM]: [iphase] should be __devinit not __init · 249c14b5

Chas Williams authored Jun 29, 2006

Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>

249c14b5

[ATM]: [idt77105] should be __devinit not __init · b47eb0eb

Chas Williams authored Jun 29, 2006

Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>

b47eb0eb

[BNX2]: Add NETIF_F_TSO_ECN · b11d6213

Michael Chan authored Jun 29, 2006

Add NETIF_F_TSO_ECN feature for all bnx2 hardware.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b11d6213

[NET]: Add ECN support for TSO · b0da8537

Michael Chan authored Jun 29, 2006

In the current TSO implementation, NETIF_F_TSO and ECN cannot be
turned on together in a TCP connection.  The problem is that most
hardware that supports TSO does not handle CWR correctly if it is set
in the TSO packet.  Correct handling requires CWR to be set in the
first packet only if it is set in the TSO header.

This patch adds the ability to turn on NETIF_F_TSO and ECN using
GSO if necessary to handle TSO packets with CWR set.  Hardware
that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev->
features flag.

All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set.  If
the output device does not have the NETIF_F_TSO_ECN feature set, GSO
will split the packet up correctly with CWR only set in the first
segment.

With help from Herbert Xu <herbert@gondor.apana.org.au>.

Since ECN can always be enabled with TSO, the SOCK_NO_LARGESEND sock
flag is completely removed.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b0da8537

[AF_UNIX]: Datagram getpeersec · 877ce7c1

Catherine Zhang authored Jun 29, 2006

This patch implements an API whereby an application can determine the
label of its peer's Unix datagram sockets via the auxiliary data mechanism of
recvmsg.

Patch purpose:

This patch enables a security-aware application to retrieve the
security context of the peer of a Unix datagram socket.  The application
can then use this security context to determine the security context for
processing on behalf of the peer who sent the packet.

Patch design and implementation:

The design and implementation is very similar to the UDP case for INET
sockets.  Basically we build upon the existing Unix domain socket API for
retrieving user credentials.  Linux offers the API for obtaining user
credentials via ancillary messages (i.e., out of band/control messages
that are bundled together with a normal message).  To retrieve the security
context, the application first indicates to the kernel such desire by
setting the SO_PASSSEC option via getsockopt.  Then the application
retrieves the security context using the auxiliary data mechanism.

An example server application for Unix datagram socket should look like this:

toggle = 1;
toggle_len = sizeof(toggle);

setsockopt(sockfd, SOL_SOCKET, SO_PASSSEC, &toggle, &toggle_len);
recvmsg(sockfd, &msg_hdr, 0);
if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) {
    cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr);
    if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) &&
        cmsg_hdr->cmsg_level == SOL_SOCKET &&
        cmsg_hdr->cmsg_type == SCM_SECURITY) {
        memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext));
    }
}

sock_setsockopt is enhanced with a new socket option SOCK_PASSSEC to allow
a server socket to receive security context of the peer.

Testing:

We have tested the patch by setting up Unix datagram client and server
applications.  We verified that the server can retrieve the security context
using the auxiliary data mechanism of recvmsg.
Signed-off-by: Catherine Zhang <cxzhang@watson.ibm.com>
Acked-by: Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

877ce7c1

[NET]: Fix logical error in skb_gso_ok · d6b4991a

Herbert Xu authored Jun 29, 2006

The test in skb_gso_ok is backwards.  Noticed by Michael Chan
<mchan@broadcom.com>.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6b4991a

[PKT_SCHED]: PSCHED_TADD() and PSCHED_TADD2() can result,tv_usec >= 1000000 · 4ee303df
Shuya MAEDA authored Jun 28, 2006
```
Signed-off-by: Shuya MAEDA <maeda-sxb@necst.nec.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
4ee303df

[NET]: Make illegal_highdma more anal · 3d3a8533

Herbert Xu authored Jun 27, 2006

Rather than having illegal_highdma as a macro when HIGHMEM is off, we
can turn it into an inline function that returns zero.  This will catch
callers that give it bad arguments.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d3a8533

[TCP]: Export accept queue len of a TCP listening socket via rx_queue · 47da8ee6

Sridhar Samudrala authored Jun 27, 2006

While debugging a TCP server hang issue, we noticed that currently there is
no way for a user to get the acceptq backlog value for a TCP listen socket.

All the standard networking utilities that display socket info like netstat,
ss and /proc/net/tcp have 2 fields called rx_queue and tx_queue. These
fields do not mean much for listening sockets. This patch uses one of these
unused fields(rx_queue) to export the accept queue len for listening sockets.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

47da8ee6

[NETLINK]: Encapsulate eff_cap usage within security framework. · c7bdb545

Darrel Goeddel authored Jun 27, 2006

This patch encapsulates the usage of eff_cap (in netlink_skb_params) within
the security framework by extending security_netlink_recv to include a required
capability parameter and converting all direct usage of eff_caps outside
of the lsm modules to use the interface. It also updates the SELinux
implementation of the security_netlink_send and security_netlink_recv
hooks to take advantage of the sid in the netlink_skb_params struct.
This also enables SELinux to perform auditing of netlink capability checks.
Please apply, for 2.6.18 if possible.
Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

c7bdb545

[NET]: Added GSO header verification · 576a30eb

Herbert Xu authored Jun 27, 2006

When GSO packets come from an untrusted source (e.g., a Xen guest domain),
we need to verify the header integrity before passing it to the hardware.

Since the first step in GSO is to verify the header, we can reuse that
code by adding a new bit to gso_type: SKB_GSO_DODGY. Packets with this
bit set can only be fed directly to devices with the corresponding bit
NETIF_F_GSO_ROBUST. If the device doesn't have that bit, then the skb
is fed to the GSO engine which will allow the packet to be sent to the
hardware if it passes the header check.

This patch changes the sg flag to a full features flag. The same method
can be used to implement TSO ECN support. We simply have to mark packets
with CWR set with SKB_GSO_ECN so that only hardware with a corresponding
NETIF_F_TSO_ECN can accept them. The GSO engine can either fully segment
the packet, or segment the first MTU and pass the rest to the hardware for
further segmentation.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

576a30eb

[NETFILTER]: statistic match: add missing Kconfig help text · 68c1692e
Patrick McHardy authored Jun 27, 2006
```
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
68c1692e