1. 18 Jul, 2007 40 commits
    • Linus Torvalds's avatar
      Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 · 485cf925
      Linus Torvalds authored
      * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (24 commits)
        [NETFILTER]: xt_connlimit needs to depend on nf_conntrack
        [NETFILTER]: ipt_iprange.h must #include <linux/types.h>
        [IrDA]: Fix IrDA build failure
        [ATM]: nicstar needs virt_to_bus
        [NET]: move __dev_addr_discard adjacent to dev_addr_discard for readability
        [NET]: merge dev_unicast_discard and dev_mc_discard into one
        [NET]: move dev_mc_discard from dev_mcast.c to dev.c
        [NETLINK]: negative groups in netlink_setsockopt
        [PPPOL2TP]: Reset meta-data in xmit function
        [PPPOL2TP]: Fix use-after-free
        [PKT_SCHED]: Some typo fixes in net/sched/Kconfig
        [XFRM]: Fix crash introduced by struct dst_entry reordering
        [TCP]: remove unused argument to cong_avoid op
        [ATM]: [idt77252] Rename CONFIG_ATM_IDT77252_SEND_IDLE to not resemble a Kconfig variable
        [ATM]: [drivers] ioremap balanced with iounmap
        [ATM]: [lanai] sram_test_word() must be __devinit
        [ATM]: [nicstar] Replace C code with call to ARRAY_SIZE() macro.
        [ATM]: Eliminate dead config variable CONFIG_BR2684_FAST_TRANS.
        [ATM]: Replacing kmalloc/memset combination with kzalloc.
        [NET]: gen_estimator deadlock fix
        ...
      485cf925
    • Linus Torvalds's avatar
      Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6 · 31bdc5dc
      Linus Torvalds authored
      * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
        [SPARC64]: Set vio->desc_buf to NULL after freeing.
        [SPARC]: Mark sparc and sparc64 as not having virt_to_bus
        [SPARC64]: Fix reset handling in VNET driver.
        [SPARC64]: Handle reset events in vio_link_state_change().
        [SPARC64]: Handle LDC resets properly in domain-services driver.
        [SPARC64]: Massively simplify VIO device layer and support hot add/remove.
        [SPARC64]: Simplify VNET probing.
        [SPARC64]: Simplify VDC device probing.
        [SPARC64]: Add basic infrastructure for MD add/remove notification.
      31bdc5dc
    • Linus Torvalds's avatar
      Merge branch 'xen-upstream' of ssh://master.kernel.org/pub/scm/linux/kernel/git/jeremy/xen · 5cc97bf2
      Linus Torvalds authored
      * 'xen-upstream' of ssh://master.kernel.org/pub/scm/linux/kernel/git/jeremy/xen: (44 commits)
        xen: disable all non-virtual drivers
        xen: use iret directly when possible
        xen: suppress abs symbol warnings for unused reloc pointers
        xen: Attempt to patch inline versions of common operations
        xen: Place vcpu_info structure into per-cpu memory
        xen: handle external requests for shutdown, reboot and sysrq
        xen: machine operations
        xen: add virtual network device driver
        xen: add virtual block device driver.
        xen: add the Xenbus sysfs and virtual device hotplug driver
        xen: Add grant table support
        xen: use the hvc console infrastructure for Xen console
        xen: hack to prevent bad segment register reload
        xen: lazy-mmu operations
        xen: Add support for preemption
        xen: SMP guest support
        xen: Implement sched_clock
        xen: Account for stolen time
        xen: ignore RW mapping of RO pages in pagetable_init
        xen: Complete pagetable pinning
        ...
      5cc97bf2
    • Tony Breeds's avatar
      Revert "[POWERPC] Do firmware feature fixups after features are initialised" · 826ea8f2
      Tony Breeds authored
      This reverts commit 5a26f6bb.
      
      The original patch causes boot failures when built with ppc64_defconfig.  The
      quickest fix is to revert it while alterates are investigated.
      Signed-off-by: default avatarTony Breeds <tony@bakeyournoodle.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      826ea8f2
    • Tony Breeds's avatar
      Fix compile failure in arch/powerpc/kernel/pci-common.c · 4f3731da
      Tony Breeds authored
      This fixes the fallout from the recent powerpc merge (commit
      489de302):
      
         CC      arch/powerpc/kernel/pci-common.o
        arch/powerpc/kernel/pci-common.c:160: error: conflicting types for 'pcibios_add_platform_entries'
        include/linux/pci.h:889: error: previous declaration of 'pcibios_add_platform_entries' was here
      Signed-off-by: default avatarTony Breeds <tony@bakeyournoodle.com>
      Tested-by: default avatarBret Towe <magnade@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f3731da
    • Jeremy Fitzhardinge's avatar
      xen: disable all non-virtual drivers · dfdcdd42
      Jeremy Fitzhardinge authored
      A domU Xen environment has no non-virtual drivers, so make sure
      they're all disabled at once.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      dfdcdd42
    • Jeremy Fitzhardinge's avatar
      xen: use iret directly when possible · 9ec2b804
      Jeremy Fitzhardinge authored
      Most of the time we can simply use the iret instruction to exit the
      kernel, rather than having to use the iret hypercall - the only
      exception is if we're returning into vm86 mode, or from delivering an
      NMI (which we don't support yet).
      
      When running native, iret has the behaviour of testing for a pending
      interrupt atomically with re-enabling interrupts.  Unfortunately
      there's no way to do this with Xen, so there's a window in which we
      could get a recursive exception after enabling events but before
      actually returning to userspace.
      
      This causes a problem: if the nested interrupt causes one of the
      task's TIF_WORK_MASK flags to be set, they will not be checked again
      before returning to userspace.  This means that pending work may be
      left pending indefinitely, until the process enters and leaves the
      kernel again.  The net effect is that a pending signal or reschedule
      event could be delayed for an unbounded amount of time.
      
      To deal with this, the xen event upcall handler checks to see if the
      EIP is within the critical section of the iret code, after events
      are (potentially) enabled up to the iret itself.  If its within this
      range, it calls the iret critical section fixup, which adjusts the
      stack to deal with any unrestored registers, and then shifts the
      stack frame up to replace the previous invocation.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      9ec2b804
    • Jeremy Fitzhardinge's avatar
      xen: suppress abs symbol warnings for unused reloc pointers · 600b2fc2
      Jeremy Fitzhardinge authored
      arch/i386/xen/xen-asm.S defines some small pieces of code which are
      used to implement a few paravirt_ops.  They're designed so they can be
      used either in-place, or be inline patched into their callsites if
      there's enough space.
      
      Some of those operations need to make calls out (specifically, if you
      re-enable events [interrupts], and there's a pending event at that
      time).  These calls need the call instruction to be relocated if the
      code is patched inline.  In this case xen_foo_reloc is a
      section-relative symbol which points to xen_foo's required relocation.
      
      Other operations have no need of a relocation, and so their
      corresponding xen_bar_reloc is absolute 0.  These are the cases which
      are triggering the warning.
      
      This patch adds those symbols to the list of safe abs symbols.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Adrian Bunk <bunk@stusta.de>
      600b2fc2
    • Jeremy Fitzhardinge's avatar
      xen: Attempt to patch inline versions of common operations · 6487673b
      Jeremy Fitzhardinge authored
      This patchs adds the mechanism to allow us to patch inline versions of
      common operations.
      
      The implementations of the direct-access versions save_fl, restore_fl,
      irq_enable and irq_disable are now in assembler, and the same code is
      used for both out of line and inline uses.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Keir Fraser <keir@xensource.com>
      6487673b
    • Jeremy Fitzhardinge's avatar
      xen: Place vcpu_info structure into per-cpu memory · 60223a32
      Jeremy Fitzhardinge authored
      An experimental patch for Xen allows guests to place their vcpu_info
      structs anywhere.  We try to use this to place the vcpu_info into the
      PDA, which allows direct access.
      
      If this works, then switch to using direct access operations for
      irq_enable, disable, save_fl and restore_fl.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Keir Fraser <keir@xensource.com>
      60223a32
    • Jeremy Fitzhardinge's avatar
      xen: handle external requests for shutdown, reboot and sysrq · 3e2b8fbe
      Jeremy Fitzhardinge authored
      The guest domain can be asked to shutdown or reboot itself, or have a
      sysrq key injected, via xenbus.  This patch adds a watcher for those
      events, and does the appropriate action.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      3e2b8fbe
    • Jeremy Fitzhardinge's avatar
      xen: machine operations · fefa629a
      Jeremy Fitzhardinge authored
      Make the appropriate hypercalls to halt and reboot the virtual machine.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: default avatarChris Wright <chrisw@sous-sol.org>
      fefa629a
    • Jeremy Fitzhardinge's avatar
      xen: add virtual network device driver · 0d160211
      Jeremy Fitzhardinge authored
      The network device frontend driver allows the kernel to access network
      devices exported exported by a virtual machine containing a physical
      network device driver.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Acked-by: default avatarJeff Garzik <jeff@garzik.org>
      Cc: Ian Pratt <ian.pratt@xensource.com>
      Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Cc: Stephen Hemminger <shemminger@linux-foundation.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Keir Fraser <Keir.Fraser@cl.cam.ac.uk>
      Cc: netdev@vger.kernel.org
      0d160211
    • Jeremy Fitzhardinge's avatar
      xen: add virtual block device driver. · 9f27ee59
      Jeremy Fitzhardinge authored
      The block device frontend driver allows the kernel to access block
      devices exported exported by a virtual machine containing a physical
      block device driver.
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      9f27ee59
    • Jeremy Fitzhardinge's avatar
      xen: add the Xenbus sysfs and virtual device hotplug driver · 4bac07c9
      Jeremy Fitzhardinge authored
      This communicates with the machine control software via a registry
      residing in a controlling virtual machine. This allows dynamic
      creation, destruction and modification of virtual device
      configurations (network devices, block devices and CPUS, to name some
      examples).
      
      [ Greg, would you mind giving this a review?  Thanks -J ]
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Greg KH <greg@kroah.com>
      4bac07c9
    • Jeremy Fitzhardinge's avatar
      xen: Add grant table support · ad9a8612
      Jeremy Fitzhardinge authored
      Add Xen 'grant table' driver which allows granting of access to
      selected local memory pages by other virtual machines and,
      symmetrically, the mapping of remote memory pages which other virtual
      machines have granted access to.
      
      This driver is a prerequisite for many of the Xen virtual device
      drivers, which grant the 'device driver domain' restricted and
      temporary access to only those memory pages that are currently
      involved in I/O operations.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      ad9a8612
    • Jeremy Fitzhardinge's avatar
      xen: use the hvc console infrastructure for Xen console · b536b4b9
      Jeremy Fitzhardinge authored
      Implement a Xen back-end for hvc console.
      
      * * *
      Add early printk support via hvc console, enable using
      "earlyprintk=xen" on the kernel command line.
      
      From: Gerd Hoffmann <kraxel@suse.de>
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarOlof Johansson <olof@lixom.net>
      b536b4b9
    • Jeremy Fitzhardinge's avatar
      xen: hack to prevent bad segment register reload · 8b84ad94
      Jeremy Fitzhardinge authored
      The hypervisor saves and restores the segment registers as part of the
      state is saves while context switching.  If, during a context switch,
      the next process doesn't use the TLS segments, it invalidates the GDT
      entry, causing the segment register reload to fault.  This fault
      effectively doubles the cost of a context switch.
      
      This patch is a band-aid workaround which clears the usermode %gs
      after it has been saved for the previous process, but before it gets
      reloaded for the next, and it avoids having the hypervisor attempt to
      erroneously reload it.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      8b84ad94
    • Jeremy Fitzhardinge's avatar
      xen: lazy-mmu operations · d66bf8fc
      Jeremy Fitzhardinge authored
      This patch uses the lazy-mmu hooks to batch mmu operations where
      possible.  This is primarily useful for batching operations applied to
      active pagetables, which happens during mprotect, munmap, mremap and
      the like (mmap does not do bulk pagetable operations, so it isn't
      helped).
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: default avatarChris Wright <chrisw@sous-sol.org>
      d66bf8fc
    • Jeremy Fitzhardinge's avatar
      xen: Add support for preemption · f120f13e
      Jeremy Fitzhardinge authored
      Add Xen support for preemption.  This is mostly a cleanup of existing
      preempt_enable/disable calls, or just comments to explain the current
      usage.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      f120f13e
    • Jeremy Fitzhardinge's avatar
      xen: SMP guest support · f87e4cac
      Jeremy Fitzhardinge authored
      This is a fairly straightforward Xen implementation of smp_ops.
      
      Xen has its own IPI mechanisms, and has no dependency on any
      APIC-based IPI.  The smp_ops hooks and the flush_tlb_others pv_op
      allow a Xen guest to avoid all APIC code in arch/i386 (the only apic
      operation is a single apic_read for the apic version number).
      
      One subtle point which needs to be addressed is unpinning pagetables
      when another cpu may have a lazy tlb reference to the pagetable. Xen
      will not allow an in-use pagetable to be unpinned, so we must find any
      other cpus with a reference to the pagetable and get them to shoot
      down their references.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      f87e4cac
    • Jeremy Fitzhardinge's avatar
      xen: Implement sched_clock · ab550288
      Jeremy Fitzhardinge authored
      Implement xen_sched_clock, which returns the number of ns the current
      vcpu has been actually in an unstolen state (ie, running or blocked,
      vs runnable-but-not-running, or offline) since boot.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: john stultz <johnstul@us.ibm.com>
      ab550288
    • Jeremy Fitzhardinge's avatar
      xen: Account for stolen time · f91a8b44
      Jeremy Fitzhardinge authored
      This patch accounts for the time stolen from our VCPUs.  Stolen time is
      time where a vcpu is runnable and could be running, but all available
      physical CPUs are being used for something else.
      
      This accounting gets run on each timer interrupt, just as a way to get
      it run relatively often, and when interesting things are going on.
      Stolen time is not really used by much in the kernel; it is reported
      in /proc/stats, and that's about it.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      f91a8b44
    • Jeremy Fitzhardinge's avatar
      xen: ignore RW mapping of RO pages in pagetable_init · 9a4029fd
      Jeremy Fitzhardinge authored
      When setting up the initial pagetable, which includes mappings of all
      low physical memory, ignore a mapping which tries to set the RW bit on
      an RO pte.  An RO pte indicates a page which is part of the current
      pagetable, and so it cannot be allowed to become RW.
      
      Once xen_pagetable_setup_done is called, set_pte reverts to its normal
      behaviour.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: ebiederm@xmission.com (Eric W. Biederman)
      9a4029fd
    • Jeremy Fitzhardinge's avatar
      xen: Complete pagetable pinning · f4f97b3e
      Jeremy Fitzhardinge authored
      Xen requires all active pagetables to be marked read-only.  When the
      base of the pagetable is loaded into %cr3, the hypervisor validates
      the entire pagetable and only allows the load to proceed if it all
      checks out.
      
      This is pretty slow, so to mitigate this cost Xen has a notion of
      pinned pagetables.  Pinned pagetables are pagetables which are
      considered to be active even if no processor's cr3 is pointing to is.
      This means that it must remain read-only and all updates are validated
      by the hypervisor.  This makes context switches much cheaper, because
      the hypervisor doesn't need to revalidate the pagetable each time.
      
      This also adds a new paravirt hook which is called during setup once
      the zones and memory allocator have been initialized.  When the
      init_mm pagetable is first built, the struct page array does not yet
      exist, and so there's nowhere to put he init_mm pagetable's PG_pinned
      flags.  Once the zones are initialized and the struct page array
      exists, we can set the PG_pinned flags for those pages.
      
      This patch also adds the Xen support for pte pages allocated out of
      highmem (highpte) by implementing xen_kmap_atomic_pte.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Zach Amsden <zach@vmware.com>
      f4f97b3e
    • Jeremy Fitzhardinge's avatar
      xen: add pinned page flag · c85b04c3
      Jeremy Fitzhardinge authored
      Add a new definition for PG_owner_priv_1 to define PG_pinned on Xen
      pagetable pages.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      c85b04c3
    • Jeremy Fitzhardinge's avatar
      xen: configuration · e738fca8
      Jeremy Fitzhardinge authored
      Put config options for Xen after the core pieces are in place.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      e738fca8
    • Jeremy Fitzhardinge's avatar
      xen: time implementation · 15c84731
      Jeremy Fitzhardinge authored
      Xen maintains a base clock which measures nanoseconds since system
      boot.  This is provided to guests via a shared page which contains a
      base time in ns, a tsc timestamp at that point and tsc frequency
      parameters.  Guests can compute the current time by reading the tsc
      and using it to extrapolate the current time from the basetime.  The
      hypervisor makes sure that the frequency parameters are updated
      regularly, paricularly if the tsc changes rate or stops.
      
      This is implemented as a clocksource, so the interface to the rest of
      the kernel is a simple clocksource which simply returns the current
      time directly in nanoseconds.
      
      Xen also provides a simple timer mechanism, which allows a timeout to
      be set in the future.  When that time arrives, a timer event is sent
      to the guest.  There are two timer interfaces:
       - An old one which also delivers a stream of (unused) ticks at 100Hz,
         and on the same event, the actual timer events.  The 100Hz ticks
         cause a lot of spurious wakeups, but are basically harmless.
       - The new timer interface doesn't have the 100Hz ticks, and can also
         fail if the specified time is in the past.
      
      This code presents the Xen timer as a clockevent driver, and uses the
      new interface by preference.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      15c84731
    • Jeremy Fitzhardinge's avatar
      xen: event channels · e46cdb66
      Jeremy Fitzhardinge authored
      Xen implements interrupts in terms of event channels.  Each guest
      domain gets 1024 event channels which can be used for a variety of
      purposes, such as Xen timer events, inter-domain events,
      inter-processor events (IPI) or for real hardware IRQs.
      
      Within the kernel, we map the event channels to IRQs, and implement
      the whole interrupt handling using a Xen irq_chip.
      
      Rather than setting NR_IRQ to 1024 under PARAVIRT in order to
      accomodate Xen, we create a dynamic mapping between event channels and
      IRQs.  Ideally, Linux will eventually move towards dynamically
      allocating per-irq structures, and we can use a 1:1 mapping between
      event channels and irqs.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      e46cdb66
    • Jeremy Fitzhardinge's avatar
      xen: virtual mmu · 3b827c1b
      Jeremy Fitzhardinge authored
      Xen pagetable handling, including the machinery to implement direct
      pagetables.
      
      Xen presents the real CPU's pagetables directly to guests, with no
      added shadowing or other layer of abstraction.  Naturally this means
      the hypervisor must maintain close control over what the guest can put
      into the pagetable.
      
      When the guest modifies the pte/pmd/pgd, it must convert its
      domain-specific notion of a "physical" pfn into a global machine frame
      number (mfn) before inserting the entry into the pagetable.  Xen will
      check to make sure the domain is allowed to create a mapping of the
      given mfn.
      
      Xen also requires that all mappings the guest has of its own active
      pagetable are read-only.  This is relatively easy to implement in
      Linux because all pagetables share the same pte pages for kernel
      mappings, so updating the pte in one pagetable will implicitly update
      the mapping in all pagetables.
      
      Normally a pagetable becomes active when you point to it with cr3 (or
      the Xen equivalent), but when you do so, Xen must check the whole
      pagetable for correctness, which is clearly a performance problem.
      
      Xen solves this with pinning which keeps a pagetable effectively
      active even if its currently unused, which means that all the normal
      update rules are enforced.  This means that it need not revalidate the
      pagetable when loading cr3.
      
      This patch has a first-cut implementation of pinning, but it is more
      fully implemented in a later patch.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      3b827c1b
    • Jeremy Fitzhardinge's avatar
      xen: Core Xen implementation · 5ead97c8
      Jeremy Fitzhardinge authored
      This patch is a rollup of all the core pieces of the Xen
      implementation, including:
       - booting and setup
       - pagetable setup
       - privileged instructions
       - segmentation
       - interrupt flags
       - upcalls
       - multicall batching
      
      BOOTING AND SETUP
      
      The vmlinux image is decorated with ELF notes which tell the Xen
      domain builder what the kernel's requirements are; the domain builder
      then constructs the address space accordingly and starts the kernel.
      
      Xen has its own entrypoint for the kernel (contained in an ELF note).
      The ELF notes are set up by xen-head.S, which is included into head.S.
      In principle it could be linked separately, but it seems to provoke
      lots of binutils bugs.
      
      Because the domain builder starts the kernel in a fairly sane state
      (32-bit protected mode, paging enabled, flat segments set up), there's
      not a lot of setup needed before starting the kernel proper.  The main
      steps are:
        1. Install the Xen paravirt_ops, which is simply a matter of a
           structure assignment.
        2. Set init_mm to use the Xen-supplied pagetables (analogous to the
           head.S generated pagetables in a native boot).
        3. Reserve address space for Xen, since it takes a chunk at the top
           of the address space for its own use.
        4. Call start_kernel()
      
      PAGETABLE SETUP
      
      Once we hit the main kernel boot sequence, it will end up calling back
      via paravirt_ops to set up various pieces of Xen specific state.  One
      of the critical things which requires a bit of extra care is the
      construction of the initial init_mm pagetable.  Because Xen places
      tight constraints on pagetables (an active pagetable must always be
      valid, and must always be mapped read-only to the guest domain), we
      need to be careful when constructing the new pagetable to keep these
      constraints in mind.  It turns out that the easiest way to do this is
      use the initial Xen-provided pagetable as a template, and then just
      insert new mappings for memory where a mapping doesn't already exist.
      
      This means that during pagetable setup, it uses a special version of
      xen_set_pte which ignores any attempt to remap a read-only page as
      read-write (since Xen will map its own initial pagetable as RO), but
      lets other changes to the ptes happen, so that things like NX are set
      properly.
      
      PRIVILEGED INSTRUCTIONS AND SEGMENTATION
      
      When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
      This means that it is more privileged than user-mode in ring 3, but it
      still can't run privileged instructions directly.  Non-performance
      critical instructions are dealt with by taking a privilege exception
      and trapping into the hypervisor and emulating the instruction, but
      more performance-critical instructions have their own specific
      paravirt_ops.  In many cases we can avoid having to do any hypercalls
      for these instructions, or the Xen implementation is quite different
      from the normal native version.
      
      The privileged instructions fall into the broad classes of:
        Segmentation: setting up the GDT and the GDT entries, LDT,
           TLS and so on.  Xen doesn't allow the GDT to be directly
           modified; all GDT updates are done via hypercalls where the new
           entries can be validated.  This is important because Xen uses
           segment limits to prevent the guest kernel from damaging the
           hypervisor itself.
        Traps and exceptions: Xen uses a special format for trap entrypoints,
           so when the kernel wants to set an IDT entry, it needs to be
           converted to the form Xen expects.  Xen sets int 0x80 up specially
           so that the trap goes straight from userspace into the guest kernel
           without going via the hypervisor.  sysenter isn't supported.
        Kernel stack: The esp0 entry is extracted from the tss and provided to
           Xen.
        TLB operations: the various TLB calls are mapped into corresponding
           Xen hypercalls.
        Control registers: all the control registers are privileged.  The most
           important is cr3, which points to the base of the current pagetable,
           and we handle it specially.
      
      Another instruction we treat specially is CPUID, even though its not
      privileged.  We want to control what CPU features are visible to the
      rest of the kernel, and so CPUID ends up going into a paravirt_op.
      Xen implements this mainly to disable the ACPI and APIC subsystems.
      
      INTERRUPT FLAGS
      
      Xen maintains its own separate flag for masking events, which is
      contained within the per-cpu vcpu_info structure.  Because the guest
      kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
      ignored (and must be, because even if a guest domain disables
      interrupts for itself, it can't disable them overall).
      
      (A note on terminology: "events" and interrupts are effectively
      synonymous.  However, rather than using an "enable flag", Xen uses a
      "mask flag", which blocks event delivery when it is non-zero.)
      
      There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
      are implemented to manage the Xen event mask state.  The only thing
      worth noting is that when events are unmasked, we need to explicitly
      see if there's a pending event and call into the hypervisor to make
      sure it gets delivered.
      
      UPCALLS
      
      Xen needs a couple of upcall (or callback) functions to be implemented
      by each guest.  One is the event upcalls, which is how events
      (interrupts, effectively) are delivered to the guests.  The other is
      the failsafe callback, which is used to report errors in either
      reloading a segment register, or caused by iret.  These are
      implemented in i386/kernel/entry.S so they can jump into the normal
      iret_exc path when necessary.
      
      MULTICALL BATCHING
      
      Xen provides a multicall mechanism, which allows multiple hypercalls
      to be issued at once in order to mitigate the cost of trapping into
      the hypervisor.  This is particularly useful for context switches,
      since the 4-5 hypercalls they would normally need (reload cr3, update
      TLS, maybe update LDT) can be reduced to one.  This patch implements a
      generic batching mechanism for hypercalls, which gets used in many
      places in the Xen code.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Ian Pratt <ian.pratt@xensource.com>
      Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Cc: Adrian Bunk <bunk@stusta.de>
      5ead97c8
    • Jeremy Fitzhardinge's avatar
      xen: Add Xen interface header files · a42089dd
      Jeremy Fitzhardinge authored
      Add Xen interface header files. These are taken fairly directly from
      the Xen tree, but somewhat rearranged to suit the kernel's conventions.
      
      Define macros and inline functions for doing hypercalls into the
      hypervisor.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      a42089dd
    • Jeremy Fitzhardinge's avatar
      Add nosegneg capability to the vsyscall page notes · 24037a8b
      Jeremy Fitzhardinge authored
      Add the "nosegneg" fake capabilty to the vsyscall page notes. This is
      used by the runtime linker to select a glibc version which then
      disables negative-offset accesses to the thread-local segment via
      %gs. These accesses require emulation in Xen (because segments are
      truncated to protect the hypervisor address space) and avoiding them
      provides a measurable performance boost.
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: default avatarZachary Amsden <zach@vmware.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      24037a8b
    • Jeremy Fitzhardinge's avatar
      Add a sched_clock paravirt_op · 688340ea
      Jeremy Fitzhardinge authored
      The tsc-based get_scheduled_cycles interface is not a good match for
      Xen's runstate accounting, which reports everything in nanoseconds.
      
      This patch replaces this interface with a sched_clock interface, which
      matches both Xen and VMI's requirements.
      
      In order to do this, we:
         1. replace get_scheduled_cycles with sched_clock
         2. hoist cycles_2_ns into a common header
         3. update vmi accordingly
      
      One thing to note: because sched_clock is implemented as a weak
      function in kernel/sched.c, we must define a real function in order to
      override this weak binding.  This means the usual paravirt_ops
      technique of using an inline function won't work in this case.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Zachary Amsden <zach@vmware.com>
      Cc: Dan Hecht <dhecht@vmware.com>
      Cc: john stultz <johnstul@us.ibm.com>
      688340ea
    • Jeremy Fitzhardinge's avatar
      paravirt: helper to disable all IO space · d572929c
      Jeremy Fitzhardinge authored
      In a virtual environment, device drivers such as legacy IDE will waste
      quite a lot of time probing for their devices which will never appear.
      This helper function allows a paravirt implementation to lay claim to
      the whole iomem and ioport space, thereby disabling all device drivers
      trying to claim IO resources.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      d572929c
    • Jeremy Fitzhardinge's avatar
      Allocate and free vmalloc areas · 5f4352fb
      Jeremy Fitzhardinge authored
      Allocate/release a chunk of vmalloc address space:
       alloc_vm_area reserves a chunk of address space, and makes sure all
       the pagetables are constructed for that address range - but no pages.
      
       free_vm_area releases the address space range.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: "Jan Beulich" <JBeulich@novell.com>
      Cc: "Andi Kleen" <ak@muc.de>
      5f4352fb
    • Jeremy Fitzhardinge's avatar
      paravirt: export __supported_pte_mask · bdef40a6
      Jeremy Fitzhardinge authored
      __supported_pte_mask is needed when constructing pte values.  Xen
      device drivers need to do this to make mappings of foreign pages (ie,
      pages granted to us by other domains).
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      bdef40a6
    • Jeremy Fitzhardinge's avatar
      paravirt: make siblingmap functions visible · c70df743
      Jeremy Fitzhardinge authored
      Paravirt implementations need to set the sibling map on new cpus.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      c70df743
    • Jeremy Fitzhardinge's avatar
      paravirt: unstatic smp_store_cpu_info · 724faa89
      Jeremy Fitzhardinge authored
      Paravirt implementations need to store cpu info when bringing up cpus.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      724faa89
    • Jeremy Fitzhardinge's avatar
      paravirt: unstatic leave_mm · 53787013
      Jeremy Fitzhardinge authored
      Make globally leave_mm visible, specifically so that Xen can use it to
      shoot-down lazy uses of cr3.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      53787013