Commit 0d681009 authored by Linus Torvalds's avatar Linus Torvalds

Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest

* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest: (45 commits)
  Use "struct boot_params" in example launcher
  Loading bzImage directly.
  Revert lguest magic and use hook in head.S
  Update lguest documentation to reflect the new virtual block device name.
  generalize lgread_u32/lgwrite_u32.
  Example launcher handle guests not being ready for input
  Update example launcher for virtio
  Lguest support for Virtio
  Remove old lguest I/O infrrasructure.
  Remove old lguest bus and drivers.
  Virtio helper routines for a descriptor ringbuffer implementation
  Module autoprobing support for virtio drivers.
  Virtio console driver
  Block driver using virtio.
  Net driver using virtio
  Virtio interface
  Boot with virtual == physical to get closer to native Linux.
  Allow guest to specify syscall vector to use.
  Rename "cr3" to "gpgdir" to avoid x86-specific naming.
  Pagetables to use normal kernel types
  ...
parents a98ce5c6 43d33b21
# This creates the demonstration utility "lguest" which runs a Linux guest.
# For those people that have a separate object dir, look there for .config
KBUILD_OUTPUT := ../..
ifdef O
ifeq ("$(origin O)", "command line")
KBUILD_OUTPUT := $(O)
endif
endif
# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary.
include $(KBUILD_OUTPUT)/.config
LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x08000000)
CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -Wl,-T,lguest.lds
CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -I../../include
LDLIBS:=-lz
# Removing this works for some versions of ld.so (eg. Ubuntu Feisty) and
# not others (eg. FC7).
LDFLAGS+=-static
all: lguest.lds lguest
# The linker script on x86 is so complex the only way of creating one
# which will link our binary in the right place is to mangle the
# default one.
lguest.lds:
$(LD) --verbose | awk '/^==========/ { PRINT=1; next; } /SIZEOF_HEADERS/ { gsub(/0x[0-9A-F]*/, "$(LGUEST_GUEST_TOP)") } { if (PRINT) print $$0; }' > $@
all: lguest
clean:
rm -f lguest.lds lguest
rm -f lguest
This diff is collapsed.
......@@ -6,7 +6,7 @@ Lguest is designed to be a minimal hypervisor for the Linux kernel, for
Linux developers and users to experiment with virtualization with the
minimum of complexity. Nonetheless, it should have sufficient
features to make it useful for specific tasks, and, of course, you are
encouraged to fork and enhance it.
encouraged to fork and enhance it (see drivers/lguest/README).
Features:
......@@ -23,19 +23,30 @@ Developer features:
Running Lguest:
- Lguest runs the same kernel as guest and host. You can configure
them differently, but usually it's easiest not to.
- The easiest way to run lguest is to use same kernel as guest and host.
You can configure them differently, but usually it's easiest not to.
You will need to configure your kernel with the following options:
CONFIG_HIGHMEM64G=n ("High Memory Support" "64GB")[1]
CONFIG_TUN=y/m ("Universal TUN/TAP device driver support")
CONFIG_EXPERIMENTAL=y ("Prompt for development and/or incomplete code/drivers")
CONFIG_PARAVIRT=y ("Paravirtualization support (EXPERIMENTAL)")
CONFIG_LGUEST=y/m ("Linux hypervisor example code")
and I recommend:
CONFIG_HZ=100 ("Timer frequency")[2]
"General setup":
"Prompt for development and/or incomplete code/drivers" = Y
(CONFIG_EXPERIMENTAL=y)
"Processor type and features":
"Paravirtualized guest support" = Y
"Lguest guest support" = Y
"High Memory Support" = off/4GB
"Alignment value to which kernel should be aligned" = 0x100000
(CONFIG_PARAVIRT=y, CONFIG_LGUEST_GUEST=y, CONFIG_HIGHMEM64G=n and
CONFIG_PHYSICAL_ALIGN=0x100000)
"Device Drivers":
"Network device support"
"Universal TUN/TAP device driver support" = M/Y
(CONFIG_TUN=m)
"Virtualization"
"Linux hypervisor example code" = M/Y
(CONFIG_LGUEST=m)
- A tool called "lguest" is available in this directory: type "make"
to build it. If you didn't build your kernel in-tree, use "make
......@@ -51,14 +62,17 @@ Running Lguest:
dd if=/dev/zero of=rootfile bs=1M count=2048
qemu -cdrom image.iso -hda rootfile -net user -net nic -boot d
Make sure that you install a getty on /dev/hvc0 if you want to log in on the
console!
- "modprobe lg" if you built it as a module.
- Run an lguest as root:
Documentation/lguest/lguest 64m vmlinux --tunnet=192.168.19.1 --block=rootfile root=/dev/lgba
Documentation/lguest/lguest 64 vmlinux --tunnet=192.168.19.1 --block=rootfile root=/dev/vda
Explanation:
64m: the amount of memory to use.
64: the amount of memory to use, in MB.
vmlinux: the kernel image found in the top of your build directory. You
can also use a standard bzImage.
......@@ -66,10 +80,10 @@ Running Lguest:
--tunnet=192.168.19.1: configures a "tap" device for networking with this
IP address.
--block=rootfile: a file or block device which becomes /dev/lgba
--block=rootfile: a file or block device which becomes /dev/vda
inside the guest.
root=/dev/lgba: this (and anything else on the command line) are
root=/dev/vda: this (and anything else on the command line) are
kernel boot parameters.
- Configuring networking. I usually have the host masquerade, using
......@@ -99,31 +113,7 @@ Running Lguest:
"--sharenet=<filename>": any two guests using the same file are on
the same network. This file is created if it does not exist.
Lguest I/O model:
Lguest uses a simplified DMA model plus shared memory for I/O. Guests
can communicate with each other if they share underlying memory
(usually by the lguest program mmaping the same file), but they can
use any non-shared memory to communicate with the lguest process.
Guests can register DMA buffers at any key (must be a valid physical
address) using the LHCALL_BIND_DMA(key, dmabufs, num<<8|irq)
hypercall. "dmabufs" is the physical address of an array of "num"
"struct lguest_dma": each contains a used_len, and an array of
physical addresses and lengths. When a transfer occurs, the
"used_len" field of one of the buffers which has used_len 0 will be
set to the length transferred and the irq will fire.
There is a helpful mailing list at http://ozlabs.org/mailman/listinfo/lguest
Using an irq value of 0 unbinds the dma buffers.
To send DMA, the LHCALL_SEND_DMA(key, dma_physaddr) hypercall is used,
and the bytes used is written to the used_len field. This can be 0 if
noone else has bound a DMA buffer to that key or some other error.
DMA buffers bound by the same guest are ignored.
Cheers!
Good luck!
Rusty Russell rusty@rustcorp.com.au.
[1] These are on various places on the TODO list, waiting for you to
get annoyed enough at the limitation to fix it.
[2] Lguest is not yet tickless when idle. See [1].
......@@ -227,28 +227,40 @@ config SCHED_NO_NO_OMIT_FRAME_POINTER
If in doubt, say "Y".
config PARAVIRT
bool "Paravirtualization support (EXPERIMENTAL)"
depends on EXPERIMENTAL
bool
depends on !(X86_VISWS || X86_VOYAGER)
help
Paravirtualization is a way of running multiple instances of
Linux on the same machine, under a hypervisor. This option
changes the kernel so it can modify itself when it is run
under a hypervisor, improving performance significantly.
However, when run without a hypervisor the kernel is
theoretically slower. If in doubt, say N.
This changes the kernel so it can modify itself when it is run
under a hypervisor, potentially improving performance significantly
over full virtualization. However, when run without a hypervisor
the kernel is theoretically slower and slightly larger.
menuconfig PARAVIRT_GUEST
bool "Paravirtualized guest support"
help
Say Y here to get to see options related to running Linux under
various hypervisors. This option alone does not add any kernel code.
If you say N, all options in this submenu will be skipped and disabled.
if PARAVIRT_GUEST
source "arch/x86/xen/Kconfig"
config VMI
bool "VMI Paravirt-ops support"
depends on PARAVIRT
bool "VMI Guest support"
select PARAVIRT
depends on !(X86_VISWS || X86_VOYAGER)
help
VMI provides a paravirtualized interface to the VMware ESX server
(it could be used by other hypervisors in theory too, but is not
at the moment), by linking the kernel to a GPL-ed ROM module
provided by the hypervisor.
source "arch/x86/lguest/Kconfig"
endif
config ACPI_SRAT
bool
default y
......
......@@ -99,6 +99,9 @@ core-$(CONFIG_X86_ES7000) := arch/x86/mach-es7000/
# Xen paravirtualization support
core-$(CONFIG_XEN) += arch/x86/xen/
# lguest paravirtualization support
core-$(CONFIG_LGUEST_GUEST) += arch/x86/lguest/
# default subarch .h files
mflags-y += -Iinclude/asm-x86/mach-default
......
......@@ -136,6 +136,7 @@ void foo(void)
#ifdef CONFIG_LGUEST_GUEST
BLANK();
OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
OFFSET(LGUEST_DATA_pgdir, lguest_data, pgdir);
OFFSET(LGUEST_PAGES_host_gdt_desc, lguest_pages, state.host_gdt_desc);
OFFSET(LGUEST_PAGES_host_idt_desc, lguest_pages, state.host_idt_desc);
OFFSET(LGUEST_PAGES_host_cr3, lguest_pages, state.host_cr3);
......
config LGUEST_GUEST
bool "Lguest guest support"
select PARAVIRT
depends on !X86_PAE
select VIRTIO
select VIRTIO_RING
select VIRTIO_CONSOLE
help
Lguest is a tiny in-kernel hypervisor. Selecting this will
allow your kernel to boot under lguest. This option will increase
your kernel size by about 6k. If in doubt, say N.
If you say Y here, make sure you say Y (or M) to the virtio block
and net drivers which lguest needs.
obj-y := i386_head.o boot.o
......@@ -55,7 +55,7 @@
#include <linux/clockchips.h>
#include <linux/lguest.h>
#include <linux/lguest_launcher.h>
#include <linux/lguest_bus.h>
#include <linux/virtio_console.h>
#include <asm/paravirt.h>
#include <asm/param.h>
#include <asm/page.h>
......@@ -65,6 +65,7 @@
#include <asm/e820.h>
#include <asm/mce.h>
#include <asm/io.h>
#include <asm/i387.h>
/*G:010 Welcome to the Guest!
*
......@@ -85,9 +86,10 @@ struct lguest_data lguest_data = {
.hcall_status = { [0 ... LHCALL_RING_SIZE-1] = 0xFF },
.noirq_start = (u32)lguest_noirq_start,
.noirq_end = (u32)lguest_noirq_end,
.kernel_address = PAGE_OFFSET,
.blocked_interrupts = { 1 }, /* Block timer interrupts */
.syscall_vec = SYSCALL_VECTOR,
};
struct lguest_device_desc *lguest_devices;
static cycle_t clock_base;
/*G:035 Notice the lazy_hcall() above, rather than hcall(). This is our first
......@@ -146,10 +148,10 @@ void async_hcall(unsigned long call,
/* Table full, so do normal hcall which will flush table. */
hcall(call, arg1, arg2, arg3);
} else {
lguest_data.hcalls[next_call].eax = call;
lguest_data.hcalls[next_call].edx = arg1;
lguest_data.hcalls[next_call].ebx = arg2;
lguest_data.hcalls[next_call].ecx = arg3;
lguest_data.hcalls[next_call].arg0 = call;
lguest_data.hcalls[next_call].arg1 = arg1;
lguest_data.hcalls[next_call].arg2 = arg2;
lguest_data.hcalls[next_call].arg3 = arg3;
/* Arguments must all be written before we mark it to go */
wmb();
lguest_data.hcall_status[next_call] = 0;
......@@ -160,46 +162,6 @@ void async_hcall(unsigned long call,
}
/*:*/
/* Wrappers for the SEND_DMA and BIND_DMA hypercalls. This is mainly because
* Jeff Garzik complained that __pa() should never appear in drivers, and this
* helps remove most of them. But also, it wraps some ugliness. */
void lguest_send_dma(unsigned long key, struct lguest_dma *dma)
{
/* The hcall might not write this if something goes wrong */
dma->used_len = 0;
hcall(LHCALL_SEND_DMA, key, __pa(dma), 0);
}
int lguest_bind_dma(unsigned long key, struct lguest_dma *dmas,
unsigned int num, u8 irq)
{
/* This is the only hypercall which actually wants 5 arguments, and we
* only support 4. Fortunately the interrupt number is always less
* than 256, so we can pack it with the number of dmas in the final
* argument. */
if (!hcall(LHCALL_BIND_DMA, key, __pa(dmas), (num << 8) | irq))
return -ENOMEM;
return 0;
}
/* Unbinding is the same hypercall as binding, but with 0 num & irq. */
void lguest_unbind_dma(unsigned long key, struct lguest_dma *dmas)
{
hcall(LHCALL_BIND_DMA, key, __pa(dmas), 0);
}
/* For guests, device memory can be used as normal memory, so we cast away the
* __iomem to quieten sparse. */
void *lguest_map(unsigned long phys_addr, unsigned long pages)
{
return (__force void *)ioremap(phys_addr, PAGE_SIZE*pages);
}
void lguest_unmap(void *addr)
{
iounmap((__force void __iomem *)addr);
}
/*G:033
* Here are our first native-instruction replacements: four functions for
* interrupt control.
......@@ -680,6 +642,7 @@ static struct clocksource lguest_clock = {
.mask = CLOCKSOURCE_MASK(64),
.mult = 1 << 22,
.shift = 22,
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};
/* The "scheduler clock" is just our real clock, adjusted to start at zero */
......@@ -761,11 +724,9 @@ static void lguest_time_init(void)
* the TSC, otherwise it's a dumb nanosecond-resolution clock. Either
* way, the "rating" is initialized so high that it's always chosen
* over any other clocksource. */
if (lguest_data.tsc_khz) {
if (lguest_data.tsc_khz)
lguest_clock.mult = clocksource_khz2mult(lguest_data.tsc_khz,
lguest_clock.shift);
lguest_clock.flags = CLOCK_SOURCE_IS_CONTINUOUS;
}
clock_base = lguest_clock_read();
clocksource_register(&lguest_clock);
......@@ -889,6 +850,23 @@ static __init char *lguest_memory_setup(void)
return "LGUEST";
}
/* Before virtqueues are set up, we use LHCALL_NOTIFY on normal memory to
* produce console output. */
static __init int early_put_chars(u32 vtermno, const char *buf, int count)
{
char scratch[17];
unsigned int len = count;
if (len > sizeof(scratch) - 1)
len = sizeof(scratch) - 1;
scratch[len] = '\0';
memcpy(scratch, buf, len);
hcall(LHCALL_NOTIFY, __pa(scratch), 0, 0);
/* This routine returns the number of bytes actually written. */
return len;
}
/*G:050
* Patching (Powerfully Placating Performance Pedants)
*
......@@ -950,18 +928,8 @@ static unsigned lguest_patch(u8 type, u16 clobber, void *ibuf,
/*G:030 Once we get to lguest_init(), we know we're a Guest. The pv_ops
* structures in the kernel provide points for (almost) every routine we have
* to override to avoid privileged instructions. */
__init void lguest_init(void *boot)
__init void lguest_init(void)
{
/* Copy boot parameters first: the Launcher put the physical location
* in %esi, and head.S converted that to a virtual address and handed
* it to us. We use "__memcpy" because "memcpy" sometimes tries to do
* tricky things to go faster, and we're not ready for that. */
__memcpy(&boot_params, boot, PARAM_SIZE);
/* The boot parameters also tell us where the command-line is: save
* that, too. */
__memcpy(boot_command_line, __va(boot_params.hdr.cmd_line_ptr),
COMMAND_LINE_SIZE);
/* We're under lguest, paravirt is enabled, and we're running at
* privilege level 1, not 0 as normal. */
pv_info.name = "lguest";
......@@ -1033,11 +1001,7 @@ __init void lguest_init(void *boot)
/*G:070 Now we've seen all the paravirt_ops, we return to
* lguest_init() where the rest of the fairly chaotic boot setup
* occurs.
*
* The Host expects our first hypercall to tell it where our "struct
* lguest_data" is, so we do that first. */
hcall(LHCALL_LGUEST_INIT, __pa(&lguest_data), 0, 0);
* occurs. */
/* The native boot code sets up initial page tables immediately after
* the kernel itself, and sets init_pg_tables_end so they're not
......@@ -1050,11 +1014,6 @@ __init void lguest_init(void *boot)
* the normal data segment to get through booting. */
asm volatile ("mov %0, %%fs" : : "r" (__KERNEL_DS) : "memory");
/* Clear the part of the kernel data which is expected to be zero.
* Normally it will be anyway, but if we're loading from a bzImage with
* CONFIG_RELOCATALE=y, the relocations will be sitting here. */
memset(__bss_start, 0, __bss_stop - __bss_start);
/* The Host uses the top of the Guest's virtual address space for the
* Host<->Guest Switcher, and it tells us how much it needs in
* lguest_data.reserve_mem, set up on the LGUEST_INIT hypercall. */
......@@ -1092,6 +1051,9 @@ __init void lguest_init(void *boot)
* adapted for lguest's use. */
add_preferred_console("hvc", 0, NULL);
/* Register our very early console. */
virtio_cons_early_init(early_put_chars);
/* Last of all, we set the power management poweroff hook to point to
* the Guest routine to power off. */
pm_power_off = lguest_power_off;
......
#include <linux/linkage.h>
#include <linux/lguest.h>
#include <asm/lguest_hcall.h>
#include <asm/asm-offsets.h>
#include <asm/thread_info.h>
#include <asm/processor-flags.h>
/*G:020 This is where we begin: we have a magic signature which the launcher
* looks for. The plan is that the Linux boot protocol will be extended with a
* "platform type" field which will guide us here from the normal entry point,
* but for the moment this suffices. The normal boot code uses %esi for the
* boot header, so we do too. We convert it to a virtual address by adding
* PAGE_OFFSET, and hand it to lguest_init() as its argument (ie. %eax).
/*G:020 This is where we begin: head.S notes that the boot header's platform
* type field is "1" (lguest), so calls us here. The boot header is in %esi.
*
* WARNING: be very careful here! We're running at addresses equal to physical
* addesses (around 0), not above PAGE_OFFSET as most code expectes
* (eg. 0xC0000000). Jumps are relative, so they're OK, but we can't touch any
* data.
*
* The .section line puts this code in .init.text so it will be discarded after
* boot. */
.section .init.text, "ax", @progbits
.ascii "GenuineLguest"
/* Set up initial stack. */
movl $(init_thread_union+THREAD_SIZE),%esp
movl %esi, %eax
addl $__PAGE_OFFSET, %eax
jmp lguest_init
ENTRY(lguest_entry)
/* Make initial hypercall now, so we can set up the pagetables. */
movl $LHCALL_LGUEST_INIT, %eax
movl $lguest_data - __PAGE_OFFSET, %edx
int $LGUEST_TRAP_ENTRY
/* The Host put the toplevel pagetable in lguest_data.pgdir. The movsl
* instruction uses %esi implicitly. */
movl lguest_data - __PAGE_OFFSET + LGUEST_DATA_pgdir, %esi
/* Copy first 32 entries of page directory to __PAGE_OFFSET entries.
* This means the first 128M of kernel memory will be mapped at
* PAGE_OFFSET where the kernel expects to run. This will get it far
* enough through boot to switch to its own pagetables. */
movl $32, %ecx
movl %esi, %edi
addl $((__PAGE_OFFSET >> 22) * 4), %edi
rep
movsl
/* Set up the initial stack so we can run C code. */
movl $(init_thread_union+THREAD_SIZE),%esp
/* Jumps are relative, and we're running __PAGE_OFFSET too low at the
* moment. */
jmp lguest_init+__PAGE_OFFSET
/*G:055 We create a macro which puts the assembler code between lgstart_ and
* lgend_ markers. These templates are put in the .text section: they can't be
......
......@@ -3,8 +3,9 @@
#
config XEN
bool "Enable support for Xen hypervisor"
depends on PARAVIRT && X86_CMPXCHG && X86_TSC && !NEED_MULTIPLE_NODES
bool "Xen guest support"
select PARAVIRT
depends on X86_CMPXCHG && X86_TSC && !NEED_MULTIPLE_NODES && !(X86_VISWS || X86_VOYAGER)
help
This is the Linux Xen port. Enabling this will allow the
kernel to boot in a paravirtualized environment under the
......
......@@ -94,5 +94,5 @@ source "drivers/kvm/Kconfig"
source "drivers/uio/Kconfig"
source "drivers/lguest/Kconfig"
source "drivers/virtio/Kconfig"
endmenu
......@@ -91,3 +91,4 @@ obj-$(CONFIG_HID) += hid/
obj-$(CONFIG_PPC_PS3) += ps3/
obj-$(CONFIG_OF) += of/
obj-$(CONFIG_SSB) += ssb/
obj-$(CONFIG_VIRTIO) += virtio/
......@@ -425,4 +425,10 @@ config XEN_BLKDEV_FRONTEND
block device driver. It communicates with a back-end driver
in another domain which drives the actual block device.
config VIRTIO_BLK
tristate "Virtio block driver (EXPERIMENTAL)"
depends on EXPERIMENTAL && VIRTIO
---help---
This is the virtual block driver for lguest. Say Y or M.
endif # BLK_DEV
......@@ -25,10 +25,10 @@ obj-$(CONFIG_SUNVDC) += sunvdc.o
obj-$(CONFIG_BLK_DEV_UMEM) += umem.o
obj-$(CONFIG_BLK_DEV_NBD) += nbd.o
obj-$(CONFIG_BLK_DEV_CRYPTOLOOP) += cryptoloop.o
obj-$(CONFIG_VIRTIO_BLK) += virtio_blk.o
obj-$(CONFIG_VIODASD) += viodasd.o
obj-$(CONFIG_BLK_DEV_SX8) += sx8.o
obj-$(CONFIG_BLK_DEV_UB) += ub.o
obj-$(CONFIG_XEN_BLKDEV_FRONTEND) += xen-blkfront.o
obj-$(CONFIG_LGUEST_BLOCK) += lguest_blk.o
This diff is collapsed.
//#define DEBUG
#include <linux/spinlock.h>
#include <linux/blkdev.h>
#include <linux/hdreg.h>
#include <linux/virtio.h>
#include <linux/virtio_blk.h>
#include <linux/virtio_blk.h>
static unsigned char virtblk_index = 'a';
struct virtio_blk
{
spinlock_t lock;
struct virtio_device *vdev;
struct virtqueue *vq;
/* The disk structure for the kernel. */
struct gendisk *disk;
/* Request tracking. */
struct list_head reqs;
mempool_t *pool;
/* Scatterlist: can be too big for stack. */
struct scatterlist sg[3+MAX_PHYS_SEGMENTS];
};
struct virtblk_req
{
struct list_head list;
struct request *req;
struct virtio_blk_outhdr out_hdr;
struct virtio_blk_inhdr in_hdr;
};
static bool blk_done(struct virtqueue *vq)
{
struct virtio_blk *vblk = vq->vdev->priv;
struct virtblk_req *vbr;
unsigned int len;
unsigned long flags;
spin_lock_irqsave(&vblk->lock, flags);
while ((vbr = vblk->vq->vq_ops->get_buf(vblk->vq, &len)) != NULL) {
int uptodate;
switch (vbr->in_hdr.status) {
case VIRTIO_BLK_S_OK:
uptodate = 1;
break;
case VIRTIO_BLK_S_UNSUPP:
uptodate = -ENOTTY;
break;
default:
uptodate = 0;
break;
}
end_dequeued_request(vbr->req, uptodate);
list_del(&vbr->list);
mempool_free(vbr, vblk->pool);
}
/* In case queue is stopped waiting for more buffers. */
blk_start_queue(vblk->disk->queue);
spin_unlock_irqrestore(&vblk->lock, flags);
return true;
}
static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
struct request *req)
{
unsigned long num, out, in;
struct virtblk_req *vbr;
vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
if (!vbr)
/* When another request finishes we'll try again. */
return false;
vbr->req = req;
if (blk_fs_request(vbr->req)) {
vbr->out_hdr.type = 0;
vbr->out_hdr.sector = vbr->req->sector;
vbr->out_hdr.ioprio = vbr->req->ioprio;
} else if (blk_pc_request(vbr->req)) {
vbr->out_hdr.type = VIRTIO_BLK_T_SCSI_CMD;
vbr->out_hdr.sector = 0;
vbr->out_hdr.ioprio = vbr->req->ioprio;
} else {
/* We don't put anything else in the queue. */
BUG();
}
if (blk_barrier_rq(vbr->req))
vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
/* We have to zero this, otherwise blk_rq_map_sg gets upset. */
memset(vblk->sg, 0, sizeof(vblk->sg));
sg_set_buf(&vblk->sg[0], &vbr->out_hdr, sizeof(vbr->out_hdr));
num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
sg_set_buf(&vblk->sg[num+1], &vbr->in_hdr, sizeof(vbr->in_hdr));
if (rq_data_dir(vbr->req) == WRITE) {
vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
out = 1 + num;
in = 1;
} else {
vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
out = 1;
in = 1 + num;
}
if (vblk->vq->vq_ops->add_buf(vblk->vq, vblk->sg, out, in, vbr)) {
mempool_free(vbr, vblk->pool);
return false;
}
list_add_tail(&vbr->list, &vblk->reqs);
return true;
}
static void do_virtblk_request(struct request_queue *q)
{
struct virtio_blk *vblk = NULL;
struct request *req;
unsigned int issued = 0;
while ((req = elv_next_request(q)) != NULL) {
vblk = req->rq_disk->private_data;
BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg));
/* If this request fails, stop queue and wait for something to
finish to restart it. */
if (!do_req(q, vblk, req)) {
blk_stop_queue(q);
break;
}
blkdev_dequeue_request(req);
issued++;
}
if (issued)
vblk->vq->vq_ops->kick(vblk->vq);
}
static int virtblk_ioctl(struct inode *inode, struct file *filp,
unsigned cmd, unsigned long data)
{
return scsi_cmd_ioctl(filp, inode->i_bdev->bd_disk->queue,
inode->i_bdev->bd_disk, cmd,
(void __user *)data);
}
static struct block_device_operations virtblk_fops = {
.ioctl = virtblk_ioctl,
.owner = THIS_MODULE,
};
static int virtblk_probe(struct virtio_device *vdev)
{
struct virtio_blk *vblk;
int err, major;
void *token;
unsigned int len;
u64 cap;
u32 v;
vdev->priv = vblk = kmalloc(sizeof(*vblk), GFP_KERNEL);
if (!vblk) {
err = -ENOMEM;
goto out;
}
INIT_LIST_HEAD(&vblk->reqs);
spin_lock_init(&vblk->lock);
vblk->vdev = vdev;
/* We expect one virtqueue, for output. */
vblk->vq = vdev->config->find_vq(vdev, blk_done);
if (IS_ERR(vblk->vq)) {
err = PTR_ERR(vblk->vq);
goto out_free_vblk;
}
vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
if (!vblk->pool) {
err = -ENOMEM;
goto out_free_vq;
}
major = register_blkdev(0, "virtblk");
if (major < 0) {
err = major;
goto out_mempool;
}
/* FIXME: How many partitions? How long is a piece of string? */
vblk->disk = alloc_disk(1 << 4);
if (!vblk->disk) {
err = -ENOMEM;
goto out_unregister_blkdev;
}
vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
if (!vblk->disk->queue) {
err = -ENOMEM;
goto out_put_disk;
}
sprintf(vblk->disk->disk_name, "vd%c", virtblk_index++);
vblk->disk->major = major;
vblk->disk->first_minor = 0;
vblk->disk->private_data = vblk;
vblk->disk->fops = &virtblk_fops;
/* If barriers are supported, tell block layer that queue is ordered */
token = vdev->config->find(vdev, VIRTIO_CONFIG_BLK_F, &len);
if (virtio_use_bit(vdev, token, len, VIRTIO_BLK_F_BARRIER))
blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_TAG, NULL);
err = virtio_config_val(vdev, VIRTIO_CONFIG_BLK_F_CAPACITY, &cap);
if (err) {
dev_err(&vdev->dev, "Bad/missing capacity in config\n");
goto out_put_disk;
}
/* If capacity is too big, truncate with warning. */
if ((sector_t)cap != cap) {
dev_warn(&vdev->dev, "Capacity %llu too large: truncating\n",
(unsigned long long)cap);
cap = (sector_t)-1;
}
set_capacity(vblk->disk, cap);
err = virtio_config_val(vdev, VIRTIO_CONFIG_BLK_F_SIZE_MAX, &v);
if (!err)
blk_queue_max_segment_size(vblk->disk->queue, v);
else if (err != -ENOENT) {
dev_err(&vdev->dev, "Bad SIZE_MAX in config\n");
goto out_put_disk;
}
err = virtio_config_val(vdev, VIRTIO_CONFIG_BLK_F_SEG_MAX, &v);
if (!err)
blk_queue_max_hw_segments(vblk->disk->queue, v);
else if (err != -ENOENT) {
dev_err(&vdev->dev, "Bad SEG_MAX in config\n");
goto out_put_disk;
}
add_disk(vblk->disk);
return 0;
out_put_disk:
put_disk(vblk->disk);
out_unregister_blkdev:
unregister_blkdev(major, "virtblk");
out_mempool:
mempool_destroy(vblk->pool);
out_free_vq:
vdev->config->del_vq(vblk->vq);
out_free_vblk:
kfree(vblk);
out:
return err;
}
static void virtblk_remove(struct virtio_device *vdev)
{
struct virtio_blk *vblk = vdev->priv;
int major = vblk->disk->major;
BUG_ON(!list_empty(&vblk->reqs));
blk_cleanup_queue(vblk->disk->queue);
put_disk(vblk->disk);
unregister_blkdev(major, "virtblk");
mempool_destroy(vblk->pool);
kfree(vblk);
}
static struct virtio_device_id id_table[] = {
{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
{ 0 },
};
static struct virtio_driver virtio_blk = {
.driver.name = KBUILD_MODNAME,
.driver.owner = THIS_MODULE,
.id_table = id_table,
.probe = virtblk_probe,
.remove = __devexit_p(virtblk_remove),
};
static int __init init(void)
{
return register_virtio_driver(&virtio_blk);
}
static void __exit fini(void)
{
unregister_virtio_driver(&virtio_blk);
}
module_init(init);
module_exit(fini);
MODULE_DEVICE_TABLE(virtio, id_table);
MODULE_DESCRIPTION("Virtio block driver");
MODULE_LICENSE("GPL");
......@@ -613,6 +613,10 @@ config HVC_XEN
help
Xen virtual console device driver
config VIRTIO_CONSOLE
bool
select HVC_DRIVER
config HVCS
tristate "IBM Hypervisor Virtual Console Server support"
depends on PPC_PSERIES
......
......@@ -42,7 +42,6 @@ obj-$(CONFIG_SYNCLINK_GT) += synclink_gt.o
obj-$(CONFIG_N_HDLC) += n_hdlc.o
obj-$(CONFIG_AMIGA_BUILTIN_SERIAL) += amiserial.o
obj-$(CONFIG_SX) += sx.o generic_serial.o
obj-$(CONFIG_LGUEST_GUEST) += hvc_lguest.o
obj-$(CONFIG_RIO) += rio/ generic_serial.o
obj-$(CONFIG_HVC_CONSOLE) += hvc_vio.o hvsi.o
obj-$(CONFIG_HVC_ISERIES) += hvc_iseries.o
......@@ -50,6 +49,7 @@ obj-$(CONFIG_HVC_RTAS) += hvc_rtas.o
obj-$(CONFIG_HVC_BEAT) += hvc_beat.o
obj-$(CONFIG_HVC_DRIVER) += hvc_console.o
obj-$(CONFIG_HVC_XEN) += hvc_xen.o
obj-$(CONFIG_VIRTIO_CONSOLE) += virtio_console.o
obj-$(CONFIG_RAW_DRIVER) += raw.o
obj-$(CONFIG_SGI_SNSC) += snsc.o snsc_event.o
obj-$(CONFIG_MSPEC) += mspec.o
......
......@@ -47,4 +47,8 @@ config KVM_AMD
Provides support for KVM on AMD processors equipped with the AMD-V
(SVM) extensions.
# OK, it's a little counter-intuitive to do this, but it puts it neatly under
# the virtualization menu.
source drivers/lguest/Kconfig
endif # VIRTUALIZATION
config LGUEST
tristate "Linux hypervisor example code"
depends on X86 && PARAVIRT && EXPERIMENTAL && !X86_PAE && FUTEX
select LGUEST_GUEST
depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX && !(X86_VISWS || X86_VOYAGER)
select HVC_DRIVER
---help---
This is a very simple module which allows you to run
......@@ -18,13 +17,3 @@ config LGUEST_GUEST
The guest needs code built-in, even if the host has lguest
support as a module. The drivers are tiny, so we build them
in too.
config LGUEST_NET
tristate
default y
depends on LGUEST_GUEST && NET
config LGUEST_BLOCK
tristate
default y
depends on LGUEST_GUEST && BLOCK
# Guest requires the paravirt_ops replacement and the bus driver.
obj-$(CONFIG_LGUEST_GUEST) += lguest.o lguest_asm.o lguest_bus.o
# Guest requires the device configuration and probing code.
obj-$(CONFIG_LGUEST_GUEST) += lguest_device.o
# Host requires the other files, which can be a module.
obj-$(CONFIG_LGUEST) += lg.o
lg-y := core.o hypercalls.o page_tables.o interrupts_and_traps.o \
segments.o io.o lguest_user.o switcher.o
lg-y = core.o hypercalls.o page_tables.o interrupts_and_traps.o \
segments.o lguest_user.o
lg-$(CONFIG_X86_32) += x86/switcher_32.o x86/core.o
Preparation Preparation!: PREFIX=P
Guest: PREFIX=G
......
This diff is collapsed.
This diff is collapsed.
......@@ -12,8 +12,14 @@
* them first, so we also have a way of "reflecting" them into the Guest as if
* they had been delivered to it directly. :*/
#include <linux/uaccess.h>
#include <linux/interrupt.h>
#include <linux/module.h>
#include "lg.h"
/* Allow Guests to use a non-128 (ie. non-Linux) syscall trap. */
static unsigned int syscall_vector = SYSCALL_VECTOR;
module_param(syscall_vector, uint, 0444);
/* The address of the interrupt handler is split into two bits: */
static unsigned long idt_address(u32 lo, u32 hi)
{
......@@ -39,7 +45,7 @@ static void push_guest_stack(struct lguest *lg, unsigned long *gstack, u32 val)
{
/* Stack grows upwards: move stack then write value. */
*gstack -= 4;
lgwrite_u32(lg, *gstack, val);
lgwrite(lg, *gstack, u32, val);
}
/*H:210 The set_guest_interrupt() routine actually delivers the interrupt or
......@@ -56,8 +62,9 @@ static void push_guest_stack(struct lguest *lg, unsigned long *gstack, u32 val)
* it). */
static void set_guest_interrupt(struct lguest *lg, u32 lo, u32 hi, int has_err)
{
unsigned long gstack;
unsigned long gstack, origstack;
u32 eflags, ss, irq_enable;
unsigned long virtstack;
/* There are two cases for interrupts: one where the Guest is already
* in the kernel, and a more complex one where the Guest is in
......@@ -65,8 +72,10 @@ static void set_guest_interrupt(struct lguest *lg, u32 lo, u32 hi, int has_err)
if ((lg->regs->ss&0x3) != GUEST_PL) {
/* The Guest told us their kernel stack with the SET_STACK
* hypercall: both the virtual address and the segment */
gstack = guest_pa(lg, lg->esp1);
virtstack = lg->esp1;
ss = lg->ss1;
origstack = gstack = guest_pa(lg, virtstack);
/* We push the old stack segment and pointer onto the new
* stack: when the Guest does an "iret" back from the interrupt
* handler the CPU will notice they're dropping privilege
......@@ -75,8 +84,10 @@ static void set_guest_interrupt(struct lguest *lg, u32 lo, u32 hi, int has_err)
push_guest_stack(lg, &gstack, lg->regs->esp);
} else {
/* We're staying on the same Guest (kernel) stack. */
gstack = guest_pa(lg, lg->regs->esp);
virtstack = lg->regs->esp;
ss = lg->regs->ss;
origstack = gstack = guest_pa(lg, virtstack);
}
/* Remember that we never let the Guest actually disable interrupts, so
......@@ -102,7 +113,7 @@ static void set_guest_interrupt(struct lguest *lg, u32 lo, u32 hi, int has_err)
/* Now we've pushed all the old state, we change the stack, the code
* segment and the address to execute. */
lg->regs->ss = ss;
lg->regs->esp = gstack + lg->page_offset;
lg->regs->esp = virtstack + (gstack - origstack);
lg->regs->cs = (__KERNEL_CS|GUEST_PL);
lg->regs->eip = idt_address(lo, hi);
......@@ -165,7 +176,7 @@ void maybe_do_interrupt(struct lguest *lg)
/* Look at the IDT entry the Guest gave us for this interrupt. The
* first 32 (FIRST_EXTERNAL_VECTOR) entries are for traps, so we skip
* over them. */
idt = &lg->idt[FIRST_EXTERNAL_VECTOR+irq];
idt = &lg->arch.idt[FIRST_EXTERNAL_VECTOR+irq];
/* If they don't have a handler (yet?), we just ignore it */
if (idt_present(idt->a, idt->b)) {
/* OK, mark it no longer pending and deliver it. */
......@@ -183,6 +194,47 @@ void maybe_do_interrupt(struct lguest *lg)
* timer interrupt. */
write_timestamp(lg);
}
/*:*/
/* Linux uses trap 128 for system calls. Plan9 uses 64, and Ron Minnich sent
* me a patch, so we support that too. It'd be a big step for lguest if half
* the Plan 9 user base were to start using it.
*
* Actually now I think of it, it's possible that Ron *is* half the Plan 9
* userbase. Oh well. */
static bool could_be_syscall(unsigned int num)
{
/* Normal Linux SYSCALL_VECTOR or reserved vector? */
return num == SYSCALL_VECTOR || num == syscall_vector;
}
/* The syscall vector it wants must be unused by Host. */
bool check_syscall_vector(struct lguest *lg)
{
u32 vector;
if (get_user(vector, &lg->lguest_data->syscall_vec))
return false;
return could_be_syscall(vector);
}
int init_interrupts(void)
{
/* If they want some strange system call vector, reserve it now */
if (syscall_vector != SYSCALL_VECTOR
&& test_and_set_bit(syscall_vector, used_vectors)) {
printk("lg: couldn't reserve syscall %u\n", syscall_vector);
return -EBUSY;
}
return 0;
}
void free_interrupts(void)
{
if (syscall_vector != SYSCALL_VECTOR)
clear_bit(syscall_vector, used_vectors);
}
/*H:220 Now we've got the routines to deliver interrupts, delivering traps
* like page fault is easy. The only trick is that Intel decided that some
......@@ -197,14 +249,14 @@ int deliver_trap(struct lguest *lg, unsigned int num)
{
/* Trap numbers are always 8 bit, but we set an impossible trap number
* for traps inside the Switcher, so check that here. */
if (num >= ARRAY_SIZE(lg->idt))
if (num >= ARRAY_SIZE(lg->arch.idt))
return 0;
/* Early on the Guest hasn't set the IDT entries (or maybe it put a
* bogus one in): if we fail here, the Guest will be killed. */
if (!idt_present(lg->idt[num].a, lg->idt[num].b))
if (!idt_present(lg->arch.idt[num].a, lg->arch.idt[num].b))
return 0;
set_guest_interrupt(lg, lg->idt[num].a, lg->idt[num].b, has_err(num));
set_guest_interrupt(lg, lg->arch.idt[num].a, lg->arch.idt[num].b, has_err(num));
return 1;
}
......@@ -218,28 +270,20 @@ int deliver_trap(struct lguest *lg, unsigned int num)
* system calls down from 1750ns to 270ns. Plus, if lguest didn't do it, all
* the other hypervisors would tease it.
*
* This routine determines if a trap can be delivered directly. */
static int direct_trap(const struct lguest *lg,
const struct desc_struct *trap,
unsigned int num)
* This routine indicates if a particular trap number could be delivered
* directly. */
static int direct_trap(unsigned int num)
{
/* Hardware interrupts don't go to the Guest at all (except system
* call). */
if (num >= FIRST_EXTERNAL_VECTOR && num != SYSCALL_VECTOR)
if (num >= FIRST_EXTERNAL_VECTOR && !could_be_syscall(num))
return 0;
/* The Host needs to see page faults (for shadow paging and to save the
* fault address), general protection faults (in/out emulation) and
* device not available (TS handling), and of course, the hypercall
* trap. */
if (num == 14 || num == 13 || num == 7 || num == LGUEST_TRAP_ENTRY)
return 0;
/* Only trap gates (type 15) can go direct to the Guest. Interrupt
* gates (type 14) disable interrupts as they are entered, which we
* never let the Guest do. Not present entries (type 0x0) also can't
* go direct, of course 8) */
return idt_type(trap->a, trap->b) == 0xF;
return num != 14 && num != 13 && num != 7 && num != LGUEST_TRAP_ENTRY;
}
/*:*/
......@@ -348,15 +392,11 @@ void load_guest_idt_entry(struct lguest *lg, unsigned int num, u32 lo, u32 hi)
* to copy this again. */
lg->changed |= CHANGED_IDT;
/* The IDT which we keep in "struct lguest" only contains 32 entries
* for the traps and LGUEST_IRQS (32) entries for interrupts. We
* ignore attempts to set handlers for higher interrupt numbers, except
* for the system call "interrupt" at 128: we have a special IDT entry
* for that. */
if (num < ARRAY_SIZE(lg->idt))
set_trap(lg, &lg->idt[num], num, lo, hi);
else if (num == SYSCALL_VECTOR)
set_trap(lg, &lg->syscall_idt, num, lo, hi);
/* Check that the Guest doesn't try to step outside the bounds. */
if (num >= ARRAY_SIZE(lg->arch.idt))
kill_guest(lg, "Setting idt entry %u", num);
else
set_trap(lg, &lg->arch.idt[num], num, lo, hi);
}
/* The default entry for each interrupt points into the Switcher routines which
......@@ -399,20 +439,21 @@ void copy_traps(const struct lguest *lg, struct desc_struct *idt,
/* We can simply copy the direct traps, otherwise we use the default
* ones in the Switcher: they will return to the Host. */
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++) {
if (direct_trap(lg, &lg->idt[i], i))
idt[i] = lg->idt[i];
for (i = 0; i < ARRAY_SIZE(lg->arch.idt); i++) {
/* If no Guest can ever override this trap, leave it alone. */
if (!direct_trap(i))
continue;
/* Only trap gates (type 15) can go direct to the Guest.
* Interrupt gates (type 14) disable interrupts as they are
* entered, which we never let the Guest do. Not present
* entries (type 0x0) also can't go direct, of course. */
if (idt_type(lg->arch.idt[i].a, lg->arch.idt[i].b) == 0xF)
idt[i] = lg->arch.idt[i];
else
/* Reset it to the default. */
default_idt_entry(&idt[i], i, def[i]);
}
/* Don't forget the system call trap! The IDT entries for other
* interupts never change, so no need to copy them. */
i = SYSCALL_VECTOR;
if (direct_trap(lg, &lg->syscall_idt, i))
idt[i] = lg->syscall_idt;
else
default_idt_entry(&idt[i], i, def[i]);
}
void guest_set_clockevent(struct lguest *lg, unsigned long delta)
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -48,7 +48,8 @@
#include <linux/linkage.h>
#include <asm/asm-offsets.h>
#include <asm/page.h>
#include "lg.h"
#include <asm/segment.h>
#include <asm/lguest.h>
// We mark the start of the code to copy
// It's placed in .text tho it's never run here
......@@ -132,6 +133,7 @@ ENTRY(switch_to_guest)
// The Guest's register page has been mapped
// Writable onto our %esp (stack) --
// We can simply pop off all Guest regs.
popl %eax
popl %ebx
popl %ecx
popl %edx
......@@ -139,7 +141,6 @@ ENTRY(switch_to_guest)
popl %edi
popl %ebp
popl %gs
popl %eax
popl %fs
popl %ds
popl %es
......@@ -167,7 +168,6 @@ ENTRY(switch_to_guest)
pushl %es; \
pushl %ds; \
pushl %fs; \
pushl %eax; \
pushl %gs; \
pushl %ebp; \
pushl %edi; \
......@@ -175,6 +175,7 @@ ENTRY(switch_to_guest)
pushl %edx; \
pushl %ecx; \
pushl %ebx; \
pushl %eax; \
/* Our stack and our code are using segments \
* Set in the TSS and IDT \
* Yet if we were to touch data we'd use \
......
......@@ -3100,4 +3100,10 @@ config NETPOLL_TRAP
config NET_POLL_CONTROLLER
def_bool NETPOLL
config VIRTIO_NET
tristate "Virtio network driver (EXPERIMENTAL)"
depends on EXPERIMENTAL && VIRTIO
---help---
This is the virtual network driver for lguest. Say Y or M.
endif # NETDEVICES
......@@ -183,7 +183,6 @@ obj-$(CONFIG_ZORRO8390) += zorro8390.o
obj-$(CONFIG_HPLANCE) += hplance.o 7990.o
obj-$(CONFIG_MVME147_NET) += mvme147.o 7990.o
obj-$(CONFIG_EQUALIZER) += eql.o
obj-$(CONFIG_LGUEST_NET) += lguest_net.o
obj-$(CONFIG_MIPS_JAZZ_SONIC) += jazzsonic.o
obj-$(CONFIG_MIPS_AU1X00_ENET) += au1000_eth.o
obj-$(CONFIG_MIPS_SIM_NET) += mipsnet.o
......@@ -243,3 +242,4 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
obj-$(CONFIG_NETXEN_NIC) += netxen/
obj-$(CONFIG_NIU) += niu.o
obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
This diff is collapsed.
This diff is collapsed.
# Virtio always gets selected by whoever wants it.
config VIRTIO
bool
# Similarly the virtio ring implementation.
config VIRTIO_RING
bool
depends on VIRTIO
obj-$(CONFIG_VIRTIO) += virtio.o
obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment