1. 14 Jun, 2012 1 commit
    • Stefano Stabellini's avatar
      xen: mark local pages as FOREIGN in the m2p_override · b9e0d95c
      Stefano Stabellini authored
      When the frontend and the backend reside on the same domain, even if we
      add pages to the m2p_override, these pages will never be returned by
      mfn_to_pfn because the check "get_phys_to_machine(pfn) != mfn" will
      always fail, so the pfn of the frontend will be returned instead
      (resulting in a deadlock because the frontend pages are already locked).
      
      INFO: task qemu-system-i38:1085 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      qemu-system-i38 D ffff8800cfc137c0     0  1085      1 0x00000000
       ffff8800c47ed898 0000000000000282 ffff8800be4596b0 00000000000137c0
       ffff8800c47edfd8 ffff8800c47ec010 00000000000137c0 00000000000137c0
       ffff8800c47edfd8 00000000000137c0 ffffffff82213020 ffff8800be4596b0
      Call Trace:
       [<ffffffff81101ee0>] ? __lock_page+0x70/0x70
       [<ffffffff81a0fdd9>] schedule+0x29/0x70
       [<ffffffff81a0fe80>] io_schedule+0x60/0x80
       [<ffffffff81101eee>] sleep_on_page+0xe/0x20
       [<ffffffff81a0e1ca>] __wait_on_bit_lock+0x5a/0xc0
       [<ffffffff81101ed7>] __lock_page+0x67/0x70
       [<ffffffff8106f750>] ? autoremove_wake_function+0x40/0x40
       [<ffffffff811867e6>] ? bio_add_page+0x36/0x40
       [<ffffffff8110b692>] set_page_dirty_lock+0x52/0x60
       [<ffffffff81186021>] bio_set_pages_dirty+0x51/0x70
       [<ffffffff8118c6b4>] do_blockdev_direct_IO+0xb24/0xeb0
       [<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00
       [<ffffffff8118ca95>] __blockdev_direct_IO+0x55/0x60
       [<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00
       [<ffffffff811e91c8>] ext3_direct_IO+0xf8/0x390
       [<ffffffff811e71a0>] ? ext3_get_blocks_handle+0xe00/0xe00
       [<ffffffff81004b60>] ? xen_mc_flush+0xb0/0x1b0
       [<ffffffff81104027>] generic_file_aio_read+0x737/0x780
       [<ffffffff813bedeb>] ? gnttab_map_refs+0x15b/0x1e0
       [<ffffffff811038f0>] ? find_get_pages+0x150/0x150
       [<ffffffff8119736c>] aio_rw_vect_retry+0x7c/0x1d0
       [<ffffffff811972f0>] ? lookup_ioctx+0x90/0x90
       [<ffffffff81198856>] aio_run_iocb+0x66/0x1a0
       [<ffffffff811998b8>] do_io_submit+0x708/0xb90
       [<ffffffff81199d50>] sys_io_submit+0x10/0x20
       [<ffffffff81a18d69>] system_call_fastpath+0x16/0x1b
      
      The explanation is in the comment within the code:
      
      We need to do this because the pages shared by the frontend
      (xen-blkfront) can be already locked (lock_page, called by
      do_read_cache_page); when the userspace backend tries to use them
      with direct_IO, mfn_to_pfn returns the pfn of the frontend, so
      do_blockdev_direct_IO is going to try to lock the same pages
      again resulting in a deadlock.
      
      A simplified call graph looks like this:
      
      pygrub                          QEMU
      -----------------------------------------------
      do_read_cache_page              io_submit
        |                              |
      lock_page                       ext3_direct_IO
                                       |
                                      bio_add_page
                                       |
                                      lock_page
      
      Internally the xen-blkback uses m2p_add_override to swizzle (temporarily)
      a 'struct page' to have a different MFN (so that it can point to another
      guest). It also can easily find out whether another pfn corresponding
      to the mfn exists in the m2p, and can set the FOREIGN bit
      in the p2m, making sure that mfn_to_pfn returns the pfn of the backend.
      
      This allows the backend to perform direct_IO on these pages, but as a
      side effect prevents the frontend from using get_user_pages_fast on
      them while they are being shared with the backend.
      Signed-off-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      b9e0d95c
  2. 31 May, 2012 1 commit
    • Andre Przywara's avatar
      xen/setup: filter APERFMPERF cpuid feature out · 5e626254
      Andre Przywara authored
      Xen PV kernels allow access to the APERF/MPERF registers to read the
      effective frequency. Access to the MSRs is however redirected to the
      currently scheduled physical CPU, making consecutive read and
      compares unreliable. In addition each rdmsr traps into the hypervisor.
      So to avoid bogus readouts and expensive traps, disable the kernel
      internal feature flag for APERF/MPERF if running under Xen.
      This will
      a) remove the aperfmperf flag from /proc/cpuinfo
      b) not mislead the power scheduler (arch/x86/kernel/cpu/sched.c) to
         use the feature to improve scheduling (by default disabled)
      c) not mislead the cpufreq driver to use the MSRs
      
      This does not cover userland programs which access the MSRs via the
      device file interface, but this will be addressed separately.
      Signed-off-by: default avatarAndre Przywara <andre.przywara@amd.com>
      Cc: stable@vger.kernel.org # v3.0+
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      5e626254
  3. 30 May, 2012 3 commits
  4. 24 May, 2012 3 commits
  5. 21 May, 2012 5 commits
  6. 07 May, 2012 9 commits
    • Konrad Rzeszutek Wilk's avatar
      Merge branch 'stable/autoballoon.v5.2' into stable/for-linus-3.5 · 4b3451ad
      Konrad Rzeszutek Wilk authored
      * stable/autoballoon.v5.2:
        xen/setup: update VA mapping when releasing memory during setup
        xen/setup: Combine the two hypercall functions - since they are quite similar.
        xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
        xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages freed" if Z > 0
        xen/p2m: An early bootup variant of set_phys_to_machine
        xen/p2m: Collapse early_alloc_p2m_middle redundant checks.
        xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on argument
        xen/p2m: Move code around to allow for better re-usage.
      4b3451ad
    • Stefano Stabellini's avatar
      xen: enter/exit lazy_mmu_mode around m2p_override calls · f62805f1
      Stefano Stabellini authored
      This patch is a significant performance improvement for the
      m2p_override: about 6% using the gntdev device.
      
      Each m2p_add/remove_override call issues a MULTI_grant_table_op and a
      __flush_tlb_single if kmap_op != NULL.  Batching all the calls together
      is a great performance benefit because it means issuing one hypercall
      total rather than two hypercall per page.
      If paravirt_lazy_mode is set PARAVIRT_LAZY_MMU, all these calls are
      going to be batched together, otherwise they are issued one at a time.
      
      Adding arch_enter_lazy_mmu_mode/arch_leave_lazy_mmu_mode around the
      m2p_add/remove_override calls forces paravirt_lazy_mode to
      PARAVIRT_LAZY_MMU, therefore makes sure that they are always batched.
      
      However it is not safe to call arch_enter_lazy_mmu_mode if we are in
      interrupt context or if we are already in PARAVIRT_LAZY_MMU mode, so
      check for both conditions before doing so.
      
      Changes in v4:
      - rebased on 3.4-rc4: all the m2p_override users call gnttab_unmap_refs
      and gnttab_map_refs;
      - check whether we are in interrupt context and the lazy_mode we are in
      before calling arch_enter/leave_lazy_mmu_mode.
      
      Changes in v3:
      - do not call arch_enter/leave_lazy_mmu_mode in xen_blkbk_unmap, that
      can be called in interrupt context.
      Signed-off-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      [v5: s/int lazy/bool lazy/]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      f62805f1
    • Konrad Rzeszutek Wilk's avatar
      xen/acpi/sleep: Enable ACPI sleep via the __acpi_os_prepare_sleep · 211063dc
      Konrad Rzeszutek Wilk authored
      Provide the registration callback to call in the Xen's
      ACPI sleep functionality. This means that during S3/S5
      we make a hypercall XENPF_enter_acpi_sleep with the
      proper PM1A/PM1B registers.
      
      Based of Ke Yu's <ke.yu@intel.com> initial idea.
      [ From http://xenbits.xensource.com/linux-2.6.18-xen.hg
      change c68699484a65 ]
      
      [v1: Added Copyright and license]
      [v2: Added check if PM1A/B the 16-bits MSB contain something. The spec
           only uses 16-bits but might have more in future]
      Signed-off-by: default avatarLiang Tang <liang.tang@oracle.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      211063dc
    • Lin Ming's avatar
      1ff2b0c3
    • Ben Guthro's avatar
      xen: implement apic ipi interface · f447d56d
      Ben Guthro authored
      Map native ipi vector to xen vector.
      Implement apic ipi interface with xen_send_IPI_one.
      Tested-by: default avatarSteven Noonan <steven@uplinklabs.net>
      Signed-off-by: default avatarBen Guthro <ben@guthro.net>
      Signed-off-by: default avatarLin Ming <mlin@ss.pku.edu.cn>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      f447d56d
    • David Vrabel's avatar
      xen/setup: update VA mapping when releasing memory during setup · 83d51ab4
      David Vrabel authored
      In xen_memory_setup(), if a page that is being released has a VA
      mapping this must also be updated.  Otherwise, the page will be not
      released completely -- it will still be referenced in Xen and won't be
      freed util the mapping is removed and this prevents it from being
      reallocated at a different PFN.
      
      This was already being done for the ISA memory region in
      xen_ident_map_ISA() but on many systems this was omitting a few pages
      as many systems marked a few pages below the ISA memory region as
      reserved in the e820 map.
      
      This fixes errors such as:
      
      (XEN) page_alloc.c:1148:d0 Over-allocation for domain 0: 2097153 > 2097152
      (XEN) memory.c:133:d0 Could not allocate order=0 extent: id=0 memflags=0 (0 of 17)
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      83d51ab4
    • Konrad Rzeszutek Wilk's avatar
      xen/setup: Combine the two hypercall functions - since they are quite similar. · 96dc08b3
      Konrad Rzeszutek Wilk authored
      They use the same set of arguments, so it is just the matter
      of using the proper hypercall.
      Acked-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      96dc08b3
    • Konrad Rzeszutek Wilk's avatar
      xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM · 2e2fb754
      Konrad Rzeszutek Wilk authored
      When the Xen hypervisor boots a PV kernel it hands it two pieces
      of information: nr_pages and a made up E820 entry.
      
      The nr_pages value defines the range from zero to nr_pages of PFNs
      which have a valid Machine Frame Number (MFN) underneath it. The
      E820 mirrors that (with the VGA hole):
      BIOS-provided physical RAM map:
       Xen: 0000000000000000 - 00000000000a0000 (usable)
       Xen: 00000000000a0000 - 0000000000100000 (reserved)
       Xen: 0000000000100000 - 0000000080800000 (usable)
      
      The fun comes when a PV guest that is run with a machine E820 - that
      can either be the initial domain or a PCI PV guest, where the E820
      looks like the normal thing:
      
      BIOS-provided physical RAM map:
       Xen: 0000000000000000 - 000000000009e000 (usable)
       Xen: 000000000009ec00 - 0000000000100000 (reserved)
       Xen: 0000000000100000 - 0000000020000000 (usable)
       Xen: 0000000020000000 - 0000000020200000 (reserved)
       Xen: 0000000020200000 - 0000000040000000 (usable)
       Xen: 0000000040000000 - 0000000040200000 (reserved)
       Xen: 0000000040200000 - 00000000bad80000 (usable)
       Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS)
      ..
      With that overlaying the nr_pages directly on the E820 does not
      work as there are gaps and non-RAM regions that won't be used
      by the memory allocator. The 'xen_release_chunk' helps with that
      by punching holes in the P2M (PFN to MFN lookup tree) for those
      regions and tells us that:
      
      Freeing  20000-20200 pfn range: 512 pages freed
      Freeing  40000-40200 pfn range: 512 pages freed
      Freeing  bad80-badf4 pfn range: 116 pages freed
      Freeing  badf6-bae7f pfn range: 137 pages freed
      Freeing  bb000-100000 pfn range: 282624 pages freed
      Released 283999 pages of unused memory
      
      Those 283999 pages are subtracted from the nr_pages and are returned
      to the hypervisor. The end result is that the initial domain
      boots with 1GB less memory as the nr_pages has been subtracted by
      the amount of pages residing within the PCI hole. It can balloon up
      to that if desired using 'xl mem-set 0 8092', but the balloon driver
      is not always compiled in for the initial domain.
      
      This patch, implements the populate hypercall (XENMEM_populate_physmap)
      which increases the the domain with the same amount of pages that
      were released.
      
      The other solution (that did not work) was to transplant the MFN in
      the P2M tree - the ones that were going to be freed were put in
      the E820_RAM regions past the nr_pages. But the modifications to the
      M2P array (the other side of creating PTEs) were not carried away.
      As the hypervisor is the only one capable of modifying that and the
      only two hypercalls that would do this are: the update_va_mapping
      (which won't work, as during initial bootup only PFNs up to nr_pages
      are mapped in the guest) or via the populate hypercall.
      
      The end result is that the kernel can now boot with the
      nr_pages without having to subtract the 283999 pages.
      
      On a 8GB machine, with various dom0_mem= parameters this is what we get:
      
      no dom0_mem
      -Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init)
      +Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init)
      
      dom0_mem=3G
      -Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init)
      +Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init)
      
      dom0_mem=max:3G
      -Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init)
      +Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init)
      
      And the 'xm list' or 'xl list' now reflect what the dom0_mem=
      argument is.
      Acked-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      [v2: Use populate hypercall]
      [v3: Remove debug printks]
      [v4: Simplify code]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      2e2fb754
    • Konrad Rzeszutek Wilk's avatar
      xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages freed" if Z > 0 · ca118238
      Konrad Rzeszutek Wilk authored
      Otherwise we can get these meaningless:
      Freeing  bad80-badf4 pfn range: 0 pages freed
      
      We also can do this for the summary ones - no point of printing
      "Set 0 page(s) to 1-1 mapping"
      Acked-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      [v1: Extended to the summary printks]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ca118238
  7. 17 Apr, 2012 2 commits
  8. 16 Apr, 2012 3 commits
    • Linus Torvalds's avatar
      Linux 3.4-rc3 · e816b57a
      Linus Torvalds authored
      e816b57a
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm · 9a8e5d41
      Linus Torvalds authored
      Pull ARM fixes from Russell King:
       "Nothing too disasterous, the biggest thing being the removal of the
        regulator support for vcore in the AMBA driver; only one SoC was using
        this and it got broken during the last merge window, which then
        started causing problems for other people.  Mutual agreement was
        reached for it to be removed."
      
      * 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
        ARM: 7386/1: jump_label: fixup for rename to static_key
        ARM: 7384/1: ThumbEE: Disable userspace TEEHBR access for !CONFIG_ARM_THUMBEE
        ARM: 7382/1: mm: truncate memory banks to fit in 4GB space for classic MMU
        ARM: 7359/2: smp_twd: Only wait for reprogramming on active cpus
        ARM: 7383/1: nommu: populate vectors page from paging_init
        ARM: 7381/1: nommu: fix typo in mm/Kconfig
        ARM: 7380/1: DT: do not add a zero-sized memory property
        ARM: 7379/1: DT: fix atags_to_fdt() second call site
        ARM: 7366/3: amba: Remove AMBA level regulator support
        ARM: 7377/1: vic: re-read status register before dispatching each IRQ handler
        ARM: 7368/1: fault.c: correct how the tsk->[maj|min]_flt gets incremented
      9a8e5d41
    • Linus Torvalds's avatar
      x86-32: fix up strncpy_from_user() sign error · 12e993b8
      Linus Torvalds authored
      The 'max' range needs to be unsigned, since the size of the user address
      space is bigger than 2GB.
      
      We know that 'count' is positive in 'long' (that is checked in the
      caller), so we will truncate 'max' down to something that fits in a
      signed long, but before we actually do that, that comparison needs to be
      done in unsigned.
      
      Bug introduced in commit 92ae03f2 ("x86: merge 32/64-bit versions of
      'strncpy_from_user()' and speed it up").  On x86-64 you can't trigger
      this, since the user address space is much smaller than 63 bits, and on
      x86-32 it works in practice, since you would seldom hit the strncpy
      limits anyway.
      
      I had actually tested the corner-cases, I had only tested them on
      x86-64.  Besides, I had only worried about the case of a pointer *close*
      to the end of the address space, rather than really far away from it ;)
      
      This also changes the "we hit the user-specified maximum" to return
      'res', for the trivial reason that gcc seems to generate better code
      that way.  'res' and 'count' are the same in that case, so it really
      doesn't matter which one we return.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12e993b8
  9. 15 Apr, 2012 12 commits
  10. 14 Apr, 2012 1 commit