Commit b0c79f49 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 asm updates from Ingo Molnar:

 - Introduce the ORC unwinder, which can be enabled via
   CONFIG_ORC_UNWINDER=y.

   The ORC unwinder is a lightweight, Linux kernel specific debuginfo
   implementation, which aims to be DWARF done right for unwinding.
   Objtool is used to generate the ORC unwinder tables during build, so
   the data format is flexible and kernel internal: there's no
   dependency on debuginfo created by an external toolchain.

   The ORC unwinder is almost two orders of magnitude faster than the
   (out of tree) DWARF unwinder - which is important for perf call graph
   profiling. It is also significantly simpler and is coded defensively:
   there has not been a single ORC related kernel crash so far, even
   with early versions. (knock on wood!)

   But the main advantage is that enabling the ORC unwinder allows
   CONFIG_FRAME_POINTERS to be turned off - which speeds up the kernel
   measurably:

   With frame pointers disabled, GCC does not have to add frame pointer
   instrumentation code to every function in the kernel. The kernel's
   .text size decreases by about 3.2%, resulting in better cache
   utilization and fewer instructions executed, resulting in a broad
   kernel-wide speedup. Average speedup of system calls should be
   roughly in the 1-3% range - measurements by Mel Gorman [1] have shown
   a speedup of 5-10% for some function execution intense workloads.

   The main cost of the unwinder is that the unwinder data has to be
   stored in RAM: the memory cost is 2-4MB of RAM, depending on kernel
   config - which is a modest cost on modern x86 systems.

   Given how young the ORC unwinder code is it's not enabled by default
   - but given the performance advantages the plan is to eventually make
   it the default unwinder on x86.

   See Documentation/x86/orc-unwinder.txt for more details.

 - Remove lguest support: its intended role was that of a temporary
   proof of concept for virtualization, plus its removal will enable the
   reduction (removal) of the paravirt API as well, so Rusty agreed to
   its removal. (Juergen Gross)

 - Clean up and fix FSGS related functionality (Andy Lutomirski)

 - Clean up IO access APIs (Andy Shevchenko)

 - Enhance the symbol namespace (Jiri Slaby)

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
  objtool: Handle GCC stack pointer adjustment bug
  x86/entry/64: Use ENTRY() instead of ALIGN+GLOBAL for stub32_clone()
  x86/fpu/math-emu: Add ENDPROC to functions
  x86/boot/64: Extract efi_pe_entry() from startup_64()
  x86/boot/32: Extract efi_pe_entry() from startup_32()
  x86/lguest: Remove lguest support
  x86/paravirt/xen: Remove xen_patch()
  objtool: Fix objtool fallthrough detection with function padding
  x86/xen/64: Fix the reported SS and CS in SYSCALL
  objtool: Track DRAP separately from callee-saved registers
  objtool: Fix validate_branch() return codes
  x86: Clarify/fix no-op barriers for text_poke_bp()
  x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs
  selftests/x86/fsgsbase: Test selectors 1, 2, and 3
  x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
  x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common
  x86/asm: Fix UNWIND_HINT_REGS macro for older binutils
  x86/asm/32: Fix regs_get_register() on segment registers
  x86/xen/64: Rearrange the SYSCALL entries
  x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads
  ...
parents f213a6c8 dd88a0a0
ORC unwinder
============
Overview
--------
The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
similar in concept to a DWARF unwinder. The difference is that the
format of the ORC data is much simpler than DWARF, which in turn allows
the ORC unwinder to be much simpler and faster.
The ORC data consists of unwind tables which are generated by objtool.
They contain out-of-band data which is used by the in-kernel ORC
unwinder. Objtool generates the ORC data by first doing compile-time
stack metadata validation (CONFIG_STACK_VALIDATION). After analyzing
all the code paths of a .o file, it determines information about the
stack state at each instruction address in the file and outputs that
information to the .orc_unwind and .orc_unwind_ip sections.
The per-object ORC sections are combined at link time and are sorted and
post-processed at boot time. The unwinder uses the resulting data to
correlate instruction addresses with their stack states at run time.
ORC vs frame pointers
---------------------
With frame pointers enabled, GCC adds instrumentation code to every
function in the kernel. The kernel's .text size increases by about
3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel
Gorman [1] have shown a slowdown of 5-10% for some workloads.
In contrast, the ORC unwinder has no effect on text size or runtime
performance, because the debuginfo is out of band. So if you disable
frame pointers and enable the ORC unwinder, you get a nice performance
improvement across the board, and still have reliable stack traces.
Ingo Molnar says:
"Note that it's not just a performance improvement, but also an
instruction cache locality improvement: 3.2% .text savings almost
directly transform into a similarly sized reduction in cache
footprint. That can transform to even higher speedups for workloads
whose cache locality is borderline."
Another benefit of ORC compared to frame pointers is that it can
reliably unwind across interrupts and exceptions. Frame pointer based
unwinds can sometimes skip the caller of the interrupted function, if it
was a leaf function or if the interrupt hit before the frame pointer was
saved.
The main disadvantage of the ORC unwinder compared to frame pointers is
that it needs more memory to store the ORC unwind tables: roughly 2-4MB
depending on the kernel config.
ORC vs DWARF
------------
ORC debuginfo's advantage over DWARF itself is that it's much simpler.
It gets rid of the complex DWARF CFI state machine and also gets rid of
the tracking of unnecessary registers. This allows the unwinder to be
much simpler, meaning fewer bugs, which is especially important for
mission critical oops code.
The simpler debuginfo format also enables the unwinder to be much faster
than DWARF, which is important for perf and lockdep. In a basic
performance test by Jiri Slaby [2], the ORC unwinder was about 20x
faster than an out-of-tree DWARF unwinder. (Note: That measurement was
taken before some performance tweaks were added, which doubled
performance, so the speedup over DWARF may be closer to 40x.)
The ORC data format does have a few downsides compared to DWARF. ORC
unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
than DWARF-based eh_frame tables.
Another potential downside is that, as GCC evolves, it's conceivable
that the ORC data may end up being *too* simple to describe the state of
the stack for certain optimizations. But IMO this is unlikely because
GCC saves the frame pointer for any unusual stack adjustments it does,
so I suspect we'll really only ever need to keep track of the stack
pointer and the frame pointer between call frames. But even if we do
end up having to track all the registers DWARF tracks, at least we will
still be able to control the format, e.g. no complex state machines.
ORC unwind table generation
---------------------------
The ORC data is generated by objtool. With the existing compile-time
stack metadata validation feature, objtool already follows all code
paths, and so it already has all the information it needs to be able to
generate ORC data from scratch. So it's an easy step to go from stack
validation to ORC data generation.
It should be possible to instead generate the ORC data with a simple
tool which converts DWARF to ORC data. However, such a solution would
be incomplete due to the kernel's extensive use of asm, inline asm, and
special sections like exception tables.
That could be rectified by manually annotating those special code paths
using GNU assembler .cfi annotations in .S files, and homegrown
annotations for inline asm in .c files. But asm annotations were tried
in the past and were found to be unmaintainable. They were often
incorrect/incomplete and made the code harder to read and keep updated.
And based on looking at glibc code, annotating inline asm in .c files
might be even worse.
Objtool still needs a few annotations, but only in code which does
unusual things to the stack like entry code. And even then, far fewer
annotations are needed than what DWARF would need, so they're much more
maintainable than DWARF CFI annotations.
So the advantages of using objtool to generate ORC data are that it
gives more accurate debuginfo, with very few annotations. It also
insulates the kernel from toolchain bugs which can be very painful to
deal with in the kernel since we often have to workaround issues in
older versions of the toolchain for years.
The downside is that the unwinder now becomes dependent on objtool's
ability to reverse engineer GCC code flow. If GCC optimizations become
too complicated for objtool to follow, the ORC data generation might
stop working or become incomplete. (It's worth noting that livepatch
already has such a dependency on objtool's ability to follow GCC code
flow.)
If newer versions of GCC come up with some optimizations which break
objtool, we may need to revisit the current implementation. Some
possible solutions would be asking GCC to make the optimizations more
palatable, or having objtool use DWARF as an additional input, or
creating a GCC plugin to assist objtool with its analysis. But for now,
objtool follows GCC code quite well.
Unwinder implementation details
-------------------------------
Objtool generates the ORC data by integrating with the compile-time
stack metadata validation feature, which is described in detail in
tools/objtool/Documentation/stack-validation.txt. After analyzing all
the code paths of a .o file, it creates an array of orc_entry structs,
and a parallel array of instruction addresses associated with those
structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
respectively.
The ORC data is split into the two arrays for performance reasons, to
make the searchable part of the data (.orc_unwind_ip) more compact. The
arrays are sorted in parallel at boot time.
Performance is further improved by the use of a fast lookup table which
is created at runtime. The fast lookup table associates a given address
with a range of indices for the .orc_unwind table, so that only a small
subset of the table needs to be searched.
Etymology
---------
Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
enemies. Similarly, the ORC unwinder was created in opposition to the
complexity and slowness of DWARF.
"Although Orcs rarely consider multiple solutions to a problem, they do
excel at getting things done because they are creatures of action, not
thought." [3] Similarly, unlike the esoteric DWARF unwinder, the
veracious ORC unwinder wastes no time or siloconic effort decoding
variable-length zero-extended unsigned-integer byte-coded
state-machine-based debug information entries.
Similar to how Orcs frequently unravel the well-intentioned plans of
their adversaries, the ORC unwinder frequently unravels stacks with
brutal, unyielding efficiency.
ORC stands for Oops Rewind Capability.
[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
[3] http://dustin.wikidot.com/half-orcs-and-orcs
......@@ -7660,17 +7660,6 @@ T: git git://linuxtv.org/mkrufky/tuners.git
S: Maintained
F: drivers/media/dvb-frontends/lgdt3305.*
LGUEST
M: Rusty Russell <rusty@rustcorp.com.au>
L: lguest@lists.ozlabs.org
W: http://lguest.ozlabs.org/
S: Odd Fixes
F: arch/x86/include/asm/lguest*.h
F: arch/x86/lguest/
F: drivers/lguest/
F: include/linux/lguest*.h
F: tools/lguest/
LIBATA PATA ARASAN COMPACT FLASH CONTROLLER
M: Viresh Kumar <vireshk@kernel.org>
L: linux-ide@vger.kernel.org
......
#ifndef _ASM_UML_UNWIND_H
#define _ASM_UML_UNWIND_H
static inline void
unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size) {}
#endif /* _ASM_UML_UNWIND_H */
......@@ -10,9 +10,6 @@ obj-$(CONFIG_XEN) += xen/
# Hyper-V paravirtualization support
obj-$(CONFIG_HYPERVISOR_GUEST) += hyperv/
# lguest paravirtualization support
obj-$(CONFIG_LGUEST_GUEST) += lguest/
obj-y += realmode/
obj-y += kernel/
obj-y += mm/
......
......@@ -73,7 +73,6 @@ config X86
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANTS_THP_SWAP if X86_64
select BUILDTIME_EXTABLE_SORT
......@@ -158,6 +157,7 @@ config X86
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI
select HAVE_OPROFILE
select HAVE_OPTPROBES
......@@ -168,7 +168,7 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER && STACK_VALIDATION
select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER_UNWINDER && STACK_VALIDATION
select HAVE_STACK_VALIDATION if X86_64
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
......@@ -778,8 +778,6 @@ config KVM_DEBUG_FS
Statistics are displayed in debugfs filesystem. Enabling this option
may incur significant overhead.
source "arch/x86/lguest/Kconfig"
config PARAVIRT_TIME_ACCOUNTING
bool "Paravirtual steal time accounting"
depends on PARAVIRT
......
......@@ -305,8 +305,6 @@ config DEBUG_ENTRY
Some of these sanity checks may slow down kernel entries and
exits or otherwise impact performance.
This is currently used to help test NMI code.
If unsure, say N.
config DEBUG_NMI_SELFTEST
......@@ -358,4 +356,61 @@ config PUNIT_ATOM_DEBUG
The current power state can be read from
/sys/kernel/debug/punit_atom/dev_power_state
choice
prompt "Choose kernel unwinder"
default FRAME_POINTER_UNWINDER
---help---
This determines which method will be used for unwinding kernel stack
traces for panics, oopses, bugs, warnings, perf, /proc/<pid>/stack,
livepatch, lockdep, and more.
config FRAME_POINTER_UNWINDER
bool "Frame pointer unwinder"
select FRAME_POINTER
---help---
This option enables the frame pointer unwinder for unwinding kernel
stack traces.
The unwinder itself is fast and it uses less RAM than the ORC
unwinder, but the kernel text size will grow by ~3% and the kernel's
overall performance will degrade by roughly 5-10%.
This option is recommended if you want to use the livepatch
consistency model, as this is currently the only way to get a
reliable stack trace (CONFIG_HAVE_RELIABLE_STACKTRACE).
config ORC_UNWINDER
bool "ORC unwinder"
depends on X86_64
select STACK_VALIDATION
---help---
This option enables the ORC (Oops Rewind Capability) unwinder for
unwinding kernel stack traces. It uses a custom data format which is
a simplified version of the DWARF Call Frame Information standard.
This unwinder is more accurate across interrupt entry frames than the
frame pointer unwinder. It also enables a 5-10% performance
improvement across the entire kernel compared to frame pointers.
Enabling this option will increase the kernel's runtime memory usage
by roughly 2-4MB, depending on your kernel config.
config GUESS_UNWINDER
bool "Guess unwinder"
depends on EXPERT
---help---
This option enables the "guess" unwinder for unwinding kernel stack
traces. It scans the stack and reports every kernel text address it
finds. Some of the addresses it reports may be incorrect.
While this option often produces false positives, it can still be
useful in many cases. Unlike the other unwinders, it has no runtime
overhead.
endchoice
config FRAME_POINTER
depends on !ORC_UNWINDER && !GUESS_UNWINDER
bool
endmenu
......@@ -61,71 +61,6 @@
__HEAD
ENTRY(startup_32)
#ifdef CONFIG_EFI_STUB
jmp preferred_addr
/*
* We don't need the return address, so set up the stack so
* efi_main() can find its arguments.
*/
ENTRY(efi_pe_entry)
add $0x4, %esp
call 1f
1: popl %esi
subl $1b, %esi
popl %ecx
movl %ecx, efi32_config(%esi) /* Handle */
popl %ecx
movl %ecx, efi32_config+8(%esi) /* EFI System table pointer */
/* Relocate efi_config->call() */
leal efi32_config(%esi), %eax
add %esi, 40(%eax)
pushl %eax
call make_boot_params
cmpl $0, %eax
je fail
movl %esi, BP_code32_start(%eax)
popl %ecx
pushl %eax
pushl %ecx
jmp 2f /* Skip efi_config initialization */
ENTRY(efi32_stub_entry)
add $0x4, %esp
popl %ecx
popl %edx
call 1f
1: popl %esi
subl $1b, %esi
movl %ecx, efi32_config(%esi) /* Handle */
movl %edx, efi32_config+8(%esi) /* EFI System table pointer */
/* Relocate efi_config->call() */
leal efi32_config(%esi), %eax
add %esi, 40(%eax)
pushl %eax
2:
call efi_main
cmpl $0, %eax
movl %eax, %esi
jne 2f
fail:
/* EFI init failed, so hang. */
hlt
jmp fail
2:
movl BP_code32_start(%esi), %eax
leal preferred_addr(%eax), %eax
jmp *%eax
preferred_addr:
#endif
cld
/*
* Test KEEP_SEGMENTS flag to see if the bootloader is asking
......@@ -208,6 +143,70 @@ preferred_addr:
jmp *%eax
ENDPROC(startup_32)
#ifdef CONFIG_EFI_STUB
/*
* We don't need the return address, so set up the stack so efi_main() can find
* its arguments.
*/
ENTRY(efi_pe_entry)
add $0x4, %esp
call 1f
1: popl %esi
subl $1b, %esi
popl %ecx
movl %ecx, efi32_config(%esi) /* Handle */
popl %ecx
movl %ecx, efi32_config+8(%esi) /* EFI System table pointer */
/* Relocate efi_config->call() */
leal efi32_config(%esi), %eax
add %esi, 40(%eax)
pushl %eax
call make_boot_params
cmpl $0, %eax
je fail
movl %esi, BP_code32_start(%eax)
popl %ecx
pushl %eax
pushl %ecx
jmp 2f /* Skip efi_config initialization */
ENDPROC(efi_pe_entry)
ENTRY(efi32_stub_entry)
add $0x4, %esp
popl %ecx
popl %edx
call 1f
1: popl %esi
subl $1b, %esi
movl %ecx, efi32_config(%esi) /* Handle */
movl %edx, efi32_config+8(%esi) /* EFI System table pointer */
/* Relocate efi_config->call() */
leal efi32_config(%esi), %eax
add %esi, 40(%eax)
pushl %eax
2:
call efi_main
cmpl $0, %eax
movl %eax, %esi
jne 2f
fail:
/* EFI init failed, so hang. */
hlt
jmp fail
2:
movl BP_code32_start(%esi), %eax
leal startup_32(%eax), %eax
jmp *%eax
ENDPROC(efi32_stub_entry)
#endif
.text
relocated:
......
......@@ -243,65 +243,6 @@ ENTRY(startup_64)
* that maps our entire kernel(text+data+bss+brk), zero page
* and command line.
*/
#ifdef CONFIG_EFI_STUB
/*
* The entry point for the PE/COFF executable is efi_pe_entry, so
* only legacy boot loaders will execute this jmp.
*/
jmp preferred_addr
ENTRY(efi_pe_entry)
movq %rcx, efi64_config(%rip) /* Handle */
movq %rdx, efi64_config+8(%rip) /* EFI System table pointer */
leaq efi64_config(%rip), %rax
movq %rax, efi_config(%rip)
call 1f
1: popq %rbp
subq $1b, %rbp
/*
* Relocate efi_config->call().
*/
addq %rbp, efi64_config+40(%rip)
movq %rax, %rdi
call make_boot_params
cmpq $0,%rax
je fail
mov %rax, %rsi
leaq startup_32(%rip), %rax
movl %eax, BP_code32_start(%rsi)
jmp 2f /* Skip the relocation */
handover_entry:
call 1f
1: popq %rbp
subq $1b, %rbp
/*
* Relocate efi_config->call().
*/
movq efi_config(%rip), %rax
addq %rbp, 40(%rax)
2:
movq efi_config(%rip), %rdi
call efi_main
movq %rax,%rsi
cmpq $0,%rax
jne 2f
fail:
/* EFI init failed, so hang. */
hlt
jmp fail
2:
movl BP_code32_start(%esi), %eax
leaq preferred_addr(%rax), %rax
jmp *%rax
preferred_addr:
#endif
/* Setup data segments. */
xorl %eax, %eax
......@@ -413,6 +354,59 @@ lvl5:
jmp *%rax
#ifdef CONFIG_EFI_STUB
/* The entry point for the PE/COFF executable is efi_pe_entry. */
ENTRY(efi_pe_entry)
movq %rcx, efi64_config(%rip) /* Handle */
movq %rdx, efi64_config+8(%rip) /* EFI System table pointer */
leaq efi64_config(%rip), %rax
movq %rax, efi_config(%rip)
call 1f
1: popq %rbp
subq $1b, %rbp
/*
* Relocate efi_config->call().
*/
addq %rbp, efi64_config+40(%rip)
movq %rax, %rdi
call make_boot_params
cmpq $0,%rax
je fail
mov %rax, %rsi
leaq startup_32(%rip), %rax
movl %eax, BP_code32_start(%rsi)
jmp 2f /* Skip the relocation */
handover_entry:
call 1f
1: popq %rbp
subq $1b, %rbp
/*
* Relocate efi_config->call().
*/
movq efi_config(%rip), %rax
addq %rbp, 40(%rax)
2:
movq efi_config(%rip), %rdi
call efi_main
movq %rax,%rsi
cmpq $0,%rax
jne 2f
fail:
/* EFI init failed, so hang. */
hlt
jmp fail
2:
movl BP_code32_start(%esi), %eax
leaq startup_64(%rax), %rax
jmp *%rax
ENDPROC(efi_pe_entry)
.org 0x390
ENTRY(efi64_stub_entry)
movq %rdi, efi64_config(%rip) /* Handle */
......
CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
CONFIG_GUESS_UNWINDER=y
# CONFIG_FRAME_POINTER_UNWINDER is not set
......@@ -2,7 +2,6 @@
# Makefile for the x86 low level entry code
#
OBJECT_FILES_NON_STANDARD_entry_$(BITS).o := y
OBJECT_FILES_NON_STANDARD_entry_64_compat.o := y
CFLAGS_syscall_64.o += $(call cc-option,-Wno-override-init,)
......
#include <linux/jump_label.h>
#include <asm/unwind_hints.h>
/*
......@@ -112,6 +113,7 @@ For 32-bit we have the following conventions - kernel is built with
movq %rdx, 12*8+\offset(%rsp)
movq %rsi, 13*8+\offset(%rsp)
movq %rdi, 14*8+\offset(%rsp)
UNWIND_HINT_REGS offset=\offset extra=0
.endm
.macro SAVE_C_REGS offset=0
SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
......@@ -136,6 +138,7 @@ For 32-bit we have the following conventions - kernel is built with
movq %r12, 3*8+\offset(%rsp)
movq %rbp, 4*8+\offset(%rsp)
movq %rbx, 5*8+\offset(%rsp)
UNWIND_HINT_REGS offset=\offset
.endm
.macro RESTORE_EXTRA_REGS offset=0
......@@ -145,6 +148,7 @@ For 32-bit we have the following conventions - kernel is built with
movq 3*8+\offset(%rsp), %r12
movq 4*8+\offset(%rsp), %rbp
movq 5*8+\offset(%rsp), %rbx
UNWIND_HINT_REGS offset=\offset extra=0
.endm
.macro RESTORE_C_REGS_HELPER rstor_rax=1, rstor_rcx=1, rstor_r11=1, rstor_r8910=1, rstor_rdx=1
......@@ -167,6 +171,7 @@ For 32-bit we have the following conventions - kernel is built with
.endif
movq 13*8(%rsp), %rsi
movq 14*8(%rsp), %rdi
UNWIND_HINT_IRET_REGS offset=16*8
.endm
.macro RESTORE_C_REGS
RESTORE_C_REGS_HELPER 1,1,1,1,1
......
This diff is collapsed.
......@@ -183,21 +183,20 @@ ENDPROC(entry_SYSENTER_compat)
*/
ENTRY(entry_SYSCALL_compat)
/* Interrupts are off on entry. */
SWAPGS_UNSAFE_STACK
swapgs
/* Stash user ESP and switch to the kernel stack. */
movl %esp, %r8d
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
/* Zero-extending 32-bit regs, do not remove */
movl %eax, %eax
/* Construct struct pt_regs on stack */
pushq $__USER32_DS /* pt_regs->ss */
pushq %r8 /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER32_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
GLOBAL(entry_SYSCALL_compat_after_hwframe)
movl %eax, %eax /* discard orig_ax high bits */
pushq %rax /* pt_regs->orig_ax */
pushq %rdi /* pt_regs->di */
pushq %rsi /* pt_regs->si */
......@@ -342,8 +341,7 @@ ENTRY(entry_INT80_compat)
jmp restore_regs_and_iret
END(entry_INT80_compat)
ALIGN
GLOBAL(stub32_clone)
ENTRY(stub32_clone)
/*
* The 32-bit clone ABI is: clone(..., int tls_val, int *child_tidptr).
* The 64-bit clone ABI is: clone(..., int *child_tidptr, int tls_val).
......@@ -353,3 +351,4 @@ GLOBAL(stub32_clone)
*/
xchg %r8, %rcx
jmp sys_clone
ENDPROC(stub32_clone)
......@@ -226,7 +226,7 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
if (ksig->ka.sa.sa_flags & SA_ONSTACK)
sp = sigsp(sp, ksig);
/* This is the legacy signal stack switching. */
else if ((regs->ss & 0xffff) != __USER32_DS &&
else if (regs->ss != __USER32_DS &&
!(ksig->ka.sa.sa_flags & SA_RESTORER) &&
ksig->ka.sa.sa_restorer)
sp = (unsigned long) ksig->ka.sa.sa_restorer;
......
......@@ -126,15 +126,15 @@ do { \
pr_reg[4] = regs->di; \
pr_reg[5] = regs->bp; \
pr_reg[6] = regs->ax; \
pr_reg[7] = regs->ds & 0xffff; \
pr_reg[8] = regs->es & 0xffff; \
pr_reg[9] = regs->fs & 0xffff; \
pr_reg[7] = regs->ds; \
pr_reg[8] = regs->es; \
pr_reg[9] = regs->fs; \
pr_reg[11] = regs->orig_ax; \
pr_reg[12] = regs->ip; \
pr_reg[13] = regs->cs & 0xffff; \
pr_reg[13] = regs->cs; \
pr_reg[14] = regs->flags; \
pr_reg[15] = regs->sp; \
pr_reg[16] = regs->ss & 0xffff; \
pr_reg[16] = regs->ss; \
} while (0);
#define ELF_CORE_COPY_REGS(pr_reg, regs) \
......@@ -204,6 +204,7 @@ void set_personality_ia32(bool);
#define ELF_CORE_COPY_REGS(pr_reg, regs) \
do { \
unsigned long base; \
unsigned v; \
(pr_reg)[0] = (regs)->r15; \
(pr_reg)[1] = (regs)->r14; \
......@@ -226,8 +227,8 @@ do { \
(pr_reg)[18] = (regs)->flags; \
(pr_reg)[19] = (regs)->sp; \
(pr_reg)[20] = (regs)->ss; \
(pr_reg)[21] = current->thread.fsbase; \
(pr_reg)[22] = current->thread.gsbase; \
rdmsrl(MSR_FS_BASE, base); (pr_reg)[21] = base; \
rdmsrl(MSR_KERNEL_GS_BASE, base); (pr_reg)[22] = base; \
asm("movl %%ds,%0" : "=r" (v)); (pr_reg)[23] = v; \
asm("movl %%es,%0" : "=r" (v)); (pr_reg)[24] = v; \
asm("movl %%fs,%0" : "=r" (v)); (pr_reg)[25] = v; \
......
......@@ -69,6 +69,9 @@ build_mmio_write(__writeb, "b", unsigned char, "q", )
build_mmio_write(__writew, "w", unsigned short, "r", )
build_mmio_write(__writel, "l", unsigned int, "r", )
#define readb readb
#define readw readw
#define readl readl
#define readb_relaxed(a) __readb(a)
#define readw_relaxed(a) __readw(a)
#define readl_relaxed(a) __readl(a)
......@@ -76,6 +79,9 @@ build_mmio_write(__writel, "l", unsigned int, "r", )
#define __raw_readw __readw
#define __raw_readl __readl
#define writeb writeb
#define writew writew
#define writel writel
#define writeb_relaxed(v, a) __writeb(v, a)
#define writew_relaxed(v, a) __writew(v, a)
#define writel_relaxed(v, a) __writel(v, a)
......@@ -88,13 +94,15 @@ build_mmio_write(__writel, "l", unsigned int, "r", )
#ifdef CONFIG_X86_64
build_mmio_read(readq, "q", unsigned long, "=r", :"memory")
build_mmio_read(__readq, "q", unsigned long, "=r", )
build_mmio_write(writeq, "q", unsigned long, "r", :"memory")
build_mmio_write(__writeq, "q", unsigned long, "r", )
#define readq_relaxed(a) readq(a)
#define writeq_relaxed(v, a) writeq(v, a)
#define readq_relaxed(a) __readq(a)
#define writeq_relaxed(v, a) __writeq(v, a)
#define __raw_readq(a) readq(a)
#define __raw_writeq(val, addr) writeq(val, addr)
#define __raw_readq __readq
#define __raw_writeq __writeq
/* Let people know that we have them */
#define readq readq
......@@ -119,6 +127,7 @@ static inline phys_addr_t virt_to_phys(volatile void *address)
{
return __pa(address);
}
#define virt_to_phys virt_to_phys
/**
* phys_to_virt - map physical address to virtual
......@@ -137,6 +146,7 @@ static inline void *phys_to_virt(phys_addr_t address)
{
return __va(address);
}
#define phys_to_virt phys_to_virt
/*
* Change "struct page" to physical address.
......@@ -169,11 +179,14 @@ static inline unsigned int isa_virt_to_bus(volatile void *address)
* else, you probably want one of the following.
*/
extern void __iomem *ioremap_nocache(resource_size_t offset, unsigned long size);
#define ioremap_nocache ioremap_nocache
extern void __iomem *ioremap_uc(resource_size_t offset, unsigned long size);
#define ioremap_uc ioremap_uc
extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size);
#define ioremap_cache ioremap_cache
extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, unsigned long prot_val);
#define ioremap_prot ioremap_prot
/**
* ioremap - map bus memory into CPU space
......@@ -193,8 +206,10 @@ static inline void __iomem *ioremap(resource_size_t offset, unsigned long size)
{
return ioremap_nocache(offset, size);
}
#define ioremap ioremap
extern void iounmap(volatile void __iomem *addr);
#define iounmap iounmap
extern void set_iounmap_nonlazy(void);
......@@ -202,53 +217,6 @@ extern void set_iounmap_nonlazy(void);
#include <asm-generic/iomap.h>
/*
* Convert a virtual cached pointer to an uncached pointer
*/
#define xlate_dev_kmem_ptr(p) p
/**
* memset_io Set a range of I/O memory to a constant value
* @addr: The beginning of the I/O-memory range to set
* @val: The value to set the memory to
* @count: The number of bytes to set
*
* Set a range of I/O memory to a given value.
*/
static inline void
memset_io(volatile void __iomem *addr, unsigned char val, size_t count)
{
memset((void __force *)addr, val, count);
}
/**
* memcpy_fromio Copy a block of data from I/O memory
* @dst: The (RAM) destination for the copy
* @src: The (I/O memory) source for the data
* @count: The number of bytes to copy
*
* Copy a block of data from I/O memory.
*/
static inline void
memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count)
{
memcpy(dst, (const void __force *)src, count);
}
/**
* memcpy_toio Copy a block of data into I/O memory
* @dst: The (I/O memory) destination for the copy
* @src: The (RAM) source for the data
* @count: The number of bytes to copy
*
* Copy a block of data to I/O memory.
*/
static inline void
memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
{
memcpy((void __force *)dst, src, count);
}
/*
* ISA space is 'always mapped' on a typical x86 system, no need to
* explicitly ioremap() it. The fact that the ISA IO space is mapped
......@@ -341,13 +309,38 @@ BUILDIO(b, b, char)
BUILDIO(w, w, short)
BUILDIO(l, , int)
#define inb inb
#define inw inw
#define inl inl
#define inb_p inb_p
#define inw_p inw_p
#define inl_p inl_p
#define insb insb
#define insw insw
#define insl insl
#define outb outb
#define outw outw
#define outl outl
#define outb_p outb_p
#define outw_p outw_p
#define outl_p outl_p
#define outsb outsb
#define outsw outsw
#define outsl outsl
extern void *xlate_dev_mem_ptr(phys_addr_t phys);
extern void unxlate_dev_mem_ptr(phys_addr_t phys, void *addr);
#define xlate_dev_mem_ptr xlate_dev_mem_ptr
#define unxlate_dev_mem_ptr unxlate_dev_mem_ptr
extern int ioremap_change_attr(unsigned long vaddr, unsigned long size,
enum page_cache_mode pcm);
extern void __iomem *ioremap_wc(resource_size_t offset, unsigned long size);
#define ioremap_wc ioremap_wc
extern void __iomem *ioremap_wt(resource_size_t offset, unsigned long size);
#define ioremap_wt ioremap_wt
extern bool is_early_ioremap_ptep(pte_t *ptep);
......@@ -365,6 +358,9 @@ extern bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
#define IO_SPACE_LIMIT 0xffff
#include <asm-generic/io.h>
#undef PCI_IOBASE
#ifdef CONFIG_MTRR
extern int __must_check arch_phys_wc_index(int handle);
#define arch_phys_wc_index arch_phys_wc_index
......
#ifndef _ASM_X86_LGUEST_H
#define _ASM_X86_LGUEST_H
#define GDT_ENTRY_LGUEST_CS 10
#define GDT_ENTRY_LGUEST_DS 11
#define LGUEST_CS (GDT_ENTRY_LGUEST_CS * 8)
#define LGUEST_DS (GDT_ENTRY_LGUEST_DS * 8)
#ifndef __ASSEMBLY__
#include <asm/desc.h>
#define GUEST_PL 1
/* Page for Switcher text itself, then two pages per cpu */
#define SWITCHER_TEXT_PAGES (1)
#define SWITCHER_STACK_PAGES (2 * nr_cpu_ids)
#define TOTAL_SWITCHER_PAGES (SWITCHER_TEXT_PAGES + SWITCHER_STACK_PAGES)
/* Where we map the Switcher, in both Host and Guest. */
extern unsigned long switcher_addr;
/* Found in switcher.S */
extern unsigned long default_idt_entries[];
/* Declarations for definitions in arch/x86/lguest/head_32.S */
extern char lguest_noirq_iret[];
extern const char lgstart_cli[], lgend_cli[];
extern const char lgstart_pushf[], lgend_pushf[];
extern void lguest_iret(void);
extern void lguest_init(void);
struct lguest_regs {
/* Manually saved part. */
unsigned long eax, ebx, ecx, edx;
unsigned long esi, edi, ebp;
unsigned long gs;
unsigned long fs, ds, es;
unsigned long trapnum, errcode;
/* Trap pushed part */
unsigned long eip;
unsigned long cs;
unsigned long eflags;
unsigned long esp;
unsigned long ss;
};
/* This is a guest-specific page (mapped ro) into the guest. */
struct lguest_ro_state {
/* Host information we need to restore when we switch back. */
u32 host_cr3;
struct desc_ptr host_idt_desc;
struct desc_ptr host_gdt_desc;
u32 host_sp;
/* Fields which are used when guest is running. */
struct desc_ptr guest_idt_desc;
struct desc_ptr guest_gdt_desc;
struct x86_hw_tss guest_tss;
struct desc_struct guest_idt[IDT_ENTRIES];
struct desc_struct guest_gdt[GDT_ENTRIES];
};
struct lg_cpu_arch {
/* The GDT entries copied into lguest_ro_state when running. */
struct desc_struct gdt[GDT_ENTRIES];
/* The IDT entries: some copied into lguest_ro_state when running. */
struct desc_struct idt[IDT_ENTRIES];
/* The address of the last guest-visible pagefault (ie. cr2). */
unsigned long last_pagefault;
};
static inline void lguest_set_ts(void)
{
u32 cr0;
cr0 = read_cr0();
if (!(cr0 & 8))
write_cr0(cr0 | 8);
}
/* Full 4G segment descriptors, suitable for CS and DS. */
#define FULL_EXEC_SEGMENT \
((struct desc_struct)GDT_ENTRY_INIT(0xc09b, 0, 0xfffff))
#define FULL_SEGMENT ((struct desc_struct)GDT_ENTRY_INIT(0xc093, 0, 0xfffff))
#endif /* __ASSEMBLY__ */
#endif /* _ASM_X86_LGUEST_H */
/* Architecture specific portion of the lguest hypercalls */
#ifndef _ASM_X86_LGUEST_HCALL_H
#define _ASM_X86_LGUEST_HCALL_H
#define LHCALL_FLUSH_ASYNC 0
#define LHCALL_LGUEST_INIT 1
#define LHCALL_SHUTDOWN 2
#define LHCALL_NEW_PGTABLE 4
#define LHCALL_FLUSH_TLB 5
#define LHCALL_LOAD_IDT_ENTRY 6
#define LHCALL_SET_STACK 7
#define LHCALL_SET_CLOCKEVENT 9
#define LHCALL_HALT 10
#define LHCALL_SET_PMD 13
#define LHCALL_SET_PTE 14
#define LHCALL_SET_PGD 15
#define LHCALL_LOAD_TLS 16
#define LHCALL_LOAD_GDT_ENTRY 18
#define LHCALL_SEND_INTERRUPTS 19
#define LGUEST_TRAP_ENTRY 0x1F
/* Argument number 3 to LHCALL_LGUEST_SHUTDOWN */
#define LGUEST_SHUTDOWN_POWEROFF 1
#define LGUEST_SHUTDOWN_RESTART 2
#ifndef __ASSEMBLY__
#include <asm/hw_irq.h>
/*G:030
* But first, how does our Guest contact the Host to ask for privileged
* operations? There are two ways: the direct way is to make a "hypercall",
* to make requests of the Host Itself.
*
* Our hypercall mechanism uses the highest unused trap code (traps 32 and
* above are used by real hardware interrupts). Seventeen hypercalls are
* available: the hypercall number is put in the %eax register, and the
* arguments (when required) are placed in %ebx, %ecx, %edx and %esi.
* If a return value makes sense, it's returned in %eax.
*
* Grossly invalid calls result in Sudden Death at the hands of the vengeful
* Host, rather than returning failure. This reflects Winston Churchill's
* definition of a gentleman: "someone who is only rude intentionally".
*/
static inline unsigned long
hcall(unsigned long call,
unsigned long arg1, unsigned long arg2, unsigned long arg3,
unsigned long arg4)
{
/* "int" is the Intel instruction to trigger a trap. */
asm volatile("int $" __stringify(LGUEST_TRAP_ENTRY)
/* The call in %eax (aka "a") might be overwritten */
: "=a"(call)
/* The arguments are in %eax, %ebx, %ecx, %edx & %esi */
: "a"(call), "b"(arg1), "c"(arg2), "d"(arg3), "S"(arg4)
/* "memory" means this might write somewhere in memory.
* This isn't true for all calls, but it's safe to tell
* gcc that it might happen so it doesn't get clever. */
: "memory");
return call;
}
/*:*/
/* Can't use our min() macro here: needs to be a constant */
#define LGUEST_IRQS (NR_IRQS < 32 ? NR_IRQS: 32)
#define LHCALL_RING_SIZE 64
struct hcall_args {
/* These map directly onto eax/ebx/ecx/edx/esi in struct lguest_regs */
unsigned long arg0, arg1, arg2, arg3, arg4;
};
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_LGUEST_HCALL_H */
......@@ -2,6 +2,15 @@
#define _ASM_X86_MODULE_H
#include <asm-generic/module.h>
#include <asm/orc_types.h>
struct mod_arch_specific {
#ifdef CONFIG_ORC_UNWINDER
unsigned int num_orcs;
int *orc_unwind_ip;
struct orc_entry *orc_unwind;
#endif
};
#ifdef CONFIG_X86_64
/* X86_64 does not define MODULE_PROC_FAMILY */
......
/*
* Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, see <http://www.gnu.org/licenses/>.
*/
#ifndef _ORC_LOOKUP_H
#define _ORC_LOOKUP_H
/*
* This is a lookup table for speeding up access to the .orc_unwind table.
* Given an input address offset, the corresponding lookup table entry
* specifies a subset of the .orc_unwind table to search.
*
* Each block represents the end of the previous range and the start of the
* next range. An extra block is added to give the last range an end.
*
* The block size should be a power of 2 to avoid a costly 'div' instruction.
*
* A block size of 256 was chosen because it roughly doubles unwinder
* performance while only adding ~5% to the ORC data footprint.
*/
#define LOOKUP_BLOCK_ORDER 8
#define LOOKUP_BLOCK_SIZE (1 << LOOKUP_BLOCK_ORDER)
#ifndef LINKER_SCRIPT
extern unsigned int orc_lookup[];
extern unsigned int orc_lookup_end[];
#define LOOKUP_START_IP (unsigned long)_stext
#define LOOKUP_STOP_IP (unsigned long)_etext
#endif /* LINKER_SCRIPT */
#endif /* _ORC_LOOKUP_H */
/*
* Copyright (C) 2017 Josh Poimboeuf <jpoimboe@redhat.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, see <http://www.gnu.org/licenses/>.
*/
#ifndef _ORC_TYPES_H
#define _ORC_TYPES_H
#include <linux/types.h>
#include <linux/compiler.h>
/*
* The ORC_REG_* registers are base registers which are used to find other
* registers on the stack.
*
* ORC_REG_PREV_SP, also known as DWARF Call Frame Address (CFA), is the
* address of the previous frame: the caller's SP before it called the current
* function.
*
* ORC_REG_UNDEFINED means the corresponding register's value didn't change in
* the current frame.
*
* The most commonly used base registers are SP and BP -- which the previous SP
* is usually based on -- and PREV_SP and UNDEFINED -- which the previous BP is
* usually based on.
*
* The rest of the base registers are needed for special cases like entry code
* and GCC realigned stacks.
*/
#define ORC_REG_UNDEFINED 0
#define ORC_REG_PREV_SP 1
#define ORC_REG_DX 2
#define ORC_REG_DI 3
#define ORC_REG_BP 4
#define ORC_REG_SP 5
#define ORC_REG_R10 6
#define ORC_REG_R13 7
#define ORC_REG_BP_INDIRECT 8
#define ORC_REG_SP_INDIRECT 9
#define ORC_REG_MAX 15
/*
* ORC_TYPE_CALL: Indicates that sp_reg+sp_offset resolves to PREV_SP (the
* caller's SP right before it made the call). Used for all callable
* functions, i.e. all C code and all callable asm functions.
*
* ORC_TYPE_REGS: Used in entry code to indicate that sp_reg+sp_offset points
* to a fully populated pt_regs from a syscall, interrupt, or exception.
*
* ORC_TYPE_REGS_IRET: Used in entry code to indicate that sp_reg+sp_offset
* points to the iret return frame.
*
* The UNWIND_HINT macros are used only for the unwind_hint struct. They
* aren't used in struct orc_entry due to size and complexity constraints.
* Objtool converts them to real types when it converts the hints to orc
* entries.
*/
#define ORC_TYPE_CALL 0
#define ORC_TYPE_REGS 1
#define ORC_TYPE_REGS_IRET 2
#define UNWIND_HINT_TYPE_SAVE 3
#define UNWIND_HINT_TYPE_RESTORE 4
#ifndef __ASSEMBLY__
/*
* This struct is more or less a vastly simplified version of the DWARF Call
* Frame Information standard. It contains only the necessary parts of DWARF
* CFI, simplified for ease of access by the in-kernel unwinder. It tells the
* unwinder how to find the previous SP and BP (and sometimes entry regs) on
* the stack for a given code address. Each instance of the struct corresponds
* to one or more code locations.
*/
struct orc_entry {
s16 sp_offset;
s16 bp_offset;
unsigned sp_reg:4;
unsigned bp_reg:4;
unsigned type:2;
} __packed;
/*
* This struct is used by asm and inline asm code to manually annotate the
* location of registers on the stack for the ORC unwinder.
*
* Type can be either ORC_TYPE_* or UNWIND_HINT_TYPE_*.
*/
struct unwind_hint {
u32 ip;
s16 sp_offset;
u8 sp_reg;
u8 type;
};
#endif /* __ASSEMBLY__ */
#endif /* _ORC_TYPES_H */
......@@ -22,6 +22,7 @@ struct vm86;
#include <asm/nops.h>
#include <asm/special_insns.h>
#include <asm/fpu/types.h>
#include <asm/unwind_hints.h>
#include <linux/personality.h>
#include <linux/cache.h>
......@@ -661,7 +662,7 @@ static inline void sync_core(void)
* In case NMI unmasking or performance ever becomes a problem,
* the next best option appears to be MOV-to-CR2 and an
* unconditional jump. That sequence also works on all CPUs,
* but it will fault at CPL3 (i.e. Xen PV and lguest).
* but it will fault at CPL3 (i.e. Xen PV).
*
* CPUID is the conventional way, but it's nasty: it doesn't
* exist on some 486-like CPUs, and it usually exits to a
......@@ -684,6 +685,7 @@ static inline void sync_core(void)
unsigned int tmp;
asm volatile (
UNWIND_HINT_SAVE
"mov %%ss, %0\n\t"
"pushq %q0\n\t"
"pushq %%rsp\n\t"
......@@ -693,6 +695,7 @@ static inline void sync_core(void)
"pushq %q0\n\t"
"pushq $1f\n\t"
"iretq\n\t"
UNWIND_HINT_RESTORE
"1:"
: "=&r" (tmp), "+r" (__sp) : : "cc", "memory");
#endif
......
......@@ -9,6 +9,20 @@
#ifdef __i386__
struct pt_regs {
/*
* NB: 32-bit x86 CPUs are inconsistent as what happens in the
* following cases (where %seg represents a segment register):
*
* - pushl %seg: some do a 16-bit write and leave the high
* bits alone
* - movl %seg, [mem]: some do a 16-bit write despite the movl
* - IDT entry: some (e.g. 486) will leave the high bits of CS
* and (if applicable) SS undefined.
*
* Fortunately, x86-32 doesn't read the high bits on POP or IRET,
* so we can just treat all of the segment registers as 16-bit
* values.
*/
unsigned long bx;
unsigned long cx;
unsigned long dx;
......@@ -16,16 +30,22 @@ struct pt_regs {
unsigned long di;
unsigned long bp;
unsigned long ax;
unsigned long ds;
unsigned long es;
unsigned long fs;
unsigned long gs;
unsigned short ds;
unsigned short __dsh;
unsigned short es;
unsigned short __esh;
unsigned short fs;
unsigned short __fsh;
unsigned short gs;
unsigned short __gsh;
unsigned long orig_ax;
unsigned long ip;
unsigned long cs;
unsigned short cs;
unsigned short __csh;
unsigned long flags;
unsigned long sp;
unsigned long ss;
unsigned short ss;
unsigned short __ssh;
};
#else /* __i386__ */
......@@ -176,6 +196,17 @@ static inline unsigned long regs_get_register(struct pt_regs *regs,
if (offset == offsetof(struct pt_regs, sp) &&
regs->cs == __KERNEL_CS)
return kernel_stack_pointer(regs);
/* The selector fields are 16-bit. */
if (offset == offsetof(struct pt_regs, cs) ||
offset == offsetof(struct pt_regs, ss) ||
offset == offsetof(struct pt_regs, ds) ||
offset == offsetof(struct pt_regs, es) ||
offset == offsetof(struct pt_regs, fs) ||
offset == offsetof(struct pt_regs, gs)) {
return *(u16 *)((unsigned long)regs + offset);
}
#endif
return *(unsigned long *)((unsigned long)regs + offset);
}
......
#ifndef _ASM_X86_RMWcc
#define _ASM_X86_RMWcc
#define __CLOBBERS_MEM "memory"
#define __CLOBBERS_MEM_CC_CX "memory", "cc", "cx"
#if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CC_HAVE_ASM_GOTO)
/* Use asm goto */
#define __GEN_RMWcc(fullop, var, cc, ...) \
#define __GEN_RMWcc(fullop, var, cc, clobbers, ...) \
do { \
asm_volatile_goto (fullop "; j" #cc " %l[cc_label]" \
: : "m" (var), ## __VA_ARGS__ \
: "memory" : cc_label); \
: : [counter] "m" (var), ## __VA_ARGS__ \
: clobbers : cc_label); \
return 0; \
cc_label: \
return 1; \
} while (0)
#define GEN_UNARY_RMWcc(op, var, arg0, cc) \
__GEN_RMWcc(op " " arg0, var, cc)
#define __BINARY_RMWcc_ARG " %1, "
#define GEN_BINARY_RMWcc(op, var, vcon, val, arg0, cc) \
__GEN_RMWcc(op " %1, " arg0, var, cc, vcon (val))
#else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
/* Use flags output or a set instruction */
#define __GEN_RMWcc(fullop, var, cc, ...) \
#define __GEN_RMWcc(fullop, var, cc, clobbers, ...) \
do { \
bool c; \
asm volatile (fullop ";" CC_SET(cc) \
: "+m" (var), CC_OUT(cc) (c) \
: __VA_ARGS__ : "memory"); \
: [counter] "+m" (var), CC_OUT(cc) (c) \
: __VA_ARGS__ : clobbers); \
return c; \
} while (0)
#define __BINARY_RMWcc_ARG " %2, "
#endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
#define GEN_UNARY_RMWcc(op, var, arg0, cc) \
__GEN_RMWcc(op " " arg0, var, cc)
__GEN_RMWcc(op " " arg0, var, cc, __CLOBBERS_MEM)
#define GEN_UNARY_SUFFIXED_RMWcc(op, suffix, var, arg0, cc) \
__GEN_RMWcc(op " " arg0 "\n\t" suffix, var, cc, \
__CLOBBERS_MEM_CC_CX)
#define GEN_BINARY_RMWcc(op, var, vcon, val, arg0, cc) \
__GEN_RMWcc(op " %2, " arg0, var, cc, vcon (val))
__GEN_RMWcc(op __BINARY_RMWcc_ARG arg0, var, cc, \
__CLOBBERS_MEM, vcon (val))
#endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
#define GEN_BINARY_SUFFIXED_RMWcc(op, suffix, var, vcon, val, arg0, cc) \
__GEN_RMWcc(op __BINARY_RMWcc_ARG arg0 "\n\t" suffix, var, cc, \
__CLOBBERS_MEM_CC_CX, vcon (val))
#endif /* _ASM_X86_RMWcc */
......@@ -12,11 +12,14 @@ struct unwind_state {
struct task_struct *task;
int graph_idx;
bool error;
#ifdef CONFIG_FRAME_POINTER
#if defined(CONFIG_ORC_UNWINDER)
bool signal, full_regs;
unsigned long sp, bp, ip;
struct pt_regs *regs;
#elif defined(CONFIG_FRAME_POINTER_UNWINDER)
bool got_irq;
unsigned long *bp, *orig_sp;
unsigned long *bp, *orig_sp, ip;
struct pt_regs *regs;
unsigned long ip;
#else
unsigned long *sp;
#endif
......@@ -24,41 +27,30 @@ struct unwind_state {
void __unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame);
bool unwind_next_frame(struct unwind_state *state);
unsigned long unwind_get_return_address(struct unwind_state *state);
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state);
static inline bool unwind_done(struct unwind_state *state)
{
return state->stack_info.type == STACK_TYPE_UNKNOWN;
}
static inline
void unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame)
{
first_frame = first_frame ? : get_stack_pointer(task, regs);
__unwind_start(state, task, regs, first_frame);
}
static inline bool unwind_error(struct unwind_state *state)
{
return state->error;
}
#ifdef CONFIG_FRAME_POINTER
static inline
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
void unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame)
{
if (unwind_done(state))
return NULL;
first_frame = first_frame ? : get_stack_pointer(task, regs);
return state->regs ? &state->regs->ip : state->bp + 1;
__unwind_start(state, task, regs, first_frame);
}
#if defined(CONFIG_ORC_UNWINDER) || defined(CONFIG_FRAME_POINTER_UNWINDER)
static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
{
if (unwind_done(state))
......@@ -66,20 +58,46 @@ static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
return state->regs;
}
#else /* !CONFIG_FRAME_POINTER */
static inline
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
#else
static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
{
return NULL;
}
#endif
static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
#ifdef CONFIG_ORC_UNWINDER
void unwind_init(void);
void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size);
#else
static inline void unwind_init(void) {}
static inline
void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
void *orc, size_t orc_size) {}
#endif
/*
* This disables KASAN checking when reading a value from another task's stack,
* since the other task could be running on another CPU and could have poisoned
* the stack in the meantime.
*/
#define READ_ONCE_TASK_STACK(task, x) \
({ \
unsigned long val; \
if (task == current) \
val = READ_ONCE(x); \
else \
val = READ_ONCE_NOCHECK(x); \
val; \
})
static inline bool task_on_another_cpu(struct task_struct *task)
{
return NULL;
#ifdef CONFIG_SMP
return task != current && task->on_cpu;
#else
return false;
#endif
}
#endif /* CONFIG_FRAME_POINTER */
#endif /* _ASM_X86_UNWIND_H */
#ifndef _ASM_X86_UNWIND_HINTS_H
#define _ASM_X86_UNWIND_HINTS_H
#include "orc_types.h"
#ifdef __ASSEMBLY__
/*
* In asm, there are two kinds of code: normal C-type callable functions and
* the rest. The normal callable functions can be called by other code, and
* don't do anything unusual with the stack. Such normal callable functions
* are annotated with the ENTRY/ENDPROC macros. Most asm code falls in this
* category. In this case, no special debugging annotations are needed because
* objtool can automatically generate the ORC data for the ORC unwinder to read
* at runtime.
*
* Anything which doesn't fall into the above category, such as syscall and
* interrupt handlers, tends to not be called directly by other functions, and
* often does unusual non-C-function-type things with the stack pointer. Such
* code needs to be annotated such that objtool can understand it. The
* following CFI hint macros are for this type of code.
*
* These macros provide hints to objtool about the state of the stack at each
* instruction. Objtool starts from the hints and follows the code flow,
* making automatic CFI adjustments when it sees pushes and pops, filling out
* the debuginfo as necessary. It will also warn if it sees any
* inconsistencies.
*/
.macro UNWIND_HINT sp_reg=ORC_REG_SP sp_offset=0 type=ORC_TYPE_CALL
#ifdef CONFIG_STACK_VALIDATION
.Lunwind_hint_ip_\@:
.pushsection .discard.unwind_hints
/* struct unwind_hint */
.long .Lunwind_hint_ip_\@ - .
.short \sp_offset
.byte \sp_reg
.byte \type
.popsection
#endif
.endm
.macro UNWIND_HINT_EMPTY
UNWIND_HINT sp_reg=ORC_REG_UNDEFINED
.endm
.macro UNWIND_HINT_REGS base=%rsp offset=0 indirect=0 extra=1 iret=0
.if \base == %rsp
.if \indirect
.set sp_reg, ORC_REG_SP_INDIRECT
.else
.set sp_reg, ORC_REG_SP
.endif
.elseif \base == %rbp
.set sp_reg, ORC_REG_BP
.elseif \base == %rdi
.set sp_reg, ORC_REG_DI
.elseif \base == %rdx
.set sp_reg, ORC_REG_DX
.elseif \base == %r10
.set sp_reg, ORC_REG_R10
.else
.error "UNWIND_HINT_REGS: bad base register"
.endif
.set sp_offset, \offset
.if \iret
.set type, ORC_TYPE_REGS_IRET
.elseif \extra == 0
.set type, ORC_TYPE_REGS_IRET
.set sp_offset, \offset + (16*8)
.else
.set type, ORC_TYPE_REGS
.endif
UNWIND_HINT sp_reg=sp_reg sp_offset=sp_offset type=type
.endm
.macro UNWIND_HINT_IRET_REGS base=%rsp offset=0
UNWIND_HINT_REGS base=\base offset=\offset iret=1
.endm
.macro UNWIND_HINT_FUNC sp_offset=8
UNWIND_HINT sp_offset=\sp_offset
.endm
#else /* !__ASSEMBLY__ */
#define UNWIND_HINT(sp_reg, sp_offset, type) \
"987: \n\t" \
".pushsection .discard.unwind_hints\n\t" \
/* struct unwind_hint */ \
".long 987b - .\n\t" \
".short " __stringify(sp_offset) "\n\t" \
".byte " __stringify(sp_reg) "\n\t" \
".byte " __stringify(type) "\n\t" \
".popsection\n\t"
#define UNWIND_HINT_SAVE UNWIND_HINT(0, 0, UNWIND_HINT_TYPE_SAVE)
#define UNWIND_HINT_RESTORE UNWIND_HINT(0, 0, UNWIND_HINT_TYPE_RESTORE)
#endif /* __ASSEMBLY__ */
#endif /* _ASM_X86_UNWIND_HINTS_H */
......@@ -201,7 +201,7 @@ struct boot_params {
*
* @X86_SUBARCH_PC: Should be used if the hardware is enumerable using standard
* PC mechanisms (PCI, ACPI) and doesn't need a special boot flow.
* @X86_SUBARCH_LGUEST: Used for x86 hypervisor demo, lguest
* @X86_SUBARCH_LGUEST: Used for x86 hypervisor demo, lguest, deprecated
* @X86_SUBARCH_XEN: Used for Xen guest types which follow the PV boot path,
* which start at asm startup_xen() entry point and later jump to the C
* xen_start_kernel() entry point. Both domU and dom0 type of guests are
......
......@@ -126,11 +126,9 @@ obj-$(CONFIG_PERF_EVENTS) += perf_regs.o
obj-$(CONFIG_TRACING) += tracepoint.o
obj-$(CONFIG_SCHED_MC_PRIO) += itmt.o
ifdef CONFIG_FRAME_POINTER
obj-y += unwind_frame.o
else
obj-y += unwind_guess.o
endif
obj-$(CONFIG_ORC_UNWINDER) += unwind_orc.o
obj-$(CONFIG_FRAME_POINTER_UNWINDER) += unwind_frame.o
obj-$(CONFIG_GUESS_UNWINDER) += unwind_guess.o
###
# 64 bit specific files
......
......@@ -742,7 +742,16 @@ static void *bp_int3_handler, *bp_int3_addr;
int poke_int3_handler(struct pt_regs *regs)
{
/* bp_patching_in_progress */
/*
* Having observed our INT3 instruction, we now must observe
* bp_patching_in_progress.
*
* in_progress = TRUE INT3
* WMB RMB
* write INT3 if (in_progress)
*
* Idem for bp_int3_handler.
*/
smp_rmb();
if (likely(!bp_patching_in_progress))
......@@ -788,9 +797,8 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
bp_int3_addr = (u8 *)addr + sizeof(int3);
bp_patching_in_progress = true;
/*
* Corresponding read barrier in int3 notifier for
* making sure the in_progress flags is correctly ordered wrt.
* patching
* Corresponding read barrier in int3 notifier for making sure the
* in_progress and handler are correctly ordered wrt. patching.
*/
smp_wmb();
......@@ -815,9 +823,11 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
text_poke(addr, opcode, sizeof(int3));
on_each_cpu(do_sync_core, NULL, 1);
/*
* sync_core() implies an smp_mb() and orders this store against
* the writing of the new instruction.
*/
bp_patching_in_progress = false;
smp_wmb();
return addr;
}
......
......@@ -4,9 +4,6 @@
#include <asm/ucontext.h>
#include <linux/lguest.h>
#include "../../../drivers/lguest/lg.h"
#define __SYSCALL_I386(nr, sym, qual) [nr] = 1,
static char syscalls[] = {
#include <asm/syscalls_32.h>
......@@ -62,23 +59,6 @@ void foo(void)
OFFSET(stack_canary_offset, stack_canary, canary);
#endif
#if defined(CONFIG_LGUEST) || defined(CONFIG_LGUEST_GUEST) || defined(CONFIG_LGUEST_MODULE)
BLANK();
OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
OFFSET(LGUEST_DATA_irq_pending, lguest_data, irq_pending);
BLANK();
OFFSET(LGUEST_PAGES_host_gdt_desc, lguest_pages, state.host_gdt_desc);
OFFSET(LGUEST_PAGES_host_idt_desc, lguest_pages, state.host_idt_desc);
OFFSET(LGUEST_PAGES_host_cr3, lguest_pages, state.host_cr3);
OFFSET(LGUEST_PAGES_host_sp, lguest_pages, state.host_sp);
OFFSET(LGUEST_PAGES_guest_gdt_desc, lguest_pages,state.guest_gdt_desc);
OFFSET(LGUEST_PAGES_guest_idt_desc, lguest_pages,state.guest_idt_desc);
OFFSET(LGUEST_PAGES_guest_gdt, lguest_pages, state.guest_gdt);
OFFSET(LGUEST_PAGES_regs_trapnum, lguest_pages, regs.trapnum);
OFFSET(LGUEST_PAGES_regs_errcode, lguest_pages, regs.errcode);
OFFSET(LGUEST_PAGES_regs, lguest_pages, regs);
#endif
BLANK();
DEFINE(__NR_syscall_max, sizeof(syscalls) - 1);
DEFINE(NR_syscalls, sizeof(syscalls));
......
......@@ -94,6 +94,9 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
if (stack_name)
printk("%s <%s>\n", log_lvl, stack_name);
if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
__show_regs(regs, 0);
/*
* Scan the stack, printing any text addresses we find. At the
* same time, follow proper stack frames with the unwinder.
......@@ -118,10 +121,8 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
* Don't print regs->ip again if it was already printed
* by __show_regs() below.
*/
if (regs && stack == &regs->ip) {
unwind_next_frame(&state);
continue;
}
if (regs && stack == &regs->ip)
goto next;
if (stack == ret_addr_p)
reliable = 1;
......@@ -144,6 +145,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
if (!reliable)
continue;
next:
/*
* Get the next frame from the unwinder. No need to
* check for an error: if anything goes wrong, the rest
......@@ -153,7 +155,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
/* if the frame has entry regs, print them */
regs = unwind_get_entry_regs(&state);
if (regs)
if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
__show_regs(regs, 0);
}
......@@ -265,7 +267,7 @@ int __die(const char *str, struct pt_regs *regs, long err)
#ifdef CONFIG_X86_32
if (user_mode(regs)) {
sp = regs->sp;
ss = regs->ss & 0xffff;
ss = regs->ss;
} else {
sp = kernel_stack_pointer(regs);
savesegment(ss, ss);
......
......@@ -37,7 +37,7 @@ static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info)
* This is a software stack, so 'end' can be a valid stack pointer.
* It just means the stack is empty.
*/
if (stack < begin || stack > end)
if (stack <= begin || stack > end)
return false;
info->type = STACK_TYPE_IRQ;
......@@ -62,7 +62,7 @@ static bool in_softirq_stack(unsigned long *stack, struct stack_info *info)
* This is a software stack, so 'end' can be a valid stack pointer.
* It just means the stack is empty.
*/
if (stack < begin || stack > end)
if (stack <= begin || stack > end)
return false;
info->type = STACK_TYPE_SOFTIRQ;
......
......@@ -55,7 +55,7 @@ static bool in_exception_stack(unsigned long *stack, struct stack_info *info)
begin = end - (exception_stack_sizes[k] / sizeof(long));
regs = (struct pt_regs *)end - 1;
if (stack < begin || stack >= end)
if (stack <= begin || stack >= end)
continue;
info->type = STACK_TYPE_EXCEPTION + k;
......@@ -78,7 +78,7 @@ static bool in_irq_stack(unsigned long *stack, struct stack_info *info)
* This is a software stack, so 'end' can be a valid stack pointer.
* It just means the stack is empty.
*/
if (stack < begin || stack > end)
if (stack <= begin || stack > end)
return false;
info->type = STACK_TYPE_IRQ;
......
......@@ -155,7 +155,6 @@ ENTRY(startup_32)
jmp *%eax
.Lbad_subarch:
WEAK(lguest_entry)
WEAK(xen_entry)
/* Unknown implementation; there's really
nothing we can do at this point. */
......@@ -165,7 +164,6 @@ WEAK(xen_entry)
subarch_entries:
.long .Ldefault_entry /* normal x86/PC */
.long lguest_entry /* lguest hypervisor */
.long xen_entry /* Xen hypervisor */
.long .Ldefault_entry /* Moorestown MID */
num_subarch_entries = (. - subarch_entries) / 4
......@@ -457,12 +455,9 @@ early_idt_handler_common:
/* The vector number is in pt_regs->gs */
cld
pushl %fs /* pt_regs->fs */
movw $0, 2(%esp) /* clear high bits (some CPUs leave garbage) */
pushl %es /* pt_regs->es */
movw $0, 2(%esp) /* clear high bits (some CPUs leave garbage) */
pushl %ds /* pt_regs->ds */
movw $0, 2(%esp) /* clear high bits (some CPUs leave garbage) */
pushl %fs /* pt_regs->fs (__fsh varies by model) */
pushl %es /* pt_regs->es (__esh varies by model) */
pushl %ds /* pt_regs->ds (__dsh varies by model) */
pushl %eax /* pt_regs->ax */
pushl %ebp /* pt_regs->bp */
pushl %edi /* pt_regs->di */
......@@ -479,9 +474,8 @@ early_idt_handler_common:
/* Load the vector number into EDX */
movl PT_GS(%esp), %edx
/* Load GS into pt_regs->gs and clear high bits */
/* Load GS into pt_regs->gs (and maybe clobber __gsh) */
movw %gs, PT_GS(%esp)
movw $0, PT_GS+2(%esp)
movl %esp, %eax /* args are pt_regs (EAX), trapnr (EDX) */
call early_fixup_exception
......@@ -493,10 +487,10 @@ early_idt_handler_common:
popl %edi /* pt_regs->di */
popl %ebp /* pt_regs->bp */
popl %eax /* pt_regs->ax */
popl %ds /* pt_regs->ds */
popl %es /* pt_regs->es */
popl %fs /* pt_regs->fs */
popl %gs /* pt_regs->gs */
popl %ds /* pt_regs->ds (always ignores __dsh) */
popl %es /* pt_regs->es (always ignores __esh) */
popl %fs /* pt_regs->fs (always ignores __fsh) */
popl %gs /* pt_regs->gs (always ignores __gsh) */
decl %ss:early_recursion_flag
addl $4, %esp /* pop pt_regs->orig_ax */
iret
......
......@@ -21,6 +21,25 @@
#include <asm/mmu_context.h>
#include <asm/syscalls.h>
static void refresh_ldt_segments(void)
{
#ifdef CONFIG_X86_64
unsigned short sel;
/*
* Make sure that the cached DS and ES descriptors match the updated
* LDT.
*/
savesegment(ds, sel);
if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT)
loadsegment(ds, sel);
savesegment(es, sel);
if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT)
loadsegment(es, sel);
#endif
}
/* context.lock is held for us, so we don't need any locking. */
static void flush_ldt(void *__mm)
{
......@@ -32,6 +51,8 @@ static void flush_ldt(void *__mm)
pc = &mm->context;
set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
refresh_ldt_segments();
}
/* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
......
......@@ -35,6 +35,7 @@
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/setup.h>
#include <asm/unwind.h>
#if 0
#define DEBUGP(fmt, ...) \
......@@ -213,7 +214,7 @@ int module_finalize(const Elf_Ehdr *hdr,
struct module *me)
{
const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL,
*para = NULL;
*para = NULL, *orc = NULL, *orc_ip = NULL;
char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
......@@ -225,6 +226,10 @@ int module_finalize(const Elf_Ehdr *hdr,
locks = s;
if (!strcmp(".parainstructions", secstrings + s->sh_name))
para = s;
if (!strcmp(".orc_unwind", secstrings + s->sh_name))
orc = s;
if (!strcmp(".orc_unwind_ip", secstrings + s->sh_name))
orc_ip = s;
}
if (alt) {
......@@ -248,6 +253,10 @@ int module_finalize(const Elf_Ehdr *hdr,
/* make jump label nops */
jump_label_apply_nops(me);
if (orc && orc_ip)
unwind_module_init(me, (void *)orc_ip->sh_addr, orc_ip->sh_size,
(void *)orc->sh_addr, orc->sh_size);
return 0;
}
......
......@@ -16,7 +16,6 @@ void __init x86_early_init_platform_quirks(void)
x86_platform.legacy.reserve_bios_regions = 1;
break;
case X86_SUBARCH_XEN:
case X86_SUBARCH_LGUEST:
x86_platform.legacy.devices.pnpbios = 0;
x86_platform.legacy.rtc = 0;
break;
......
......@@ -68,7 +68,7 @@ void __show_regs(struct pt_regs *regs, int all)
if (user_mode(regs)) {
sp = regs->sp;
ss = regs->ss & 0xffff;
ss = regs->ss;
gs = get_user_gs(regs);
} else {
sp = kernel_stack_pointer(regs);
......
......@@ -69,8 +69,7 @@ void __show_regs(struct pt_regs *regs, int all)
unsigned int fsindex, gsindex;
unsigned int ds, cs, es;
printk(KERN_DEFAULT "RIP: %04lx:%pS\n", regs->cs & 0xffff,
(void *)regs->ip);
printk(KERN_DEFAULT "RIP: %04lx:%pS\n", regs->cs, (void *)regs->ip);
printk(KERN_DEFAULT "RSP: %04lx:%016lx EFLAGS: %08lx", regs->ss,
regs->sp, regs->flags);
if (regs->orig_ax != -1)
......@@ -149,6 +148,123 @@ void release_thread(struct task_struct *dead_task)
}
}
enum which_selector {
FS,
GS
};
/*
* Saves the FS or GS base for an outgoing thread if FSGSBASE extensions are
* not available. The goal is to be reasonably fast on non-FSGSBASE systems.
* It's forcibly inlined because it'll generate better code and this function
* is hot.
*/
static __always_inline void save_base_legacy(struct task_struct *prev_p,
unsigned short selector,
enum which_selector which)
{
if (likely(selector == 0)) {
/*
* On Intel (without X86_BUG_NULL_SEG), the segment base could
* be the pre-existing saved base or it could be zero. On AMD
* (with X86_BUG_NULL_SEG), the segment base could be almost
* anything.
*
* This branch is very hot (it's hit twice on almost every
* context switch between 64-bit programs), and avoiding
* the RDMSR helps a lot, so we just assume that whatever
* value is already saved is correct. This matches historical
* Linux behavior, so it won't break existing applications.
*
* To avoid leaking state, on non-X86_BUG_NULL_SEG CPUs, if we
* report that the base is zero, it needs to actually be zero:
* see the corresponding logic in load_seg_legacy.
*/
} else {
/*
* If the selector is 1, 2, or 3, then the base is zero on
* !X86_BUG_NULL_SEG CPUs and could be anything on
* X86_BUG_NULL_SEG CPUs. In the latter case, Linux
* has never attempted to preserve the base across context
* switches.
*
* If selector > 3, then it refers to a real segment, and
* saving the base isn't necessary.
*/
if (which == FS)
prev_p->thread.fsbase = 0;
else
prev_p->thread.gsbase = 0;
}
}
static __always_inline void save_fsgs(struct task_struct *task)
{
savesegment(fs, task->thread.fsindex);
savesegment(gs, task->thread.gsindex);
save_base_legacy(task, task->thread.fsindex, FS);
save_base_legacy(task, task->thread.gsindex, GS);
}
static __always_inline void loadseg(enum which_selector which,
unsigned short sel)
{
if (which == FS)
loadsegment(fs, sel);
else
load_gs_index(sel);
}
static __always_inline void load_seg_legacy(unsigned short prev_index,
unsigned long prev_base,
unsigned short next_index,
unsigned long next_base,
enum which_selector which)
{
if (likely(next_index <= 3)) {
/*
* The next task is using 64-bit TLS, is not using this
* segment at all, or is having fun with arcane CPU features.
*/
if (next_base == 0) {
/*
* Nasty case: on AMD CPUs, we need to forcibly zero
* the base.
*/
if (static_cpu_has_bug(X86_BUG_NULL_SEG)) {
loadseg(which, __USER_DS);
loadseg(which, next_index);
} else {
/*
* We could try to exhaustively detect cases
* under which we can skip the segment load,
* but there's really only one case that matters
* for performance: if both the previous and
* next states are fully zeroed, we can skip
* the load.
*
* (This assumes that prev_base == 0 has no
* false positives. This is the case on
* Intel-style CPUs.)
*/
if (likely(prev_index | next_index | prev_base))
loadseg(which, next_index);
}
} else {
if (prev_index != next_index)
loadseg(which, next_index);
wrmsrl(which == FS ? MSR_FS_BASE : MSR_KERNEL_GS_BASE,
next_base);
}
} else {
/*
* The next task is using a real segment. Loading the selector
* is sufficient.
*/
loadseg(which, next_index);
}
}
int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
unsigned long arg, struct task_struct *p, unsigned long tls)
{
......@@ -229,10 +345,19 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
unsigned long new_sp,
unsigned int _cs, unsigned int _ss, unsigned int _ds)
{
WARN_ON_ONCE(regs != current_pt_regs());
if (static_cpu_has(X86_BUG_NULL_SEG)) {
/* Loading zero below won't clear the base. */
loadsegment(fs, __USER_DS);
load_gs_index(__USER_DS);
}
loadsegment(fs, 0);
loadsegment(es, _ds);
loadsegment(ds, _ds);
load_gs_index(0);
regs->ip = new_ip;
regs->sp = new_sp;
regs->cs = _cs;
......@@ -277,7 +402,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
struct fpu *next_fpu = &next->fpu;
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
unsigned prev_fsindex, prev_gsindex;
WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
this_cpu_read(irq_count) != -1);
switch_fpu_prepare(prev_fpu, cpu);
......@@ -286,8 +413,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
*
* (e.g. xen_load_tls())
*/
savesegment(fs, prev_fsindex);
savesegment(gs, prev_gsindex);
save_fsgs(prev_p);
/*
* Load TLS before restoring any segments so that segment loads
......@@ -326,108 +452,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (unlikely(next->ds | prev->ds))
loadsegment(ds, next->ds);
/*
* Switch FS and GS.
*
* These are even more complicated than DS and ES: they have
* 64-bit bases are that controlled by arch_prctl. The bases
* don't necessarily match the selectors, as user code can do
* any number of things to cause them to be inconsistent.
*
* We don't promise to preserve the bases if the selectors are
* nonzero. We also don't promise to preserve the base if the
* selector is zero and the base doesn't match whatever was
* most recently passed to ARCH_SET_FS/GS. (If/when the
* FSGSBASE instructions are enabled, we'll need to offer
* stronger guarantees.)
*
* As an invariant,
* (fsbase != 0 && fsindex != 0) || (gsbase != 0 && gsindex != 0) is
* impossible.
*/
if (next->fsindex) {
/* Loading a nonzero value into FS sets the index and base. */
loadsegment(fs, next->fsindex);
} else {
if (next->fsbase) {
/* Next index is zero but next base is nonzero. */
if (prev_fsindex)
loadsegment(fs, 0);
wrmsrl(MSR_FS_BASE, next->fsbase);
} else {
/* Next base and index are both zero. */
if (static_cpu_has_bug(X86_BUG_NULL_SEG)) {
/*
* We don't know the previous base and can't
* find out without RDMSR. Forcibly clear it.
*/
loadsegment(fs, __USER_DS);
loadsegment(fs, 0);
} else {
/*
* If the previous index is zero and ARCH_SET_FS
* didn't change the base, then the base is
* also zero and we don't need to do anything.
*/
if (prev->fsbase || prev_fsindex)
loadsegment(fs, 0);
}
}
}
/*
* Save the old state and preserve the invariant.
* NB: if prev_fsindex == 0, then we can't reliably learn the base
* without RDMSR because Intel user code can zero it without telling
* us and AMD user code can program any 32-bit value without telling
* us.
*/
if (prev_fsindex)
prev->fsbase = 0;
prev->fsindex = prev_fsindex;
if (next->gsindex) {
/* Loading a nonzero value into GS sets the index and base. */
load_gs_index(next->gsindex);
} else {
if (next->gsbase) {
/* Next index is zero but next base is nonzero. */
if (prev_gsindex)
load_gs_index(0);
wrmsrl(MSR_KERNEL_GS_BASE, next->gsbase);
} else {
/* Next base and index are both zero. */
if (static_cpu_has_bug(X86_BUG_NULL_SEG)) {
/*
* We don't know the previous base and can't
* find out without RDMSR. Forcibly clear it.
*
* This contains a pointless SWAPGS pair.
* Fixing it would involve an explicit check
* for Xen or a new pvop.
*/
load_gs_index(__USER_DS);
load_gs_index(0);
} else {
/*
* If the previous index is zero and ARCH_SET_GS
* didn't change the base, then the base is
* also zero and we don't need to do anything.
*/
if (prev->gsbase || prev_gsindex)
load_gs_index(0);
}
}
}
/*
* Save the old state and preserve the invariant.
* NB: if prev_gsindex == 0, then we can't reliably learn the base
* without RDMSR because Intel user code can zero it without telling
* us and AMD user code can program any 32-bit value without telling
* us.
*/
if (prev_gsindex)
prev->gsbase = 0;
prev->gsindex = prev_gsindex;
load_seg_legacy(prev->fsindex, prev->fsbase,
next->fsindex, next->fsbase, FS);
load_seg_legacy(prev->gsindex, prev->gsbase,
next->gsindex, next->gsbase, GS);
switch_fpu_finish(next_fpu, cpu);
......
......@@ -115,6 +115,7 @@
#include <asm/microcode.h>
#include <asm/mmu_context.h>
#include <asm/kaslr.h>
#include <asm/unwind.h>
/*
* max_low_pfn_mapped: highest direct mapped pfn under 4GB
......@@ -1310,6 +1311,8 @@ void __init setup_arch(char **cmdline_p)
if (efi_enabled(EFI_BOOT))
efi_apply_memmap_quirks();
#endif
unwind_init();
}
#ifdef CONFIG_X86_32
......
......@@ -256,7 +256,7 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
sp = current->sas_ss_sp + current->sas_ss_size;
} else if (IS_ENABLED(CONFIG_X86_32) &&
!onsigstack &&
(regs->ss & 0xffff) != __USER_DS &&
regs->ss != __USER_DS &&
!(ka->sa.sa_flags & SA_RESTORER) &&
ka->sa.sa_restorer) {
/* This is the legacy signal stack switching. */
......
......@@ -13,7 +13,7 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
unsigned long addr, seg;
addr = regs->ip;
seg = regs->cs & 0xffff;
seg = regs->cs;
if (v8086_mode(regs)) {
addr = (addr & 0xffff) + (seg << 4);
return addr;
......
......@@ -10,20 +10,22 @@
#define FRAME_HEADER_SIZE (sizeof(long) * 2)
/*
* This disables KASAN checking when reading a value from another task's stack,
* since the other task could be running on another CPU and could have poisoned
* the stack in the meantime.
*/
#define READ_ONCE_TASK_STACK(task, x) \
({ \
unsigned long val; \
if (task == current) \
val = READ_ONCE(x); \
else \
val = READ_ONCE_NOCHECK(x); \
val; \
})
unsigned long unwind_get_return_address(struct unwind_state *state)
{
if (unwind_done(state))
return 0;
return __kernel_text_address(state->ip) ? state->ip : 0;
}
EXPORT_SYMBOL_GPL(unwind_get_return_address);
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
{
if (unwind_done(state))
return NULL;
return state->regs ? &state->regs->ip : state->bp + 1;
}
static void unwind_dump(struct unwind_state *state)
{
......@@ -66,15 +68,6 @@ static void unwind_dump(struct unwind_state *state)
}
}
unsigned long unwind_get_return_address(struct unwind_state *state)
{
if (unwind_done(state))
return 0;
return __kernel_text_address(state->ip) ? state->ip : 0;
}
EXPORT_SYMBOL_GPL(unwind_get_return_address);
static size_t regs_size(struct pt_regs *regs)
{
/* x86_32 regs from kernel mode are two words shorter: */
......
......@@ -19,6 +19,11 @@ unsigned long unwind_get_return_address(struct unwind_state *state)
}
EXPORT_SYMBOL_GPL(unwind_get_return_address);
unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
{
return NULL;
}
bool unwind_next_frame(struct unwind_state *state)
{
struct stack_info *info = &state->stack_info;
......
This diff is collapsed.
......@@ -24,6 +24,7 @@
#include <asm/asm-offsets.h>
#include <asm/thread_info.h>
#include <asm/page_types.h>
#include <asm/orc_lookup.h>
#include <asm/cache.h>
#include <asm/boot.h>
......@@ -148,6 +149,8 @@ SECTIONS
BUG_TABLE
ORC_UNWIND_TABLE
. = ALIGN(PAGE_SIZE);
__vvar_page = .;
......
......@@ -89,6 +89,5 @@ config KVM_MMU_AUDIT
# OK, it's a little counter-intuitive to do this, but it puts it neatly under
# the virtualization menu.
source drivers/vhost/Kconfig
source drivers/lguest/Kconfig
endif # VIRTUALIZATION
config LGUEST_GUEST
bool "Lguest guest support"
depends on X86_32 && PARAVIRT && PCI
select TTY
select VIRTUALIZATION
select VIRTIO
select VIRTIO_CONSOLE
help
Lguest is a tiny in-kernel hypervisor. Selecting this will
allow your kernel to boot under lguest. This option will increase
your kernel size by about 10k. If in doubt, say N.
If you say Y here, make sure you say Y (or M) to the virtio block
and net drivers which lguest needs.
obj-y := head_32.o boot.o
CFLAGS_boot.o := $(call cc-option, -fno-stack-protector)
This diff is collapsed.
This diff is collapsed.
......@@ -363,3 +363,4 @@ L_bugged_2:
pop %ebx
jmp L_exit
#endif /* PARANOID */
ENDPROC(div_Xsig)
......@@ -44,4 +44,4 @@ ENTRY(FPU_div_small)
leave
ret
ENDPROC(FPU_div_small)
......@@ -62,6 +62,7 @@ ENTRY(mul32_Xsig)
popl %esi
leave
ret
ENDPROC(mul32_Xsig)
ENTRY(mul64_Xsig)
......@@ -114,6 +115,7 @@ ENTRY(mul64_Xsig)
popl %esi
leave
ret
ENDPROC(mul64_Xsig)
......@@ -173,4 +175,4 @@ ENTRY(mul_Xsig_Xsig)
popl %esi
leave
ret
ENDPROC(mul_Xsig_Xsig)
......@@ -133,3 +133,4 @@ L_accum_done:
popl %esi
leave
ret
ENDPROC(polynomial_Xsig)
......@@ -94,6 +94,7 @@ L_overflow:
call arith_overflow
pop %ebx
jmp L_exit
ENDPROC(FPU_normalize)
......@@ -145,3 +146,4 @@ L_exit_nuo_zero:
popl %ebx
leave
ret
ENDPROC(FPU_normalize_nuo)
......@@ -706,3 +706,5 @@ L_exception_exit:
mov $-1,%eax
jmp fpu_reg_round_special_exit
#endif /* PARANOID */
ENDPROC(FPU_round)
......@@ -165,3 +165,4 @@ L_exit:
leave
ret
#endif /* PARANOID */
ENDPROC(FPU_u_add)
......@@ -469,3 +469,5 @@ L_exit:
leave
ret
#endif /* PARANOID */
ENDPROC(FPU_u_div)
......@@ -146,3 +146,4 @@ L_exit:
ret
#endif /* PARANOID */
ENDPROC(FPU_u_mul)
......@@ -270,3 +270,4 @@ L_exit:
popl %esi
leave
ret
ENDPROC(FPU_u_sub)
......@@ -78,7 +78,7 @@ L_exit:
popl %ebx
leave
ret
ENDPROC(round_Xsig)
......@@ -138,4 +138,4 @@ L_n_exit:
popl %ebx
leave
ret
ENDPROC(norm_Xsig)
......@@ -85,3 +85,4 @@ L_more_than_95:
popl %esi
leave
ret
ENDPROC(shr_Xsig)
......@@ -92,6 +92,7 @@ L_more_than_95:
popl %esi
leave
ret
ENDPROC(FPU_shrx)
/*---------------------------------------------------------------------------+
......@@ -202,3 +203,4 @@ Ls_more_than_95:
popl %esi
leave
ret
ENDPROC(FPU_shrxs)
......@@ -468,3 +468,4 @@ sqrt_more_prec_large:
/* Our estimate is too large */
movl $0x7fffff00,%eax
jmp sqrt_round_result
ENDPROC(wm_sqrt)
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment