Commit 88a61892 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'docs-5.19' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "It was a moderately busy cycle for documentation; highlights include:

   - After a long period of inactivity, the Japanese translations are
     seeing some much-needed maintenance and updating.

   - Reworked IOMMU documentation

   - Some new documentation for static-analysis tools

   - A new overall structure for the memory-management documentation.
     This is an LSFMM outcome that, it is hoped, will help encourage
     developers to fill in the many gaps. Optimism is eternal...but
     hopefully it will work.

   - More Chinese translations.

  Plus the usual typo fixes, updates, etc"

* tag 'docs-5.19' of git://git.lwn.net/linux: (70 commits)
  docs: pdfdocs: Add space for chapter counts >= 100 in TOC
  docs/zh_CN: Add dev-tools/gdb-kernel-debugging.rst Chinese translation
  input: Docs: correct ntrig.rst typo
  input: Docs: correct atarikbd.rst typos
  MAINTAINERS: Become the docs/zh_CN maintainer
  docs/zh_CN: fix devicetree usage-model translation
  mm,doc: Add new documentation structure
  Documentation: drop more IDE boot options and ide-cd.rst
  Documentation/process: use scripts/get_maintainer.pl on patches
  MAINTAINERS: Add entry for DOCUMENTATION/JAPANESE
  docs/trans/ja_JP/howto: Don't mention specific kernel versions
  docs/ja_JP/SubmittingPatches: Request summaries for commit references
  docs/ja_JP/SubmittingPatches: Add Suggested-by as a standard signature
  docs/ja_JP/SubmittingPatches: Randy has moved
  docs/ja_JP/SubmittingPatches: Suggest the use of scripts/get_maintainer.pl
  docs/ja_JP/SubmittingPatches: Update GregKH links
  Documentation/sysctl: document max_rcu_stall_to_panic
  Documentation: add missing angle bracket in cgroup-v2 doc
  Documentation: dev-tools: use literal block instead of code-block
  docs/zh_CN: add vm numa translation
  ...
parents 537e62c8 b86f46d5
...@@ -1881,7 +1881,7 @@ IO Latency Interface Files ...@@ -1881,7 +1881,7 @@ IO Latency Interface Files
io.latency io.latency
This takes a similar format as the other controllers. This takes a similar format as the other controllers.
"MAJOR:MINOR target=<target time in microseconds" "MAJOR:MINOR target=<target time in microseconds>"
io.stat io.stat
If the controller is enabled you will see extra stats in io.stat in If the controller is enabled you will see extra stats in io.stat in
......
...@@ -99,6 +99,7 @@ parameter is applicable:: ...@@ -99,6 +99,7 @@ parameter is applicable::
ALSA ALSA sound support is enabled. ALSA ALSA sound support is enabled.
APIC APIC support is enabled. APIC APIC support is enabled.
APM Advanced Power Management support is enabled. APM Advanced Power Management support is enabled.
APPARMOR AppArmor support is enabled.
ARM ARM architecture is enabled. ARM ARM architecture is enabled.
ARM64 ARM64 architecture is enabled. ARM64 ARM64 architecture is enabled.
AX25 Appropriate AX.25 support is enabled. AX25 Appropriate AX.25 support is enabled.
...@@ -108,15 +109,15 @@ parameter is applicable:: ...@@ -108,15 +109,15 @@ parameter is applicable::
DYNAMIC_DEBUG Build in debug messages and enable them at runtime DYNAMIC_DEBUG Build in debug messages and enable them at runtime
EDD BIOS Enhanced Disk Drive Services (EDD) is enabled EDD BIOS Enhanced Disk Drive Services (EDD) is enabled
EFI EFI Partitioning (GPT) is enabled EFI EFI Partitioning (GPT) is enabled
EIDE EIDE/ATAPI support is enabled.
EVM Extended Verification Module EVM Extended Verification Module
FB The frame buffer device is enabled. FB The frame buffer device is enabled.
FTRACE Function tracing enabled. FTRACE Function tracing enabled.
GCOV GCOV profiling is enabled. GCOV GCOV profiling is enabled.
HIBERNATION HIBERNATION is enabled.
HW Appropriate hardware is enabled. HW Appropriate hardware is enabled.
HYPER_V HYPERV support is enabled.
IA-64 IA-64 architecture is enabled. IA-64 IA-64 architecture is enabled.
IMA Integrity measurement architecture is enabled. IMA Integrity measurement architecture is enabled.
IOSCHED More than one I/O scheduler is enabled.
IP_PNP IP DHCP, BOOTP, or RARP is enabled. IP_PNP IP DHCP, BOOTP, or RARP is enabled.
IPV6 IPv6 support is enabled. IPV6 IPv6 support is enabled.
ISAPNP ISA PnP code is enabled. ISAPNP ISA PnP code is enabled.
...@@ -140,7 +141,6 @@ parameter is applicable:: ...@@ -140,7 +141,6 @@ parameter is applicable::
NUMA NUMA support is enabled. NUMA NUMA support is enabled.
NFS Appropriate NFS support is enabled. NFS Appropriate NFS support is enabled.
OF Devicetree is enabled. OF Devicetree is enabled.
OSS OSS sound support is enabled.
PV_OPS A paravirtualized kernel is enabled. PV_OPS A paravirtualized kernel is enabled.
PARIDE The ParIDE (parallel port IDE) subsystem is enabled. PARIDE The ParIDE (parallel port IDE) subsystem is enabled.
PARISC The PA-RISC architecture is enabled. PARISC The PA-RISC architecture is enabled.
...@@ -160,7 +160,6 @@ parameter is applicable:: ...@@ -160,7 +160,6 @@ parameter is applicable::
the Documentation/scsi/ sub-directory. the Documentation/scsi/ sub-directory.
SECURITY Different security models are enabled. SECURITY Different security models are enabled.
SELINUX SELinux support is enabled. SELINUX SELinux support is enabled.
APPARMOR AppArmor support is enabled.
SERIAL Serial support is enabled. SERIAL Serial support is enabled.
SH SuperH architecture is enabled. SH SuperH architecture is enabled.
SMP The kernel is an SMP kernel. SMP The kernel is an SMP kernel.
...@@ -168,7 +167,6 @@ parameter is applicable:: ...@@ -168,7 +167,6 @@ parameter is applicable::
SWSUSP Software suspend (hibernation) is enabled. SWSUSP Software suspend (hibernation) is enabled.
SUSPEND System suspend states are enabled. SUSPEND System suspend states are enabled.
TPM TPM drivers are enabled. TPM TPM drivers are enabled.
TS Appropriate touchscreen support is enabled.
UMS USB Mass Storage support is enabled. UMS USB Mass Storage support is enabled.
USB USB support is enabled. USB USB support is enabled.
USBHID USB Human Interface Device support is enabled. USBHID USB Human Interface Device support is enabled.
...@@ -177,7 +175,6 @@ parameter is applicable:: ...@@ -177,7 +175,6 @@ parameter is applicable::
VGA The VGA console has been enabled. VGA The VGA console has been enabled.
VT Virtual terminal support is enabled. VT Virtual terminal support is enabled.
WDT Watchdog support is enabled. WDT Watchdog support is enabled.
XT IBM PC/XT MFM hard disk support is enabled.
X86-32 X86-32, aka i386 architecture is enabled. X86-32 X86-32, aka i386 architecture is enabled.
X86-64 X86-64 architecture is enabled. X86-64 X86-64 architecture is enabled.
More X86-64 boot options can be found in More X86-64 boot options can be found in
...@@ -211,7 +208,7 @@ The number of kernel parameters is not limited, but the length of the ...@@ -211,7 +208,7 @@ The number of kernel parameters is not limited, but the length of the
complete command line (parameters including spaces etc.) is limited to complete command line (parameters including spaces etc.) is limited to
a fixed number of characters. This limit depends on the architecture a fixed number of characters. This limit depends on the architecture
and is between 256 and 4096 characters. It is defined in the file and is between 256 and 4096 characters. It is defined in the file
./include/asm/setup.h as COMMAND_LINE_SIZE. ./include/uapi/asm-generic/setup.h as COMMAND_LINE_SIZE.
Finally, the [KMG] suffix is commonly described after a number of kernel Finally, the [KMG] suffix is commonly described after a number of kernel
parameter values. These 'K', 'M', and 'G' letters represent the _binary_ parameter values. These 'K', 'M', and 'G' letters represent the _binary_
......
...@@ -461,6 +461,12 @@ ...@@ -461,6 +461,12 @@
Format: <io>,<irq>,<mode> Format: <io>,<irq>,<mode>
See header of drivers/net/hamradio/baycom_ser_hdx.c. See header of drivers/net/hamradio/baycom_ser_hdx.c.
bert_disable [ACPI]
Disable BERT OS support on buggy BIOSes.
bgrt_disable [ACPI][X86]
Disable BGRT to avoid flickering OEM logo.
blkdevparts= Manual partition parsing of block device(s) for blkdevparts= Manual partition parsing of block device(s) for
embedded devices based on command line input. embedded devices based on command line input.
See Documentation/block/cmdline-partition.rst See Documentation/block/cmdline-partition.rst
...@@ -476,12 +482,6 @@ ...@@ -476,12 +482,6 @@
See Documentation/admin-guide/bootconfig.rst See Documentation/admin-guide/bootconfig.rst
bert_disable [ACPI]
Disable BERT OS support on buggy BIOSes.
bgrt_disable [ACPI][X86]
Disable BGRT to avoid flickering OEM logo.
bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards) bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards)
bttv.radio= Most important insmod options are available as bttv.radio= Most important insmod options are available as
kernel args too. kernel args too.
...@@ -563,6 +563,25 @@ ...@@ -563,6 +563,25 @@
cio_ignore= [S390] cio_ignore= [S390]
See Documentation/s390/common_io.rst for details. See Documentation/s390/common_io.rst for details.
clearcpuid=X[,X...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
numbers X. Note the Linux-specific bits are not necessarily
stable over kernel options, but the vendor-specific
ones should be.
X can also be a string as appearing in the flags: line
in /proc/cpuinfo which does not have the above
instability issue. However, not all features have names
in /proc/cpuinfo.
Note that using this option will taint your kernel.
Also note that user programs calling CPUID directly
or using the feature without checking anything
will still see it. This just prevents it from
being used by the kernel or shown in /proc/cpuinfo.
Also note the kernel might malfunction if you disable
some critical bits.
clk_ignore_unused clk_ignore_unused
[CLK] [CLK]
Prevents the clock framework from automatically gating Prevents the clock framework from automatically gating
...@@ -631,24 +650,6 @@ ...@@ -631,24 +650,6 @@
Defaults to zero when built as a module and to Defaults to zero when built as a module and to
10 seconds when built into the kernel. 10 seconds when built into the kernel.
clearcpuid=X[,X...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
numbers X. Note the Linux-specific bits are not necessarily
stable over kernel options, but the vendor-specific
ones should be.
X can also be a string as appearing in the flags: line
in /proc/cpuinfo which does not have the above
instability issue. However, not all features have names
in /proc/cpuinfo.
Note that using this option will taint your kernel.
Also note that user programs calling CPUID directly
or using the feature without checking anything
will still see it. This just prevents it from
being used by the kernel or shown in /proc/cpuinfo.
Also note the kernel might malfunction if you disable
some critical bits.
cma=nn[MG]@[start[MG][-end[MG]]] cma=nn[MG]@[start[MG][-end[MG]]]
[KNL,CMA] [KNL,CMA]
Sets the size of kernel global memory area for Sets the size of kernel global memory area for
...@@ -770,6 +771,24 @@ ...@@ -770,6 +771,24 @@
0: default value, disable debugging 0: default value, disable debugging
1: enable debugging at boot time 1: enable debugging at boot time
cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
Format:
<first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
1. Resume from suspend/hibernate depends on CPU0.
Suspend/hibernate will fail if CPU0 is offline and you
need to online CPU0 before suspend/hibernate.
2. PIC interrupts also depend on CPU0. CPU0 can't be
removed if a PIC interrupt is detected.
It's said poweroff/reboot may depend on CPU0 on some
machines although I haven't seen such issues so far
after CPU0 is offline on a few tested machines.
If the dependencies are under your control, you can
turn on cpu0_hotplug.
cpuidle.off=1 [CPU_IDLE] cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system disable the cpuidle sub-system
...@@ -790,9 +809,13 @@ ...@@ -790,9 +809,13 @@
on every CPU online, such as boot, and resume from suspend. on every CPU online, such as boot, and resume from suspend.
Default: 10000 Default: 10000
cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver crash_kexec_post_notifiers
Format: Run kdump after running panic-notifiers and dumping
<first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>] kmsg. This only for the users who doubt kdump always
succeeds in any situation.
Note that this also increases risks of kdump failure,
because some panic notifiers can make the crashed
kernel more unstable.
crashkernel=size[KMG][@offset[KMG]] crashkernel=size[KMG][@offset[KMG]]
[KNL] Using kexec, Linux can switch to a 'crash kernel' [KNL] Using kexec, Linux can switch to a 'crash kernel'
...@@ -961,6 +984,8 @@ ...@@ -961,6 +984,8 @@
dump out devices still on the deferred probe list after dump out devices still on the deferred probe list after
retrying. retrying.
delayacct [KNL] Enable per-task delay accounting
dell_smm_hwmon.ignore_dmi= dell_smm_hwmon.ignore_dmi=
[HW] Continue probing hardware even if DMI data [HW] Continue probing hardware even if DMI data
indicates that the driver is running on unsupported indicates that the driver is running on unsupported
...@@ -1014,17 +1039,6 @@ ...@@ -1014,17 +1039,6 @@
disable= [IPV6] disable= [IPV6]
See Documentation/networking/ipv6.rst. See Documentation/networking/ipv6.rst.
hardened_usercopy=
[KNL] Under CONFIG_HARDENED_USERCOPY, whether
hardening is enabled for this boot. Hardened
usercopy checking is used to protect the kernel
from reading or writing beyond known memory
allocation boundaries as a proactive defense
against bounds-checking flaws in the kernel's
copy_to_user()/copy_from_user() interface.
on Perform hardened usercopy checks (default).
off Disable hardened usercopy checks.
disable_radix [PPC] disable_radix [PPC]
Disable RADIX MMU mode on POWER9 Disable RADIX MMU mode on POWER9
...@@ -1293,7 +1307,7 @@ ...@@ -1293,7 +1307,7 @@
Append ",keep" to not disable it when the real console Append ",keep" to not disable it when the real console
takes over. takes over.
Only one of vga, efi, serial, or usb debug port can Only one of vga, serial, or usb debug port can
be used at a time. be used at a time.
Currently only ttyS0 and ttyS1 may be specified by Currently only ttyS0 and ttyS1 may be specified by
...@@ -1308,7 +1322,7 @@ ...@@ -1308,7 +1322,7 @@
Interaction with the standard serial driver is not Interaction with the standard serial driver is not
very good. very good.
The VGA and EFI output is eventually overwritten by The VGA output is eventually overwritten by
the real console. the real console.
The xen option can only be used in Xen domains. The xen option can only be used in Xen domains.
...@@ -1327,17 +1341,6 @@ ...@@ -1327,17 +1341,6 @@
force: enforce the use of EDAC to report H/W event. force: enforce the use of EDAC to report H/W event.
default: on. default: on.
ekgdboc= [X86,KGDB] Allow early kernel console debugging
ekgdboc=kbd
This is designed to be used in conjunction with
the boot argument: earlyprintk=vga
This parameter works in place of the kgdboc parameter
but can only be used if the backing tty is available
very early in the boot process. For early debugging
via a serial port see kgdboc_earlycon instead.
edd= [EDD] edd= [EDD]
Format: {"off" | "on" | "skip[mbr]"} Format: {"off" | "on" | "skip[mbr]"}
...@@ -1399,6 +1402,17 @@ ...@@ -1399,6 +1402,17 @@
eisa_irq_edge= [PARISC,HW] eisa_irq_edge= [PARISC,HW]
See header of drivers/parisc/eisa.c. See header of drivers/parisc/eisa.c.
ekgdboc= [X86,KGDB] Allow early kernel console debugging
Format: ekgdboc=kbd
This is designed to be used in conjunction with
the boot argument: earlyprintk=vga
This parameter works in place of the kgdboc parameter
but can only be used if the backing tty is available
very early in the boot process. For early debugging
via a serial port see kgdboc_earlycon instead.
elanfreq= [X86-32] elanfreq= [X86-32]
See comment before function elanfreq_setup() in See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c. arch/x86/kernel/cpu/cpufreq/elanfreq.c.
...@@ -1597,6 +1611,17 @@ ...@@ -1597,6 +1611,17 @@
Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0. Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0.
Default: 1024 Default: 1024
hardened_usercopy=
[KNL] Under CONFIG_HARDENED_USERCOPY, whether
hardening is enabled for this boot. Hardened
usercopy checking is used to protect the kernel
from reading or writing beyond known memory
allocation boundaries as a proactive defense
against bounds-checking flaws in the kernel's
copy_to_user()/copy_from_user() interface.
on Perform hardened usercopy checks (default).
off Disable hardened usercopy checks.
hardlockup_all_cpu_backtrace= hardlockup_all_cpu_backtrace=
[KNL] Should the hard-lockup detector generate [KNL] Should the hard-lockup detector generate
backtraces on all cpus. backtraces on all cpus.
...@@ -1617,6 +1642,15 @@ ...@@ -1617,6 +1642,15 @@
corresponding firmware-first mode error processing corresponding firmware-first mode error processing
logic will be disabled. logic will be disabled.
hibernate= [HIBERNATION]
noresume Don't check if there's a hibernation image
present during boot.
nocompress Don't compress/decompress hibernation images.
no Disable hibernation and resume.
protect_image Turn on image protection during restoration
(that will set all pages holding image data
during restoration read-only).
highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact
size of <nn>. This works even on boxes that have no size of <nn>. This works even on boxes that have no
highmem otherwise. This also works to reduce highmem highmem otherwise. This also works to reduce highmem
...@@ -1639,16 +1673,6 @@ ...@@ -1639,16 +1673,6 @@
hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET
registers. Default set by CONFIG_HPET_MMAP_DEFAULT. registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation
of gigantic hugepages. Or using node format, the size
of a CMA area per node can be specified.
Format: nn[KMGTPE] or (node format)
<node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
Reserve a CMA area of given size and allocate gigantic
hugepages using the CMA allocator. If enabled, the
boot-time allocation of gigantic hugepages is skipped.
hugepages= [HW] Number of HugeTLB pages to allocate at boot. hugepages= [HW] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated. the number of pages of hugepagesz to be allocated.
...@@ -1670,6 +1694,16 @@ ...@@ -1670,6 +1694,16 @@
Documentation/admin-guide/mm/hugetlbpage.rst. Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG] Format: size[KMG]
hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation
of gigantic hugepages. Or using node format, the size
of a CMA area per node can be specified.
Format: nn[KMGTPE] or (node format)
<node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
Reserve a CMA area of given size and allocate gigantic
hugepages using the CMA allocator. If enabled, the
boot-time allocation of gigantic hugepages is skipped.
hugetlb_free_vmemmap= hugetlb_free_vmemmap=
[KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP [KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
enabled. enabled.
...@@ -1769,26 +1803,6 @@ ...@@ -1769,26 +1803,6 @@
icn= [HW,ISDN] icn= [HW,ISDN]
Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]] Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]]
ide-core.nodma= [HW] (E)IDE subsystem
Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc
.vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr
.cdrom .chs .ignore_cable are additional options
See Documentation/ide/ide.rst.
ide-generic.probe-mask= [HW] (E)IDE subsystem
Format: <int>
Probe mask for legacy ISA IDE ports. Depending on
platform up to 6 ports are supported, enabled by
setting corresponding bits in the mask to 1. The
default value is 0x0, which has a special meaning.
On systems that have PCI, it triggers scanning the
PCI bus for the first and the second port, which
are then probed. On systems without PCI the value
of 0x0 enables probing the two first ports as if it
was 0x3.
ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem
Claim all unknown PCI IDE storage controllers.
idle= [X86] idle= [X86]
Format: idle=poll, idle=halt, idle=nomwait Format: idle=poll, idle=halt, idle=nomwait
...@@ -2722,8 +2736,6 @@ ...@@ -2722,8 +2736,6 @@
If there are multiple matching configurations changing If there are multiple matching configurations changing
the same attribute, the last one is used. the same attribute, the last one is used.
memblock=debug [KNL] Enable memblock debug messages.
load_ramdisk= [RAM] [Deprecated] load_ramdisk= [RAM] [Deprecated]
lockd.nlm_grace_period=P [NFS] Assign grace period. lockd.nlm_grace_period=P [NFS] Assign grace period.
...@@ -2865,7 +2877,7 @@ ...@@ -2865,7 +2877,7 @@
different yeeloong laptops. different yeeloong laptops.
Example: machtype=lemote-yeeloong-2f-7inch Example: machtype=lemote-yeeloong-2f-7inch
max_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory greater max_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory greater
than or equal to this physical address is ignored. than or equal to this physical address is ignored.
maxcpus= [SMP] Maximum number of processors that an SMP kernel maxcpus= [SMP] Maximum number of processors that an SMP kernel
...@@ -2965,6 +2977,8 @@ ...@@ -2965,6 +2977,8 @@
mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel
memory. memory.
memblock=debug [KNL] Enable memblock debug messages.
memchunk=nn[KMG] memchunk=nn[KMG]
[KNL,SH] Allow user to override the default size for [KNL,SH] Allow user to override the default size for
per-device physically contiguous DMA buffers. per-device physically contiguous DMA buffers.
...@@ -3108,7 +3122,7 @@ ...@@ -3108,7 +3122,7 @@
mga= [HW,DRM] mga= [HW,DRM]
min_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory below this min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this
physical address is ignored. physical address is ignored.
mini2440= [ARM,HW,KNL] mini2440= [ARM,HW,KNL]
...@@ -3233,20 +3247,6 @@ ...@@ -3233,20 +3247,6 @@
mtdparts= [MTD] mtdparts= [MTD]
See drivers/mtd/parsers/cmdlinepart.c See drivers/mtd/parsers/cmdlinepart.c
multitce=off [PPC] This parameter disables the use of the pSeries
firmware feature for updating multiple TCE entries
at a time.
onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration
Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]
boundary - index of last SLC block on Flex-OneNAND.
The remaining blocks are configured as MLC blocks.
lock - Configure if Flex-OneNAND boundary should be locked.
Once locked, the boundary cannot be changed.
1 indicates lock status, 0 indicates unlock status.
mtdset= [ARM] mtdset= [ARM]
ARM/S3C2412 JIVE boot control ARM/S3C2412 JIVE boot control
...@@ -3273,6 +3273,10 @@ ...@@ -3273,6 +3273,10 @@
Used for mtrr cleanup. It is spare mtrr entries number. Used for mtrr cleanup. It is spare mtrr entries number.
Set to 2 or more if your graphical card needs more. Set to 2 or more if your graphical card needs more.
multitce=off [PPC] This parameter disables the use of the pSeries
firmware feature for updating multiple TCE entries
at a time.
n2= [NET] SDL Inc. RISCom/N2 synchronous serial card n2= [NET] SDL Inc. RISCom/N2 synchronous serial card
netdev= [NET] Network devices parameters netdev= [NET] Network devices parameters
...@@ -3282,6 +3286,11 @@ ...@@ -3282,6 +3286,11 @@
This usage is only documented in each driver source This usage is only documented in each driver source
file if at all. file if at all.
netpoll.carrier_timeout=
[NET] Specifies amount of time (in seconds) that
netpoll should wait for a carrier. By default netpoll
waits 4 seconds.
nf_conntrack.acct= nf_conntrack.acct=
[NETFILTER] Enable connection tracking flow accounting [NETFILTER] Enable connection tracking flow accounting
0 to disable accounting 0 to disable accounting
...@@ -3432,11 +3441,6 @@ ...@@ -3432,11 +3441,6 @@
These settings can be accessed at runtime via These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls. the nmi_watchdog and hardlockup_panic sysctls.
netpoll.carrier_timeout=
[NET] Specifies amount of time (in seconds) that
netpoll should wait for a carrier. By default netpoll
waits 4 seconds.
no387 [BUGS=X86-32] Tells the kernel to use the 387 maths no387 [BUGS=X86-32] Tells the kernel to use the 387 maths
emulation library even if a 387 maths coprocessor emulation library even if a 387 maths coprocessor
is present. is present.
...@@ -3491,8 +3495,6 @@ ...@@ -3491,8 +3495,6 @@
nocache [ARM] nocache [ARM]
delayacct [KNL] Enable per-task delay accounting
nodsp [SH] Disable hardware DSP at boot time. nodsp [SH] Disable hardware DSP at boot time.
noefi Disable EFI runtime services support. noefi Disable EFI runtime services support.
...@@ -3721,20 +3723,6 @@ ...@@ -3721,20 +3723,6 @@
nox2apic [X86-64,APIC] Do not enable x2APIC mode. nox2apic [X86-64,APIC] Do not enable x2APIC mode.
cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
1. Resume from suspend/hibernate depends on CPU0.
Suspend/hibernate will fail if CPU0 is offline and you
need to online CPU0 before suspend/hibernate.
2. PIC interrupts also depend on CPU0. CPU0 can't be
removed if a PIC interrupt is detected.
It's said poweroff/reboot may depend on CPU0 on some
machines although I haven't seen such issues so far
after CPU0 is offline on a few tested machines.
If the dependencies are under your control, you can
turn on cpu0_hotplug.
nps_mtm_hs_ctr= [KNL,ARC] nps_mtm_hs_ctr= [KNL,ARC]
This parameter sets the maximum duration, in This parameter sets the maximum duration, in
cycles, each HW thread of the CTOP can run cycles, each HW thread of the CTOP can run
...@@ -3787,6 +3775,16 @@ ...@@ -3787,6 +3775,16 @@
For example, to override I2C bus2: For example, to override I2C bus2:
omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100 omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100
onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration
Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]
boundary - index of last SLC block on Flex-OneNAND.
The remaining blocks are configured as MLC blocks.
lock - Configure if Flex-OneNAND boundary should be locked.
Once locked, the boundary cannot be changed.
1 indicates lock status, 0 indicates unlock status.
oops=panic Always panic on oopses. Default is to just kill the oops=panic Always panic on oopses. Default is to just kill the
process, but there is a small probability of process, but there is a small probability of
deadlocking the machine. deadlocking the machine.
...@@ -3857,14 +3855,6 @@ ...@@ -3857,14 +3855,6 @@
panic_on_warn panic() instead of WARN(). Useful to cause kdump panic_on_warn panic() instead of WARN(). Useful to cause kdump
on a WARN(). on a WARN().
crash_kexec_post_notifiers
Run kdump after running panic-notifiers and dumping
kmsg. This only for the users who doubt kdump always
succeeds in any situation.
Note that this also increases risks of kdump failure,
because some panic notifiers can make the crashed
kernel more unstable.
parkbd.port= [HW] Parallel port number the keyboard adapter is parkbd.port= [HW] Parallel port number the keyboard adapter is
connected to, default is 0. connected to, default is 0.
Format: <parport#> Format: <parport#>
...@@ -5156,15 +5146,6 @@ ...@@ -5156,15 +5146,6 @@
Useful for devices that are detected asynchronously Useful for devices that are detected asynchronously
(e.g. USB and MMC devices). (e.g. USB and MMC devices).
hibernate= [HIBERNATION]
noresume Don't check if there's a hibernation image
present during boot.
nocompress Don't compress/decompress hibernation images.
no Disable hibernation and resume.
protect_image Turn on image protection during restoration
(that will set all pages holding image data
during restoration read-only).
retain_initrd [RAM] Keep initrd memory after extraction retain_initrd [RAM] Keep initrd memory after extraction
rfkill.default_state= rfkill.default_state=
...@@ -5480,7 +5461,7 @@ ...@@ -5480,7 +5461,7 @@
1: Fast pin select (default) 1: Fast pin select (default)
2: ATC IRMode 2: ATC IRMode
smt [KNL,S390] Set the maximum number of threads (logical smt= [KNL,S390] Set the maximum number of threads (logical
CPUs) to use per physical CPU on systems capable of CPUs) to use per physical CPU on systems capable of
symmetric multithreading (SMT). Will be capped to the symmetric multithreading (SMT). Will be capped to the
actual hardware limit. actual hardware limit.
...@@ -5867,8 +5848,9 @@ ...@@ -5867,8 +5848,9 @@
This parameter controls use of the Protected This parameter controls use of the Protected
Execution Facility on pSeries. Execution Facility on pSeries.
swapaccount=[0|1] swapaccount= [KNL]
[KNL] Enable accounting of swap in memory resource Format: [0|1]
Enable accounting of swap in memory resource
controller if no parameter or 1 is given or disable controller if no parameter or 1 is given or disable
it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst) it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
...@@ -5914,7 +5896,8 @@ ...@@ -5914,7 +5896,8 @@
tdfx= [HW,DRM] tdfx= [HW,DRM]
test_suspend= [SUSPEND][,N] test_suspend= [SUSPEND]
Format: { "mem" | "standby" | "freeze" }[,N]
Specify "mem" (for Suspend-to-RAM) or "standby" (for Specify "mem" (for Suspend-to-RAM) or "standby" (for
standby suspend) or "freeze" (for suspend type freeze) standby suspend) or "freeze" (for suspend type freeze)
as the system sleep state during system startup with as the system sleep state during system startup with
...@@ -5998,32 +5981,7 @@ ...@@ -5998,32 +5981,7 @@
This will guarantee that all the other pcrs This will guarantee that all the other pcrs
are saved. are saved.
trace_buf_size=nn[KMG] tp_printk [FTRACE]
[FTRACE] will set tracing buffer size on each cpu.
trace_event=[event-list]
[FTRACE] Set and start specified trace events in order
to facilitate early boot debugging. The event-list is a
comma-separated list of trace events to enable. See
also Documentation/trace/events.rst
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
that can be enabled or disabled just as if you were
to echo the option name into
/sys/kernel/debug/tracing/trace_options
For example, to enable stacktrace option (to dump the
stack trace of each event), add to the command line:
trace_options=stacktrace
See also Documentation/trace/ftrace.rst "trace options"
section.
tp_printk[FTRACE]
Have the tracepoints sent to printk as well as the Have the tracepoints sent to printk as well as the
tracing ring buffer. This is useful for early boot up tracing ring buffer. This is useful for early boot up
where the system hangs or reboots and does not give the where the system hangs or reboots and does not give the
...@@ -6045,7 +6003,7 @@ ...@@ -6045,7 +6003,7 @@
frequency tracepoints such as irq or sched, can cause frequency tracepoints such as irq or sched, can cause
the system to live lock. the system to live lock.
tp_printk_stop_on_boot[FTRACE] tp_printk_stop_on_boot [FTRACE]
When tp_printk (above) is set, it can cause a lot of noise When tp_printk (above) is set, it can cause a lot of noise
on the console. It may be useful to only include the on the console. It may be useful to only include the
printing of events during boot up, as user space may printing of events during boot up, as user space may
...@@ -6054,6 +6012,53 @@ ...@@ -6054,6 +6012,53 @@
This command line option will stop the printing of events This command line option will stop the printing of events
to console at the late_initcall_sync() time frame. to console at the late_initcall_sync() time frame.
trace_buf_size=nn[KMG]
[FTRACE] will set tracing buffer size on each cpu.
trace_clock= [FTRACE] Set the clock used for tracing events
at boot up.
local - Use the per CPU time stamp counter
(converted into nanoseconds). Fast, but
depending on the architecture, may not be
in sync between CPUs.
global - Event time stamps are synchronize across
CPUs. May be slower than the local clock,
but better for some race conditions.
counter - Simple counting of events (1, 2, ..)
note, some counts may be skipped due to the
infrastructure grabbing the clock more than
once per event.
uptime - Use jiffies as the time stamp.
perf - Use the same clock that perf uses.
mono - Use ktime_get_mono_fast_ns() for time stamps.
mono_raw - Use ktime_get_raw_fast_ns() for time
stamps.
boot - Use ktime_get_boot_fast_ns() for time stamps.
Architectures may add more clocks. See
Documentation/trace/ftrace.rst for more details.
trace_event=[event-list]
[FTRACE] Set and start specified trace events in order
to facilitate early boot debugging. The event-list is a
comma-separated list of trace events to enable. See
also Documentation/trace/events.rst
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
that can be enabled or disabled just as if you were
to echo the option name into
/sys/kernel/debug/tracing/trace_options
For example, to enable stacktrace option (to dump the
stack trace of each event), add to the command line:
trace_options=stacktrace
See also Documentation/trace/ftrace.rst "trace options"
section.
traceoff_on_warning traceoff_on_warning
[FTRACE] enable this option to disable tracing when a [FTRACE] enable this option to disable tracing when a
warning is hit. This turns off "tracing_on". Tracing can warning is hit. This turns off "tracing_on". Tracing can
...@@ -6405,7 +6410,7 @@ ...@@ -6405,7 +6410,7 @@
HIGHMEM regardless of setting HIGHMEM regardless of setting
of CONFIG_HIGHPTE. of CONFIG_HIGHPTE.
vdso= [X86,SH] vdso= [X86,SH,SPARC]
On X86_32, this is an alias for vdso32=. Otherwise: On X86_32, this is an alias for vdso32=. Otherwise:
vdso=1: enable VDSO (the default) vdso=1: enable VDSO (the default)
...@@ -6431,11 +6436,12 @@ ...@@ -6431,11 +6436,12 @@
video= [FB] Frame buffer configuration video= [FB] Frame buffer configuration
See Documentation/fb/modedb.rst. See Documentation/fb/modedb.rst.
video.brightness_switch_enabled= [0,1] video.brightness_switch_enabled= [ACPI]
Format: [0|1]
If set to 1, on receiving an ACPI notify event If set to 1, on receiving an ACPI notify event
generated by hotkey, video driver will adjust brightness generated by hotkey, video driver will adjust brightness
level and then send out the event to user space through level and then send out the event to user space through
the allocated input device; If set to 0, video driver the allocated input device. If set to 0, video driver
will only send out the event without touching backlight will only send out the event without touching backlight
brightness level. brightness level.
default: 1 default: 1
......
...@@ -783,6 +783,13 @@ is useful to define the root cause of RCU stalls using a vmcore. ...@@ -783,6 +783,13 @@ is useful to define the root cause of RCU stalls using a vmcore.
1 panic() after printing RCU stall messages. 1 panic() after printing RCU stall messages.
= ============================================================ = ============================================================
max_rcu_stall_to_panic
======================
When ``panic_on_rcu_stall`` is set to 1, this value determines the
number of times that RCU can stall before panic() is called.
When ``panic_on_rcu_stall`` is set to 0, this value is has no effect.
perf_cpu_time_max_percent perf_cpu_time_max_percent
========================= =========================
......
IDE-CD driver documentation
===========================
:Originally by: scott snyder <snyder@fnald0.fnal.gov> (19 May 1996)
:Carrying on the torch is: Erik Andersen <andersee@debian.org>
:New maintainers (19 Oct 1998): Jens Axboe <axboe@image.dk>
1. Introduction
---------------
The ide-cd driver should work with all ATAPI ver 1.2 to ATAPI 2.6 compliant
CDROM drives which attach to an IDE interface. Note that some CDROM vendors
(including Mitsumi, Sony, Creative, Aztech, and Goldstar) have made
both ATAPI-compliant drives and drives which use a proprietary
interface. If your drive uses one of those proprietary interfaces,
this driver will not work with it (but one of the other CDROM drivers
probably will). This driver will not work with `ATAPI` drives which
attach to the parallel port. In addition, there is at least one drive
(CyCDROM CR520ie) which attaches to the IDE port but is not ATAPI;
this driver will not work with drives like that either (but see the
aztcd driver).
This driver provides the following features:
- Reading from data tracks, and mounting ISO 9660 filesystems.
- Playing audio tracks. Most of the CDROM player programs floating
around should work; I usually use Workman.
- Multisession support.
- On drives which support it, reading digital audio data directly
from audio tracks. The program cdda2wav can be used for this.
Note, however, that only some drives actually support this.
- There is now support for CDROM changers which comply with the
ATAPI 2.6 draft standard (such as the NEC CDR-251). This additional
functionality includes a function call to query which slot is the
currently selected slot, a function call to query which slots contain
CDs, etc. A sample program which demonstrates this functionality is
appended to the end of this file. The Sanyo 3-disc changer
(which does not conform to the standard) is also now supported.
Please note the driver refers to the first CD as slot # 0.
2. Installation
---------------
0. The ide-cd relies on the ide disk driver. See
Documentation/ide/ide.rst for up-to-date information on the ide
driver.
1. Make sure that the ide and ide-cd drivers are compiled into the
kernel you're using. When configuring the kernel, in the section
entitled "Floppy, IDE, and other block devices", say either `Y`
(which will compile the support directly into the kernel) or `M`
(to compile support as a module which can be loaded and unloaded)
to the options::
ATA/ATAPI/MFM/RLL support
Include IDE/ATAPI CDROM support
Depending on what type of IDE interface you have, you may need to
specify additional configuration options. See
Documentation/ide/ide.rst.
2. You should also ensure that the iso9660 filesystem is either
compiled into the kernel or available as a loadable module. You
can see if a filesystem is known to the kernel by catting
/proc/filesystems.
3. The CDROM drive should be connected to the host on an IDE
interface. Each interface on a system is defined by an I/O port
address and an IRQ number, the standard assignments being
0x1f0 and 14 for the primary interface and 0x170 and 15 for the
secondary interface. Each interface can control up to two devices,
where each device can be a hard drive, a CDROM drive, a floppy drive,
or a tape drive. The two devices on an interface are called `master`
and `slave`; this is usually selectable via a jumper on the drive.
Linux names these devices as follows. The master and slave devices
on the primary IDE interface are called `hda` and `hdb`,
respectively. The drives on the secondary interface are called
`hdc` and `hdd`. (Interfaces at other locations get other letters
in the third position; see Documentation/ide/ide.rst.)
If you want your CDROM drive to be found automatically by the
driver, you should make sure your IDE interface uses either the
primary or secondary addresses mentioned above. In addition, if
the CDROM drive is the only device on the IDE interface, it should
be jumpered as `master`. (If for some reason you cannot configure
your system in this manner, you can probably still use the driver.
You may have to pass extra configuration information to the kernel
when you boot, however. See Documentation/ide/ide.rst for more
information.)
4. Boot the system. If the drive is recognized, you should see a
message which looks like::
hdb: NEC CD-ROM DRIVE:260, ATAPI CDROM drive
If you do not see this, see section 5 below.
5. You may want to create a symbolic link /dev/cdrom pointing to the
actual device. You can do this with the command::
ln -s /dev/hdX /dev/cdrom
where X should be replaced by the letter indicating where your
drive is installed.
6. You should be able to see any error messages from the driver with
the `dmesg` command.
3. Basic usage
--------------
An ISO 9660 CDROM can be mounted by putting the disc in the drive and
typing (as root)::
mount -t iso9660 /dev/cdrom /mnt/cdrom
where it is assumed that /dev/cdrom is a link pointing to the actual
device (as described in step 5 of the last section) and /mnt/cdrom is
an empty directory. You should now be able to see the contents of the
CDROM under the /mnt/cdrom directory. If you want to eject the CDROM,
you must first dismount it with a command like::
umount /mnt/cdrom
Note that audio CDs cannot be mounted.
Some distributions set up /etc/fstab to always try to mount a CDROM
filesystem on bootup. It is not required to mount the CDROM in this
manner, though, and it may be a nuisance if you change CDROMs often.
You should feel free to remove the cdrom line from /etc/fstab and
mount CDROMs manually if that suits you better.
Multisession and photocd discs should work with no special handling.
The hpcdtoppm package (ftp.gwdg.de:/pub/linux/hpcdtoppm/) may be
useful for reading photocds.
To play an audio CD, you should first unmount and remove any data
CDROM. Any of the CDROM player programs should then work (workman,
workbone, cdplayer, etc.).
On a few drives, you can read digital audio directly using a program
such as cdda2wav. The only types of drive which I've heard support
this are Sony and Toshiba drives. You will get errors if you try to
use this function on a drive which does not support it.
For supported changers, you can use the `cdchange` program (appended to
the end of this file) to switch between changer slots. Note that the
drive should be unmounted before attempting this. The program takes
two arguments: the CDROM device, and the slot number to which you wish
to change. If the slot number is -1, the drive is unloaded.
4. Common problems
------------------
This section discusses some common problems encountered when trying to
use the driver, and some possible solutions. Note that if you are
experiencing problems, you should probably also review
Documentation/ide/ide.rst for current information about the underlying
IDE support code. Some of these items apply only to earlier versions
of the driver, but are mentioned here for completeness.
In most cases, you should probably check with `dmesg` for any errors
from the driver.
a. Drive is not detected during booting.
- Review the configuration instructions above and in
Documentation/ide/ide.rst, and check how your hardware is
configured.
- If your drive is the only device on an IDE interface, it should
be jumpered as master, if at all possible.
- If your IDE interface is not at the standard addresses of 0x170
or 0x1f0, you'll need to explicitly inform the driver using a
lilo option. See Documentation/ide/ide.rst. (This feature was
added around kernel version 1.3.30.)
- If the autoprobing is not finding your drive, you can tell the
driver to assume that one exists by using a lilo option of the
form `hdX=cdrom`, where X is the drive letter corresponding to
where your drive is installed. Note that if you do this and you
see a boot message like::
hdX: ATAPI cdrom (?)
this does _not_ mean that the driver has successfully detected
the drive; rather, it means that the driver has not detected a
drive, but is assuming there's one there anyway because you told
it so. If you actually try to do I/O to a drive defined at a
nonexistent or nonresponding I/O address, you'll probably get
errors with a status value of 0xff.
- Some IDE adapters require a nonstandard initialization sequence
before they'll function properly. (If this is the case, there
will often be a separate MS-DOS driver just for the controller.)
IDE interfaces on sound cards often fall into this category.
Support for some interfaces needing extra initialization is
provided in later 1.3.x kernels. You may need to turn on
additional kernel configuration options to get them to work;
see Documentation/ide/ide.rst.
Even if support is not available for your interface, you may be
able to get it to work with the following procedure. First boot
MS-DOS and load the appropriate drivers. Then warm-boot linux
(i.e., without powering off). If this works, it can be automated
by running loadlin from the MS-DOS autoexec.
b. Timeout/IRQ errors.
- If you always get timeout errors, interrupts from the drive are
probably not making it to the host.
- IRQ problems may also be indicated by the message
`IRQ probe failed (<n>)` while booting. If <n> is zero, that
means that the system did not see an interrupt from the drive when
it was expecting one (on any feasible IRQ). If <n> is negative,
that means the system saw interrupts on multiple IRQ lines, when
it was expecting to receive just one from the CDROM drive.
- Double-check your hardware configuration to make sure that the IRQ
number of your IDE interface matches what the driver expects.
(The usual assignments are 14 for the primary (0x1f0) interface
and 15 for the secondary (0x170) interface.) Also be sure that
you don't have some other hardware which might be conflicting with
the IRQ you're using. Also check the BIOS setup for your system;
some have the ability to disable individual IRQ levels, and I've
had one report of a system which was shipped with IRQ 15 disabled
by default.
- Note that many MS-DOS CDROM drivers will still function even if
there are hardware problems with the interrupt setup; they
apparently don't use interrupts.
- If you own a Pioneer DR-A24X, you _will_ get nasty error messages
on boot such as "irq timeout: status=0x50 { DriveReady SeekComplete }"
The Pioneer DR-A24X CDROM drives are fairly popular these days.
Unfortunately, these drives seem to become very confused when we perform
the standard Linux ATA disk drive probe. If you own one of these drives,
you can bypass the ATA probing which confuses these CDROM drives, by
adding `append="hdX=noprobe hdX=cdrom"` to your lilo.conf file and running
lilo (again where X is the drive letter corresponding to where your drive
is installed.)
c. System hangups.
- If the system locks up when you try to access the CDROM, the most
likely cause is that you have a buggy IDE adapter which doesn't
properly handle simultaneous transactions on multiple interfaces.
The most notorious of these is the CMD640B chip. This problem can
be worked around by specifying the `serialize` option when
booting. Recent kernels should be able to detect the need for
this automatically in most cases, but the detection is not
foolproof. See Documentation/ide/ide.rst for more information
about the `serialize` option and the CMD640B.
- Note that many MS-DOS CDROM drivers will work with such buggy
hardware, apparently because they never attempt to overlap CDROM
operations with other disk activity.
d. Can't mount a CDROM.
- If you get errors from mount, it may help to check `dmesg` to see
if there are any more specific errors from the driver or from the
filesystem.
- Make sure there's a CDROM loaded in the drive, and that's it's an
ISO 9660 disc. You can't mount an audio CD.
- With the CDROM in the drive and unmounted, try something like::
cat /dev/cdrom | od | more
If you see a dump, then the drive and driver are probably working
OK, and the problem is at the filesystem level (i.e., the CDROM is
not ISO 9660 or has errors in the filesystem structure).
- If you see `not a block device` errors, check that the definitions
of the device special files are correct. They should be as
follows::
brw-rw---- 1 root disk 3, 0 Nov 11 18:48 /dev/hda
brw-rw---- 1 root disk 3, 64 Nov 11 18:48 /dev/hdb
brw-rw---- 1 root disk 22, 0 Nov 11 18:48 /dev/hdc
brw-rw---- 1 root disk 22, 64 Nov 11 18:48 /dev/hdd
Some early Slackware releases had these defined incorrectly. If
these are wrong, you can remake them by running the script
scripts/MAKEDEV.ide. (You may have to make it executable
with chmod first.)
If you have a /dev/cdrom symbolic link, check that it is pointing
to the correct device file.
If you hear people talking of the devices `hd1a` and `hd1b`, these
were old names for what are now called hdc and hdd. Those names
should be considered obsolete.
- If mount is complaining that the iso9660 filesystem is not
available, but you know it is (check /proc/filesystems), you
probably need a newer version of mount. Early versions would not
always give meaningful error messages.
e. Directory listings are unpredictably truncated, and `dmesg` shows
`buffer botch` error messages from the driver.
- There was a bug in the version of the driver in 1.2.x kernels
which could cause this. It was fixed in 1.3.0. If you can't
upgrade, you can probably work around the problem by specifying a
blocksize of 2048 when mounting. (Note that you won't be able to
directly execute binaries off the CDROM in that case.)
If you see this in kernels later than 1.3.0, please report it as a
bug.
f. Data corruption.
- Random data corruption was occasionally observed with the Hitachi
CDR-7730 CDROM. If you experience data corruption, using "hdx=slow"
as a command line parameter may work around the problem, at the
expense of low system performance.
5. cdchange.c
-------------
::
/*
* cdchange.c [-v] <device> [<slot>]
*
* This loads a CDROM from a specified slot in a changer, and displays
* information about the changer status. The drive should be unmounted before
* using this program.
*
* Changer information is displayed if either the -v flag is specified
* or no slot was specified.
*
* Based on code originally from Gerhard Zuber <zuber@berlin.snafu.de>.
* Changer status information, and rewrite for the new Uniform CDROM driver
* interface by Erik Andersen <andersee@debian.org>.
*/
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <linux/cdrom.h>
int
main (int argc, char **argv)
{
char *program;
char *device;
int fd; /* file descriptor for CD-ROM device */
int status; /* return status for system calls */
int verbose = 0;
int slot=-1, x_slot;
int total_slots_available;
program = argv[0];
++argv;
--argc;
if (argc < 1 || argc > 3) {
fprintf (stderr, "usage: %s [-v] <device> [<slot>]\n",
program);
fprintf (stderr, " Slots are numbered 1 -- n.\n");
exit (1);
}
if (strcmp (argv[0], "-v") == 0) {
verbose = 1;
++argv;
--argc;
}
device = argv[0];
if (argc == 2)
slot = atoi (argv[1]) - 1;
/* open device */
fd = open(device, O_RDONLY | O_NONBLOCK);
if (fd < 0) {
fprintf (stderr, "%s: open failed for `%s`: %s\n",
program, device, strerror (errno));
exit (1);
}
/* Check CD player status */
total_slots_available = ioctl (fd, CDROM_CHANGER_NSLOTS);
if (total_slots_available <= 1 ) {
fprintf (stderr, "%s: Device `%s` is not an ATAPI "
"compliant CD changer.\n", program, device);
exit (1);
}
if (slot >= 0) {
if (slot >= total_slots_available) {
fprintf (stderr, "Bad slot number. "
"Should be 1 -- %d.\n",
total_slots_available);
exit (1);
}
/* load */
slot=ioctl (fd, CDROM_SELECT_DISC, slot);
if (slot<0) {
fflush(stdout);
perror ("CDROM_SELECT_DISC ");
exit(1);
}
}
if (slot < 0 || verbose) {
status=ioctl (fd, CDROM_SELECT_DISC, CDSL_CURRENT);
if (status<0) {
fflush(stdout);
perror (" CDROM_SELECT_DISC");
exit(1);
}
slot=status;
printf ("Current slot: %d\n", slot+1);
printf ("Total slots available: %d\n",
total_slots_available);
printf ("Drive status: ");
status = ioctl (fd, CDROM_DRIVE_STATUS, CDSL_CURRENT);
if (status<0) {
perror(" CDROM_DRIVE_STATUS");
} else switch(status) {
case CDS_DISC_OK:
printf ("Ready.\n");
break;
case CDS_TRAY_OPEN:
printf ("Tray Open.\n");
break;
case CDS_DRIVE_NOT_READY:
printf ("Drive Not Ready.\n");
break;
default:
printf ("This Should not happen!\n");
break;
}
for (x_slot=0; x_slot<total_slots_available; x_slot++) {
printf ("Slot %2d: ", x_slot+1);
status = ioctl (fd, CDROM_DRIVE_STATUS, x_slot);
if (status<0) {
perror(" CDROM_DRIVE_STATUS");
} else switch(status) {
case CDS_DISC_OK:
printf ("Disc present.");
break;
case CDS_NO_DISC:
printf ("Empty slot.");
break;
case CDS_TRAY_OPEN:
printf ("CD-ROM tray open.\n");
break;
case CDS_DRIVE_NOT_READY:
printf ("CD-ROM drive not ready.\n");
break;
case CDS_NO_INFO:
printf ("No Information available.");
break;
default:
printf ("This Should not happen!\n");
break;
}
if (slot == x_slot) {
status = ioctl (fd, CDROM_DISC_STATUS);
if (status<0) {
perror(" CDROM_DISC_STATUS");
}
switch (status) {
case CDS_AUDIO:
printf ("\tAudio disc.\t");
break;
case CDS_DATA_1:
case CDS_DATA_2:
printf ("\tData disc type %d.\t", status-CDS_DATA_1+1);
break;
case CDS_XA_2_1:
case CDS_XA_2_2:
printf ("\tXA data disc type %d.\t", status-CDS_XA_2_1+1);
break;
default:
printf ("\tUnknown disc type 0x%x!\t", status);
break;
}
}
status = ioctl (fd, CDROM_MEDIA_CHANGED, x_slot);
if (status<0) {
perror(" CDROM_MEDIA_CHANGED");
}
switch (status) {
case 1:
printf ("Changed.\n");
break;
default:
printf ("\n");
break;
}
}
}
/* close device */
status = close (fd);
if (status != 0) {
fprintf (stderr, "%s: close failed for `%s`: %s\n",
program, device, strerror (errno));
exit (1);
}
exit (0);
}
...@@ -8,7 +8,6 @@ cdrom ...@@ -8,7 +8,6 @@ cdrom
:maxdepth: 1 :maxdepth: 1
cdrom-standard cdrom-standard
ide-cd
packet-writing packet-writing
.. only:: subproject and html .. only:: subproject and html
......
...@@ -18,6 +18,7 @@ it. ...@@ -18,6 +18,7 @@ it.
kernel-api kernel-api
workqueue workqueue
watch_queue
printk-basics printk-basics
printk-formats printk-formats
printk-index printk-index
......
...@@ -115,34 +115,32 @@ The diagnostic data field is optional, and results which have neither a ...@@ -115,34 +115,32 @@ The diagnostic data field is optional, and results which have neither a
directive nor any diagnostic data do not need to include the "#" field directive nor any diagnostic data do not need to include the "#" field
separator. separator.
Example result lines include: Example result lines include::
.. code-block:: none
ok 1 test_case_name ok 1 test_case_name
The test "test_case_name" passed. The test "test_case_name" passed.
.. code-block:: none ::
not ok 1 test_case_name not ok 1 test_case_name
The test "test_case_name" failed. The test "test_case_name" failed.
.. code-block:: none ::
ok 1 test # SKIP necessary dependency unavailable ok 1 test # SKIP necessary dependency unavailable
The test "test" was SKIPPED with the diagnostic message "necessary dependency The test "test" was SKIPPED with the diagnostic message "necessary dependency
unavailable". unavailable".
.. code-block:: none ::
not ok 1 test # TIMEOUT 30 seconds not ok 1 test # TIMEOUT 30 seconds
The test "test" timed out, with diagnostic data "30 seconds". The test "test" timed out, with diagnostic data "30 seconds".
.. code-block:: none ::
ok 5 check return code # rcode=0 ok 5 check return code # rcode=0
...@@ -202,7 +200,7 @@ allowed to be either indented or not indented. ...@@ -202,7 +200,7 @@ allowed to be either indented or not indented.
An example of a test with two nested subtests: An example of a test with two nested subtests:
.. code-block:: none ::
KTAP version 1 KTAP version 1
1..1 1..1
...@@ -215,7 +213,7 @@ An example of a test with two nested subtests: ...@@ -215,7 +213,7 @@ An example of a test with two nested subtests:
An example format with multiple levels of nested testing: An example format with multiple levels of nested testing:
.. code-block:: none ::
KTAP version 1 KTAP version 1
1..2 1..2
...@@ -250,7 +248,7 @@ nested version line, uses a line of the form ...@@ -250,7 +248,7 @@ nested version line, uses a line of the form
Example KTAP output Example KTAP output
-------------------- --------------------
.. code-block:: none ::
KTAP version 1 KTAP version 1
1..1 1..1
......
...@@ -125,7 +125,7 @@ All expectations/assertions are formatted as: ...@@ -125,7 +125,7 @@ All expectations/assertions are formatted as:
``void __noreturn kunit_try_catch_throw(struct kunit_try_catch *try_catch)``. ``void __noreturn kunit_try_catch_throw(struct kunit_try_catch *try_catch)``.
- ``kunit_try_catch_throw`` calls function: - ``kunit_try_catch_throw`` calls function:
``void complete_and_exit(struct completion *, long) __noreturn;`` ``void kthread_complete_and_exit(struct completion *, long) __noreturn;``
and terminates the special thread context. and terminates the special thread context.
- ``<op>`` denotes a check with options: ``TRUE`` (supplied property - ``<op>`` denotes a check with options: ``TRUE`` (supplied property
......
...@@ -115,3 +115,66 @@ that none of these errors are occurring during the test. ...@@ -115,3 +115,66 @@ that none of these errors are occurring during the test.
Some of these tools integrate with KUnit or kselftest and will Some of these tools integrate with KUnit or kselftest and will
automatically fail tests if an issue is detected. automatically fail tests if an issue is detected.
Static Analysis Tools
=====================
In addition to testing a running kernel, one can also analyze kernel source code
directly (**at compile time**) using **static analysis** tools. The tools
commonly used in the kernel allow one to inspect the whole source tree or just
specific files within it. They make it easier to detect and fix problems during
the development process.
Sparse can help test the kernel by performing type-checking, lock checking,
value range checking, in addition to reporting various errors and warnings while
examining the code. See the Documentation/dev-tools/sparse.rst documentation
page for details on how to use it.
Smatch extends Sparse and provides additional checks for programming logic
mistakes such as missing breaks in switch statements, unused return values on
error checking, forgetting to set an error code in the return of an error path,
etc. Smatch also has tests against more serious issues such as integer
overflows, null pointer dereferences, and memory leaks. See the project page at
http://smatch.sourceforge.net/.
Coccinelle is another static analyzer at our disposal. Coccinelle is often used
to aid refactoring and collateral evolution of source code, but it can also help
to avoid certain bugs that occur in common code patterns. The types of tests
available include API tests, tests for correct usage of kernel iterators, checks
for the soundness of free operations, analysis of locking behavior, and further
tests known to help keep consistent kernel usage. See the
Documentation/dev-tools/coccinelle.rst documentation page for details.
Beware, though, that static analysis tools suffer from **false positives**.
Errors and warns need to be evaluated carefully before attempting to fix them.
When to use Sparse and Smatch
-----------------------------
Sparse does type checking, such as verifying that annotated variables do not
cause endianness bugs, detecting places that use ``__user`` pointers improperly,
and analyzing the compatibility of symbol initializers.
Smatch does flow analysis and, if allowed to build the function database, it
also does cross function analysis. Smatch tries to answer questions like where
is this buffer allocated? How big is it? Can this index be controlled by the
user? Is this variable larger than that variable?
It's generally easier to write checks in Smatch than it is to write checks in
Sparse. Nevertheless, there are some overlaps between Sparse and Smatch checks.
Strong points of Smatch and Coccinelle
--------------------------------------
Coccinelle is probably the easiest for writing checks. It works before the
pre-processor so it's easier to check for bugs in macros using Coccinelle.
Coccinelle also creates patches for you, which no other tool does.
For example, with Coccinelle you can do a mass conversion from
``kmalloc(x * size, GFP_KERNEL)`` to ``kmalloc_array(x, size, GFP_KERNEL)``, and
that's really useful. If you just created a Smatch warning and try to push the
work of converting on to the maintainers they would be annoyed. You'd have to
argue about each warning if can really overflow or not.
Coccinelle does no analysis of variable values, which is the strong point of
Smatch. On the other hand, Coccinelle allows you to do simple things in a simple
way.
...@@ -79,8 +79,9 @@ simplistic idea of what C comment blocks look like. This problem had been ...@@ -79,8 +79,9 @@ simplistic idea of what C comment blocks look like. This problem had been
present since that comment was added in 2016 — a full four years. Fixing present since that comment was added in 2016 — a full four years. Fixing
it was a matter of adding the missing asterisks. A quick look at the it was a matter of adding the missing asterisks. A quick look at the
history for that file showed what the normal format for subject lines is, history for that file showed what the normal format for subject lines is,
and ``scripts/get_maintainer.pl`` told me who should receive it. The and ``scripts/get_maintainer.pl`` told me who should receive it (pass paths to
resulting patch looked like this:: your patches as arguments to scripts/get_maintainer.pl). The resulting patch
looked like this::
[PATCH] PM / devfreq: Fix two malformed kerneldoc comments [PATCH] PM / devfreq: Fix two malformed kerneldoc comments
......
===========================
Writing kernel-doc comments Writing kernel-doc comments
=========================== ===========================
...@@ -436,6 +437,7 @@ The title following ``DOC:`` acts as a heading within the source file, but also ...@@ -436,6 +437,7 @@ The title following ``DOC:`` acts as a heading within the source file, but also
as an identifier for extracting the documentation comment. Thus, the title must as an identifier for extracting the documentation comment. Thus, the title must
be unique within the file. be unique within the file.
=============================
Including kernel-doc comments Including kernel-doc comments
============================= =============================
......
.. _sphinxdoc: .. _sphinxdoc:
Introduction =====================================
============ Using Sphinx for kernel documentation
=====================================
The Linux kernel uses `Sphinx`_ to generate pretty documentation from The Linux kernel uses `Sphinx`_ to generate pretty documentation from
`reStructuredText`_ files under ``Documentation``. To build the documentation in `reStructuredText`_ files under ``Documentation``. To build the documentation in
......
...@@ -249,7 +249,7 @@ CLOCK ...@@ -249,7 +249,7 @@ CLOCK
devm_clk_bulk_get() devm_clk_bulk_get()
devm_clk_bulk_get_all() devm_clk_bulk_get_all()
devm_clk_bulk_get_optional() devm_clk_bulk_get_optional()
devm_get_clk_from_childl() devm_get_clk_from_child()
devm_clk_hw_register() devm_clk_hw_register()
devm_of_clk_add_hw_provider() devm_of_clk_add_hw_provider()
devm_clk_hw_register_clkdev() devm_clk_hw_register_clkdev()
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
Intel(R) Dynamic Platform and Thermal Framework Sysfs Interface Intel(R) Dynamic Platform and Thermal Framework Sysfs Interface
=============================================================== ===============================================================
:Copyright: |copy| 2022 Intel Corporation :Copyright: © 2022 Intel Corporation
:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> :Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
......
...@@ -132,16 +132,16 @@ configuration of fault-injection capabilities. ...@@ -132,16 +132,16 @@ configuration of fault-injection capabilities.
Format: { 'Y' | 'N' } Format: { 'Y' | 'N' }
default is 'N', setting it to 'Y' won't inject failures into default is 'Y', setting it to 'N' will also inject failures into
highmem/user allocations. highmem/user allocations (__GFP_HIGHMEM allocations).
- /sys/kernel/debug/failslab/ignore-gfp-wait: - /sys/kernel/debug/failslab/ignore-gfp-wait:
- /sys/kernel/debug/fail_page_alloc/ignore-gfp-wait: - /sys/kernel/debug/fail_page_alloc/ignore-gfp-wait:
Format: { 'Y' | 'N' } Format: { 'Y' | 'N' }
default is 'N', setting it to 'Y' will inject failures default is 'Y', setting it to 'N' will also inject failures
only into non-sleep allocations (GFP_ATOMIC allocations). into allocations that can sleep (__GFP_DIRECT_RECLAIM allocations).
- /sys/kernel/debug/fail_page_alloc/min-order: - /sys/kernel/debug/fail_page_alloc/min-order:
...@@ -280,7 +280,7 @@ Application Examples ...@@ -280,7 +280,7 @@ Application Examples
printf %#x -1 > /sys/kernel/debug/$FAILTYPE/times printf %#x -1 > /sys/kernel/debug/$FAILTYPE/times
echo 0 > /sys/kernel/debug/$FAILTYPE/space echo 0 > /sys/kernel/debug/$FAILTYPE/space
echo 2 > /sys/kernel/debug/$FAILTYPE/verbose echo 2 > /sys/kernel/debug/$FAILTYPE/verbose
echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait echo Y > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait
faulty_system() faulty_system()
{ {
...@@ -334,8 +334,8 @@ Application Examples ...@@ -334,8 +334,8 @@ Application Examples
printf %#x -1 > /sys/kernel/debug/$FAILTYPE/times printf %#x -1 > /sys/kernel/debug/$FAILTYPE/times
echo 0 > /sys/kernel/debug/$FAILTYPE/space echo 0 > /sys/kernel/debug/$FAILTYPE/space
echo 2 > /sys/kernel/debug/$FAILTYPE/verbose echo 2 > /sys/kernel/debug/$FAILTYPE/verbose
echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait echo Y > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait
echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-highmem echo Y > /sys/kernel/debug/$FAILTYPE/ignore-gfp-highmem
echo 10 > /sys/kernel/debug/$FAILTYPE/stacktrace-depth echo 10 > /sys/kernel/debug/$FAILTYPE/stacktrace-depth
trap "echo 0 > /sys/kernel/debug/$FAILTYPE/probability" SIGINT SIGTERM EXIT trap "echo 0 > /sys/kernel/debug/$FAILTYPE/probability" SIGINT SIGTERM EXIT
......
/*
* 1.00 Oct 31, 1994 -- Initial version.
* 1.01 Nov 2, 1994 -- Fixed problem with starting request in
* cdrom_check_status.
* 1.03 Nov 25, 1994 -- leaving unmask_intr[] as a user-setting (as for disks)
* (from mlord) -- minor changes to cdrom_setup()
* -- renamed ide_dev_s to ide_drive_t, enable irq on command
* 2.00 Nov 27, 1994 -- Generalize packet command interface;
* add audio ioctls.
* 2.01 Dec 3, 1994 -- Rework packet command interface to handle devices
* which send an interrupt when ready for a command.
* 2.02 Dec 11, 1994 -- Cache the TOC in the driver.
* Don't use SCMD_PLAYAUDIO_TI; it's not included
* in the current version of ATAPI.
* Try to use LBA instead of track or MSF addressing
* when possible.
* Don't wait for READY_STAT.
* 2.03 Jan 10, 1995 -- Rewrite block read routines to handle block sizes
* other than 2k and to move multiple sectors in a
* single transaction.
* 2.04 Apr 21, 1995 -- Add work-around for Creative Labs CD220E drives.
* Thanks to Nick Saw <cwsaw@pts7.pts.mot.com> for
* help in figuring this out. Ditto for Acer and
* Aztech drives, which seem to have the same problem.
* 2.04b May 30, 1995 -- Fix to match changes in ide.c version 3.16 -ml
* 2.05 Jun 8, 1995 -- Don't attempt to retry after an illegal request
* or data protect error.
* Use HWIF and DEV_HWIF macros as in ide.c.
* Always try to do a request_sense after
* a failed command.
* Include an option to give textual descriptions
* of ATAPI errors.
* Fix a bug in handling the sector cache which
* showed up if the drive returned data in 512 byte
* blocks (like Pioneer drives). Thanks to
* Richard Hirst <srh@gpt.co.uk> for diagnosing this.
* Properly supply the page number field in the
* MODE_SELECT command.
* PLAYAUDIO12 is broken on the Aztech; work around it.
* 2.05x Aug 11, 1995 -- lots of data structure renaming/restructuring in ide.c
* (my apologies to Scott, but now ide-cd.c is independent)
* 3.00 Aug 22, 1995 -- Implement CDROMMULTISESSION ioctl.
* Implement CDROMREADAUDIO ioctl (UNTESTED).
* Use input_ide_data() and output_ide_data().
* Add door locking.
* Fix usage count leak in cdrom_open, which happened
* when a read-write mount was attempted.
* Try to load the disk on open.
* Implement CDROMEJECT_SW ioctl (off by default).
* Read total cdrom capacity during open.
* Rearrange logic in cdrom_decode_status. Issue
* request sense commands for failed packet commands
* from here instead of from cdrom_queue_packet_command.
* Fix a race condition in retrieving error information.
* Suppress printing normal unit attention errors and
* some drive not ready errors.
* Implement CDROMVOLREAD ioctl.
* Implement CDROMREADMODE1/2 ioctls.
* Fix race condition in setting up interrupt handlers
* when the `serialize' option is used.
* 3.01 Sep 2, 1995 -- Fix ordering of reenabling interrupts in
* cdrom_queue_request.
* Another try at using ide_[input,output]_data.
* 3.02 Sep 16, 1995 -- Stick total disk capacity in partition table as well.
* Make VERBOSE_IDE_CD_ERRORS dump failed command again.
* Dump out more information for ILLEGAL REQUEST errs.
* Fix handling of errors occurring before the
* packet command is transferred.
* Fix transfers with odd bytelengths.
* 3.03 Oct 27, 1995 -- Some Creative drives have an id of just `CD'.
* `DCI-2S10' drives are broken too.
* 3.04 Nov 20, 1995 -- So are Vertos drives.
* 3.05 Dec 1, 1995 -- Changes to go with overhaul of ide.c and ide-tape.c
* 3.06 Dec 16, 1995 -- Add support needed for partitions.
* More workarounds for Vertos bugs (based on patches
* from Holger Dietze <dietze@aix520.informatik.uni-leipzig.de>).
* Try to eliminate byteorder assumptions.
* Use atapi_cdrom_subchnl struct definition.
* Add STANDARD_ATAPI compilation option.
* 3.07 Jan 29, 1996 -- More twiddling for broken drives: Sony 55D,
* Vertos 300.
* Add NO_DOOR_LOCKING configuration option.
* Handle drive_cmd requests w/NULL args (for hdparm -t).
* Work around sporadic Sony55e audio play problem.
* 3.07a Feb 11, 1996 -- check drive->id for NULL before dereferencing, to fix
* problem with "hde=cdrom" with no drive present. -ml
* 3.08 Mar 6, 1996 -- More Vertos workarounds.
* 3.09 Apr 5, 1996 -- Add CDROMCLOSETRAY ioctl.
* Switch to using MSF addressing for audio commands.
* Reformat to match kernel tabbing style.
* Add CDROM_GET_UPC ioctl.
* 3.10 Apr 10, 1996 -- Fix compilation error with STANDARD_ATAPI.
* 3.11 Apr 29, 1996 -- Patch from Heiko Eißfeldt <heiko@colossus.escape.de>
* to remove redundant verify_area calls.
* 3.12 May 7, 1996 -- Rudimentary changer support. Based on patches
* from Gerhard Zuber <zuber@berlin.snafu.de>.
* Let open succeed even if there's no loaded disc.
* 3.13 May 19, 1996 -- Fixes for changer code.
* 3.14 May 29, 1996 -- Add work-around for Vertos 600.
* (From Hennus Bergman <hennus@sky.ow.nl>.)
* 3.15 July 2, 1996 -- Added support for Sanyo 3 CD changers
* from Ben Galliart <bgallia@luc.edu> with
* special help from Jeff Lightfoot
* <jeffml@pobox.com>
* 3.15a July 9, 1996 -- Improved Sanyo 3 CD changer identification
* 3.16 Jul 28, 1996 -- Fix from Gadi to reduce kernel stack usage for ioctl.
* 3.17 Sep 17, 1996 -- Tweak audio reads for some drives.
* Start changing CDROMLOADFROMSLOT to CDROM_SELECT_DISC.
* 3.18 Oct 31, 1996 -- Added module and DMA support.
*
* 4.00 Nov 5, 1996 -- New ide-cd maintainer,
* Erik B. Andersen <andersee@debian.org>
* -- Newer Creative drives don't always set the error
* register correctly. Make sure we see media changes
* regardless.
* -- Integrate with generic cdrom driver.
* -- CDROMGETSPINDOWN and CDROMSETSPINDOWN ioctls, based on
* a patch from Ciro Cattuto <>.
* -- Call set_device_ro.
* -- Implement CDROMMECHANISMSTATUS and CDROMSLOTTABLE
* ioctls, based on patch by Erik Andersen
* -- Add some probes of drive capability during setup.
*
* 4.01 Nov 11, 1996 -- Split into ide-cd.c and ide-cd.h
* -- Removed CDROMMECHANISMSTATUS and CDROMSLOTTABLE
* ioctls in favor of a generalized approach
* using the generic cdrom driver.
* -- Fully integrated with the 2.1.X kernel.
* -- Other stuff that I forgot (lots of changes)
*
* 4.02 Dec 01, 1996 -- Applied patch from Gadi Oxman <gadio@netvision.net.il>
* to fix the drive door locking problems.
*
* 4.03 Dec 04, 1996 -- Added DSC overlap support.
* 4.04 Dec 29, 1996 -- Added CDROMREADRAW ioclt based on patch
* by Ales Makarov (xmakarov@sun.felk.cvut.cz)
*
* 4.05 Nov 20, 1997 -- Modified to print more drive info on init
* Minor other changes
* Fix errors on CDROMSTOP (If you have a "Dolphin",
* you must define IHAVEADOLPHIN)
* Added identifier so new Sanyo CD-changer works
* Better detection if door locking isn't supported
*
* 4.06 Dec 17, 1997 -- fixed endless "tray open" messages -ml
* 4.07 Dec 17, 1997 -- fallback to set pc->stat on "tray open"
* 4.08 Dec 18, 1997 -- spew less noise when tray is empty
* -- fix speed display for ACER 24X, 18X
* 4.09 Jan 04, 1998 -- fix handling of the last block so we return
* an end of file instead of an I/O error (Gadi)
* 4.10 Jan 24, 1998 -- fixed a bug so now changers can change to a new
* slot when there is no disc in the current slot.
* -- Fixed a memory leak where info->changer_info was
* malloc'ed but never free'd when closing the device.
* -- Cleaned up the global namespace a bit by making more
* functions static that should already have been.
* 4.11 Mar 12, 1998 -- Added support for the CDROM_SELECT_SPEED ioctl
* based on a patch for 2.0.33 by Jelle Foks
* <jelle@scintilla.utwente.nl>, a patch for 2.0.33
* by Toni Giorgino <toni@pcape2.pi.infn.it>, the SCSI
* version, and my own efforts. -erik
* -- Fixed a stupid bug which egcs was kind enough to
* inform me of where "Illegal mode for this track"
* was never returned due to a comparison on data
* types of limited range.
* 4.12 Mar 29, 1998 -- Fixed bug in CDROM_SELECT_SPEED so write speed is
* now set ionly for CD-R and CD-RW drives. I had
* removed this support because it produced errors.
* It produced errors _only_ for non-writers. duh.
* 4.13 May 05, 1998 -- Suppress useless "in progress of becoming ready"
* messages, since this is not an error.
* -- Change error messages to be const
* -- Remove a "\t" which looks ugly in the syslogs
* 4.14 July 17, 1998 -- Change to pointing to .ps version of ATAPI spec
* since the .pdf version doesn't seem to work...
* -- Updated the TODO list to something more current.
*
* 4.15 Aug 25, 1998 -- Updated ide-cd.h to respect machine endianness,
* patch thanks to "Eddie C. Dost" <ecd@skynet.be>
*
* 4.50 Oct 19, 1998 -- New maintainers!
* Jens Axboe <axboe@image.dk>
* Chris Zwilling <chris@cloudnet.com>
*
* 4.51 Dec 23, 1998 -- Jens Axboe <axboe@image.dk>
* - ide_cdrom_reset enabled since the ide subsystem
* handles resets fine now. <axboe@image.dk>
* - Transfer size fix for Samsung CD-ROMs, thanks to
* "Ville Hallik" <ville.hallik@mail.ee>.
* - other minor stuff.
*
* 4.52 Jan 19, 1999 -- Jens Axboe <axboe@image.dk>
* - Detect DVD-ROM/RAM drives
*
* 4.53 Feb 22, 1999 - Include other model Samsung and one Goldstar
* drive in transfer size limit.
* - Fix the I/O error when doing eject without a medium
* loaded on some drives.
* - CDROMREADMODE2 is now implemented through
* CDROMREADRAW, since many drives don't support
* MODE2 (even though ATAPI 2.6 says they must).
* - Added ignore parameter to ide-cd (as a module), eg
* insmod ide-cd ignore='hda hdb'
* Useful when using ide-cd in conjunction with
* ide-scsi. TODO: non-modular way of doing the
* same.
*
* 4.54 Aug 5, 1999 - Support for MMC2 class commands through the generic
* packet interface to cdrom.c.
* - Unified audio ioctl support, most of it.
* - cleaned up various deprecated verify_area().
* - Added ide_cdrom_packet() as the interface for
* the Uniform generic_packet().
* - bunch of other stuff, will fill in logs later.
* - report 1 slot for non-changers, like the other
* cd-rom drivers. don't report select disc for
* non-changers as well.
* - mask out audio playing, if the device can't do it.
*
* 4.55 Sep 1, 1999 - Eliminated the rest of the audio ioctls, except
* for CDROMREADTOC[ENTRY|HEADER]. Some of the drivers
* use this independently of the actual audio handling.
* They will disappear later when I get the time to
* do it cleanly.
* - Minimize the TOC reading - only do it when we
* know a media change has occurred.
* - Moved all the CDROMREADx ioctls to the Uniform layer.
* - Heiko Eißfeldt <heiko@colossus.escape.de> supplied
* some fixes for CDI.
* - CD-ROM leaving door locked fix from Andries
* Brouwer <Andries.Brouwer@cwi.nl>
* - Erik Andersen <andersen@xmission.com> unified
* commands across the various drivers and how
* sense errors are handled.
*
* 4.56 Sep 12, 1999 - Removed changer support - it is now in the
* Uniform layer.
* - Added partition based multisession handling.
* - Mode sense and mode select moved to the
* Uniform layer.
* - Fixed a problem with WPI CDS-32X drive - it
* failed the capabilities
*
* 4.57 Apr 7, 2000 - Fixed sense reporting.
* - Fixed possible oops in ide_cdrom_get_last_session()
* - Fix locking mania and make ide_cdrom_reset relock
* - Stop spewing errors to log when magicdev polls with
* TEST_UNIT_READY on some drives.
* - Various fixes from Tobias Ringstrom:
* tray if it was locked prior to the reset.
* - cdrom_read_capacity returns one frame too little.
* - Fix real capacity reporting.
*
* 4.58 May 1, 2000 - Clean up ACER50 stuff.
* - Fix small problem with ide_cdrom_capacity
*
* 4.59 Aug 11, 2000 - Fix changer problem in cdrom_read_toc, we weren't
* correctly sensing a disc change.
* - Rearranged some code
* - Use extended sense on drives that support it for
* correctly reporting tray status -- from
* Michael D Johnson <johnsom@orst.edu>
* 4.60 Dec 17, 2003 - Add mt rainier support
* - Bump timeout for packet commands, matches sr
* - Odd stuff
* 4.61 Jan 22, 2004 - support hardware sector sizes other than 2kB,
* Pascal Schmidt <der.eremit@email.de>
*/
/*
* Many thanks to Lode Leroy <Lode.Leroy@www.ibase.be>, who tested so many
* ALPHA patches to this driver on an EASYSTOR LS-120 ATAPI floppy drive.
*
* Ver 0.1 Oct 17 96 Initial test version, mostly based on ide-tape.c.
* Ver 0.2 Oct 31 96 Minor changes.
* Ver 0.3 Dec 2 96 Fixed error recovery bug.
* Ver 0.4 Jan 26 97 Add support for the HDIO_GETGEO ioctl.
* Ver 0.5 Feb 21 97 Add partitions support.
* Use the minimum of the LBA and CHS capacities.
* Avoid hwgroup->rq == NULL on the last irq.
* Fix potential null dereferencing with DEBUG_LOG.
* Ver 0.8 Dec 7 97 Increase irq timeout from 10 to 50 seconds.
* Add media write-protect detection.
* Issue START command only if TEST UNIT READY fails.
* Add work-around for IOMEGA ZIP revision 21.D.
* Remove idefloppy_get_capabilities().
* Ver 0.9 Jul 4 99 Fix a bug which might have caused the number of
* bytes requested on each interrupt to be zero.
* Thanks to <shanos@es.co.nz> for pointing this out.
* Ver 0.9.sv Jan 6 01 Sam Varshavchik <mrsam@courier-mta.com>
* Implement low level formatting. Reimplemented
* IDEFLOPPY_CAPABILITIES_PAGE, since we need the srfp
* bit. My LS-120 drive barfs on
* IDEFLOPPY_CAPABILITIES_PAGE, but maybe it's just me.
* Compromise by not reporting a failure to get this
* mode page. Implemented four IOCTLs in order to
* implement formatting. IOCTls begin with 0x4600,
* 0x46 is 'F' as in Format.
* Jan 9 01 Userland option to select format verify.
* Added PC_SUPPRESS_ERROR flag - some idefloppy drives
* do not implement IDEFLOPPY_CAPABILITIES_PAGE, and
* return a sense error. Suppress error reporting in
* this particular case in order to avoid spurious
* errors in syslog. The culprit is
* idefloppy_get_capability_page(), so move it to
* idefloppy_begin_format() so that it's not used
* unless absolutely necessary.
* If drive does not support format progress indication
* monitor the dsc bit in the status register.
* Also, O_NDELAY on open will allow the device to be
* opened without a disk available. This can be used to
* open an unformatted disk, or get the device capacity.
* Ver 0.91 Dec 11 99 Added IOMEGA Clik! drive support by
* <paul@paulbristow.net>
* Ver 0.92 Oct 22 00 Paul Bristow became official maintainer for this
* driver. Included Powerbook internal zip kludge.
* Ver 0.93 Oct 24 00 Fixed bugs for Clik! drive
* no disk on insert and disk change now works
* Ver 0.94 Oct 27 00 Tidied up to remove strstr(Clik) everywhere
* Ver 0.95 Nov 7 00 Brought across to kernel 2.4
* Ver 0.96 Jan 7 01 Actually in line with release version of 2.4.0
* including set_bit patch from Rusty Russell
* Ver 0.97 Jul 22 01 Merge 0.91-0.96 onto 0.9.sv for ac series
* Ver 0.97.sv Aug 3 01 Backported from 2.4.7-ac3
* Ver 0.98 Oct 26 01 Split idefloppy_transfer_pc into two pieces to
* fix a lost interrupt problem. It appears the busy
* bit was being deasserted by my IOMEGA ATAPI ZIP 100
* drive before the drive was actually ready.
* Ver 0.98a Oct 29 01 Expose delay value so we can play.
* Ver 0.99 Feb 24 02 Remove duplicate code, modify clik! detection code
* to support new PocketZip drives
*/
/*
* Ver 0.1 Nov 1 95 Pre-working code :-)
* Ver 0.2 Nov 23 95 A short backup (few megabytes) and restore procedure
* was successful ! (Using tar cvf ... on the block
* device interface).
* A longer backup resulted in major swapping, bad
* overall Linux performance and eventually failed as
* we received non serial read-ahead requests from the
* buffer cache.
* Ver 0.3 Nov 28 95 Long backups are now possible, thanks to the
* character device interface. Linux's responsiveness
* and performance doesn't seem to be much affected
* from the background backup procedure.
* Some general mtio.h magnetic tape operations are
* now supported by our character device. As a result,
* popular tape utilities are starting to work with
* ide tapes :-)
* The following configurations were tested:
* 1. An IDE ATAPI TAPE shares the same interface
* and irq with an IDE ATAPI CDROM.
* 2. An IDE ATAPI TAPE shares the same interface
* and irq with a normal IDE disk.
* Both configurations seemed to work just fine !
* However, to be on the safe side, it is meanwhile
* recommended to give the IDE TAPE its own interface
* and irq.
* The one thing which needs to be done here is to
* add a "request postpone" feature to ide.c,
* so that we won't have to wait for the tape to finish
* performing a long media access (DSC) request (such
* as a rewind) before we can access the other device
* on the same interface. This effect doesn't disturb
* normal operation most of the time because read/write
* requests are relatively fast, and once we are
* performing one tape r/w request, a lot of requests
* from the other device can be queued and ide.c will
* service all of them after this single tape request.
* Ver 1.0 Dec 11 95 Integrated into Linux 1.3.46 development tree.
* On each read / write request, we now ask the drive
* if we can transfer a constant number of bytes
* (a parameter of the drive) only to its buffers,
* without causing actual media access. If we can't,
* we just wait until we can by polling the DSC bit.
* This ensures that while we are not transferring
* more bytes than the constant referred to above, the
* interrupt latency will not become too high and
* we won't cause an interrupt timeout, as happened
* occasionally in the previous version.
* While polling for DSC, the current request is
* postponed and ide.c is free to handle requests from
* the other device. This is handled transparently to
* ide.c. The hwgroup locking method which was used
* in the previous version was removed.
* Use of new general features which are provided by
* ide.c for use with atapi devices.
* (Programming done by Mark Lord)
* Few potential bug fixes (Again, suggested by Mark)
* Single character device data transfers are now
* not limited in size, as they were before.
* We are asking the tape about its recommended
* transfer unit and send a larger data transfer
* as several transfers of the above size.
* For best results, use an integral number of this
* basic unit (which is shown during driver
* initialization). I will soon add an ioctl to get
* this important parameter.
* Our data transfer buffer is allocated on startup,
* rather than before each data transfer. This should
* ensure that we will indeed have a data buffer.
* Ver 1.1 Dec 14 95 Fixed random problems which occurred when the tape
* shared an interface with another device.
* (poll_for_dsc was a complete mess).
* Removed some old (non-active) code which had
* to do with supporting buffer cache originated
* requests.
* The block device interface can now be opened, so
* that general ide driver features like the unmask
* interrupts flag can be selected with an ioctl.
* This is the only use of the block device interface.
* New fast pipelined operation mode (currently only on
* writes). When using the pipelined mode, the
* throughput can potentially reach the maximum
* tape supported throughput, regardless of the
* user backup program. On my tape drive, it sometimes
* boosted performance by a factor of 2. Pipelined
* mode is enabled by default, but since it has a few
* downfalls as well, you may want to disable it.
* A short explanation of the pipelined operation mode
* is available below.
* Ver 1.2 Jan 1 96 Eliminated pipelined mode race condition.
* Added pipeline read mode. As a result, restores
* are now as fast as backups.
* Optimized shared interface behavior. The new behavior
* typically results in better IDE bus efficiency and
* higher tape throughput.
* Pre-calculation of the expected read/write request
* service time, based on the tape's parameters. In
* the pipelined operation mode, this allows us to
* adjust our polling frequency to a much lower value,
* and thus to dramatically reduce our load on Linux,
* without any decrease in performance.
* Implemented additional mtio.h operations.
* The recommended user block size is returned by
* the MTIOCGET ioctl.
* Additional minor changes.
* Ver 1.3 Feb 9 96 Fixed pipelined read mode bug which prevented the
* use of some block sizes during a restore procedure.
* The character device interface will now present a
* continuous view of the media - any mix of block sizes
* during a backup/restore procedure is supported. The
* driver will buffer the requests internally and
* convert them to the tape's recommended transfer
* unit, making performance almost independent of the
* chosen user block size.
* Some improvements in error recovery.
* By cooperating with ide-dma.c, bus mastering DMA can
* now sometimes be used with IDE tape drives as well.
* Bus mastering DMA has the potential to dramatically
* reduce the CPU's overhead when accessing the device,
* and can be enabled by using hdparm -d1 on the tape's
* block device interface. For more info, read the
* comments in ide-dma.c.
* Ver 1.4 Mar 13 96 Fixed serialize support.
* Ver 1.5 Apr 12 96 Fixed shared interface operation, broken in 1.3.85.
* Fixed pipelined read mode inefficiency.
* Fixed nasty null dereferencing bug.
* Ver 1.6 Aug 16 96 Fixed FPU usage in the driver.
* Fixed end of media bug.
* Ver 1.7 Sep 10 96 Minor changes for the CONNER CTT8000-A model.
* Ver 1.8 Sep 26 96 Attempt to find a better balance between good
* interactive response and high system throughput.
* Ver 1.9 Nov 5 96 Automatically cross encountered filemarks rather
* than requiring an explicit FSF command.
* Abort pending requests at end of media.
* MTTELL was sometimes returning incorrect results.
* Return the real block size in the MTIOCGET ioctl.
* Some error recovery bug fixes.
* Ver 1.10 Nov 5 96 Major reorganization.
* Reduced CPU overhead a bit by eliminating internal
* bounce buffers.
* Added module support.
* Added multiple tape drives support.
* Added partition support.
* Rewrote DSC handling.
* Some portability fixes.
* Removed ide-tape.h.
* Additional minor changes.
* Ver 1.11 Dec 2 96 Bug fix in previous DSC timeout handling.
* Use ide_stall_queue() for DSC overlap.
* Use the maximum speed rather than the current speed
* to compute the request service time.
* Ver 1.12 Dec 7 97 Fix random memory overwriting and/or last block data
* corruption, which could occur if the total number
* of bytes written to the tape was not an integral
* number of tape blocks.
* Add support for INTERRUPT DRQ devices.
* Ver 1.13 Jan 2 98 Add "speed == 0" work-around for HP COLORADO 5GB
* Ver 1.14 Dec 30 98 Partial fixes for the Sony/AIWA tape drives.
* Replace cli()/sti() with hwgroup spinlocks.
* Ver 1.15 Mar 25 99 Fix SMP race condition by replacing hwgroup
* spinlock with private per-tape spinlock.
* Ver 1.16 Sep 1 99 Add OnStream tape support.
* Abort read pipeline on EOD.
* Wait for the tape to become ready in case it returns
* "in the process of becoming ready" on open().
* Fix zero padding of the last written block in
* case the tape block size is larger than PAGE_SIZE.
* Decrease the default disconnection time to tn.
* Ver 1.16e Oct 3 99 Minor fixes.
* Ver 1.16e1 Oct 13 99 Patches by Arnold Niessen,
* niessen@iae.nl / arnold.niessen@philips.com
* GO-1) Undefined code in idetape_read_position
* according to Gadi's email
* AJN-1) Minor fix asc == 11 should be asc == 0x11
* in idetape_issue_packet_command (did effect
* debugging output only)
* AJN-2) Added more debugging output, and
* added ide-tape: where missing. I would also
* like to add tape->name where possible
* AJN-3) Added different debug_level's
* via /proc/ide/hdc/settings
* "debug_level" determines amount of debugging output;
* can be changed using /proc/ide/hdx/settings
* 0 : almost no debugging output
* 1 : 0+output errors only
* 2 : 1+output all sensekey/asc
* 3 : 2+follow all chrdev related procedures
* 4 : 3+follow all procedures
* 5 : 4+include pc_stack rq_stack info
* 6 : 5+USE_COUNT updates
* AJN-4) Fixed timeout for retension in idetape_queue_pc_tail
* from 5 to 10 minutes
* AJN-5) Changed maximum number of blocks to skip when
* reading tapes with multiple consecutive write
* errors from 100 to 1000 in idetape_get_logical_blk
* Proposed changes to code:
* 1) output "logical_blk_num" via /proc
* 2) output "current_operation" via /proc
* 3) Either solve or document the fact that `mt rewind' is
* required after reading from /dev/nhtx to be
* able to rmmod the idetape module;
* Also, sometimes an application finishes but the
* device remains `busy' for some time. Same cause ?
* Proposed changes to release-notes:
* 4) write a simple `quickstart' section in the
* release notes; I volunteer if you don't want to
* 5) include a pointer to video4linux in the doc
* to stimulate video applications
* 6) release notes lines 331 and 362: explain what happens
* if the application data rate is higher than 1100 KB/s;
* similar approach to lower-than-500 kB/s ?
* 7) 6.6 Comparison; wouldn't it be better to allow different
* strategies for read and write ?
* Wouldn't it be better to control the tape buffer
* contents instead of the bandwidth ?
* 8) line 536: replace will by would (if I understand
* this section correctly, a hypothetical and unwanted situation
* is being described)
* Ver 1.16f Dec 15 99 Change place of the secondary OnStream header frames.
* Ver 1.17 Nov 2000 / Jan 2001 Marcel Mol, marcel@mesa.nl
* - Add idetape_onstream_mode_sense_tape_parameter_page
* function to get tape capacity in frames: tape->capacity.
* - Add support for DI-50 drives( or any DI- drive).
* - 'workaround' for read error/blank block around block 3000.
* - Implement Early warning for end of media for Onstream.
* - Cosmetic code changes for readability.
* - Idetape_position_tape should not use SKIP bit during
* Onstream read recovery.
* - Add capacity, logical_blk_num and first/last_frame_position
* to /proc/ide/hd?/settings.
* - Module use count was gone in the Linux 2.4 driver.
* Ver 1.17a Apr 2001 Willem Riede osst@riede.org
* - Get drive's actual block size from mode sense block descriptor
* - Limit size of pipeline
* Ver 1.17b Oct 2002 Alan Stern <stern@rowland.harvard.edu>
* Changed IDETAPE_MIN_PIPELINE_STAGES to 1 and actually used
* it in the code!
* Actually removed aborted stages in idetape_abort_pipeline
* instead of just changing the command code.
* Made the transfer byte count for Request Sense equal to the
* actual length of the data transfer.
* Changed handling of partial data transfers: they do not
* cause DMA errors.
* Moved initiation of DMA transfers to the correct place.
* Removed reference to unallocated memory.
* Made __idetape_discard_read_pipeline return the number of
* sectors skipped, not the number of stages.
* Replaced errant kfree() calls with __idetape_kfree_stage().
* Fixed off-by-one error in testing the pipeline length.
* Fixed handling of filemarks in the read pipeline.
* Small code optimization for MTBSF and MTBSFM ioctls.
* Don't try to unlock the door during device close if is
* already unlocked!
* Cosmetic fixes to miscellaneous debugging output messages.
* Set the minimum /proc/ide/hd?/settings values for "pipeline",
* "pipeline_min", and "pipeline_max" to 1.
*/
Changelog for ide cd
--------------------
.. include:: ChangeLog.ide-cd.1994-2004
:literal:
Changelog for ide floppy
------------------------
.. include:: ChangeLog.ide-floppy.1996-2002
:literal:
Changelog for ide tape
----------------------
.. include:: ChangeLog.ide-tape.1995-2002
:literal:
===============================
IDE ATAPI streaming tape driver
===============================
This driver is a part of the Linux ide driver.
The driver, in co-operation with ide.c, basically traverses the
request-list for the block device interface. The character device
interface, on the other hand, creates new requests, adds them
to the request-list of the block device, and waits for their completion.
The block device major and minor numbers are determined from the
tape's relative position in the ide interfaces, as explained in ide.c.
The character device interface consists of the following devices::
ht0 major 37, minor 0 first IDE tape, rewind on close.
ht1 major 37, minor 1 second IDE tape, rewind on close.
...
nht0 major 37, minor 128 first IDE tape, no rewind on close.
nht1 major 37, minor 129 second IDE tape, no rewind on close.
...
The general magnetic tape commands compatible interface, as defined by
include/linux/mtio.h, is accessible through the character device.
General ide driver configuration options, such as the interrupt-unmask
flag, can be configured by issuing an ioctl to the block device interface,
as any other ide device.
Our own ide-tape ioctl's can be issued to either the block device or
the character device interface.
Maximal throughput with minimal bus load will usually be achieved in the
following scenario:
1. ide-tape is operating in the pipelined operation mode.
2. No buffering is performed by the user backup program.
Testing was done with a 2 GB CONNER CTMA 4000 IDE ATAPI Streaming Tape Drive.
Here are some words from the first releases of hd.c, which are quoted
in ide.c and apply here as well:
* Special care is recommended. Have Fun!
Possible improvements
=====================
1. Support for the ATAPI overlap protocol.
In order to maximize bus throughput, we currently use the DSC
overlap method which enables ide.c to service requests from the
other device while the tape is busy executing a command. The
DSC overlap method involves polling the tape's status register
for the DSC bit, and servicing the other device while the tape
isn't ready.
In the current QIC development standard (December 1995),
it is recommended that new tape drives will *in addition*
implement the ATAPI overlap protocol, which is used for the
same purpose - efficient use of the IDE bus, but is interrupt
driven and thus has much less CPU overhead.
ATAPI overlap is likely to be supported in most new ATAPI
devices, including new ATAPI cdroms, and thus provides us
a method by which we can achieve higher throughput when
sharing a (fast) ATA-2 disk with any (slow) new ATAPI device.
============================================
Information regarding the Enhanced IDE drive
============================================
The hdparm utility can be used to control various IDE features on a
running system. It is packaged separately. Please Look for it on popular
linux FTP sites.
-------------------------------------------------------------------------------
.. important::
BUGGY IDE CHIPSETS CAN CORRUPT DATA!!
PCI versions of the CMD640 and RZ1000 interfaces are now detected
automatically at startup when PCI BIOS support is configured.
Linux disables the "prefetch" ("readahead") mode of the RZ1000
to prevent data corruption possible due to hardware design flaws.
For the CMD640, linux disables "IRQ unmasking" (hdparm -u1) on any
drive for which the "prefetch" mode of the CMD640 is turned on.
If "prefetch" is disabled (hdparm -p8), then "IRQ unmasking" can be
used again.
For the CMD640, linux disables "32bit I/O" (hdparm -c1) on any drive
for which the "prefetch" mode of the CMD640 is turned off.
If "prefetch" is enabled (hdparm -p9), then "32bit I/O" can be
used again.
The CMD640 is also used on some Vesa Local Bus (VLB) cards, and is *NOT*
automatically detected by Linux. For safe, reliable operation with such
interfaces, one *MUST* use the "cmd640.probe_vlb" kernel option.
Use of the "serialize" option is no longer necessary.
-------------------------------------------------------------------------------
Common pitfalls
===============
- 40-conductor IDE cables are capable of transferring data in DMA modes up to
udma2, but no faster.
- If possible devices should be attached to separate channels if they are
available. Typically the disk on the first and CD-ROM on the second.
- If you mix devices on the same cable, please consider using similar devices
in respect of the data transfer mode they support.
- Even better try to stick to the same vendor and device type on the same
cable.
This is the multiple IDE interface driver, as evolved from hd.c
===============================================================
It supports up to 9 IDE interfaces per default, on one or more IRQs (usually
14 & 15). There can be up to two drives per interface, as per the ATA-6 spec.::
Primary: ide0, port 0x1f0; major=3; hda is minor=0; hdb is minor=64
Secondary: ide1, port 0x170; major=22; hdc is minor=0; hdd is minor=64
Tertiary: ide2, port 0x1e8; major=33; hde is minor=0; hdf is minor=64
Quaternary: ide3, port 0x168; major=34; hdg is minor=0; hdh is minor=64
fifth.. ide4, usually PCI, probed
sixth.. ide5, usually PCI, probed
To access devices on interfaces > ide0, device entries please make sure that
device files for them are present in /dev. If not, please create such
entries, by using /dev/MAKEDEV.
This driver automatically probes for most IDE interfaces (including all PCI
ones), for the drives/geometries attached to those interfaces, and for the IRQ
lines being used by the interfaces (normally 14, 15 for ide0/ide1).
Any number of interfaces may share a single IRQ if necessary, at a slight
performance penalty, whether on separate cards or a single VLB card.
The IDE driver automatically detects and handles this. However, this may
or may not be harmful to your hardware.. two or more cards driving the same IRQ
can potentially burn each other's bus driver, though in practice this
seldom occurs. Be careful, and if in doubt, don't do it!
Drives are normally found by auto-probing and/or examining the CMOS/BIOS data.
For really weird situations, the apparent (fdisk) geometry can also be specified
on the kernel "command line" using LILO. The format of such lines is::
ide_core.chs=[interface_number.device_number]:cyls,heads,sects
or::
ide_core.cdrom=[interface_number.device_number]
For example::
ide_core.chs=1.0:1050,32,64 ide_core.cdrom=1.1
The results of successful auto-probing may override the physical geometry/irq
specified, though the "original" geometry may be retained as the "logical"
geometry for partitioning purposes (fdisk).
If the auto-probing during boot time confuses a drive (ie. the drive works
with hd.c but not with ide.c), then an command line option may be specified
for each drive for which you'd like the drive to skip the hardware
probe/identification sequence. For example::
ide_core.noprobe=0.1
or::
ide_core.chs=1.0:768,16,32
ide_core.noprobe=1.0
Note that when only one IDE device is attached to an interface, it should be
jumpered as "single" or "master", *not* "slave". Many folks have had
"trouble" with cdroms because of this requirement, so the driver now probes
for both units, though success is more likely when the drive is jumpered
correctly.
Courtesy of Scott Snyder and others, the driver supports ATAPI cdrom drives
such as the NEC-260 and the new MITSUMI triple/quad speed drives.
Such drives will be identified at boot time, just like a hard disk.
If for some reason your cdrom drive is *not* found at boot time, you can force
the probe to look harder by supplying a kernel command line parameter
via LILO, such as:::
ide_core.cdrom=1.0 /* "master" on second interface (hdc) */
or::
ide_core.cdrom=1.1 /* "slave" on second interface (hdd) */
For example, a GW2000 system might have a hard drive on the primary
interface (/dev/hda) and an IDE cdrom drive on the secondary interface
(/dev/hdc). To mount a CD in the cdrom drive, one would use something like::
ln -sf /dev/hdc /dev/cdrom
mkdir /mnt/cdrom
mount /dev/cdrom /mnt/cdrom -t iso9660 -o ro
If, after doing all of the above, mount doesn't work and you see
errors from the driver (with dmesg) complaining about `status=0xff`,
this means that the hardware is not responding to the driver's attempts
to read it. One of the following is probably the problem:
- Your hardware is broken.
- You are using the wrong address for the device, or you have the
drive jumpered wrong. Review the configuration instructions above.
- Your IDE controller requires some nonstandard initialization sequence
before it will work properly. If this is the case, there will often
be a separate MS-DOS driver just for the controller. IDE interfaces
on sound cards usually fall into this category. Such configurations
can often be made to work by first booting MS-DOS, loading the
appropriate drivers, and then warm-booting linux (without powering
off). This can be automated using loadlin in the MS-DOS autoexec.
If you always get timeout errors, interrupts from the drive are probably
not making it to the host. Check how you have the hardware jumpered
and make sure it matches what the driver expects (see the configuration
instructions above). If you have a PCI system, also check the BIOS
setup; I've had one report of a system which was shipped with IRQ 15
disabled by the BIOS.
The kernel is able to execute binaries directly off of the cdrom,
provided it is mounted with the default block size of 1024 (as above).
Please pass on any feedback on any of this stuff to the maintainer,
whose address can be found in linux/MAINTAINERS.
The IDE driver is modularized. The high level disk/CD-ROM/tape/floppy
drivers can always be compiled as loadable modules, the chipset drivers
can only be compiled into the kernel, and the core code (ide.c) can be
compiled as a loadable module provided no chipset support is needed.
When using ide.c as a module in combination with kmod, add::
alias block-major-3 ide-probe
to a configuration file in /etc/modprobe.d/.
When ide.c is used as a module, you can pass command line parameters to the
driver using the "options=" keyword to insmod, while replacing any ',' with
';'.
Summary of ide driver parameters for kernel command line
========================================================
For legacy IDE VLB host drivers (ali14xx/dtc2278/ht6560b/qd65xx/umc8672)
you need to explicitly enable probing by using "probe" kernel parameter,
i.e. to enable probing for ALI M14xx chipsets (ali14xx host driver) use:
* "ali14xx.probe" boot option when ali14xx driver is built-in the kernel
* "probe" module parameter when ali14xx driver is compiled as module
("modprobe ali14xx probe")
Also for legacy CMD640 host driver (cmd640) you need to use "probe_vlb"
kernel paremeter to enable probing for VLB version of the chipset (PCI ones
are detected automatically).
You also need to use "probe" kernel parameter for ide-4drives driver
(support for IDE generic chipset with four drives on one port).
To enable support for IDE doublers on Amiga use "doubler" kernel parameter
for gayle host driver (i.e. "gayle.doubler" if the driver is built-in).
To force ignoring cable detection (this should be needed only if you're using
short 40-wires cable which cannot be automatically detected - if this is not
a case please report it as a bug instead) use "ignore_cable" kernel parameter:
* "ide_core.ignore_cable=[interface_number]" boot option if IDE is built-in
(i.e. "ide_core.ignore_cable=1" to force ignoring cable for "ide1")
* "ignore_cable=[interface_number]" module parameter (for ide_core module)
if IDE is compiled as module
Other kernel parameters for ide_core are:
* "nodma=[interface_number.device_number]" to disallow DMA for a device
* "noflush=[interface_number.device_number]" to disable flush requests
* "nohpa=[interface_number.device_number]" to disable Host Protected Area
* "noprobe=[interface_number.device_number]" to skip probing
* "nowerr=[interface_number.device_number]" to ignore the WRERR_STAT bit
* "cdrom=[interface_number.device_number]" to force device as a CD-ROM
* "chs=[interface_number.device_number]" to force device as a disk (using CHS)
Some Terminology
================
IDE
Integrated Drive Electronics, meaning that each drive has a built-in
controller, which is why an "IDE interface card" is not a "controller card".
ATA
AT (the old IBM 286 computer) Attachment Interface, a draft American
National Standard for connecting hard drives to PCs. This is the official
name for "IDE".
The latest standards define some enhancements, known as the ATA-6 spec,
which grew out of vendor-specific "Enhanced IDE" (EIDE) implementations.
ATAPI
ATA Packet Interface, a new protocol for controlling the drives,
similar to SCSI protocols, created at the same time as the ATA2 standard.
ATAPI is currently used for controlling CDROM, TAPE and FLOPPY (ZIP or
LS120/240) devices, removable R/W cartridges, and for high capacity hard disk
drives.
mlord@pobox.com
Wed Apr 17 22:52:44 CEST 2002 edited by Marcin Dalecki, the current
maintainer.
Wed Aug 20 22:31:29 CEST 2003 updated ide boot options to current ide.c
comments at 2.6.0-test4 time. Maciej Soltysiak <solt@dns.toxicfilms.tv>
.. SPDX-License-Identifier: GPL-2.0
==================================
Integrated Drive Electronics (IDE)
==================================
.. toctree::
:maxdepth: 1
ide
ide-tape
warm-plug-howto
changelogs
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
===================
IDE warm-plug HOWTO
===================
To warm-plug devices on a port 'idex'::
# echo -n "1" > /sys/class/ide_port/idex/delete_devices
unplug old device(s) and plug new device(s)::
# echo -n "1" > /sys/class/ide_port/idex/scan
done
NOTE: please make sure that partitions are unmounted and that there are
no other active references to devices before doing "delete_devices" step,
also do not attempt "scan" step on devices currently in use -- otherwise
results may be unpredictable and lead to data loss if you're unlucky
...@@ -103,7 +103,6 @@ needed). ...@@ -103,7 +103,6 @@ needed).
block/index block/index
cdrom/index cdrom/index
cpu-freq/index cpu-freq/index
ide/index
fb/index fb/index
fpga/index fpga/index
hid/index hid/index
...@@ -169,7 +168,6 @@ to ReStructured Text format, or are simply too old. ...@@ -169,7 +168,6 @@ to ReStructured Text format, or are simply too old.
tools/index tools/index
staging/index staging/index
watch_queue
Translations Translations
......
...@@ -288,7 +288,7 @@ between 0 and large positive numbers. Excess motion below 0 is ignored. The ...@@ -288,7 +288,7 @@ between 0 and large positive numbers. Excess motion below 0 is ignored. The
command sets the maximum positive value that can be attained in the scaled command sets the maximum positive value that can be attained in the scaled
coordinate system. Motion beyond that value is also ignored. coordinate system. Motion beyond that value is also ignored.
SET MOUSE KEYCODE MOSE SET MOUSE KEYCODE MODE
---------------------- ----------------------
:: ::
...@@ -333,7 +333,7 @@ occur before the internally maintained coordinate is changed by one ...@@ -333,7 +333,7 @@ occur before the internally maintained coordinate is changed by one
(independently scaled for each axis). Remember that the mouse position (independently scaled for each axis). Remember that the mouse position
information is available only by interrogating the ikbd in the ABSOLUTE MOUSE information is available only by interrogating the ikbd in the ABSOLUTE MOUSE
POSITIONING mode unless the ikbd has been commanded to report on button press POSITIONING mode unless the ikbd has been commanded to report on button press
or release (see SET MOSE BUTTON ACTION). or release (see SET MOUSE BUTTON ACTION).
INTERROGATE MOUSE POSITION INTERROGATE MOUSE POSITION
-------------------------- --------------------------
......
...@@ -32,7 +32,7 @@ The following parameters are used to configure filters to reduce noise: ...@@ -32,7 +32,7 @@ The following parameters are used to configure filters to reduce noise:
|activation_height, |size threshold to activate immediately | |activation_height, |size threshold to activate immediately |
|activation_width | | |activation_width | |
+-----------------------+-----------------------------------------------------+ +-----------------------+-----------------------------------------------------+
|min_height, |size threshold bellow which fingers are ignored | |min_height, |size threshold below which fingers are ignored |
|min_width |both to decide activation and during activity | |min_width |both to decide activation and during activity |
+-----------------------+-----------------------------------------------------+ +-----------------------+-----------------------------------------------------+
|deactivate_slack |the number of "no contact" frames to ignore before | |deactivate_slack |the number of "no contact" frames to ignore before |
......
...@@ -112,8 +112,7 @@ time, although different tasklets can run simultaneously. ...@@ -112,8 +112,7 @@ time, although different tasklets can run simultaneously.
.. warning:: .. warning::
The name 'tasklet' is misleading: they have nothing to do with The name 'tasklet' is misleading: they have nothing to do with
'tasks', and probably more to do with some bad vodka Alexey 'tasks'.
Kuznetsov had at the time.
You can tell you are in a softirq (or tasklet) using the You can tell you are in a softirq (or tasklet) using the
:c:func:`in_softirq()` macro (``include/linux/preempt.h``). :c:func:`in_softirq()` macro (``include/linux/preempt.h``).
...@@ -290,8 +289,8 @@ userspace. ...@@ -290,8 +289,8 @@ userspace.
Unlike :c:func:`put_user()` and :c:func:`get_user()`, they Unlike :c:func:`put_user()` and :c:func:`get_user()`, they
return the amount of uncopied data (ie. 0 still means success). return the amount of uncopied data (ie. 0 still means success).
[Yes, this moronic interface makes me cringe. The flamewar comes up [Yes, this objectionable interface makes me cringe. The flamewar comes
every year or so. --RR.] up every year or so. --RR.]
The functions may sleep implicitly. This should never be called outside The functions may sleep implicitly. This should never be called outside
user context (it makes no sense), with interrupts disabled, or a user context (it makes no sense), with interrupts disabled, or a
...@@ -645,8 +644,9 @@ names in development kernels; this is not done just to keep everyone on ...@@ -645,8 +644,9 @@ names in development kernels; this is not done just to keep everyone on
their toes: it reflects a fundamental change (eg. can no longer be their toes: it reflects a fundamental change (eg. can no longer be
called with interrupts on, or does extra checks, or doesn't do checks called with interrupts on, or does extra checks, or doesn't do checks
which were caught before). Usually this is accompanied by a fairly which were caught before). Usually this is accompanied by a fairly
complete note to the linux-kernel mailing list; search the archive. complete note to the appropriate kernel development mailing list; search
Simply doing a global replace on the file usually makes things **worse**. the archives. Simply doing a global replace on the file usually makes
things **worse**.
Initializing structure members Initializing structure members
------------------------------ ------------------------------
...@@ -723,14 +723,14 @@ Putting Your Stuff in the Kernel ...@@ -723,14 +723,14 @@ Putting Your Stuff in the Kernel
In order to get your stuff into shape for official inclusion, or even to In order to get your stuff into shape for official inclusion, or even to
make a neat patch, there's administrative work to be done: make a neat patch, there's administrative work to be done:
- Figure out whose pond you've been pissing in. Look at the top of the - Figure out who are the owners of the code you've been modifying. Look
source files, inside the ``MAINTAINERS`` file, and last of all in the at the top of the source files, inside the ``MAINTAINERS`` file, and
``CREDITS`` file. You should coordinate with this person to make sure last of all in the ``CREDITS`` file. You should coordinate with these
you're not duplicating effort, or trying something that's already people to make sure you're not duplicating effort, or trying something
been rejected. that's already been rejected.
Make sure you put your name and EMail address at the top of any files Make sure you put your name and email address at the top of any files
you create or mangle significantly. This is the first place people you create or modify significantly. This is the first place people
will look when they find a bug, or when **they** want to make a change. will look when they find a bug, or when **they** want to make a change.
- Usually you want a configuration option for your kernel hack. Edit - Usually you want a configuration option for your kernel hack. Edit
...@@ -748,11 +748,11 @@ make a neat patch, there's administrative work to be done: ...@@ -748,11 +748,11 @@ make a neat patch, there's administrative work to be done:
can usually just add a "obj-$(CONFIG_xxx) += xxx.o" line. The syntax can usually just add a "obj-$(CONFIG_xxx) += xxx.o" line. The syntax
is documented in ``Documentation/kbuild/makefiles.rst``. is documented in ``Documentation/kbuild/makefiles.rst``.
- Put yourself in ``CREDITS`` if you've done something noteworthy, - Put yourself in ``CREDITS`` if you consider what you've done
usually beyond a single file (your name should be at the top of the noteworthy, usually beyond a single file (your name should be at the
source files anyway). ``MAINTAINERS`` means you want to be consulted top of the source files anyway). ``MAINTAINERS`` means you want to be
when changes are made to a subsystem, and hear about bugs; it implies consulted when changes are made to a subsystem, and hear about bugs;
a more-than-passing commitment to some part of the code. it implies a more-than-passing commitment to some part of the code.
- Finally, don't forget to read - Finally, don't forget to read
``Documentation/process/submitting-patches.rst`` and possibly ``Documentation/process/submitting-patches.rst`` and possibly
......
...@@ -941,8 +941,7 @@ lock. ...@@ -941,8 +941,7 @@ lock.
A classic problem here is when you provide callbacks or hooks: if you A classic problem here is when you provide callbacks or hooks: if you
call these with the lock held, you risk simple deadlock, or a deadly call these with the lock held, you risk simple deadlock, or a deadly
embrace (who knows what the callback will do?). Remember, the other embrace (who knows what the callback will do?).
programmers are out to get you, so don't do this.
Overzealous Prevention Of Deadlocks Overzealous Prevention Of Deadlocks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -952,8 +951,6 @@ grabs a read lock, searches a list, fails to find what it wants, drops ...@@ -952,8 +951,6 @@ grabs a read lock, searches a list, fails to find what it wants, drops
the read lock, grabs a write lock and inserts the object has a race the read lock, grabs a write lock and inserts the object has a race
condition. condition.
If you don't see why, please stay away from my code.
Racing Timers: A Kernel Pastime Racing Timers: A Kernel Pastime
------------------------------- -------------------------------
......
...@@ -154,10 +154,11 @@ that the kernel developers have added a script to ease the process: ...@@ -154,10 +154,11 @@ that the kernel developers have added a script to ease the process:
This script will return the current maintainer(s) for a given file or This script will return the current maintainer(s) for a given file or
directory when given the "-f" option. If passed a patch on the directory when given the "-f" option. If passed a patch on the
command line, it will list the maintainers who should probably receive command line, it will list the maintainers who should probably receive
copies of the patch. There are a number of options regulating how hard copies of the patch. This is the preferred way (unlike "-f" option) to get the
get_maintainer.pl will search for maintainers; please be careful about list of people to Cc for your patches. There are a number of options
using the more aggressive options as you may end up including developers regulating how hard get_maintainer.pl will search for maintainers; please be
who have no real interest in the code you are modifying. careful about using the more aggressive options as you may end up including
developers who have no real interest in the code you are modifying.
If all else fails, talking to Andrew Morton can be an effective way to If all else fails, talking to Andrew Morton can be an effective way to
track down a maintainer for a specific piece of code. track down a maintainer for a specific piece of code.
......
...@@ -7,7 +7,7 @@ Intro ...@@ -7,7 +7,7 @@ Intro
===== =====
This document is designed to provide a list of the minimum levels of This document is designed to provide a list of the minimum levels of
software necessary to run the 4.x kernels. software necessary to run the current kernel version.
This document is originally based on my "Changes" file for 2.0.x kernels This document is originally based on my "Changes" file for 2.0.x kernels
and therefore owes credit to the same people as that file (Jared Mauch, and therefore owes credit to the same people as that file (Jared Mauch,
...@@ -56,6 +56,7 @@ iptables 1.4.2 iptables -V ...@@ -56,6 +56,7 @@ iptables 1.4.2 iptables -V
openssl & libcrypto 1.0.0 openssl version openssl & libcrypto 1.0.0 openssl version
bc 1.06.95 bc --version bc 1.06.95 bc --version
Sphinx\ [#f1]_ 1.7 sphinx-build --version Sphinx\ [#f1]_ 1.7 sphinx-build --version
cpio any cpio --version
====================== =============== ======================================== ====================== =============== ========================================
.. [#f1] Sphinx is needed only to build the Kernel documentation .. [#f1] Sphinx is needed only to build the Kernel documentation
...@@ -458,6 +459,11 @@ mcelog ...@@ -458,6 +459,11 @@ mcelog
- <http://www.mcelog.org/> - <http://www.mcelog.org/>
cpio
----
- <https://www.gnu.org/software/cpio/>
Networking Networking
********** **********
......
...@@ -77,7 +77,7 @@ as you intend it to. ...@@ -77,7 +77,7 @@ as you intend it to.
The maintainer will thank you if you write your patch description in a The maintainer will thank you if you write your patch description in a
form which can be easily pulled into Linux's source code management form which can be easily pulled into Linux's source code management
system, ``git``, as a "commit log". See :ref:`explicit_in_reply_to`. system, ``git``, as a "commit log". See :ref:`the_canonical_patch_format`.
Solve only one problem per patch. If your description starts to get Solve only one problem per patch. If your description starts to get
long, that's a sign that you probably need to split up your patch. long, that's a sign that you probably need to split up your patch.
...@@ -227,9 +227,10 @@ Select the recipients for your patch ...@@ -227,9 +227,10 @@ Select the recipients for your patch
You should always copy the appropriate subsystem maintainer(s) on any patch You should always copy the appropriate subsystem maintainer(s) on any patch
to code that they maintain; look through the MAINTAINERS file and the to code that they maintain; look through the MAINTAINERS file and the
source code revision history to see who those maintainers are. The source code revision history to see who those maintainers are. The
script scripts/get_maintainer.pl can be very useful at this step. If you script scripts/get_maintainer.pl can be very useful at this step (pass paths to
cannot find a maintainer for the subsystem you are working on, Andrew your patches as arguments to scripts/get_maintainer.pl). If you cannot find a
Morton (akpm@linux-foundation.org) serves as a maintainer of last resort. maintainer for the subsystem you are working on, Andrew Morton
(akpm@linux-foundation.org) serves as a maintainer of last resort.
You should also normally choose at least one mailing list to receive a copy You should also normally choose at least one mailing list to receive a copy
of your patch set. linux-kernel@vger.kernel.org should be used by default of your patch set. linux-kernel@vger.kernel.org should be used by default
...@@ -318,7 +319,10 @@ understands what is going on. ...@@ -318,7 +319,10 @@ understands what is going on.
Be sure to tell the reviewers what changes you are making and to thank them Be sure to tell the reviewers what changes you are making and to thank them
for their time. Code review is a tiring and time-consuming process, and for their time. Code review is a tiring and time-consuming process, and
reviewers sometimes get grumpy. Even in that case, though, respond reviewers sometimes get grumpy. Even in that case, though, respond
politely and address the problems they have pointed out. politely and address the problems they have pointed out. When sending a next
version, add a ``patch changelog`` to the cover letter or to individual patches
explaining difference aganst previous submission (see
:ref:`the_canonical_patch_format`).
See Documentation/process/email-clients.rst for recommendations on email See Documentation/process/email-clients.rst for recommendations on email
clients and mailing list etiquette. clients and mailing list etiquette.
......
...@@ -56,9 +56,9 @@ Next two are try_to_wake_up() statistics: ...@@ -56,9 +56,9 @@ Next two are try_to_wake_up() statistics:
Next three are statistics describing scheduling latency: Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies) 7) sum of all time spent running by tasks on this processor (in nanoseconds)
8) sum of all time spent waiting to run by tasks on this processor (in 8) sum of all time spent waiting to run by tasks on this processor (in
jiffies) nanoseconds)
9) # of timeslices run on this cpu 9) # of timeslices run on this cpu
...@@ -155,8 +155,8 @@ schedstats also adds a new /proc/<pid>/schedstat file to include some of ...@@ -155,8 +155,8 @@ schedstats also adds a new /proc/<pid>/schedstat file to include some of
the same information on a per-process level. There are three fields in the same information on a per-process level. There are three fields in
this file correlating for that process to: this file correlating for that process to:
1) time spent on the cpu 1) time spent on the cpu (in nanoseconds)
2) time spent waiting on a runqueue 2) time spent waiting on a runqueue (in nanoseconds)
3) # of timeslices run on this cpu 3) # of timeslices run on this cpu
A program could be easily written to make use of these extra fields to A program could be easily written to make use of these extra fields to
......
...@@ -20,13 +20,13 @@ ...@@ -20,13 +20,13 @@
% - Indent of 2 chars is preserved for ease of comparison. % - Indent of 2 chars is preserved for ease of comparison.
% Summary of changes from default params: % Summary of changes from default params:
% Width of page number (\@pnumwidth): 1.55em -> 2.7em % Width of page number (\@pnumwidth): 1.55em -> 2.7em
% Width of chapter number: 1.5em -> 1.8em % Width of chapter number: 1.5em -> 2.4em
% Indent of section number: 1.5em -> 1.8em % Indent of section number: 1.5em -> 2.4em
% Width of section number: 2.6em -> 3.2em % Width of section number: 2.6em -> 3.2em
% Indent of sebsection number: 4.1em -> 5em % Indent of subsection number: 4.1em -> 5.6em
% Width of subsection number: 3.5em -> 4.3em % Width of subsection number: 3.5em -> 4.3em
% %
% These params can have 4 digit page counts, 2 digit chapter counts, % These params can have 4 digit page counts, 3 digit chapter counts,
% section counts of 4 digits + 1 period (e.g., 18.10), and subsection counts % section counts of 4 digits + 1 period (e.g., 18.10), and subsection counts
% of 5 digits + 2 periods (e.g., 18.7.13). % of 5 digits + 2 periods (e.g., 18.7.13).
\makeatletter \makeatletter
...@@ -37,7 +37,7 @@ ...@@ -37,7 +37,7 @@
\ifnum \c@tocdepth >\m@ne \ifnum \c@tocdepth >\m@ne
\addpenalty{-\@highpenalty}% \addpenalty{-\@highpenalty}%
\vskip 1.0em \@plus\p@ \vskip 1.0em \@plus\p@
\setlength\@tempdima{1.8em}% \setlength\@tempdima{2.4em}%
\begingroup \begingroup
\parindent \z@ \rightskip \@pnumwidth \parindent \z@ \rightskip \@pnumwidth
\parfillskip -\@pnumwidth \parfillskip -\@pnumwidth
...@@ -51,8 +51,8 @@ ...@@ -51,8 +51,8 @@
\endgroup \endgroup
\fi} \fi}
%% Redefine \l@section and \l@subsection %% Redefine \l@section and \l@subsection
\renewcommand*\l@section{\@dottedtocline{1}{1.8em}{3.2em}} \renewcommand*\l@section{\@dottedtocline{1}{2.4em}{3.2em}}
\renewcommand*\l@subsection{\@dottedtocline{2}{5em}{4.3em}} \renewcommand*\l@subsection{\@dottedtocline{2}{5.6em}{4.3em}}
\makeatother \makeatother
%% Sphinx < 1.8 doesn't have \sphinxtableofcontentshook %% Sphinx < 1.8 doesn't have \sphinxtableofcontentshook
\providecommand{\sphinxtableofcontentshook}{} \providecommand{\sphinxtableofcontentshook}{}
......
REPORTING BUGS REPORTING BUGS
============== ==============
Report bugs to <lkml@vger.kernel.org> Report bugs to <linux-kernel@vger.kernel.org>
and <linux-trace-devel@vger.kernel.org>
LICENSE LICENSE
======= =======
......
...@@ -81,9 +81,7 @@ Linux カーネルに対する全ての変更は diff(1) コマンドによる ...@@ -81,9 +81,7 @@ Linux カーネルに対する全ての変更は diff(1) コマンドによる
dontdiff ファイルには Linux カーネルのビルドプロセスの過程で生成された dontdiff ファイルには Linux カーネルのビルドプロセスの過程で生成された
ファイルの一覧がのっています。そして、それらはパッチを生成する diff(1) ファイルの一覧がのっています。そして、それらはパッチを生成する diff(1)
コマンドで無視されるべきです。dontdiff ファイルは 2.6.12 以後のバージョ コマンドで無視されるべきです。dontdiff ファイルは 2.6.12 以後のバージョ
ンの Linux カーネルソースツリーに含まれています。それより前のバージョン ンの Linux カーネルソースツリーに含まれています。
の Linux カーネルソースツリーに対する dontdiff ファイルは、
<http://www.xenotime.net/linux/doc/dontdiff>から取得することができます。
投稿するパッチの中に関係のない余分なファイルが含まれていないことを確 投稿するパッチの中に関係のない余分なファイルが含まれていないことを確
認してください。diff(1) コマンドで生成したパッチがあなたの意図したとお 認してください。diff(1) コマンドで生成したパッチがあなたの意図したとお
...@@ -125,6 +123,17 @@ http://savannah.nongnu.org/projects/quilt ...@@ -125,6 +123,17 @@ http://savannah.nongnu.org/projects/quilt
登録済みのバグエントリを修正するパッチであれば、そのバグエントリを示すバグ ID 登録済みのバグエントリを修正するパッチであれば、そのバグエントリを示すバグ ID
や URL を明記してください。 や URL を明記してください。
特定のコミットを参照したい場合は、その SHA-1 ID だけでなく、一行サマリ
も含めてください。それにより、それが何に関するコミットなのかがレビューする
人にわかりやすくなります。
例 (英文のママ):
Commit e21d2170f36602ae2708 ("video: remove unnecessary
platform_set_drvdata()") removed the unnecessary
platform_set_drvdata(), but left the variable "dev" unused,
delete it.
3) パッチの分割 3) パッチの分割
意味のあるひとまとまりごとに変更を個々のパッチファイルに分けてください。 意味のあるひとまとまりごとに変更を個々のパッチファイルに分けてください。
...@@ -162,7 +171,8 @@ http://savannah.nongnu.org/projects/quilt ...@@ -162,7 +171,8 @@ http://savannah.nongnu.org/projects/quilt
MAINTAINERS ファイルとソースコードに目を通してください。そして、その変 MAINTAINERS ファイルとソースコードに目を通してください。そして、その変
更がメンテナのいる特定のサブシステムに加えられるものであることが分か 更がメンテナのいる特定のサブシステムに加えられるものであることが分か
れば、その人に電子メールを送ってください。 れば、その人に電子メールを送ってください。その際
./scripts/get_maintainers.pl のスクリプトが有用です。
もし、メンテナが載っていなかったり、メンテナからの応答がないなら、 もし、メンテナが載っていなかったり、メンテナからの応答がないなら、
LKML ( linux-kernel@vger.kernel.org )へパッチを送ってください。ほとんど LKML ( linux-kernel@vger.kernel.org )へパッチを送ってください。ほとんど
...@@ -400,7 +410,7 @@ Acked-by: が必ずしもパッチ全体の承認を示しているわけでは ...@@ -400,7 +410,7 @@ Acked-by: が必ずしもパッチ全体の承認を示しているわけでは
このタグはパッチに関心があると思われる人達がそのパッチの議論に含まれていたこと このタグはパッチに関心があると思われる人達がそのパッチの議論に含まれていたこと
を明文化します。 を明文化します。
14) Reported-by と Tested-by: と Reviewed-by: の利用 14) Reported-by:, Tested-by:, Reviewed-by: および Suggested-by: の利用
他の誰かによって報告された問題を修正するパッチであれば、問題報告者という寄与を 他の誰かによって報告された問題を修正するパッチであれば、問題報告者という寄与を
クレジットするために、Reported-by: タグを追加することを検討してください。 クレジットするために、Reported-by: タグを追加することを検討してください。
...@@ -449,6 +459,13 @@ Reviewd-by タグはそのパッチがカーネルに対して適切な修正で ...@@ -449,6 +459,13 @@ Reviewd-by タグはそのパッチがカーネルに対して適切な修正で
レビューを実施したレビューアによって提供される時、Reviewed-by: タグがあなたの レビューを実施したレビューアによって提供される時、Reviewed-by: タグがあなたの
パッチをカーネルにマージする可能性を高めるでしょう。 パッチをカーネルにマージする可能性を高めるでしょう。
Suggested-by: タグは、パッチのアイデアがその人からの提案に基づくものである
ことを示し、アイデアの提供をクレジットするものです。提案者の明示的な許可が
ない場合、特にそのアイデアが公開のフォーラムで示されていない場合には、この
タグをつけないように注意してください。とはいえ、アイデアの提供者をこつこつ
クレジットしていけば、望むらくはその人たちが将来別の機会に再度力を貸す気に
なってくれるかもしれません。
15) 標準的なパッチのフォーマット 15) 標準的なパッチのフォーマット
標準的なパッチのサブジェクトは以下のとおりです。 標準的なパッチのサブジェクトは以下のとおりです。
...@@ -681,10 +698,11 @@ Jeff Garzik, "Linux kernel patch submission format". ...@@ -681,10 +698,11 @@ Jeff Garzik, "Linux kernel patch submission format".
<https://web.archive.org/web/20180829112450/http://linux.yyz.us/patch-format.html> <https://web.archive.org/web/20180829112450/http://linux.yyz.us/patch-format.html>
Greg Kroah-Hartman, "How to piss off a kernel subsystem maintainer". Greg Kroah-Hartman, "How to piss off a kernel subsystem maintainer".
<http://www.kroah.com/log/2005/03/31/> <http://www.kroah.com/log/linux/maintainer.html>
<http://www.kroah.com/log/2005/07/08/> <http://www.kroah.com/log/linux/maintainer-02.html>
<http://www.kroah.com/log/2005/10/19/> <http://www.kroah.com/log/linux/maintainer-03.html>
<http://www.kroah.com/log/2006/01/11/> <http://www.kroah.com/log/linux/maintainer-04.html>
<http://www.kroah.com/log/linux/maintainer-05.html>
NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people! NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people!
<https://lore.kernel.org/r/20050711.125305.08322243.davem@davemloft.net> <https://lore.kernel.org/r/20050711.125305.08322243.davem@davemloft.net>
......
...@@ -262,21 +262,21 @@ Linux カーネルの開発プロセスは現在幾つかの異なるメイン ...@@ -262,21 +262,21 @@ Linux カーネルの開発プロセスは現在幾つかの異なるメイン
チ」と多数のサブシステム毎のカーネルブランチから構成されます。これらの チ」と多数のサブシステム毎のカーネルブランチから構成されます。これらの
ブランチとは - ブランチとは -
- メインの 4.x カーネルツリー - Linus のメインラインツリー
- 4.x.y -stable カーネルツリー - メジャー番号をまたぐ数本の安定版ツリー
- サブシステム毎のカーネルツリーとパッチ - サブシステム毎のカーネルツリー
- 統合テストのための 4.x -next カーネルツリー - 統合テストのための linux-next カーネルツリー
4.x カーネルツリー メインラインツリー
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
4.x カーネルは Linus Torvalds によってメンテナンスされ、 メインラインツリーは Linus Torvalds によってメンテナンスされ、
https://kernel.org の pub/linux/kernel/v4.x/ ディレクトリに存在します。 https://kernel.org のリポジトリに存在します。
この開発プロセスは以下のとおり - この開発プロセスは以下のとおり -
- 新しいカーネルがリリースされた直後に、2週間の特別期間が設けられ、 - 新しいカーネルがリリースされた直後に、2週間の特別期間が設けられ、
この期間中に、メンテナ達は Linus に大きな差分を送ることができます。 この期間中に、メンテナ達は Linus に大きな差分を送ることができます。
このような差分は通常 -next カーネルに数週間含まれてきたパッチです。 このような差分は通常 linux-next カーネルに数週間含まれてきたパッチです。
大きな変更は git(カーネルのソース管理ツール、詳細は 大きな変更は git(カーネルのソース管理ツール、詳細は
http://git-scm.com/ 参照) を使って送るのが好ましいやり方ですが、パッ http://git-scm.com/ 参照) を使って送るのが好ましいやり方ですが、パッ
チファイルの形式のまま送るのでも十分です。 チファイルの形式のまま送るのでも十分です。
...@@ -303,20 +303,18 @@ Andrew Morton が Linux-kernel メーリングリストにカーネルリリー ...@@ -303,20 +303,18 @@ Andrew Morton が Linux-kernel メーリングリストにカーネルリリー
前もって決められた計画によってリリースされるものではないから 前もって決められた計画によってリリースされるものではないから
です。」* です。」*
4.x.y -stable カーネルツリー メジャー番号をまたぐ数本の安定版ツリー
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
バージョン番号が3つの数字に分かれているカーネルは -stable カーネルです。 バージョン番号が3つの数字に分かれているカーネルは -stable カーネルです。
これには、4.x カーネルで見つかったセキュリティ問題や重大な後戻りに対す これには最初の2つのバージョン番号の数字に対応した、
る比較的小さい重要な修正が含まれます。 メインラインリリースで見つかったセキュリティ問題や
重大な後戻りに対する比較的小さい重要な修正が含まれます。
これは、開発/実験的バージョンのテストに協力することに興味が無く、最新 これは、開発/実験的バージョンのテストに協力することに興味が無く、最新
の安定したカーネルを使いたいユーザに推奨するブランチです。 の安定したカーネルを使いたいユーザに推奨するブランチです。
もし、4.x.y カーネルが存在しない場合には、番号が一番大きい 4.x が最新 安定版ツリーは"stable" チーム <stable@vger.kernel.org> でメンテされており、
の安定版カーネルです。
4.x.y は "stable" チーム <stable@vger.kernel.org> でメンテされており、
必要に応じてリリースされます。通常のリリース期間は 2週間毎ですが、差 必要に応じてリリースされます。通常のリリース期間は 2週間毎ですが、差
し迫った問題がなければもう少し長くなることもあります。セキュリティ関 し迫った問題がなければもう少し長くなることもあります。セキュリティ関
連の問題の場合はこれに対してだいたいの場合、すぐにリリースがされます。 連の問題の場合はこれに対してだいたいの場合、すぐにリリースがされます。
...@@ -326,7 +324,7 @@ Documentation/process/stable-kernel-rules.rst ファイルにはどのような ...@@ -326,7 +324,7 @@ Documentation/process/stable-kernel-rules.rst ファイルにはどのような
類の変更が -stable ツリーに受け入れ可能か、またリリースプロセスがどう 類の変更が -stable ツリーに受け入れ可能か、またリリースプロセスがどう
動くかが記述されています。 動くかが記述されています。
サブシステム毎のカーネルツリーとパッチ サブシステム毎のカーネルツリー
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
それぞれのカーネルサブシステムのメンテナ達は --- そして多くのカーネル それぞれのカーネルサブシステムのメンテナ達は --- そして多くのカーネル
...@@ -351,19 +349,19 @@ quilt シリーズとして公開されているパッチキューも使われ ...@@ -351,19 +349,19 @@ quilt シリーズとして公開されているパッチキューも使われ
けることができます。大部分のこれらの patchwork のサイトは けることができます。大部分のこれらの patchwork のサイトは
https://patchwork.kernel.org/ でリストされています。 https://patchwork.kernel.org/ でリストされています。
統合テストのための 4.x -next カーネルツリー 統合テストのための linux-next カーネルツリー
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
サブシステムツリーの更新内容がメインラインの 4.x ツリーにマージされる サブシステムツリーの更新内容がメインラインツリーにマージされる
前に、それらは統合テストされる必要があります。この目的のため、実質的に 前に、それらは統合テストされる必要があります。この目的のため、実質的に
全サブシステムツリーからほぼ毎日プルされてできる特別なテスト用のリポジ 全サブシステムツリーからほぼ毎日プルされてできる特別なテスト用のリポジ
トリが存在します- トリが存在します-
https://git.kernel.org/?p=linux/kernel/git/next/linux-next.git https://git.kernel.org/?p=linux/kernel/git/next/linux-next.git
このやり方によって、-next カーネルは次のマージ機会でどんなものがメイン このやり方によって、linux-next は次のマージ機会でどんなものがメイン
ラインカーネルにマージされるか、おおまかなの展望を提供します。-next カー ラインにマージされるか、おおまかな展望を提供します。
ネルの実行テストを行う冒険好きなテスターは大いに歓迎されます。 linux-next の実行テストを行う冒険好きなテスターは大いに歓迎されます。
バグレポート バグレポート
------------- -------------
......
...@@ -5,7 +5,7 @@ ...@@ -5,7 +5,7 @@
\kerneldocCJKon \kerneldocCJKon
\kerneldocBeginJP{ \kerneldocBeginJP{
Japanese translations 日本語訳
===================== =====================
.. toctree:: .. toctree::
......
...@@ -53,8 +53,8 @@ DAMON_RECLAIM找到在特定时间内没有被访问的内存区域并分页。 ...@@ -53,8 +53,8 @@ DAMON_RECLAIM找到在特定时间内没有被访问的内存区域并分页。
下面是每个参数的描述。 下面是每个参数的描述。
enable enabled
------ -------
启用或禁用DAMON_RECLAIM。 启用或禁用DAMON_RECLAIM。
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
详细用法 详细用法
======== ========
DAMON 为不同的用户提供了下面三种接口。 DAMON 为不同的用户提供了下面这些接口。
- *DAMON用户空间工具。* - *DAMON用户空间工具。*
`这 <https://github.com/awslabs/damo>`_ 为有这特权的人, 如系统管理员,希望有一个刚好 `这 <https://github.com/awslabs/damo>`_ 为有这特权的人, 如系统管理员,希望有一个刚好
...@@ -21,19 +21,290 @@ DAMON 为不同的用户提供了下面三种接口。 ...@@ -21,19 +21,290 @@ DAMON 为不同的用户提供了下面三种接口。
使用它,用户可以以人性化的方式使用DAMON的主要功能。不过,它可能不会为特殊情况进行高度调整。 使用它,用户可以以人性化的方式使用DAMON的主要功能。不过,它可能不会为特殊情况进行高度调整。
它同时支持虚拟和物理地址空间的监测。更多细节,请参考它的 `使用文档 它同时支持虚拟和物理地址空间的监测。更多细节,请参考它的 `使用文档
<https://github.com/awslabs/damo/blob/next/USAGE.md>`_。 <https://github.com/awslabs/damo/blob/next/USAGE.md>`_。
- *debugfs接口。* - *sysfs接口。*
:ref:`这 <debugfs_interface>` 是为那些希望更高级的使用DAMON的特权用户空间程序员准备的。 :ref:`这 <sysfs_interface>` 是为那些希望更高级的使用DAMON的特权用户空间程序员准备的。
使用它,用户可以通过读取和写入特殊的debugfs文件来使用DAMON的主要功能。因此,你可以编写和使 使用它,用户可以通过读取和写入特殊的sysfs文件来使用DAMON的主要功能。因此,你可以编写和使
用你个性化的DAMON debugfs包装程序,代替你读/写debugfs文件。 `DAMON用户空间工具 用你个性化的DAMON sysfs包装程序,代替你读/写sysfs文件。 `DAMON用户空间工具
<https://github.com/awslabs/damo>`_ 就是这种程序的一个例子 它同时支持虚拟和物理地址 <https://github.com/awslabs/damo>`_ 就是这种程序的一个例子 它同时支持虚拟和物理地址
空间的监测。注意,这个界面只提供简单的监测结果 :ref:`统计 <damos_stats>`。对于详细的监测 空间的监测。注意,这个界面只提供简单的监测结果 :ref:`统计 <damos_stats>`。对于详细的监测
结果,DAMON提供了一个:ref:`跟踪点 <tracepoint>`。 结果,DAMON提供了一个:ref:`跟踪点 <tracepoint>`。
- *debugfs interface.*
:ref:`这 <debugfs_interface>` 几乎与:ref:`sysfs interface <sysfs_interface>` 接
口相同。这将在下一个LTS内核发布后被移除,所以用户应该转移到
:ref:`sysfs interface <sysfs_interface>`。
- *内核空间编程接口。* - *内核空间编程接口。*
:doc:`This </vm/damon/api>` 这是为内核空间程序员准备的。使用它,用户可以通过为你编写内 :doc:` </vm/damon/api>` 这是为内核空间程序员准备的。使用它,用户可以通过为你编写内
核空间的DAMON应用程序,最灵活有效地利用DAMON的每一个功能。你甚至可以为各种地址空间扩展DAMON。 核空间的DAMON应用程序,最灵活有效地利用DAMON的每一个功能。你甚至可以为各种地址空间扩展DAMON。
详细情况请参考接口 :doc:`文件 </vm/damon/api>`。 详细情况请参考接口 :doc:`文件 </vm/damon/api>`。
sysfs接口
=========
DAMON的sysfs接口是在定义 ``CONFIG_DAMON_SYSFS`` 时建立的。它在其sysfs目录下创建多
个目录和文件, ``<sysfs>/kernel/mm/damon/`` 。你可以通过对该目录下的文件进行写入和
读取来控制DAMON。
对于一个简短的例子,用户可以监测一个给定工作负载的虚拟地址空间,如下所示::
# cd /sys/kernel/mm/damon/admin/
# echo 1 > kdamonds/nr && echo 1 > kdamonds/0/contexts/nr
# echo vaddr > kdamonds/0/contexts/0/operations
# echo 1 > kdamonds/0/contexts/0/targets/nr
# echo $(pidof <workload>) > kdamonds/0/contexts/0/targets/0/pid
# echo on > kdamonds/0/state
文件层次结构
------------
DAMON sysfs接口的文件层次结构如下图所示。在下图中,父子关系用缩进表示,每个目录有
``/`` 后缀,每个目录中的文件用逗号(",")分开。 ::
/sys/kernel/mm/damon/admin
│ kdamonds/nr_kdamonds
│ │ 0/state,pid
│ │ │ contexts/nr_contexts
│ │ │ │ 0/operations
│ │ │ │ │ monitoring_attrs/
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
│ │ │ │ │ │ nr_regions/min,max
│ │ │ │ │ targets/nr_targets
│ │ │ │ │ │ 0/pid_target
│ │ │ │ │ │ │ regions/nr_regions
│ │ │ │ │ │ │ │ 0/start,end
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
│ │ │ │ │ schemes/nr_schemes
│ │ │ │ │ │ 0/action
│ │ │ │ │ │ │ access_pattern/
│ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max
│ │ │ │ │ │ │ │ age/min,max
│ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
│ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
│ │ │ │ │ │ ...
│ │ │ │ ...
│ │ ...
--
DAMON sysfs接口的根是 ``<sysfs>/kernel/mm/damon/`` ,它有一个名为 ``admin`` 的
目录。该目录包含特权用户空间程序控制DAMON的文件。拥有根权限的用户空间工具或deamons可以
使用这个目录。
kdamonds/
---------
与监测相关的信息包括请求规格和结果被称为DAMON上下文。DAMON用一个叫做kdamond的内核线程
执行每个上下文,多个kdamonds可以并行运行。
在 ``admin`` 目录下,有一个目录,即``kdamonds``,它有控制kdamonds的文件存在。在开始
时,这个目录只有一个文件,``nr_kdamonds``。向该文件写入一个数字(``N``),就会创建名为
``0`` 到 ``N-1`` 的子目录数量。每个目录代表每个kdamond。
kdamonds/<N>/
-------------
在每个kdamond目录中,存在两个文件(``state`` 和 ``pid`` )和一个目录( ``contexts`` )。
读取 ``state`` 时,如果kdamond当前正在运行,则返回 ``on`` ,如果没有运行则返回 ``off`` 。
写入 ``on`` 或 ``off`` 使kdamond处于状态。向 ``state`` 文件写 ``update_schemes_stats`` ,
更新kdamond的每个基于DAMON的操作方案的统计文件的内容。关于统计信息的细节,请参考
:ref:`stats section <sysfs_schemes_stats>`.
如果状态为 ``on``,读取 ``pid`` 显示kdamond线程的pid。
``contexts`` 目录包含控制这个kdamond要执行的监测上下文的文件。
kdamonds/<N>/contexts/
----------------------
在开始时,这个目录只有一个文件,即 ``nr_contexts`` 。向该文件写入一个数字( ``N`` ),就会创
建名为``0`` 到 ``N-1`` 的子目录数量。每个目录代表每个监测背景。目前,每个kdamond只支持
一个上下文,所以只有 ``0`` 或 ``1`` 可以被写入文件。
contexts/<N>/
-------------
在每个上下文目录中,存在一个文件(``operations``)和三个目录(``monitoring_attrs``,
``targets``, 和 ``schemes``)。
DAMON支持多种类型的监测操作,包括对虚拟地址空间和物理地址空间的监测。你可以通过向文件
中写入以下关键词之一,并从文件中读取,来设置和获取DAMON将为上下文使用何种类型的监测操作。
- vaddr: 监测特定进程的虚拟地址空间
- paddr: 监视系统的物理地址空间
contexts/<N>/monitoring_attrs/
------------------------------
用于指定监测属性的文件,包括所需的监测质量和效率,都在 ``monitoring_attrs`` 目录中。
具体来说,这个目录下有两个目录,即 ``intervals`` 和 ``nr_regions`` 。
在 ``intervals`` 目录下,存在DAMON的采样间隔(``sample_us``)、聚集间隔(``aggr_us``)
和更新间隔(``update_us``)三个文件。你可以通过写入和读出这些文件来设置和获取微秒级的值。
在 ``nr_regions`` 目录下,有两个文件分别用于DAMON监测区域的下限和上限(``min`` 和 ``max`` ),
这两个文件控制着监测的开销。你可以通过向这些文件的写入和读出来设置和获取这些值。
关于间隔和监测区域范围的更多细节,请参考设计文件 (:doc:`/vm/damon/design`)。
contexts/<N>/targets/
---------------------
在开始时,这个目录只有一个文件 ``nr_targets`` 。向该文件写入一个数字(``N``),就可以创建
名为 ``0`` 到 ``N-1`` 的子目录的数量。每个目录代表每个监测目标。
targets/<N>/
------------
在每个目标目录中,存在一个文件(``pid_target``)和一个目录(``regions``)。
如果你把 ``vaddr`` 写到 ``contexts/<N>/operations`` 中,每个目标应该是一个进程。你
可以通过将进程的pid写到 ``pid_target`` 文件中来指定DAMON的进程。
targets/<N>/regions
-------------------
当使用 ``vaddr`` 监测操作集时( ``vaddr`` 被写入 ``contexts/<N>/operations`` 文
件),DAMON自动设置和更新监测目标区域,这样就可以覆盖目标进程的整个内存映射。然而,用户可
能希望将初始监测区域设置为特定的地址范围。
相反,当使用 ``paddr`` 监测操作集时,DAMON不会自动设置和更新监测目标区域( ``paddr``
被写入 ``contexts/<N>/operations`` 中)。因此,在这种情况下,用户应该自己设置监测目标
区域。
在这种情况下,用户可以按照自己的意愿明确设置初始监测目标区域,将适当的值写入该目录下的文件。
开始时,这个目录只有一个文件, ``nr_regions`` 。向该文件写入一个数字(``N``),就可以创
建名为 ``0`` 到 ``N-1`` 的子目录。每个目录代表每个初始监测目标区域。
regions/<N>/
------------
在每个区域目录中,你会发现两个文件( ``start`` 和 ``end`` )。你可以通过向文件写入
和从文件中读出,分别设置和获得初始监测目标区域的起始和结束地址。
contexts/<N>/schemes/
---------------------
对于一版的基于DAMON的数据访问感知的内存管理优化,用户通常希望系统对特定访问模式的内存区
域应用内存管理操作。DAMON从用户那里接收这种形式化的操作方案,并将这些方案应用于目标内存
区域。用户可以通过读取和写入这个目录下的文件来获得和设置这些方案。
在开始时,这个目录只有一个文件,``nr_schemes``。向该文件写入一个数字(``N``),就可以
创建名为``0``到``N-1``的子目录的数量。每个目录代表每个基于DAMON的操作方案。
schemes/<N>/
------------
在每个方案目录中,存在四个目录(``access_pattern``, ``quotas``,``watermarks``,
和 ``stats``)和一个文件(``action``)。
``action`` 文件用于设置和获取你想应用于具有特定访问模式的内存区域的动作。可以写入文件
和从文件中读取的关键词及其含义如下。
- ``willneed``: 对有 ``MADV_WILLNEED`` 的区域调用 ``madvise()`` 。
- ``cold``: 对具有 ``MADV_COLD`` 的区域调用 ``madvise()`` 。
- ``pageout``: 为具有 ``MADV_PAGEOUT`` 的区域调用 ``madvise()`` 。
- ``hugepage``: 为带有 ``MADV_HUGEPAGE`` 的区域调用 ``madvise()`` 。
- ``nohugepage``: 为带有 ``MADV_NOHUGEPAGE`` 的区域调用 ``madvise()``。
- ``stat``: 什么都不做,只计算统计数据
schemes/<N>/access_pattern/
---------------------------
每个基于DAMON的操作方案的目标访问模式由三个范围构成,包括以字节为单位的区域大小、每个
聚合区间的监测访问次数和区域年龄的聚合区间数。
在 ``access_pattern`` 目录下,存在三个目录( ``sz``, ``nr_accesses``, 和 ``age`` ),
每个目录有两个文件(``min`` 和 ``max`` )。你可以通过向 ``sz``, ``nr_accesses``, 和
``age`` 目录下的 ``min`` 和 ``max`` 文件分别写入和读取来设置和获取给定方案的访问模式。
schemes/<N>/quotas/
-------------------
每个 ``动作`` 的最佳 ``目标访问模式`` 取决于工作负载,所以不容易找到。更糟糕的是,将某些动作
的方案设置得过于激进会造成严重的开销。为了避免这种开销,用户可以为每个方案限制时间和大小配额。
具体来说,用户可以要求DAMON尽量只使用特定的时间(``时间配额``)来应用行动,并且在给定的时间间
隔(``重置间隔``)内,只对具有目标访问模式的内存区域应用行动,而不使用特定数量(``大小配额``)。
当预计超过配额限制时,DAMON会根据 ``目标访问模式`` 的大小、访问频率和年龄,对找到的内存区域
进行优先排序。为了进行个性化的优先排序,用户可以为这三个属性设置权重。
在 ``quotas`` 目录下,存在三个文件(``ms``, ``bytes``, ``reset_interval_ms``)和一个
目录(``weights``),其中有三个文件(``sz_permil``, ``nr_accesses_permil``, 和
``age_permil``)。
你可以设置以毫秒为单位的 ``时间配额`` ,以字节为单位的 ``大小配额`` ,以及以毫秒为单位的 ``重
置间隔`` ,分别向这三个文件写入数值。你还可以通过向 ``weights`` 目录下的三个文件写入数值来设
置大小、访问频率和年龄的优先权,单位为千分之一。
schemes/<N>/watermarks/
-----------------------
为了便于根据系统状态激活和停用每个方案,DAMON提供了一个称为水位的功能。该功能接收五个值,称为
``度量`` 、``间隔`` 、``高`` 、``中`` 、``低`` 。``度量值`` 是指可以测量的系统度量值,如
自由内存比率。如果系统的度量值 ``高`` 于memoent的高值或 ``低`` 于低值,则该方案被停用。如果
该值低于 ``中`` ,则该方案被激活。
在水位目录下,存在五个文件(``metric``, ``interval_us``,``high``, ``mid``, and ``low``)
用于设置每个值。你可以通过向这些文件的写入来分别设置和获取这五个值。
可以写入 ``metric`` 文件的关键词和含义如下。
- none: 忽略水位
- free_mem_rate: 系统的自由内存率(千分比)。
``interval`` 应以微秒为单位写入。
schemes/<N>/stats/
------------------
DAMON统计每个方案被尝试应用的区域的总数量和字节数,每个方案被成功应用的区域的两个数字,以及
超过配额限制的总数量。这些统计数据可用于在线分析或调整方案。
可以通过读取 ``stats`` 目录下的文件(``nr_tried``, ``sz_tried``, ``nr_applied``,
``sz_applied``, 和 ``qt_exceeds``))分别检索这些统计数据。这些文件不是实时更新的,所以
你应该要求DAMON sysfs接口通过在相关的 ``kdamonds/<N>/state`` 文件中写入一个特殊的关键字
``update_schemes_stats`` 来更新统计信息的文件内容。
用例
~~~~
下面的命令应用了一个方案:”如果一个大小为[4KiB, 8KiB]的内存区域在[10, 20]的聚合时间间隔内
显示出每一个聚合时间间隔[0, 5]的访问量,请分页该区域。对于分页,每秒最多只能使用10ms,而且每
秒分页不能超过1GiB。在这一限制下,首先分页出具有较长年龄的内存区域。另外,每5秒钟检查一次系统
的可用内存率,当可用内存率低于50%时开始监测和分页,但如果可用内存率大于60%,或低于30%,则停
止监测。“ ::
# cd <sysfs>/kernel/mm/damon/admin
# # populate directories
# echo 1 > kdamonds/nr_kdamonds; echo 1 > kdamonds/0/contexts/nr_contexts;
# echo 1 > kdamonds/0/contexts/0/schemes/nr_schemes
# cd kdamonds/0/contexts/0/schemes/0
# # set the basic access pattern and the action
# echo 4096 > access_patterns/sz/min
# echo 8192 > access_patterns/sz/max
# echo 0 > access_patterns/nr_accesses/min
# echo 5 > access_patterns/nr_accesses/max
# echo 10 > access_patterns/age/min
# echo 20 > access_patterns/age/max
# echo pageout > action
# # set quotas
# echo 10 > quotas/ms
# echo $((1024*1024*1024)) > quotas/bytes
# echo 1000 > quotas/reset_interval_ms
# # set watermark
# echo free_mem_rate > watermarks/metric
# echo 5000000 > watermarks/interval_us
# echo 600 > watermarks/high
# echo 500 > watermarks/mid
# echo 300 > watermarks/low
请注意,我们强烈建议使用用户空间的工具,如 `damo <https://github.com/awslabs/damo>`_ ,
而不是像上面那样手动读写文件。以上只是一个例子。
debugfs接口 debugfs接口
=========== ===========
...@@ -46,7 +317,7 @@ DAMON导出了八个文件, ``attrs``, ``target_ids``, ``init_regions``, ...@@ -46,7 +317,7 @@ DAMON导出了八个文件, ``attrs``, ``target_ids``, ``init_regions``,
属性 属性
---- ----
用户可以通过读取和写入 ``attrs`` 文件获得和设置 ``采样间隔`` 、 ``聚集间隔`` 、 ``区域更新间隔`` 用户可以通过读取和写入 ``attrs`` 文件获得和设置 ``采样间隔`` 、 ``聚集间隔`` 、 ``更新间隔``
以及监测目标区域的最小/最大数量。要详细了解监测属性,请参考 `:doc:/vm/damon/design` 。例如, 以及监测目标区域的最小/最大数量。要详细了解监测属性,请参考 `:doc:/vm/damon/design` 。例如,
下面的命令将这些值设置为5ms、100ms、1000ms、10和1000,然后再次检查:: 下面的命令将这些值设置为5ms、100ms、1000ms、10和1000,然后再次检查::
...@@ -108,8 +379,8 @@ DAMON导出了八个文件, ``attrs``, ``target_ids``, ``init_regions``, ...@@ -108,8 +379,8 @@ DAMON导出了八个文件, ``attrs``, ``target_ids``, ``init_regions``,
1 20 40 1 20 40
1 50 100" > init_regions 1 50 100" > init_regions
请注意,这只是设置了初始的监测目标区域。在虚拟内存监测的情况下,DAMON会在一个 ``区域更新间隔`` 请注意,这只是设置了初始的监测目标区域。在虚拟内存监测的情况下,DAMON会在一个 ``更新间隔``
后自动更新区域的边界。因此,在这种情况下,如果用户不希望更新的话,应该把 ``区域的更新间隔`` 设 后自动更新区域的边界。因此,在这种情况下,如果用户不希望更新的话,应该把 ``更新间隔`` 设
置得足够大。 置得足够大。
......
.. highlight:: none
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/dev-tools/gdb-kernel-debugging.rst
:Translator: 高超 gao chao <gaochao49@huawei.com>
通过gdb调试内核和模块
=====================
Kgdb内核调试器、QEMU等虚拟机管理程序或基于JTAG的硬件接口,支持在运行时使用gdb
调试Linux内核及其模块。Gdb提供了一个强大的python脚本接口,内核也提供了一套
辅助脚本以简化典型的内核调试步骤。本文档为如何启用和使用这些脚本提供了一个简要的教程。
此教程基于QEMU/KVM虚拟机,但文中示例也适用于其他gdb stub。
环境配置要求
------------
- gdb 7.2+ (推荐版本: 7.4+) 且开启python支持 (通常发行版上都已支持)
设置
----
- 创建一个QEMU/KVM的linux虚拟机(详情请参考 www.linux-kvm.org 和 www.qemu.org )。
对于交叉开发,https://landley.net/aboriginal/bin 提供了一些镜像和工具链,
可以帮助搭建交叉开发环境。
- 编译内核时开启CONFIG_GDB_SCRIPTS,关闭CONFIG_DEBUG_INFO_REDUCED。
如果架构支持CONFIG_FRAME_POINTER,请保持开启。
- 在guest环境上安装该内核。如有必要,通过在内核command line中添加“nokaslr”来关闭KASLR。
此外,QEMU允许通过-kernel、-append、-initrd这些命令行选项直接启动内核。
但这通常仅在不依赖内核模块时才有效。有关此模式的更多详细信息,请参阅QEMU文档。
在这种情况下,如果架构支持KASLR,应该在禁用CONFIG_RANDOMIZE_BASE的情况下构建内核。
- 启用QEMU/KVM的gdb stub,可以通过如下方式实现
- 在VM启动时,通过在QEMU命令行中添加“-s”参数
- 在运行时通过从QEMU监视控制台发送“gdbserver”
- 切换到/path/to/linux-build(内核源码编译)目录
- 启动gdb:gdb vmlinux
注意:某些发行版可能会将gdb脚本的自动加载限制在已知的安全目录中。
如果gdb报告拒绝加载vmlinux-gdb.py(相关命令找不到),请将::
add-auto-load-safe-path /path/to/linux-build
添加到~/.gdbinit。更多详细信息,请参阅gdb帮助信息。
- 连接到已启动的guest环境::
(gdb) target remote :1234
使用Linux提供的gdb脚本的示例
----------------------------
- 加载模块(以及主内核)符号::
(gdb) lx-symbols
loading vmlinux
scanning for modules in /home/user/linux/build
loading @0xffffffffa0020000: /home/user/linux/build/net/netfilter/xt_tcpudp.ko
loading @0xffffffffa0016000: /home/user/linux/build/net/netfilter/xt_pkttype.ko
loading @0xffffffffa0002000: /home/user/linux/build/net/netfilter/xt_limit.ko
loading @0xffffffffa00ca000: /home/user/linux/build/net/packet/af_packet.ko
loading @0xffffffffa003c000: /home/user/linux/build/fs/fuse/fuse.ko
...
loading @0xffffffffa0000000: /home/user/linux/build/drivers/ata/ata_generic.ko
- 对一些尚未加载的模块中的函数函数设置断点,例如::
(gdb) b btrfs_init_sysfs
Function "btrfs_init_sysfs" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (btrfs_init_sysfs) pending.
- 继续执行::
(gdb) c
- 加载模块并且能观察到正在加载的符号以及断点命中::
loading @0xffffffffa0034000: /home/user/linux/build/lib/libcrc32c.ko
loading @0xffffffffa0050000: /home/user/linux/build/lib/lzo/lzo_compress.ko
loading @0xffffffffa006e000: /home/user/linux/build/lib/zlib_deflate/zlib_deflate.ko
loading @0xffffffffa01b1000: /home/user/linux/build/fs/btrfs/btrfs.ko
Breakpoint 1, btrfs_init_sysfs () at /home/user/linux/fs/btrfs/sysfs.c:36
36 btrfs_kset = kset_create_and_add("btrfs", NULL, fs_kobj);
- 查看内核的日志缓冲区::
(gdb) lx-dmesg
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 3.8.0-rc4-dbg+ (...
[ 0.000000] Command line: root=/dev/sda2 resume=/dev/sda1 vga=0x314
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
....
- 查看当前task struct结构体的字段(仅x86和arm64支持)::
(gdb) p $lx_current().pid
$1 = 4998
(gdb) p $lx_current().comm
$2 = "modprobe\000\000\000\000\000\000\000"
- 对当前或指定的CPU使用per-cpu函数::
(gdb) p $lx_per_cpu("runqueues").nr_running
$3 = 1
(gdb) p $lx_per_cpu("runqueues", 2).nr_running
$4 = 0
- 使用container_of查看更多hrtimers信息::
(gdb) set $next = $lx_per_cpu("hrtimer_bases").clock_base[0].active.next
(gdb) p *$container_of($next, "struct hrtimer", "node")
$5 = {
node = {
node = {
__rb_parent_color = 18446612133355256072,
rb_right = 0x0 <irq_stack_union>,
rb_left = 0x0 <irq_stack_union>
},
expires = {
tv64 = 1835268000000
}
},
_softexpires = {
tv64 = 1835268000000
},
function = 0xffffffff81078232 <tick_sched_timer>,
base = 0xffff88003fd0d6f0,
state = 1,
start_pid = 0,
start_site = 0xffffffff81055c1f <hrtimer_start_range_ns+20>,
start_comm = "swapper/2\000\000\000\000\000\000"
}
命令和辅助调试功能列表
----------------------
命令和辅助调试功能可能会随着时间的推移而改进,此文显示的是初始版本的部分示例::
(gdb) apropos lx
function lx_current -- Return current task
function lx_module -- Find module by name and return the module variable
function lx_per_cpu -- Return per-cpu variable
function lx_task_by_pid -- Find Linux task by PID and return the task_struct variable
function lx_thread_info -- Calculate Linux thread_info from task variable
lx-dmesg -- Print Linux kernel log buffer
lx-lsmod -- List currently loaded modules
lx-symbols -- (Re-)load symbols of Linux kernel and currently loaded modules
可以通过“help <command-name>”或“help function <function-name>”命令
获取指定命令或指定调试功能的更多详细信息。
...@@ -25,6 +25,7 @@ Documentation/translations/zh_CN/dev-tools/testing-overview.rst ...@@ -25,6 +25,7 @@ Documentation/translations/zh_CN/dev-tools/testing-overview.rst
sparse sparse
gcov gcov
kasan kasan
gdb-kernel-debugging
Todolist: Todolist:
...@@ -34,7 +35,6 @@ Todolist: ...@@ -34,7 +35,6 @@ Todolist:
- kmemleak - kmemleak
- kcsan - kcsan
- kfence - kfence
- gdb-kernel-debugging
- kgdb - kgdb
- kselftest - kselftest
- kunit/index - kunit/index
...@@ -120,24 +120,24 @@ dt_compat列表(如果你好奇,该列表定义在arch/arm/include/asm/mach/ ...@@ -120,24 +120,24 @@ dt_compat列表(如果你好奇,该列表定义在arch/arm/include/asm/mach/
表示什么。在Documentation/devicetree/bindings中添加兼容字符串的文档。 表示什么。在Documentation/devicetree/bindings中添加兼容字符串的文档。
同样在ARM上,对于每个machine_desc,内核会查看是否有任何dt_compat列表条 同样在ARM上,对于每个machine_desc,内核会查看是否有任何dt_compat列表条
目出现在兼容属性中。如果有,那么该机器_desc就是驱动该机器的候选者。在搜索 目出现在兼容属性中。如果有,那么该machine_desc就是驱动该机器的候选者。在搜索
了整个machine_descs表之后,setup_machine_fdt()根据每个machine_desc 了整个machine_descs表之后,setup_machine_fdt()根据每个machine_desc
在兼容属性中匹配的条目,返回 “最兼容” 的machine_desc。如果没有找到匹配 在兼容属性中匹配的条目,返回 “最兼容” 的machine_desc。如果没有找到匹配
的machine_desc,那么它将返回NULL。 的machine_desc,那么它将返回NULL。
这个方案背后的原因是观察到,在大多数情况下,如果它们都使用相同的SoC或相同 这个方案背后的原因是观察到,在大多数情况下,如果它们都使用相同的SoC或相同
系列的SoC,一个机器_desc可以支持大量的电路板。然而,不可避免地会有一些例 系列的SoC,一个machine_desc可以支持大量的电路板。然而,不可避免地会有一些例
外情况,即特定的板子需要特殊的设置代码,这在一般情况下是没有用的。特殊情况 外情况,即特定的板子需要特殊的设置代码,这在一般情况下是没有用的。特殊情况
可以通过在通用设置代码中明确检查有问题的板子来处理,但如果超过几个情况下, 可以通过在通用设置代码中明确检查有问题的板子来处理,但如果超过几个情况下,
这样做很快就会变得很难看和/或无法维护。 这样做很快就会变得很难看和/或无法维护。
相反,兼容列表允许通用机器_desc通过在dt_compat列表中指定“不太兼容”的值 相反,兼容列表允许通用machine_desc通过在dt_compat列表中指定“不太兼容”的值
来提供对广泛的通用板的支持。在上面的例子中,通用板支持可以声称与“ti,ompa3” 来提供对广泛的通用板的支持。在上面的例子中,通用板支持可以声称与“ti,ompa3”
或“ti,ompa3450”兼容。如果在最初的beagleboard上发现了一个bug,需要在 或“ti,ompa3450”兼容。如果在最初的beagleboard上发现了一个bug,需要在
早期启动时使用特殊的变通代码,那么可以添加一个新的machine_desc,实现变通, 早期启动时使用特殊的变通代码,那么可以添加一个新的machine_desc,实现变通,
并且只在“ti,omap3-beagleboard”上匹配。 并且只在“ti,omap3-beagleboard”上匹配。
PowerPC使用了一个稍微不同的方案,它从每个机器_desc中调用.probe()钩子, PowerPC使用了一个稍微不同的方案,它从每个machine_desc中调用.probe()钩子,
并使用第一个返回TRUE的钩子。然而,这种方法没有考虑到兼容列表的优先级,对于 并使用第一个返回TRUE的钩子。然而,这种方法没有考虑到兼容列表的优先级,对于
新的架构支持可能应该避免。 新的架构支持可能应该避免。
......
...@@ -108,6 +108,7 @@ TODOList: ...@@ -108,6 +108,7 @@ TODOList:
:maxdepth: 2 :maxdepth: 2
core-api/index core-api/index
locking/index
accounting/index accounting/index
cpu-freq/index cpu-freq/index
iio/index iio/index
...@@ -123,7 +124,6 @@ TODOList: ...@@ -123,7 +124,6 @@ TODOList:
TODOList: TODOList:
* driver-api/index * driver-api/index
* locking/index
* block/index * block/index
* cdrom/index * cdrom/index
* ide/index * ide/index
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/locking/index.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
==
==
.. toctree::
:maxdepth: 1
TODOList:
* locktypes
* lockdep-design
* lockstat
* locktorture
* mutex-design
* rt-mutex-design
* rt-mutex
* seqlock
* spinlocks
* ww-mutex-design
* preempt-locking
* pi-futex
* futex-requeue-pi
* hwspinlock
* percpu-rw-semaphore
* robust-futexes
* robust-futex-ABI
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/locking/spinlocks.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
==========
加锁的教训
==========
教训 1:自旋锁
==============
加锁最基本的原语是自旋锁(spinlock)::
static DEFINE_SPINLOCK(xxx_lock);
unsigned long flags;
spin_lock_irqsave(&xxx_lock, flags);
... 这里是临界区 ..
spin_unlock_irqrestore(&xxx_lock, flags);
上述代码总是安全的。自旋锁将在 _本地_ 禁用中断,但它本身将保证全局锁定。所以它
将保证在该锁保护的区域内只有一个控制线程。即使在单处理器(UP)下也能很好的工作,
所以代码 _不_ 需要担心UP还是SMP的问题:自旋锁在两种情况下都能正常工作。
注意!自旋锁对内存的潜在影响由下述文档进一步描述:
Documentation/memory-barriers.txt
(5) ACQUIRE operations.
(6) RELEASE operations.
上述代码通常非常简单(对大部分情况,你通常需要并且只希望有一个自旋锁——使用多个
自旋锁会使事情变得更复杂,甚至更慢,而且通常仅仅在你 **理解的** 序列有被拆分的
需求时才值得这么做:如果你不确定的话,请不惜一切代价避免这样做)。
这是关于自旋锁的唯一真正困难的部分:一旦你开始使用自旋锁,它们往往会扩展到你以前
可能没有注意到的领域,因为你必须确保自旋锁正确地保护共享数据结构 **每一处** 被
使用的地方。自旋锁是最容易被添加到完全独立于其它代码的地方(例如,没有人访问的
内部驱动数据结构)的。
注意!仅当你在跨CPU核访问时使用 **同一把** 自旋锁,对它的使用才是安全的。
这意味着所有访问共享变量的代码必须对它们想使用的自旋锁达成一致。
----
教训 2:读-写自旋锁
===================
如果你的数据访问有一个非常自然的模式,倾向于从共享变量中读取数据,读-写自旋锁
(rw_lock)有时是有用的。它们允许多个读者同时出现在同一个临界区,但是如果有人想
改变变量,它必须获得一个独占的写锁。
注意!读-写自旋锁比原始自旋锁需要更多的原子内存操作。除非读者的临界区很长,
否则你最好只使用原始自旋锁。
例程看起来和上面一样::
rwlock_t xxx_lock = __RW_LOCK_UNLOCKED(xxx_lock);
unsigned long flags;
read_lock_irqsave(&xxx_lock, flags);
.. 仅读取信息的临界区 ...
read_unlock_irqrestore(&xxx_lock, flags);
write_lock_irqsave(&xxx_lock, flags);
.. 读取和独占写信息 ...
write_unlock_irqrestore(&xxx_lock, flags);
上面这种锁对于复杂的数据结构如链表可能会有用,特别是在不改变链表的情况下搜索其中
的条目。读锁允许许多并发的读者。任何希望 **修改** 链表的代码将必须先获取写锁。
注意!RCU锁更适合遍历链表,但需要仔细注意设计细节(见Documentation/RCU/listRCU.rst)。
另外,你不能把读锁“升级”为写锁,所以如果你在 _任何_ 时候需要做任何修改
(即使你不是每次都这样做),你必须在一开始就获得写锁。
注意!我们正在努力消除大多数情况下的读-写自旋锁的使用,所以请不要在没有达成
共识的情况下增加一个新的(相反,请参阅Documentation/RCU/rcu.rst以获得完整
信息)。
----
教训 3:重新审视自旋锁
======================
上述的自旋锁原语绝不是唯一的。它们是最安全的,在所有情况下都能正常工作,但部分
**因为** 它们是安全的,它们也是相当慢的。它们比原本需要的更慢,因为它们必须要
禁用中断(在X86上只是一条指令,但却是一条昂贵的指令——而在其他体系结构上,情况
可能更糟)。
如果你有必须保护跨CPU访问的数据结构且你想使用自旋锁的场景,你有可能使用代价小的
自旋锁版本。当且仅当你知道某自旋锁永远不会在中断处理程序中使用,你可以使用非中断
的版本::
spin_lock(&lock);
...
spin_unlock(&lock);
(当然,也可以使用相应的读-写锁版本)。这种自旋锁将同样可以保证独占访问,而且
速度会快很多。如果你知道有关的数据只在“进程上下文”中被存取,即,不涉及中断,
这种自旋锁就有用了。
当这些版本的自旋锁涉及中断时,你不能使用的原因是会陷入死锁::
spin_lock(&lock);
...
<- 中断来临:
spin_lock(&lock);
一个中断试图对一个已经锁定的变量上锁。如果中断发生在另一个CPU上,不会有问题;
但如果中断发生在已经持有自旋锁的同一个CPU上,将 _会_ 有问题,因为该锁显然永远
不会被释放(因为中断正在等待该锁,而锁的持有者被中断打断,并且无法继续执行,
直到中断处理结束)。
(这也是自旋锁的中断版本只需要禁用 _本地_ 中断的原因——在发生于其它CPU的中断中
使用同一把自旋锁是没问题的,因为发生于其它CPU的中断不会打断已经持锁的CPU,所以
锁的持有者可以继续执行并最终释放锁)。
Linus
----
参考信息
========
对于动态初始化,使用spin_lock_init()或rwlock_init()是合适的::
spinlock_t xxx_lock;
rwlock_t xxx_rw_lock;
static int __init xxx_init(void)
{
spin_lock_init(&xxx_lock);
rwlock_init(&xxx_rw_lock);
...
}
module_init(xxx_init);
对于静态初始化,使用DEFINE_SPINLOCK() / DEFINE_RWLOCK()或
__SPIN_LOCK_UNLOCKED() / __RW_LOCK_UNLOCKED()是合适的。
...@@ -252,7 +252,7 @@ Linux-next 集成测试树 ...@@ -252,7 +252,7 @@ Linux-next 集成测试树
在将子系统树的更新合并到主线树之前,需要对它们进行集成测试。为此,存在一个 在将子系统树的更新合并到主线树之前,需要对它们进行集成测试。为此,存在一个
特殊的测试存储库,其中几乎每天都会提取所有子系统树: 特殊的测试存储库,其中几乎每天都会提取所有子系统树:
https://git.kernel.org/p=linux/kernel/git/next/linux-next.git https://git.kernel.org/?p=linux/kernel/git/next/linux-next.git
通过这种方式,Linux-next 对下一个合并阶段将进入主线内核的内容给出了一个概要 通过这种方式,Linux-next 对下一个合并阶段将进入主线内核的内容给出了一个概要
展望。非常欢冒险的测试者运行测试Linux-next。 展望。非常欢冒险的测试者运行测试Linux-next。
......
...@@ -25,8 +25,10 @@ Linux调度器 ...@@ -25,8 +25,10 @@ Linux调度器
sched-domains sched-domains
sched-capacity sched-capacity
sched-energy sched-energy
schedutil
sched-nice-design sched-nice-design
sched-stats sched-stats
sched-debug
TODOList: TODOList:
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/scheduler/sched-debug.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
=============
调度器debugfs
=============
用配置项CONFIG_SCHED_DEBUG=y启动内核后,将可以访问/sys/kernel/debug/sched
下的调度器专用调试文件。其中一些文件描述如下。
numa_balancing
==============
`numa_balancing` 目录用来存放控制非统一内存访问(NUMA)平衡特性的相关文件。
如果该特性导致系统负载太高,那么可以通过 `scan_period_min_ms, scan_delay_ms,
scan_period_max_ms, scan_size_mb` 文件控制NUMA缺页的内核采样速率。
scan_period_min_ms, scan_delay_ms, scan_period_max_ms, scan_size_mb
-------------------------------------------------------------------
自动NUMA平衡会扫描任务地址空间,检测页面是否被正确放置,或者数据是否应该被
迁移到任务正在运行的本地内存结点,此时需解映射页面。每个“扫描延迟”(scan delay)
时间之后,任务扫描其地址空间中下一批“扫描大小”(scan size)个页面。若抵达
内存地址空间末尾,扫描器将从头开始重新扫描。
结合来看,“扫描延迟”和“扫描大小”决定扫描速率。当“扫描延迟”减小时,扫描速率
增加。“扫描延迟”和每个任务的扫描速率都是自适应的,且依赖历史行为。如果页面被
正确放置,那么扫描延迟就会增加;否则扫描延迟就会减少。“扫描大小”不是自适应的,
“扫描大小”越大,扫描速率越高。
更高的扫描速率会产生更高的系统开销,因为必须捕获缺页异常,并且潜在地必须迁移
数据。然而,当扫描速率越高,若工作负载模式发生变化,任务的内存将越快地迁移到
本地结点,由于远程内存访问而产生的性能影响将降到最低。下面这些文件控制扫描延迟
的阈值和被扫描的页面数量。
``scan_period_min_ms`` 是扫描一个任务虚拟内存的最小时间,单位是毫秒。它有效地
控制了每个任务的最大扫描速率。
``scan_delay_ms`` 是一个任务初始化创建(fork)时,第一次使用的“扫描延迟”。
``scan_period_max_ms`` 是扫描一个任务虚拟内存的最大时间,单位是毫秒。它有效地
控制了每个任务的最小扫描速率。
``scan_size_mb`` 是一次特定的扫描中,要扫描多少兆字节(MB)对应的页面数。
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/scheduler/schedutil.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
=========
Schedutil
=========
.. note::
本文所有内容都假设频率和工作算力之间存在线性关系。我们知道这是有瑕疵的,
但这是最可行的近似处理。
PELT(实体负载跟踪,Per Entity Load Tracking)
==============================================
通过PELT,我们跟踪了各种调度器实体的一些指标,从单个任务到任务组分片到CPU
运行队列。我们使用指数加权移动平均数(Exponentially Weighted Moving Average,
EWMA)作为其基础,每个周期(1024us)都会衰减,衰减速率满足y^32 = 0.5。
也就是说,最近的32ms贡献负载的一半,而历史上的其它时间则贡献另一半。
具体而言:
ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
ewma(u) = ewma_sum(u) / ewma_sum(1)
由于这本质上是一个无限几何级数的累加,结果是可组合的,即ewma(A) + ewma(B) = ewma(A+B)。
这个属性是关键,因为它提供了在任务迁移时重新组合平均数的能力。
请注意,阻塞态的任务仍然对累加值(任务组分片和CPU运行队列)有贡献,这反映了
它们在恢复运行后的预期贡献。
利用这一点,我们跟踪2个关键指标:“运行”和“可运行”。“运行”反映了一个调度实体
在CPU上花费的时间,而“可运行”反映了一个调度实体在运行队列中花费的时间。当只有
一个任务时,这两个指标是相同的,但一旦出现对CPU的争用,“运行”将减少以反映每个
任务在CPU上花费的时间,而“可运行”将增加以反映争用的激烈程度。
更多细节见:kernel/sched/pelt.c
频率 / CPU不变性
================
因为CPU频率在1GHz时利用率为50%和CPU频率在2GHz时利用率为50%是不一样的,同样
在小核上运行时利用率为50%和在大核上运行时利用率为50%是不一样的,我们允许架构
以两个比率来伸缩时间差,其中一个是动态电压频率升降(Dynamic Voltage and
Frequency Scaling,DVFS)比率,另一个是微架构比率。
对于简单的DVFS架构(软件有完全控制能力),我们可以很容易地计算该比率为::
f_cur
r_dvfs := -----
f_max
对于由硬件控制DVFS的更多动态系统,我们使用硬件计数器(Intel APERF/MPERF,
ARMv8.4-AMU)来计算这一比率。具体到Intel,我们使用::
APERF
f_cur := ----- * P0
MPERF
4C-turbo; 如果可用并且使能了turbo
f_max := { 1C-turbo; 如果使能了turbo
P0; 其它情况
f_cur
r_dvfs := min( 1, ----- )
f_max
我们选择4C turbo而不是1C turbo,以使其更持久性略微更强。
r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最高性能水平的比率。
r_tot = r_dvfs * r_cpu
其结果是,上述“运行”和“可运行”的指标变成DVFS无关和CPU型号无关了。也就是说,
我们可以在CPU之间转移和比较它们。
更多细节见:
- kernel/sched/pelt.h:update_rq_clock_pelt()
- arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
- Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
UTIL_EST / UTIL_EST_FASTUP
==========================
由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
因此它们在再次运行后会面临(DVFS)的上涨。
为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
(Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加,
仅在利用率下降时衰减。
进一步,运行队列的(可运行任务的)利用率之和由下式计算:
util_est := \Sum_t max( t_running, t_util_est_ewma )
更多细节见: kernel/sched/fair.c:util_est_dequeue()
UCLAMP
======
可以在每个CFS或RT任务上设置有效的u_min和u_max clamp值(译注:clamp可以理解
为类似滤波器的能力,它定义了有效取值范围的最大值和最小值);运行队列为所有正在
运行的任务保持这些clamp的最大聚合值。
更多细节见: include/uapi/linux/sched/types.h
Schedutil / DVFS
================
每当调度器的负载跟踪被更新时(任务唤醒、任务迁移、时间流逝),我们都会调用
schedutil来更新硬件DVFS状态。
其基础是CPU运行队列的“运行”指标,根据上面的内容,它是CPU的频率不变的利用率
估计值。由此我们计算出一个期望的频率,如下::
max( running, util_est ); 如果使能UTIL_EST
u_cfs := { running; 其它情况
clamp( u_cfs + u_rt, u_min, u_max ); 如果使能UCLAMP_TASK
u_clamp := { u_cfs + u_rt; 其它情况
u := u_clamp + u_irq + u_dl; [估计值。更多细节见源代码]
f_des := min( f_max, 1.25 u * f_max )
关于IO-wait的说明:当发生更新是因为任务从IO完成中唤醒时,我们提升上面的“u”。
然后,这个频率被用来选择一个P-state或OPP,或者直接混入一个发给硬件的CPPC式
请求。
关于截止期限调度器的说明: 截止期限任务(偶发任务模型)使我们能够计算出满足
工作负荷所需的硬f_min值。
因为这些回调函数是直接来自调度器的,所以DVFS的硬件交互应该是“快速”和非阻塞的。
在硬件交互缓慢和昂贵的时候,schedutil支持DVFS请求限速,不过会降低效率。
更多信息见: kernel/sched/cpufreq_schedutil.c
注意
====
- 在低负载场景下,DVFS是最相关的,“运行”的值将密切反映利用率。
- 在负载饱和的场景下,任务迁移会导致一些瞬时性的使用率下降。假设我们有一个
CPU,有4个任务占用导致其饱和,接下来我们将一个任务迁移到另一个空闲CPU上,
旧的CPU的“运行”值将为0.75,而新的CPU将获得0.25。这是不可避免的,而且随着
时间流逝将自动修正。另注,由于没有空闲时间,我们还能保证f_max值吗?
- 上述大部分内容是关于避免DVFS下滑,以及独立的DVFS域发生负载迁移时不得不
重新学习/提升频率。
...@@ -77,7 +77,7 @@ DAMON目前为物理和虚拟地址空间提供了基元的实现。下面两个 ...@@ -77,7 +77,7 @@ DAMON目前为物理和虚拟地址空间提供了基元的实现。下面两个
======================== ========================
下面四个部分分别描述了DAMON的核心机制和五个监测属性,即 ``采样间隔`` 、 ``聚集间隔`` 、 下面四个部分分别描述了DAMON的核心机制和五个监测属性,即 ``采样间隔`` 、 ``聚集间隔`` 、
``区域更新间隔`` 、 ``最小区域数`` 和 ``最大区域数`` 。 ``更新间隔`` 、 ``最小区域数`` 和 ``最大区域数`` 。
访问频率监测 访问频率监测
...@@ -135,5 +135,6 @@ DAMON的输出显示了在给定的时间内哪些页面的访问频率是多少 ...@@ -135,5 +135,6 @@ DAMON的输出显示了在给定的时间内哪些页面的访问频率是多少
监测目标地址范围可以动态改变。例如,虚拟内存可以动态地被映射和解映射。物理内存可以被 监测目标地址范围可以动态改变。例如,虚拟内存可以动态地被映射和解映射。物理内存可以被
热插拔。 热插拔。
由于在某些情况下变化可能相当频繁,DAMON检查动态内存映射的变化,并仅在用户指定的时间 由于在某些情况下变化可能相当频繁,DAMON允许监控操作检查动态变化,包括内存映射变化,
间隔( ``区域更新间隔`` )内将其应用于抽象的目标区域。 并仅在用户指定的时间间隔( ``更新间隔`` )中的每个时间段,将其应用于监控操作相关的
数据结构,如抽象的监控目标内存区。
\ No newline at end of file
:Original: Documentation/vm/_free_page_reporting.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
=========
Frontswap
=========
Frontswap为交换页提供了一个 “transcendent memory” 的接口。在一些环境中,由
于交换页被保存在RAM(或类似RAM的设备)中,而不是交换磁盘,因此可以获得巨大的性能
节省(提高)。
.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
Frontswap之所以这么命名,是因为它可以被认为是与swap设备的“back”存储相反。存
储器被认为是一个同步并发安全的面向页面的“伪RAM设备”,符合transcendent memory
(如Xen的“tmem”,或内核内压缩内存,又称“zcache”,或未来的类似RAM的设备)的要
求;这个伪RAM设备不能被内核直接访问或寻址,其大小未知且可能随时间变化。驱动程序通过
调用frontswap_register_ops将自己与frontswap链接起来,以适当地设置frontswap_ops
的功能,它提供的功能必须符合某些策略,如下所示:
一个 “init” 将设备准备好接收与指定的交换设备编号(又称“类型”)相关的frontswap
交换页。一个 “store” 将把该页复制到transcendent memory,并与该页的类型和偏移
量相关联。一个 “load” 将把该页,如果找到的话,从transcendent memory复制到内核
内存,但不会从transcendent memory中删除该页。一个 “invalidate_page” 将从
transcendent memory中删除该页,一个 “invalidate_area” 将删除所有与交换类型
相关的页(例如,像swapoff)并通知 “device” 拒绝进一步存储该交换类型。
一旦一个页面被成功存储,在该页面上的匹配加载通常会成功。因此,当内核发现自己处于需
要交换页面的情况时,它首先尝试使用frontswap。如果存储的结果是成功的,那么数据就已
经成功的保存到了transcendent memory中,并且避免了磁盘写入,如果后来再读回数据,
也避免了磁盘读取。如果存储返回失败,transcendent memory已经拒绝了该数据,且该页
可以像往常一样被写入交换空间。
请注意,如果一个页面被存储,而该页面已经存在于transcendent memory中(一个 “重复”
的存储),要么存储成功,数据被覆盖,要么存储失败,该页面被废止。这确保了旧的数据永远
不会从frontswap中获得。
如果配置正确,对frontswap的监控是通过 `/sys/kernel/debug/frontswap` 目录下的
debugfs完成的。frontswap的有效性可以通过以下方式测量(在所有交换设备中):
``failed_stores``
有多少次存储的尝试是失败的
``loads``
尝试了多少次加载(应该全部成功)
``succ_stores``
有多少次存储的尝试是成功的
``invalidates``
尝试了多少次作废
后台实现可以提供额外的指标。
经常问到的问题
==============
* 价值在哪里?
当一个工作负载开始交换时,性能就会下降。Frontswap通过提供一个干净的、动态的接口来
读取和写入交换页到 “transcendent memory”,从而大大增加了许多这样的工作负载的性
能,否则内核是无法直接寻址的。当数据被转换为不同的形式和大小(比如压缩)或者被秘密
移动(对于一些类似RAM的设备来说,这可能对写平衡很有用)时,这个接口是理想的。交换
页(和被驱逐的页面缓存页)是这种比RAM慢但比磁盘快得多的“伪RAM设备”的一大用途。
Frontswap对内核的影响相当小,为各种系统配置中更动态、更灵活的RAM利用提供了巨大的
灵活性:
在单一内核的情况下,又称“zcache”,页面被压缩并存储在本地内存中,从而增加了可以安
全保存在RAM中的匿名页面总数。Zcache本质上是用压缩/解压缩的CPU周期换取更好的内存利
用率。Benchmarks测试显示,当内存压力较低时,几乎没有影响,而在高内存压力下的一些
工作负载上,则有明显的性能改善(25%以上)。
“RAMster” 在zcache的基础上增加了对集群系统的 “peer-to-peer” transcendent memory
的支持。Frontswap页面像zcache一样被本地压缩,但随后被“remotified” 到另一个系
统的RAM。这使得RAM可以根据需要动态地来回负载平衡,也就是说,当系统A超载时,它可以
交换到系统B,反之亦然。RAMster也可以被配置成一个内存服务器,因此集群中的许多服务器
可以根据需要动态地交换到配置有大量内存的单一服务器上......而不需要预先配置每个客户
有多少内存可用
在虚拟情况下,虚拟化的全部意义在于统计地将物理资源在多个虚拟机的不同需求之间进行复
用。对于RAM来说,这真的很难做到,而且在不改变内核的情况下,要做好这一点的努力基本上
是失败的(除了一些广为人知的特殊情况下的工作负载)。具体来说,Xen Transcendent Memory
后端允许管理器拥有的RAM “fallow”,不仅可以在多个虚拟机之间进行“time-shared”,
而且页面可以被压缩和重复利用,以优化RAM的利用率。当客户操作系统被诱导交出未充分利用
的RAM时(如 “selfballooning”),突然出现的意外内存压力可能会导致交换;frontswap
允许这些页面被交换到管理器RAM中或从管理器RAM中交换(如果整体主机系统内存条件允许),
从而减轻计划外交换可能带来的可怕的性能影响。
一个KVM的实现正在进行中,并且已经被RFC'ed到lkml。而且,利用frontswap,对NVM作为
内存扩展技术的调查也在进行中。
* 当然,在某些情况下可能有性能上的优势,但frontswap的空间/时间开销是多少?
如果 CONFIG_FRONTSWAP 被禁用,每个 frontswap 钩子都会编译成空,唯一的开销是每
个 swapon'ed swap 设备的几个额外字节。如果 CONFIG_FRONTSWAP 被启用,但没有
frontswap的 “backend” 寄存器,每读或写一个交换页就会有一个额外的全局变量,而不
是零。如果 CONFIG_FRONTSWAP 被启用,并且有一个frontswap的backend寄存器,并且
后端每次 “store” 请求都失败(即尽管声称可能,但没有提供内存),CPU 的开销仍然可以
忽略不计 - 因为每次frontswap失败都是在交换页写到磁盘之前,系统很可能是 I/O 绑定
的,无论如何使用一小部分的 CPU 都是不相关的。
至于空间,如果CONFIG_FRONTSWAP被启用,并且有一个frontswap的backend注册,那么
每个交换设备的每个交换页都会被分配一个比特。这是在内核已经为每个交换设备的每个交换
页分配的8位(在2.6.34之前是16位)上增加的。(Hugh Dickins观察到,frontswap可能
会偷取现有的8个比特,但是我们以后再来担心这个小的优化问题)。对于标准的4K页面大小的
非常大的交换盘(这很罕见),这是每32GB交换盘1MB开销。
当交换页存储在transcendent memory中而不是写到磁盘上时,有一个副作用,即这可能会
产生更多的内存压力,有可能超过其他的优点。一个backend,比如zcache,必须实现策略
来仔细(但动态地)管理内存限制,以确保这种情况不会发生。
* 好吧,那就用内核骇客能理解的术语来快速概述一下这个frontswap补丁的作用如何?
我们假设在内核初始化过程中,一个frontswap 的 “backend” 已经注册了;这个注册表
明这个frontswap 的 “backend” 可以访问一些不被内核直接访问的“内存”。它到底提
供了多少内存是完全动态和随机的。
每当一个交换设备被交换时,就会调用frontswap_init(),把交换设备的编号(又称“类
型”)作为一个参数传给它。这就通知了frontswap,以期待 “store” 与该号码相关的交
换页的尝试。
每当交换子系统准备将一个页面写入交换设备时(参见swap_writepage()),就会调用
frontswap_store。Frontswap与frontswap backend协商,如果backend说它没有空
间,frontswap_store返回-1,内核就会照常把页换到交换设备上。注意,来自frontswap
backend的响应对内核来说是不可预测的;它可能选择从不接受一个页面,可能接受每九个
页面,也可能接受每一个页面。但是如果backend确实接受了一个页面,那么这个页面的数
据已经被复制并与类型和偏移量相关联了,而且backend保证了数据的持久性。在这种情况
下,frontswap在交换设备的“frontswap_map” 中设置了一个位,对应于交换设备上的
页面偏移量,否则它就会将数据写入该设备。
当交换子系统需要交换一个页面时(swap_readpage()),它首先调用frontswap_load(),
检查frontswap_map,看这个页面是否早先被frontswap backend接受。如果是,该页
的数据就会从frontswap后端填充,换入就完成了。如果不是,正常的交换代码将被执行,
以便从真正的交换设备上获得这一页的数据。
所以每次frontswap backend接受一个页面时,交换设备的读取和(可能)交换设备的写
入都被 “frontswap backend store” 和(可能)“frontswap backend loads”
所取代,这可能会快得多。
* frontswap不能被配置为一个 “特殊的” 交换设备,它的优先级要高于任何真正的交换
设备(例如像zswap,或者可能是swap-over-nbd/NFS)?
首先,现有的交换子系统不允许有任何种类的交换层次结构。也许它可以被重写以适应层次
结构,但这将需要相当大的改变。即使它被重写,现有的交换子系统也使用了块I/O层,它
假定交换设备是固定大小的,其中的任何页面都是可线性寻址的。Frontswap几乎没有触
及现有的交换子系统,而是围绕着块I/O子系统的限制,提供了大量的灵活性和动态性。
例如,frontswap backend对任何交换页的接受是完全不可预测的。这对frontswap backend
的定义至关重要,因为它赋予了backend完全动态的决定权。在zcache中,人们无法预
先知道一个页面的可压缩性如何。可压缩性 “差” 的页面会被拒绝,而 “差” 本身也可
以根据当前的内存限制动态地定义。
此外,frontswap是完全同步的,而真正的交换设备,根据定义,是异步的,并且使用
块I/O。块I/O层不仅是不必要的,而且可能进行 “优化”,这对面向RAM的设备来说是
不合适的,包括将一些页面的写入延迟相当长的时间。同步是必须的,以确保后端的动
态性,并避免棘手的竞争条件,这将不必要地大大增加frontswap和/或块I/O子系统的
复杂性。也就是说,只有最初的 “store” 和 “load” 操作是需要同步的。一个独立
的异步线程可以自由地操作由frontswap存储的页面。例如,RAMster中的 “remotification”
线程使用标准的异步内核套接字,将压缩的frontswap页面移动到远程机器。同样,
KVM的客户方实现可以进行客户内压缩,并使用 “batched” hypercalls。
在虚拟化环境中,动态性允许管理程序(或主机操作系统)做“intelligent overcommit”。
例如,它可以选择只接受页面,直到主机交换可能即将发生,然后强迫客户机做他们
自己的交换。
transcendent memory规格的frontswap有一个坏处。因为任何 “store” 都可
能失败,所以必须在一个真正的交换设备上有一个真正的插槽来交换页面。因此,
frontswap必须作为每个交换设备的 “影子” 来实现,它有可能容纳交换设备可能
容纳的每一个页面,也有可能根本不容纳任何页面。这意味着frontswap不能包含比
swap设备总数更多的页面。例如,如果在某些安装上没有配置交换设备,frontswap
就没有用。无交换设备的便携式设备仍然可以使用frontswap,但是这种设备的
backend必须配置某种 “ghost” 交换设备,并确保它永远不会被使用。
* 为什么会有这种关于 “重复存储” 的奇怪定义?如果一个页面以前被成功地存储过,
难道它不能总是被成功地覆盖吗?
几乎总是可以的,不,有时不能。考虑一个例子,数据被压缩了,原来的4K页面被压
缩到了1K。现在,有人试图用不可压缩的数据覆盖该页,因此会占用整个4K。但是
backend没有更多的空间了。在这种情况下,这个存储必须被拒绝。每当frontswap
拒绝一个会覆盖的存储时,它也必须使旧的数据作废,并确保它不再被访问。因为交
换子系统会把新的数据写到读交换设备上,这是确保一致性的正确做法。
* 为什么frontswap补丁会创建新的头文件swapfile.h?
frontswap代码依赖于一些swap子系统内部的数据结构,这些数据结构多年来一直
在静态和全局之间来回移动。这似乎是一个合理的妥协:将它们定义为全局,但在一
个新的包含文件中声明它们,该文件不被包含swap.h的大量源文件所包含。
Dan Magenheimer,最后更新于2012年4月9日
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/vm/hmm.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
==================
异构内存管理 (HMM)
==================
提供基础设施和帮助程序以将非常规内存(设备内存,如板上 GPU 内存)集成到常规内核路径中,其
基石是此类内存的专用struct page(请参阅本文档的第 5 至 7 节)。
HMM 还为 SVM(共享虚拟内存)提供了可选的帮助程序,即允许设备透明地访问与 CPU 一致的程序
地址,这意味着 CPU 上的任何有效指针也是该设备的有效指针。这对于简化高级异构计算的使用变得
必不可少,其中 GPU、DSP 或 FPGA 用于代表进程执行各种计算。
本文档分为以下部分:在第一部分中,我揭示了与使用特定于设备的内存分配器相关的问题。在第二
部分中,我揭示了许多平台固有的硬件限制。第三部分概述了 HMM 设计。第四部分解释了 CPU 页
表镜像的工作原理以及 HMM 在这种情况下的目的。第五部分处理内核中如何表示设备内存。最后,
最后一节介绍了一个新的迁移助手,它允许利用设备 DMA 引擎。
.. contents:: :local:
使用特定于设备的内存分配器的问题
================================
具有大量板载内存(几 GB)的设备(如 GPU)历来通过专用驱动程序特定 API 管理其内存。这会
造成设备驱动程序分配和管理的内存与常规应用程序内存(私有匿名、共享内存或常规文件支持内存)
之间的隔断。从这里开始,我将把这个方面称为分割的地址空间。我使用共享地址空间来指代相反的情况:
即,设备可以透明地使用任何应用程序内存区域。
分割的地址空间的发生是因为设备只能访问通过设备特定 API 分配的内存。这意味着从设备的角度来
看,程序中的所有内存对象并不平等,这使得依赖于广泛的库的大型程序变得复杂。
具体来说,这意味着想要利用像 GPU 这样的设备的代码需要在通用分配的内存(malloc、mmap
私有、mmap 共享)和通过设备驱动程序 API 分配的内存之间复制对象(这仍然以 mmap 结束,
但是是设备文件)。
对于平面数据集(数组、网格、图像……),这并不难实现,但对于复杂数据集(列表、树……),
很难做到正确。复制一个复杂的数据集需要重新映射其每个元素之间的所有指针关系。这很容易出错,
而且由于数据集和地址的重复,程序更难调试。
分割地址空间也意味着库不能透明地使用它们从核心程序或另一个库中获得的数据,因此每个库可能
不得不使用设备特定的内存分配器来重复其输入数据集。大型项目会因此受到影响,并因为各种内存
拷贝而浪费资源。
复制每个库的API以接受每个设备特定分配器分配的内存作为输入或输出,并不是一个可行的选择。
这将导致库入口点的组合爆炸。
最后,随着高级语言结构(在 C++ 中,当然也在其他语言中)的进步,编译器现在有可能在没有程
序员干预的情况下利用 GPU 和其他设备。某些编译器识别的模式仅适用于共享地址空间。对所有
其他模式,使用共享地址空间也更合理。
I/O 总线、设备内存特性
======================
由于一些限制,I/O 总线削弱了共享地址空间。大多数 I/O 总线只允许从设备到主内存的基本
内存访问;甚至缓存一致性通常是可选的。从 CPU 访问设备内存甚至更加有限。通常情况下,它
不是缓存一致的。
如果我们只考虑 PCIE 总线,那么设备可以访问主内存(通常通过 IOMMU)并与 CPU 缓存一
致。但是,它只允许设备对主存储器进行一组有限的原子操作。这在另一个方向上更糟:CPU
只能访问有限范围的设备内存,而不能对其执行原子操作。因此,从内核的角度来看,设备内存不
能被视为与常规内存等同。
另一个严重的因素是带宽有限(约 32GBytes/s,PCIE 4.0 和 16 通道)。这比最快的 GPU
内存 (1 TBytes/s) 慢 33 倍。最后一个限制是延迟。从设备访问主内存的延迟比设备访问自
己的内存时高一个数量级。
一些平台正在开发新的 I/O 总线或对 PCIE 的添加/修改以解决其中一些限制
(OpenCAPI、CCIX)。它们主要允许 CPU 和设备之间的双向缓存一致性,并允许架构支持的所
有原子操作。遗憾的是,并非所有平台都遵循这一趋势,并且一些主要架构没有针对这些问题的硬
件解决方案。
因此,为了使共享地址空间有意义,我们不仅必须允许设备访问任何内存,而且还必须允许任何内
存在设备使用时迁移到设备内存(在迁移时阻止 CPU 访问)。
共享地址空间和迁移
==================
HMM 打算提供两个主要功能。第一个是通过复制cpu页表到设备页表中来共享地址空间,因此对
于进程地址空间中的任何有效主内存地址,相同的地址指向相同的物理内存。
为了实现这一点,HMM 提供了一组帮助程序来填充设备页表,同时跟踪 CPU 页表更新。设备页表
更新不像 CPU 页表更新那么容易。要更新设备页表,您必须分配一个缓冲区(或使用预先分配的
缓冲区池)并在其中写入 GPU 特定命令以执行更新(取消映射、缓存失效和刷新等)。这不能通
过所有设备的通用代码来完成。因此,为什么HMM提供了帮助器,在把硬件的具体细节留给设备驱
动程序的同时,把一切可以考虑的因素都考虑进去了。
HMM 提供的第二种机制是一种新的 ZONE_DEVICE 内存,它允许为设备内存的每个页面分配一个
struct page。这些页面很特殊,因为 CPU 无法映射它们。然而,它们允许使用现有的迁移机
制将主内存迁移到设备内存,从 CPU 的角度来看,一切看起来都像是换出到磁盘的页面。使用
struct page可以与现有的 mm 机制进行最简单、最干净的集成。再次,HMM 仅提供帮助程序,
首先为设备内存热插拔新的 ZONE_DEVICE 内存,然后执行迁移。迁移内容和时间的策略决定留
给设备驱动程序。
请注意,任何 CPU 对设备页面的访问都会触发缺页异常并迁移回主内存。例如,当支持给定CPU
地址 A 的页面从主内存页面迁移到设备页面时,对地址 A 的任何 CPU 访问都会触发缺页异常
并启动向主内存的迁移。
凭借这两个特性,HMM 不仅允许设备镜像进程地址空间并保持 CPU 和设备页表同步,而且还通
过迁移设备正在使用的数据集部分来利用设备内存。
地址空间镜像实现和API
=====================
地址空间镜像的主要目标是允许将一定范围的 CPU 页表复制到一个设备页表中;HMM 有助于
保持两者同步。想要镜像进程地址空间的设备驱动程序必须从注册 mmu_interval_notifier
开始::
int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub,
struct mm_struct *mm, unsigned long start,
unsigned long length,
const struct mmu_interval_notifier_ops *ops);
在 ops->invalidate() 回调期间,设备驱动程序必须对范围执行更新操作(将范围标记为只
读,或完全取消映射等)。设备必须在驱动程序回调返回之前完成更新。
当设备驱动程序想要填充一个虚拟地址范围时,它可以使用::
int hmm_range_fault(struct hmm_range *range);
如果请求写访问,它将在丢失或只读条目上触发缺页异常(见下文)。缺页异常使用通用的 mm 缺
页异常代码路径,就像 CPU 缺页异常一样。
这两个函数都将 CPU 页表条目复制到它们的 pfns 数组参数中。该数组中的每个条目对应于虚拟
范围中的一个地址。HMM 提供了一组标志来帮助驱动程序识别特殊的 CPU 页表项。
在 sync_cpu_device_pagetables() 回调中锁定是驱动程序必须尊重的最重要的方面,以保
持事物正确同步。使用模式是::
int driver_populate_range(...)
{
struct hmm_range range;
...
range.notifier = &interval_sub;
range.start = ...;
range.end = ...;
range.hmm_pfns = ...;
if (!mmget_not_zero(interval_sub->notifier.mm))
return -EFAULT;
again:
range.notifier_seq = mmu_interval_read_begin(&interval_sub);
mmap_read_lock(mm);
ret = hmm_range_fault(&range);
if (ret) {
mmap_read_unlock(mm);
if (ret == -EBUSY)
goto again;
return ret;
}
mmap_read_unlock(mm);
take_lock(driver->update);
if (mmu_interval_read_retry(&ni, range.notifier_seq) {
release_lock(driver->update);
goto again;
}
/* Use pfns array content to update device page table,
* under the update lock */
release_lock(driver->update);
return 0;
}
driver->update 锁与驱动程序在其 invalidate() 回调中使用的锁相同。该锁必须在调用
mmu_interval_read_retry() 之前保持,以避免与并发 CPU 页表更新发生任何竞争。
利用 default_flags 和 pfn_flags_mask
====================================
hmm_range 结构有 2 个字段,default_flags 和 pfn_flags_mask,它们指定整个范围
的故障或快照策略,而不必为 pfns 数组中的每个条目设置它们。
例如,如果设备驱动程序需要至少具有读取权限的范围的页面,它会设置::
range->default_flags = HMM_PFN_REQ_FAULT;
range->pfn_flags_mask = 0;
并如上所述调用 hmm_range_fault()。这将填充至少具有读取权限的范围内的所有页面。
现在假设驱动程序想要做同样的事情,除了它想要拥有写权限的范围内的一页。现在驱动程序设
置::
range->default_flags = HMM_PFN_REQ_FAULT;
range->pfn_flags_mask = HMM_PFN_REQ_WRITE;
range->pfns[index_of_write] = HMM_PFN_REQ_WRITE;
有了这个,HMM 将在至少读取(即有效)的所有页面中异常,并且对于地址
== range->start + (index_of_write << PAGE_SHIFT) 它将异常写入权限,即,如果
CPU pte 没有设置写权限,那么HMM将调用handle_mm_fault()。
hmm_range_fault 完成后,标志位被设置为页表的当前状态,即 HMM_PFN_VALID | 如果页
面可写,将设置 HMM_PFN_WRITE。
从核心内核的角度表示和管理设备内存
==================================
尝试了几种不同的设计来支持设备内存。第一个使用特定于设备的数据结构来保存有关迁移内存
的信息,HMM 将自身挂接到 mm 代码的各个位置,以处理对设备内存支持的地址的任何访问。
事实证明,这最终复制了 struct page 的大部分字段,并且还需要更新许多内核代码路径才
能理解这种新的内存类型。
大多数内核代码路径从不尝试访问页面后面的内存,而只关心struct page的内容。正因为如此,
HMM 切换到直接使用 struct page 用于设备内存,这使得大多数内核代码路径不知道差异。
我们只需要确保没有人试图从 CPU 端映射这些页面。
移入和移出设备内存
==================
由于 CPU 无法直接访问设备内存,因此设备驱动程序必须使用硬件 DMA 或设备特定的加载/存
储指令来迁移数据。migrate_vma_setup()、migrate_vma_pages() 和
migrate_vma_finalize() 函数旨在使驱动程序更易于编写并集中跨驱动程序的通用代码。
在将页面迁移到设备私有内存之前,需要创建特殊的设备私有 ``struct page`` 。这些将用
作特殊的“交换”页表条目,以便 CPU 进程在尝试访问已迁移到设备专用内存的页面时会发生异常。
这些可以通过以下方式分配和释放::
struct resource *res;
struct dev_pagemap pagemap;
res = request_free_mem_region(&iomem_resource, /* number of bytes */,
"name of driver resource");
pagemap.type = MEMORY_DEVICE_PRIVATE;
pagemap.range.start = res->start;
pagemap.range.end = res->end;
pagemap.nr_range = 1;
pagemap.ops = &device_devmem_ops;
memremap_pages(&pagemap, numa_node_id());
memunmap_pages(&pagemap);
release_mem_region(pagemap.range.start, range_len(&pagemap.range));
还有devm_request_free_mem_region(), devm_memremap_pages(),
devm_memunmap_pages() 和 devm_release_mem_region() 当资源可以绑定到 ``struct device``.
整体迁移步骤类似于在系统内存中迁移 NUMA 页面(see :ref:`Page migration <page_migration>`) ,
但这些步骤分为设备驱动程序特定代码和共享公共代码:
1. ``mmap_read_lock()``
设备驱动程序必须将 ``struct vm_area_struct`` 传递给migrate_vma_setup(),
因此需要在迁移期间保留 mmap_read_lock() 或 mmap_write_lock()。
2. ``migrate_vma_setup(struct migrate_vma *args)``
设备驱动初始化了 ``struct migrate_vma`` 的字段,并将该指针传递给
migrate_vma_setup()。``args->flags`` 字段是用来过滤哪些源页面应该被迁移。
例如,设置 ``MIGRATE_VMA_SELECT_SYSTEM`` 将只迁移系统内存,设置
``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` 将只迁移驻留在设备私有内存中的页
面。如果后者被设置, ``args->pgmap_owner`` 字段被用来识别驱动所拥有的设备
私有页。这就避免了试图迁移驻留在其他设备中的设备私有页。目前,只有匿名的私有VMA
范围可以被迁移到系统内存和设备私有内存。
migrate_vma_setup()所做的第一步是用 ``mmu_notifier_invalidate_range_start()``
和 ``mmu_notifier_invalidate_range_end()`` 调用来遍历设备周围的页表,使
其他设备的MMU无效,以便在 ``args->src`` 数组中填写要迁移的PFN。
``invalidate_range_start()`` 回调传递给一个``struct mmu_notifier_range`` ,
其 ``event`` 字段设置为MMU_NOTIFY_MIGRATE, ``owner`` 字段设置为传递给
migrate_vma_setup()的 ``args->pgmap_owner`` 字段。这允许设备驱动跳过无
效化回调,只无效化那些实际正在迁移的设备私有MMU映射。这一点将在下一节详细解释。
在遍历页表时,一个 ``pte_none()`` 或 ``is_zero_pfn()`` 条目导致一个有效
的 “zero” PFN 存储在 ``args->src`` 阵列中。这让驱动分配设备私有内存并清
除它,而不是复制一个零页。到系统内存或设备私有结构页的有效PTE条目将被
``lock_page()``锁定,与LRU隔离(如果系统内存和设备私有页不在LRU上),从进
程中取消映射,并插入一个特殊的迁移PTE来代替原来的PTE。 migrate_vma_setup()
还清除了 ``args->dst`` 数组。
3. 设备驱动程序分配目标页面并将源页面复制到目标页面。
驱动程序检查每个 ``src`` 条目以查看该 ``MIGRATE_PFN_MIGRATE`` 位是否已
设置并跳过未迁移的条目。设备驱动程序还可以通过不填充页面的 ``dst`` 数组来选
择跳过页面迁移。
然后,驱动程序分配一个设备私有 struct page 或一个系统内存页,用 ``lock_page()``
锁定该页,并将 ``dst`` 数组条目填入::
dst[i] = migrate_pfn(page_to_pfn(dpage));
现在驱动程序知道这个页面正在被迁移,它可以使设备私有 MMU 映射无效并将设备私有
内存复制到系统内存或另一个设备私有页面。由于核心 Linux 内核会处理 CPU 页表失
效,因此设备驱动程序只需使其自己的 MMU 映射失效。
驱动程序可以使用 ``migrate_pfn_to_page(src[i])`` 来获取源设备的
``struct page`` 面,并将源页面复制到目标设备上,如果指针为 ``NULL`` ,意
味着源页面没有被填充到系统内存中,则清除目标设备的私有内存。
4. ``migrate_vma_pages()``
这一步是实际“提交”迁移的地方。
如果源页是 ``pte_none()`` 或 ``is_zero_pfn()`` 页,这时新分配的页会被插
入到CPU的页表中。如果一个CPU线程在同一页面上发生异常,这可能会失败。然而,页
表被锁定,只有一个新页会被插入。如果它失去了竞争,设备驱动将看到
``MIGRATE_PFN_MIGRATE`` 位被清除。
如果源页被锁定、隔离等,源 ``struct page`` 信息现在被复制到目标
``struct page`` ,最终完成CPU端的迁移。
5. 设备驱动为仍在迁移的页面更新设备MMU页表,回滚未迁移的页面。
如果 ``src`` 条目仍然有 ``MIGRATE_PFN_MIGRATE`` 位被设置,设备驱动可以
更新设备MMU,如果 ``MIGRATE_PFN_WRITE`` 位被设置,则设置写启用位。
6. ``migrate_vma_finalize()``
这一步用新页的页表项替换特殊的迁移页表项,并释放对源和目的 ``struct page``
的引用。
7. ``mmap_read_unlock()``
现在可以释放锁了。
独占访问存储器
==============
一些设备具有诸如原子PTE位的功能,可以用来实现对系统内存的原子访问。为了支持对一
个共享的虚拟内存页的原子操作,这样的设备需要对该页的访问是排他的,而不是来自CPU
的任何用户空间访问。 ``make_device_exclusive_range()`` 函数可以用来使一
个内存范围不能从用户空间访问。
这将用特殊的交换条目替换给定范围内的所有页的映射。任何试图访问交换条目的行为都会
导致一个异常,该异常会通过用原始映射替换该条目而得到恢复。驱动程序会被通知映射已
经被MMU通知器改变,之后它将不再有对该页的独占访问。独占访问被保证持续到驱动程序
放弃页面锁和页面引用为止,这时页面上的任何CPU异常都可以按所述进行。
内存 cgroup (memcg) 和 rss 统计
===============================
目前,设备内存被视为 rss 计数器中的任何常规页面(如果设备页面用于匿名,则为匿名,
如果设备页面用于文件支持页面,则为文件,如果设备页面用于共享内存,则为 shmem)。
这是为了保持现有应用程序的故意选择,这些应用程序可能在不知情的情况下开始使用设备
内存,运行不受影响。
一个缺点是 OOM 杀手可能会杀死使用大量设备内存而不是大量常规系统内存的应用程序,
因此不会释放太多系统内存。在决定以不同方式计算设备内存之前,我们希望收集更多关
于应用程序和系统在存在设备内存的情况下在内存压力下如何反应的实际经验。
对内存 cgroup 做出了相同的决定。设备内存页面根据相同的内存 cgroup 计算,常规
页面将被计算在内。这确实简化了进出设备内存的迁移。这也意味着从设备内存迁移回常规
内存不会失败,因为它会超过内存 cgroup 限制。一旦我们对设备内存的使用方式及其对
内存资源控制的影响有了更多的了解,我们可能会在后面重新考虑这个选择。
请注意,设备内存永远不能由设备驱动程序或通过 GUP 固定,因此此类内存在进程退出时
总是被释放的。或者在共享内存或文件支持内存的情况下,当删除最后一个引用时。
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/vm/hugetlbfs_reserv.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
==============
Hugetlbfs 预留
==============
概述
====
:ref:`hugetlbpage` 中描述的巨页通常是预先分配给应用程序使用的。如果VMA指
示要使用巨页,这些巨页会在缺页异常时被实例化到任务的地址空间。如果在缺页异常
时没有巨页存在,任务就会被发送一个SIGBUS,并经常不高兴地死去。在加入巨页支
持后不久,人们决定,在mmap()时检测巨页的短缺情况会更好。这个想法是,如果
没有足够的巨页来覆盖映射,mmap()将失败。这首先是在mmap()时在代码中做一个
简单的检查,以确定是否有足够的空闲巨页来覆盖映射。就像内核中的大多数东西一
样,代码随着时间的推移而不断发展。然而,基本的想法是在mmap()时 “预留”
巨页,以确保巨页可以用于该映射中的缺页异常。下面的描述试图描述在v4.10内核
中是如何进行巨页预留处理的。
读者
====
这个描述主要是针对正在修改hugetlbfs代码的内核开发者。
数据结构
========
resv_huge_pages
这是一个全局的(per-hstate)预留的巨页的计数。预留的巨页只对预留它们的任
务可用。因此,一般可用的巨页的数量被计算为(``free_huge_pages - resv_huge_pages``)。
Reserve Map
预留映射由以下结构体描述::
struct resv_map {
struct kref refs;
spinlock_t lock;
struct list_head regions;
long adds_in_progress;
struct list_head region_cache;
long region_cache_count;
};
系统中每个巨页映射都有一个预留映射。resv_map中的regions列表描述了映射中的
区域。一个区域被描述为::
struct file_region {
struct list_head link;
long from;
long to;
};
file_region结构体的 ‘from’ 和 ‘to’ 字段是进入映射的巨页索引。根据映射的类型,在
reserv_map 中的一个区域可能表示该范围存在预留,或预留不存在。
Flags for MAP_PRIVATE Reservations
这些被存储在预留的映射指针的底部。
``#define HPAGE_RESV_OWNER (1UL << 0)``
表示该任务是与该映射相关的预留的所有者。
``#define HPAGE_RESV_UNMAPPED (1UL << 1)``
表示最初映射此范围(并创建储备)的任务由于COW失败而从该任务(子任务)中取消映
射了一个页面。
Page Flags
PagePrivate页面标志是用来指示在释放巨页时必须恢复巨页的预留。更多细节将在
“释放巨页” 一节中讨论。
预留映射位置(私有或共享)
==========================
一个巨页映射或段要么是私有的,要么是共享的。如果是私有的,它通常只对一个地址空间
(任务)可用。如果是共享的,它可以被映射到多个地址空间(任务)。对于这两种类型的映射,
预留映射的位置和语义是明显不同的。位置的差异是:
- 对于私有映射,预留映射挂在VMA结构体上。具体来说,就是vma->vm_private_data。这个保
留映射是在创建映射(mmap(MAP_PRIVATE))时创建的。
- 对于共享映射,预留映射挂在inode上。具体来说,就是inode->i_mapping->private_data。
由于共享映射总是由hugetlbfs文件系统中的文件支持,hugetlbfs代码确保每个节点包含一个预
留映射。因此,预留映射在创建节点时被分配。
创建预留
========
当创建一个巨大的有页面支持的共享内存段(shmget(SHM_HUGETLB))或通过mmap(MAP_HUGETLB)
创建一个映射时,就会创建预留。这些操作会导致对函数hugetlb_reserve_pages()的调用::
int hugetlb_reserve_pages(struct inode *inode,
long from, long to,
struct vm_area_struct *vma,
vm_flags_t vm_flags)
hugetlb_reserve_pages()做的第一件事是检查在调用shmget()或mmap()时是否指定了NORESERVE
标志。如果指定了NORESERVE,那么这个函数立即返回,因为不需要预留。
参数'from'和'to'是映射或基础文件的巨页索引。对于shmget(),'from'总是0,'to'对应于段/映射
的长度。对于mmap(),offset参数可以用来指定进入底层文件的偏移量。在这种情况下,'from'和'to'
参数已经被这个偏移量所调整。
PRIVATE和SHARED映射之间的一个很大的区别是预留在预留映射中的表示方式。
- 对于共享映射,预留映射中的条目表示对应页面的预留存在或曾经存在。当预留被消耗时,预留映射不被
修改。
- 对于私有映射,预留映射中没有条目表示相应页面存在预留。随着预留被消耗,条目被添加到预留映射中。
因此,预留映射也可用于确定哪些预留已被消耗。
对于私有映射,hugetlb_reserve_pages()创建预留映射并将其挂在VMA结构体上。此外,
HPAGE_RESV_OWNER标志被设置,以表明该VMA拥有预留。
预留映射被查阅以确定当前映射/段需要多少巨页预留。对于私有映射,这始终是一个值(to - from)。
然而,对于共享映射来说,一些预留可能已经存在于(to - from)的范围内。关于如何实现这一点的细节,
请参见 :ref:`预留映射的修改 <resv_map_modifications>` 一节。
该映射可能与一个子池(subpool)相关联。如果是这样,将查询子池以确保有足够的空间用于映射。子池
有可能已经预留了可用于映射的预留空间。更多细节请参见 :ref: `子池预留 <sub_pool_resv>`
一节。
在咨询了预留映射和子池之后,就知道了需要的新预留数量。hugetlb_acct_memory()函数被调用以检查
并获取所要求的预留数量。hugetlb_acct_memory()调用到可能分配和调整剩余页数的函数。然而,在这
些函数中,代码只是检查以确保有足够的空闲的巨页来容纳预留。如果有的话,全局预留计数resv_huge_pages
会被调整,如下所示::
if (resv_needed <= (resv_huge_pages - free_huge_pages))
resv_huge_pages += resv_needed;
注意,在检查和调整这些计数器时,全局锁hugetlb_lock会被预留。
如果有足够的空闲的巨页,并且全局计数resv_huge_pages被调整,那么与映射相关的预留映射被修改以
反映预留。在共享映射的情况下,将存在一个file_region,包括'from'-'to'范围。对于私有映射,
不对预留映射进行修改,因为没有条目表示存在预留。
如果hugetlb_reserve_pages()成功,全局预留数和与映射相关的预留映射将根据需要被修改,以确保
在'from'-'to'范围内存在预留。
消耗预留/分配一个巨页
===========================
当与预留相关的巨页在相应的映射中被分配和实例化时,预留就被消耗了。该分配是在函数alloc_huge_page()
中进行的::
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve)
alloc_huge_page被传递给一个VMA指针和一个虚拟地址,因此它可以查阅预留映射以确定是否存在预留。
此外,alloc_huge_page需要一个参数avoid_reserve,该参数表示即使看起来已经为指定的地址预留了
预留,也不应该使用预留。avoid_reserve参数最常被用于写时拷贝和页面迁移的情况下,即现有页面的额
外拷贝被分配。
调用辅助函数vma_needs_reservation()来确定是否存在对映射(vma)中地址的预留。关于这个函数的详
细内容,请参见 :ref:`预留映射帮助函数 <resv_map_helpers>` 一节。从
vma_needs_reservation()返回的值通常为0或1。如果该地址存在预留,则为0,如果不存在预留,则为1。
如果不存在预留,并且有一个与映射相关联的子池,则查询子池以确定它是否包含预留。如果子池包含预留,
则可将其中一个用于该分配。然而,在任何情况下,avoid_reserve参数都会优先考虑为分配使用预留。在
确定预留是否存在并可用于分配后,调用dequeue_huge_page_vma()函数。这个函数需要两个与预留有关
的参数:
- avoid_reserve,这是传递给alloc_huge_page()的同一个值/参数。
- chg,尽管这个参数的类型是long,但只有0或1的值被传递给dequeue_huge_page_vma。如果该值为0,
则表明存在预留(关于可能的问题,请参见 “预留和内存策略” 一节)。如果值
为1,则表示不存在预留,如果可能的话,必须从全局空闲池中取出该页。
与VMA的内存策略相关的空闲列表被搜索到一个空闲页。如果找到了一个页面,当该页面从空闲列表中移除时,
free_huge_pages的值被递减。如果有一个与该页相关的预留,将进行以下调整::
SetPagePrivate(page); /* 表示分配这个页面消耗了一个预留,
* 如果遇到错误,以至于必须释放这个页面,预留将被
* 恢复。 */
resv_huge_pages--; /* 减少全局预留计数 */
注意,如果找不到满足VMA内存策略的巨页,将尝试使用伙伴分配器分配一个。这就带来了超出预留范围
的剩余巨页和超额分配的问题。即使分配了一个多余的页面,也会进行与上面一样的基于预留的调整:
SetPagePrivate(page) 和 resv_huge_pages--.
在获得一个新的巨页后,(page)->private被设置为与该页面相关的子池的值,如果它存在的话。当页
面被释放时,这将被用于子池的计数。
然后调用函数vma_commit_reservation(),根据预留的消耗情况调整预留映射。一般来说,这涉及
到确保页面在区域映射的file_region结构体中被表示。对于预留存在的共享映射,预留映射中的条目
已经存在,所以不做任何改变。然而,如果共享映射中没有预留,或者这是一个私有映射,则必须创建一
个新的条目。
注意,如果找不到满足VMA内存策略的巨页,将尝试使用伙伴分配器分配一个。这就带来了超出预留范围
的剩余巨页和过度分配的问题。即使分配了一个多余的页面,也会进行与上面一样的基于预留的调整。
SetPagePrivate(page)和resv_huge_pages-。
在获得一个新的巨页后,(page)->private被设置为与该页面相关的子池的值,如果它存在的话。当页
面被释放时,这将被用于子池的计数。
然后调用函数vma_commit_reservation(),根据预留的消耗情况调整预留映射。一般来说,这涉及
到确保页面在区域映射的file_region结构体中被表示。对于预留存在的共享映射,预留映射中的条目
已经存在,所以不做任何改变。然而,如果共享映射中没有预留,或者这是一个私有映射,则必须创建
一个新的条目。
在alloc_huge_page()开始调用vma_needs_reservation()和页面分配后调用
vma_commit_reservation()之间,预留映射有可能被改变。如果hugetlb_reserve_pages在共
享映射中为同一页面被调用,这将是可能的。在这种情况下,预留计数和子池空闲页计数会有一个偏差。
这种罕见的情况可以通过比较vma_needs_reservation和vma_commit_reservation的返回值来
识别。如果检测到这种竞争,子池和全局预留计数将被调整以进行补偿。关于这些函数的更多信息,请
参见 :ref:`预留映射帮助函数 <resv_map_helpers>` 一节。
实例化巨页
==========
在巨页分配之后,页面通常被添加到分配任务的页表中。在此之前,共享映射中的页面被添加到页面缓
存中,私有映射中的页面被添加到匿名反向映射中。在这两种情况下,PagePrivate标志被清除。因此,
当一个已经实例化的巨页被释放时,不会对全局预留计数(resv_huge_pages)进行调整。
释放巨页
========
巨页释放是由函数free_huge_page()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
递一个指向页面结构体的指针。当一个巨页被释放时,可能需要进行预留计算。如果该页与包含保
留的子池相关联,或者该页在错误路径上被释放,必须恢复全局预留计数,就会出现这种情况。
page->private字段指向与该页相关的任何子池。如果PagePrivate标志被设置,它表明全局预留计数
应该被调整(关于如何设置这些标志的信息,请参见
:ref: `消耗预留/分配一个巨页 <consume_resv>` )。
该函数首先调用hugepage_subpool_put_pages()来处理该页。如果这个函数返回一个0的值(不等于
传递的1的值),它表明预留与子池相关联,这个新释放的页面必须被用来保持子池预留的数量超过最小值。
因此,在这种情况下,全局resv_huge_pages计数器被递增。
如果页面中设置了PagePrivate标志,那么全局resv_huge_pages计数器将永远被递增。
子池预留
========
有一个结构体hstate与每个巨页尺寸相关联。hstate跟踪所有指定大小的巨页。一个子池代表一
个hstate中的页面子集,它与一个已挂载的hugetlbfs文件系统相关
当一个hugetlbfs文件系统被挂载时,可以指定min_size选项,它表示文件系统所需的最小的巨页数量。
如果指定了这个选项,与min_size相对应的巨页的数量将被预留给文件系统使用。这个数字在结构体
hugepage_subpool的min_hpages字段中被跟踪。在挂载时,hugetlb_acct_memory(min_hpages)
被调用以预留指定数量的巨页。如果它们不能被预留,挂载就会失败。
当从子池中获取或释放页面时,会调用hugepage_subpool_get/put_pages()函数。
hugepage_subpool_get/put_pages被传递给巨页数量,以此来调整子池的 “已用页面” 计数
(get为下降,put为上升)。通常情况下,如果子池中没有足够的页面,它们会返回与传递的相同的值或
一个错误。
然而,如果预留与子池相关联,可能会返回一个小于传递值的返回值。这个返回值表示必须进行的额外全局
池调整的数量。例如,假设一个子池包含3个预留的巨页,有人要求5个。与子池相关的3个预留页可以用来
满足部分请求。但是,必须从全局池中获得2个页面。为了向调用者转达这一信息,将返回值2。然后,调用
者要负责从全局池中获取另外两个页面。
COW和预留
==========
由于共享映射都指向并使用相同的底层页面,COW最大的预留问题是私有映射。在这种情况下,两个任务可
以指向同一个先前分配的页面。一个任务试图写到该页,所以必须分配一个新的页,以便每个任务都指向它
自己的页。
当该页最初被分配时,该页的预留被消耗了。当由于COW而试图分配一个新的页面时,有可能没有空闲的巨
页,分配会失败。
当最初创建私有映射时,通过设置所有者的预留映射指针中的HPAGE_RESV_OWNER位来标记映射的所有者。
由于所有者创建了映射,所有者拥有与映射相关的所有预留。因此,当一个写异常发生并且没有可用的页面
时,对预留的所有者和非所有者采取不同的行动。
在发生异常的任务不是所有者的情况下,异常将失败,该任务通常会收到一个SIGBUS。
如果所有者是发生异常的任务,我们希望它能够成功,因为它拥有原始的预留。为了达到这个目的,该页被
从非所有者任务中解映射出来。这样一来,唯一的引用就是来自拥有者的任务。此外,HPAGE_RESV_UNMAPPED
位被设置在非拥有任务的预留映射指针中。如果非拥有者任务后来在一个不存在的页面上发生异常,它可能
会收到一个SIGBUS。但是,映射/预留的原始拥有者的行为将与预期一致。
预留映射的修改
==============
以下低级函数用于对预留映射进行修改。通常情况下,这些函数不会被直接调用。而是调用一个预留映射辅
助函数,该函数调用这些低级函数中的一个。这些低级函数在源代码(mm/hugetlb.c)中得到了相当好的
记录。这些函数是::
long region_chg(struct resv_map *resv, long f, long t);
long region_add(struct resv_map *resv, long f, long t);
void region_abort(struct resv_map *resv, long f, long t);
long region_count(struct resv_map *resv, long f, long t);
在预留映射上的操作通常涉及两个操作:
1) region_chg()被调用来检查预留映射,并确定在指定的范围[f, t]内有多少页目前没有被代表。
调用代码执行全局检查和分配,以确定是否有足够的巨页使操作成功。
2)
a) 如果操作能够成功,regi_add()将被调用,以实际修改先前传递给regi_chg()的相同范围
[f, t]的预留映射。
b) 如果操作不能成功,region_abort被调用,在相同的范围[f, t]内中止操作。
注意,这是一个两步的过程, region_add()和 region_abort()在事先调用 region_chg()后保证
成功。 region_chg()负责预先分配任何必要的数据结构以确保后续操作(特别是 region_add())的
成功。
如上所述,region_chg()确定该范围内当前没有在映射中表示的页面的数量。region_add()返回添加
到映射中的范围内的页数。在大多数情况下, region_add() 的返回值与 region_chg() 的返回值相
同。然而,在共享映射的情况下,有可能在调用 region_chg() 和 region_add() 之间对预留映射进
行更改。在这种情况下,regi_add()的返回值将与regi_chg()的返回值不符。在这种情况下,全局计数
和子池计数很可能是不正确的,需要调整。检查这种情况并进行适当的调整是调用者的责任。
函数region_del()被调用以从预留映射中移除区域。
它通常在以下情况下被调用:
- 当hugetlbfs文件系统中的一个文件被删除时,该节点将被释放,预留映射也被释放。在释放预留映射
之前,所有单独的file_region结构体必须被释放。在这种情况下,region_del的范围是[0, LONG_MAX]。
- 当一个hugetlbfs文件正在被截断时。在这种情况下,所有在新文件大小之后分配的页面必须被释放。
此外,预留映射中任何超过新文件大小的file_region条目必须被删除。在这种情况下,region_del
的范围是[new_end_of_file, LONG_MAX]。
- 当在一个hugetlbfs文件中打洞时。在这种情况下,巨页被一次次从文件的中间移除。当这些页被移除
时,region_del()被调用以从预留映射中移除相应的条目。在这种情况下,region_del被传递的范
围是[page_idx, page_idx + 1]。
在任何情况下,region_del()都会返回从预留映射中删除的页面数量。在非常罕见的情况下,region_del()
会失败。这只能发生在打洞的情况下,即它必须分割一个现有的file_region条目,而不能分配一个新的
结构体。在这种错误情况下,region_del()将返回-ENOMEM。这里的问题是,预留映射将显示对该页有
预留。然而,子池和全局预留计数将不反映该预留。为了处理这种情况,调用函数hugetlb_fix_reserve_counts()
来调整计数器,使其与不能被删除的预留映射条目相对应。
region_count()在解除私有巨页映射时被调用。在私有映射中,预留映射中没有条目表明存在一个预留。
因此,通过计算预留映射中的条目数,我们知道有多少预留被消耗了,有多少预留是未完成的
(Outstanding = (end - start) - region_count(resv, start, end))。由于映射正在消
失,子池和全局预留计数被未完成的预留数量所减去。
预留映射帮助函数
================
有几个辅助函数可以查询和修改预留映射。这些函数只对特定的巨页的预留感兴趣,所以它们只是传入一个
地址而不是一个范围。此外,它们还传入相关的VMA。从VMA中,可以确定映射的类型(私有或共享)和预留
映射的位置(inode或VMA)。这些函数只是调用 “预留映射的修改” 一节中描述的基础函数。然而,
它们确实考虑到了私有和共享映射的预留映射条目的 “相反” 含义,并向调用者隐藏了这个细节::
long vma_needs_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
该函数为指定的页面调用 region_chg()。如果不存在预留,则返回1。如果存在预留,则返回0::
long vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
这将调用 region_add(),用于指定的页面。与region_chg和region_add的情况一样,该函数应在
先前调用的vma_needs_reservation后调用。它将为该页添加一个预留条目。如果预留被添加,它将
返回1,如果没有则返回0。返回值应与之前调用vma_needs_reservation的返回值进行比较。如果出
现意外的差异,说明在两次调用之间修改了预留映射::
void vma_end_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
这将调用指定页面的 region_abort()。与region_chg和region_abort的情况一样,该函数应在
先前调用的vma_needs_reservation后被调用。它将中止/结束正在进行的预留添加操作::
long vma_add_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
这是一个特殊的包装函数,有助于在错误路径上清理预留。它只从repare_reserve_on_error()函数
中调用。该函数与vma_needs_reservation一起使用,试图将一个预留添加到预留映射中。它考虑到
了私有和共享映射的不同预留映射语义。因此,region_add被调用用于共享映射(因为映射中的条目表
示预留),而region_del被调用用于私有映射(因为映射中没有条目表示预留)。关于在错误路径上需
要做什么的更多信息,请参见 “错误路径中的预留清理” 。
错误路径中的预留清理
====================
正如在:ref:`预留映射帮助函数<resv_map_helpers>` 一节中提到的,预留的修改分两步进行。首
先,在分配页面之前调用vma_needs_reservation。如果分配成功,则调用vma_commit_reservation。
如果不是,则调用vma_end_reservation。全局和子池的预留计数根据操作的成功或失败进行调整,
一切都很好。
此外,在一个巨页被实例化后,PagePrivate标志被清空,这样,当页面最终被释放时,计数是
正确的。
然而,有几种情况是,在一个巨页被分配后,但在它被实例化之前,就遇到了错误。在这种情况下,
页面分配已经消耗了预留,并进行了适当的子池、预留映射和全局计数调整。如果页面在这个时候被释放
(在实例化和清除PagePrivate之前),那么free_huge_page将增加全局预留计数。然而,预留映射
显示报留被消耗了。这种不一致的状态将导致预留的巨页的 “泄漏” 。全局预留计数将比它原本的要高,
并阻止分配一个预先分配的页面。
函数 restore_reserve_on_error() 试图处理这种情况。它有相当完善的文档。这个函数的目的
是将预留映射恢复到页面分配前的状态。通过这种方式,预留映射的状态将与页面释放后的全局预留计
数相对应。
函数restore_reserve_on_error本身在试图恢复预留映射条目时可能会遇到错误。在这种情况下,
它将简单地清除该页的PagePrivate标志。这样一来,当页面被释放时,全局预留计数将不会被递增。
然而,预留映射将继续看起来像预留被消耗了一样。一个页面仍然可以被分配到该地址,但它不会像最
初设想的那样使用一个预留页。
有一些代码(最明显的是userfaultfd)不能调用restore_reserve_on_error。在这种情况下,
它简单地修改了PagePrivate,以便在释放巨页时不会泄露预留。
预留和内存策略
==============
当git第一次被用来管理Linux代码时,每个节点的巨页列表就存在于hstate结构中。预留的概念是
在一段时间后加入的。当预留被添加时,没有尝试将内存策略考虑在内。虽然cpusets与内存策略不
完全相同,但hugetlb_acct_memory中的这个注释总结了预留和cpusets/内存策略之间的相互作
用::
/*
* 当cpuset被配置时,它打破了严格的hugetlb页面预留,因为计数是在一个全局变量上完
* 成的。在有cpuset的情况下,这样的预留完全是垃圾,因为预留没有根据当前cpuset的
* 页面可用性来检查。在任务所在的cpuset中缺乏空闲的htlb页面时,应用程序仍然有可能
* 被内核OOM'ed。试图用cpuset来执行严格的计数几乎是不可能的(或者说太难看了),因
* 为cpuset太不稳定了,任务或内存节点可以在cpuset之间动态移动。与cpuset共享
* hugetlb映射的语义变化是不可取的。然而,为了预留一些语义,我们退回到检查当前空闲
* 页的可用性,作为一种最好的尝试,希望能将cpuset改变语义的影响降到最低。
*/
添加巨页预留是为了防止在缺页异常时出现意外的页面分配失败(OOM)。然而,如果一个应用
程序使用cpusets或内存策略,就不能保证在所需的节点上有巨页可用。即使有足够数量的全局
预留,也是如此。
Hugetlbfs回归测试
=================
最完整的hugetlb测试集在libhugetlbfs仓库。如果你修改了任何hugetlb相关的代码,请使用
libhugetlbfs测试套件来检查回归情况。此外,如果你添加了任何新的hugetlb功能,请在
libhugetlbfs中添加适当的测试。
--
Mike Kravetz,2017年4月7日
:Original: Documentation/vm/hwpoison.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
========
hwpoison
========
什么是hwpoison?
===============
即将推出的英特尔CPU支持从一些内存错误中恢复( ``MCA恢复`` )。这需要操作系统宣布
一个页面"poisoned",杀死与之相关的进程,并避免在未来使用它。
这个补丁包在虚拟机中实现了必要的(编程)框架。
引用概述中的评论::
高级机器的检查与处理。处理方法是损坏的页面被硬件报告,通常是由于2位ECC内
存或高速缓存故障。
这主要是针对在后台检测到的损坏的页面。当当前的CPU试图访问它时,当前运行的进程
可以直接被杀死。因为还没有访问损坏的页面, 如果错误由于某种原因不能被处理,就可
以安全地忽略它. 而不是用另外一个机器检查去处理它。
处理不同状态的页面缓存页。这里棘手的部分是,相对于其他虚拟内存用户, 我们可以异
步访问任何页面。因为内存故障可能随时随地发生,可能违反了他们的一些假设。这就是
为什么这段代码必须非常小心。一般来说,它试图使用正常的锁规则,如获得标准锁,即使
这意味着错误处理可能需要很长的时间。
这里的一些操作有点低效,并且具有非线性的算法复杂性,因为数据结构没有针对这种情
况进行优化。特别是从vma到进程的映射就是这种情况。由于这种情况大概率是罕见的,所
以我们希望我们可以摆脱这种情况。
该代码由mm/memory-failure.c中的高级处理程序、一个新的页面poison位和虚拟机中的
各种检查组成,用来处理poison的页面。
现在主要目标是KVM客户机,但它适用于所有类型的应用程序。支持KVM需要最近的qemu-kvm
版本。
对于KVM的使用,需要一个新的信号类型,这样KVM就可以用适当的地址将机器检查注入到客户
机中。这在理论上也允许其他应用程序处理内存故障。我们的期望是,所有的应用程序都不要这
样做,但一些非常专业的应用程序可能会这样做。
故障恢复模式
============
有两种(实际上是三种)模式的内存故障恢复可以在。
vm.memory_failure_recovery sysctl 置零:
所有的内存故障都会导致panic。请不要尝试恢复。
早期处理
(可以在全局和每个进程中控制) 一旦检测到错误,立即向应用程序发送SIGBUS这允许
应用程序以温和的方式处理内存错误(例如,放弃受影响的对象) 这是KVM qemu使用的
模式。
推迟处理
当应用程序运行到损坏的页面时,发送SIGBUS。这对不知道内存错误的应用程序来说是
最好的,默认情况下注意一些页面总是被当作late kill处理。
用户控制
========
vm.memory_failure_recovery
参阅 sysctl.txt
vm.memory_failure_early_kill
全局启用early kill
PR_MCE_KILL
设置early/late kill mode/revert 到系统默认值。
arg1: PR_MCE_KILL_CLEAR:
恢复到系统默认值
arg1: PR_MCE_KILL_SET:
arg2定义了线程特定模式
PR_MCE_KILL_EARLY:
Early kill
PR_MCE_KILL_LATE:
Late kill
PR_MCE_KILL_DEFAULT
使用系统全局默认值
注意,如果你想有一个专门的线程代表进程处理SIGBUS(BUS_MCEERR_AO),你应该在
指定线程上调用prctl(PR_MCE_KILL_EARLY)。否则,SIGBUS将被发送到主线程。
PR_MCE_KILL_GET
返回当前模式
测试
====
* madvise(MADV_HWPOISON, ....) (as root) - 在测试过程中Poison一个页面
* 通过debugfs ``/sys/kernel/debug/hwpoison/`` hwpoison-inject模块
corrupt-pfn
在PFN处注入hwpoison故障,并echoed到这个文件。这做了一些早期过滤,以避
免在测试套件中损坏非预期页面。
unpoison-pfn
在PFN的Software-unpoison页面对应到这个文件。这样,一个页面可以再次被
复用。这只对Linux注入的故障起作用,对真正的内存故障不起作用。
注意这些注入接口并不稳定,可能会在不同的内核版本中发生变化
corrupt-filter-dev-major, corrupt-filter-dev-minor
只处理与块设备major/minor定义的文件系统相关的页面的内存故障。-1U是通
配符值。这应该只用于人工注入的测试。
corrupt-filter-memcg
限制注入到memgroup拥有的页面。由memcg的inode号指定。
Example::
mkdir /sys/fs/cgroup/mem/hwpoison
usemem -m 100 -s 1000 &
echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
page-types -p `pidof init` --hwpoison # shall do nothing
page-types -p `pidof usemem` --hwpoison # poison its pages
corrupt-filter-flags-mask, corrupt-filter-flags-value
当指定时,只有在((page_flags & mask) == value)的情况下才会poison页面。
这允许对许多种类的页面进行压力测试。page_flags与/proc/kpageflags中的相
同。这些标志位在include/linux/kernel-page-flags.h中定义,并在
Documentation/admin-guide/mm/pagemap.rst中记录。
* 架构特定的MCE注入器
x86 有 mce-inject, mce-test
在mce-test中的一些便携式hwpoison测试程序,见下文。
引用
====
http://halobates.de/mce-lc09-2.pdf
09年LinuxCon的概述演讲
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
测试套件(在tsrc中的hwpoison特定可移植测试)。
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
x86特定的注入器
限制
====
- 不是所有的页面类型都被支持,而且永远不会。大多数内核内部对象不能被恢
复,目前只有LRU页。
---
Andi Kleen, 2009年10月
...@@ -27,27 +27,28 @@ TODO:待引用文档集被翻译完毕后请及时修改此处) ...@@ -27,27 +27,28 @@ TODO:待引用文档集被翻译完毕后请及时修改此处)
free_page_reporting free_page_reporting
highmem highmem
ksm ksm
frontswap
hmm
hwpoison
hugetlbfs_reserv
memory-model
mmu_notifier
numa
overcommit-accounting
page_frags
page_owner
page_table_check
remap_file_pages
split_page_table_lock
z3fold
zsmalloc
TODOLIST: TODOLIST:
* arch_pgtable_helpers * arch_pgtable_helpers
* free_page_reporting * free_page_reporting
* frontswap
* hmm
* hwpoison
* hugetlbfs_reserv * hugetlbfs_reserv
* memory-model
* mmu_notifier
* numa
* overcommit-accounting
* page_migration * page_migration
* page_frags
* page_owner
* page_table_check
* remap_file_pages
* slub * slub
* split_page_table_lock
* transhuge * transhuge
* unevictable-lru * unevictable-lru
* vmalloced-kernel-stacks * vmalloced-kernel-stacks
* z3fold
* zsmalloc
.. SPDX-License-Identifier: GPL-2.0
:Original: Documentation/vm/memory-model.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
============
物理内存模型
============
系统中的物理内存可以用不同的方式进行寻址。最简单的情况是,物理内存从地址0开
始,跨越一个连续的范围,直到最大的地址。然而,这个范围可能包含CPU无法访问的
小孔隙。那么,在完全不同的地址可能有几个连续的范围。而且,别忘了NUMA,即不
同的内存库连接到不同的CPU。
Linux使用两种内存模型中的一种对这种多样性进行抽象。FLATMEM和SPARSEM。每
个架构都定义了它所支持的内存模型,默认的内存模型是什么,以及是否有可能手动
覆盖该默认值。
所有的内存模型都使用排列在一个或多个数组中的 `struct page` 来跟踪物理页
帧的状态。
无论选择哪种内存模型,物理页框号(PFN)和相应的 `struct page` 之间都存
在一对一的映射关系。
每个内存模型都定义了 :c:func:`pfn_to_page` 和 :c:func:`page_to_pfn`
帮助函数,允许从PFN到 `struct page` 的转换,反之亦然。
FLATMEM
=======
最简单的内存模型是FLATMEM。这个模型适用于非NUMA系统的连续或大部分连续的
物理内存。
在FLATMEM内存模型中,有一个全局的 `mem_map` 数组来映射整个物理内存。对
于大多数架构,孔隙在 `mem_map` 数组中都有条目。与孔洞相对应的 `struct page`
对象从未被完全初始化。
为了分配 `mem_map` 数组,架构特定的设置代码应该调用free_area_init()函数。
然而,在调用memblock_free_all()函数之前,映射数组是不能使用的,该函数
将所有的内存交给页分配器。
一个架构可能会释放 `mem_map` 数组中不包括实际物理页的部分。在这种情况下,特
定架构的 :c:func:`pfn_valid` 实现应该考虑到 `mem_map` 中的孔隙。
使用FLATMEM,PFN和 `struct page` 之间的转换是直接的。 `PFN - ARCH_PFN_OFFSET`
是 `mem_map` 数组的一个索引。
`ARCH_PFN_OFFSET` 定义了物理内存起始地址不同于0的系统的第一个页框号。
SPARSEMEM
=========
SPARSEMEM是Linux中最通用的内存模型,它是唯一支持若干高级功能的内存模型,
如物理内存的热插拔、非易失性内存设备的替代内存图和较大系统的内存图的延迟
初始化。
SPARSEMEM模型将物理内存显示为一个部分的集合。一个区段用mem_section结构
体表示,它包含 `section_mem_map` ,从逻辑上讲,它是一个指向 `struct page`
阵列的指针。然而,它被存储在一些其他的magic中,以帮助分区管理。区段的大小
和最大区段数是使用 `SECTION_SIZE_BITS` 和 `MAX_PHYSMEM_BITS` 常量
来指定的,这两个常量是由每个支持SPARSEMEM的架构定义的。 `MAX_PHYSMEM_BITS`
是一个架构所支持的物理地址的实际宽度,而 `SECTION_SIZE_BITS` 是一个任
意的值。
最大的段数表示为 `NR_MEM_SECTIONS` ,定义为
.. math::
NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
`mem_section` 对象被安排在一个叫做 `mem_sections` 的二维数组中。这个数组的
大小和位置取决于 `CONFIG_SPARSEM_EXTREME` 和可能的最大段数:
* 当 `CONFIG_SPARSEMEM_EXTREME` 被禁用时, `mem_sections` 数组是静态的,有
`NR_MEM_SECTIONS` 行。每一行持有一个 `mem_section` 对象。
* 当 `CONFIG_SPARSEMEM_EXTREME` 被启用时, `mem_sections` 数组被动态分配。
每一行包含价值 `PAGE_SIZE` 的 `mem_section` 对象,行数的计算是为了适应所有的
内存区。
架构设置代码应该调用sparse_init()来初始化内存区和内存映射。
通过SPARSEMEM,有两种可能的方式将PFN转换为相应的 `struct page` --"classic sparse"和
"sparse vmemmap"。选择是在构建时进行的,它由 `CONFIG_SPARSEMEM_VMEMMAP` 的
值决定。
Classic sparse在page->flags中编码了一个页面的段号,并使用PFN的高位来访问映射该页
框的段。在一个区段内,PFN是指向页数组的索引。
Sparse vmemmapvmemmap使用虚拟映射的内存映射来优化pfn_to_page和page_to_pfn操
作。有一个全局的 `struct page *vmemmap` 指针,指向一个虚拟连续的 `struct page`
对象阵列。PFN是该数组的一个索引,`struct page` 从 `vmemmap` 的偏移量是该页的PFN。
为了使用vmemmap,一个架构必须保留一个虚拟地址的范围,以映射包含内存映射的物理页,并
确保 `vmemmap`指向该范围。此外,架构应该实现 :c:func:`vmemmap_populate` 方法,
它将分配物理内存并为虚拟内存映射创建页表。如果一个架构对vmemmap映射没有任何特殊要求,
它可以使用通用内存管理提供的默认 :c:func:`vmemmap_populate_basepages`。
虚拟映射的内存映射允许将持久性内存设备的 `struct page` 对象存储在这些设备上预先分
配的存储中。这种存储用vmem_altmap结构表示,最终通过一长串的函数调用传递给
vmemmap_populate()。vmemmap_populate()实现可以使用 `vmem_altmap` 和
:c:func:`vmemmap_alloc_block_buf` 助手来分配持久性内存设备上的内存映射。
ZONE_DEVICE
===========
`ZONE_DEVICE` 设施建立在 `SPARSEM_VMEMMAP` 之上,为设备驱动识别的物理地址范
围提供 `struct page` `mem_map` 服务。 `ZONE_DEVICE` 的 "设备" 方面与以下
事实有关:这些地址范围的页面对象从未被在线标记过,而且必须对设备进行引用,而不仅仅
是页面,以保持内存被“锁定”以便使用。 `ZONE_DEVICE` ,通过 :c:func:`devm_memremap_pages` ,
为给定的pfns范围执行足够的内存热插拔来开启 :c:func:`pfn_to_page`,
:c:func:`page_to_pfn`, ,和 :c:func:`get_user_pages` 服务。由于页面引
用计数永远不会低于1,所以页面永远不会被追踪为空闲内存,页面的 `struct list_head lru`
空间被重新利用,用于向映射该内存的主机设备/驱动程序进行反向引用。
虽然 `SPARSEMEM` 将内存作为一个区段的集合,可以选择收集并合成内存块,但
`ZONE_DEVICE` 用户需要更小的颗粒度来填充 `mem_map` 。鉴于 `ZONE_DEVICE`
内存从未被在线标记,因此它的内存范围从未通过sysfs内存热插拔api暴露在内存块边界
上。这个实现依赖于这种缺乏用户接口的约束,允许子段大小的内存范围被指定给
:c:func:`arch_add_memory` ,即内存热插拔的上半部分。子段支持允许2MB作为
:c:func:`devm_memremap_pages` 的跨架构通用对齐颗粒度。
`ZONE_DEVICE` 的用户是:
* pmem: 通过DAX映射将平台持久性内存作为直接I/O目标使用。
* hmm: 用 `->page_fault()` 和 `->page_free()` 事件回调扩展 `ZONE_DEVICE` ,
以允许设备驱动程序协调与设备内存相关的内存管理事件,通常是GPU内存。参见/vm/hmm.rst。
* p2pdma: 创建 `struct page` 对象,允许PCI/E拓扑结构中的peer设备协调它们之间的
直接DMA操作,即绕过主机内存。
:Original: Documentation/vm/mmu_notifier.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
什么时候需要页表锁内通知?
==========================
当清除一个pte/pmd时,我们可以选择通过在页表锁下(通知版的\*_clear_flush调用
mmu_notifier_invalidate_range)通知事件。但这种通知并不是在所有情况下都需要的。
对于二级TLB(非CPU TLB),如IOMMU TLB或设备TLB(当设备使用类似ATS/PASID的东西让
IOMMU走CPU页表来访问进程的虚拟地址空间)。只有两种情况需要在清除pte/pmd时在持有页
表锁的同时通知这些二级TLB:
A) 在mmu_notifier_invalidate_range_end()之前,支持页的地址被释放。
B) 一个页表项被更新以指向一个新的页面(COW,零页上的写异常,__replace_page(),...)。
情况A很明显,你不想冒风险让设备写到一个现在可能被一些完全不同的任务使用的页面。
情况B更加微妙。为了正确起见,它需要按照以下序列发生:
- 上页表锁
- 清除页表项并通知 ([pmd/pte]p_huge_clear_flush_notify())
- 设置页表项以指向新页
如果在设置新的pte/pmd值之前,清除页表项之后没有进行通知,那么你就会破坏设备的C11或
C++11等内存模型。
考虑以下情况(设备使用类似于ATS/PASID的功能)。
两个地址addrA和addrB,这样|addrA - addrB| >= PAGE_SIZE,我们假设它们是COW的
写保护(B的其他情况也适用)。
::
[Time N] --------------------------------------------------------------------
CPU-thread-0 {尝试写到addrA}
CPU-thread-1 {尝试写到addrB}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {读取addrA并填充设备TLB}
DEV-thread-2 {读取addrB并填充设备TLB}
[Time N+1] ------------------------------------------------------------------
CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+2] ------------------------------------------------------------------
CPU-thread-0 {COW_step1: {更新页表以指向addrA的新页}}
CPU-thread-1 {COW_step1: {更新页表以指向addrB的新页}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ------------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {写入addrA,这是对新页面的写入}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ------------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {}
CPU-thread-3 {写入addrB,这是一个写入新页的过程}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+4] ------------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+5] ------------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {从旧页中读取addrA}
DEV-thread-2 {从新页面读取addrB}
所以在这里,因为在N+2的时候,清空页表项没有和通知一起作废二级TLB,设备在看到addrA的新值之前
就看到了addrB的新值。这就破坏了设备的总内存序。
当改变一个pte的写保护或指向一个新的具有相同内容的写保护页(KSM)时,将mmu_notifier_invalidate_range
调用延迟到页表锁外的mmu_notifier_invalidate_range_end()是可以的。即使做页表更新的线程
在释放页表锁后但在调用mmu_notifier_invalidate_range_end()前被抢占,也是如此。
:Original: Documentation/vm/numa.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
始于1999年11月,作者: <kanoj@sgi.com>
==========================
何为非统一内存访问(NUMA)?
==========================
这个问题可以从几个视角来回答:硬件观点和Linux软件视角。
从硬件角度看,NUMA系统是一个由多个组件或装配组成的计算机平台,每个组件可能包含0个或更多的CPU、
本地内存和/或IO总线。为了简洁起见,并将这些物理组件/装配的硬件视角与软件抽象区分开来,我们在
本文中称这些组件/装配为“单元”。
每个“单元”都可以看作是系统的一个SMP[对称多处理器]子集——尽管独立的SMP系统所需的一些组件可能
不会在任何给定的单元上填充。NUMA系统的单元通过某种系统互连连接在一起——例如,交叉开关或点对点
链接是NUMA系统互连的常见类型。这两种类型的互连都可以聚合起来,以创建NUMA平台,其中的单元与其
他单元有多个距离。
对于Linux,感兴趣的NUMA平台主要是所谓的缓存相干NUMA--简称ccNUMA系统系统。在ccNUMA系统中,
所有的内存都是可见的,并且可以从连接到任何单元的任何CPU中访问,缓存一致性是由处理器缓存和/或
系统互连在硬件中处理。
内存访问时间和有效的内存带宽取决于包含CPU的单元或进行内存访问的IO总线距离包含目标内存的单元
有多远。例如,连接到同一单元的CPU对内存的访问将比访问其他远程单元的内存经历更快的访问时间和
更高的带宽。 NUMA平台可以在任何给定单元上访问多种远程距离的(其他)单元。
平台供应商建立NUMA系统并不只是为了让软件开发人员的生活变得有趣。相反,这种架构是提供可扩展
内存带宽的一种手段。然而,为了实现可扩展的内存带宽,系统和应用软件必须安排大部分的内存引用
[cache misses]到“本地”内存——同一单元的内存,如果有的话——或者到最近的有内存的单元。
这就自然而然有了Linux软件对NUMA系统的视角:
Linux将系统的硬件资源划分为多个软件抽象,称为“节点”。Linux将节点映射到硬件平台的物理单元
上,对一些架构的细节进行了抽象。与物理单元一样,软件节点可能包含0或更多的CPU、内存和/或IO
总线。同样,对“较近”节点的内存访问——映射到较近单元的节点——通常会比对较远单元的访问经历更快
的访问时间和更高的有效带宽。
对于一些架构,如x86,Linux将“隐藏”任何代表没有内存连接的物理单元的节点,并将连接到该单元
的任何CPU重新分配到代表有内存的单元的节点上。因此,在这些架构上,我们不能假设Linux将所有
的CPU与一个给定的节点相关联,会看到相同的本地内存访问时间和带宽。
此外,对于某些架构,同样以x86为例,Linux支持对额外节点的仿真。对于NUMA仿真,Linux会将现
有的节点或者非NUMA平台的系统内存分割成多个节点。每个模拟的节点将管理底层单元物理内存的一部
分。NUMA仿真对于在非NUMA平台上测试NUMA内核和应用功能是非常有用的,当与cpusets一起使用时,
可以作为一种内存资源管理机制。[见 Documentation/admin-guide/cgroup-v1/cpusets.rst]
对于每个有内存的节点,Linux构建了一个独立的内存管理子系统,有自己的空闲页列表、使用中页列表、
使用统计和锁来调解访问。此外,Linux为每个内存区[DMA、DMA32、NORMAL、HIGH_MEMORY、MOVABLE
中的一个或多个]构建了一个有序的“区列表”。zonelist指定了当一个选定的区/节点不能满足分配请求
时要访问的区/节点。当一个区没有可用的内存来满足请求时,这种情况被称为“overflow 溢出”或
“fallback 回退”。
由于一些节点包含多个包含不同类型内存的区,Linux必须决定是否对区列表进行排序,使分配回退到不同
节点上的相同区类型,或同一节点上的不同区类型。这是一个重要的考虑因素,因为有些区,如DMA或DMA32,
代表了相对稀缺的资源。Linux选择了一个默认的Node ordered zonelist。这意味着在使用按NUMA距
离排序的远程节点之前,它会尝试回退到同一节点的其他分区。
默认情况下,Linux会尝试从执行请求的CPU被分配到的节点中满足内存分配请求。具体来说,Linux将试
图从请求来源的节点的适当分区列表中的第一个节点进行分配。这被称为“本地分配”。如果“本地”节点不能
满足请求,内核将检查所选分区列表中其他节点的区域,寻找列表中第一个能满足请求的区域。
本地分配将倾向于保持对分配的内存的后续访问 “本地”的底层物理资源和系统互连——只要内核代表其分配
一些内存的任务后来不从该内存迁移。Linux调度器知道平台的NUMA拓扑结构——体现在“调度域”数据结构
中[见 Documentation/scheduler/sched-domains.rst]——并且调度器试图尽量减少任务迁移到遥
远的调度域中。然而,调度器并没有直接考虑到任务的NUMA足迹。因此,在充分不平衡的情况下,任务可
以在节点之间迁移,远离其初始节点和内核数据结构。
系统管理员和应用程序设计者可以使用各种CPU亲和命令行接口,如taskset(1)和numactl(1),以及程
序接口,如sched_setaffinity(2),来限制任务的迁移,以改善NUMA定位。此外,人们可以使用
Linux NUMA内存策略修改内核的默认本地分配行为。 [见
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`].
系统管理员可以使用控制组和CPUsets限制非特权用户在调度或NUMA命令和功能中可以指定的CPU和节点
的内存。 [见 Documentation/admin-guide/cgroup-v1/cpusets.rst]
在不隐藏无内存节点的架构上,Linux会在分区列表中只包括有内存的区域[节点]。这意味着对于一个无
内存的节点,“本地内存节点”——CPU节点的分区列表中的第一个区域的节点——将不是节点本身。相反,它
将是内核在建立分区列表时选择的离它最近的有内存的节点。所以,默认情况下,本地分配将由内核提供
最近的可用内存来完成。这是同一机制的结果,该机制允许这种分配在一个包含内存的节点溢出时回退到
其他附近的节点。
一些内核分配不希望或不能容忍这种分配回退行为。相反,他们想确保他们从指定的节点获得内存,或者
得到通知说该节点没有空闲内存。例如,当一个子系统分配每个CPU的内存资源时,通常是这种情况。
一个典型的分配模式是使用内核的numa_node_id()或CPU_to_node()函数获得“当前CPU”所在节点的
节点ID,然后只从返回的节点ID请求内存。当这样的分配失败时,请求的子系统可以恢复到它自己的回退
路径。板块内核内存分配器就是这样的一个例子。或者,子系统可以选择在分配失败时禁用或不启用自己。
内核分析子系统就是这样的一个例子。
如果架构支持——不隐藏无内存节点,那么连接到无内存节点的CPU将总是产生回退路径的开销,或者一些
子系统如果试图完全从无内存的节点分配内存,将无法初始化。为了透明地支持这种架构,内核子系统可
以使用numa_mem_id()或cpu_to_mem()函数来定位调用或指定CPU的“本地内存节点”。同样,这是同
一个节点,默认的本地页分配将从这个节点开始尝试。
:Original: Documentation/vm/overcommit-accounting.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
==============
超量使用审计
==============
Linux内核支持下列超量使用处理模式
0
启发式超量使用处理。拒绝明显的地址空间超量使用。用于一个典型的系统。
它确保严重的疯狂分配失败,同时允许超量使用以减少swap的使用。在这种模式下,
允许root分配稍多的内存。这是默认的。
1
总是超量使用。适用于一些科学应用。经典的例子是使用稀疏数组的代码,只是依赖
几乎完全由零页组成的虚拟内存
2
不超量使用。系统提交的总地址空间不允许超过swap+一个可配置的物理RAM的数量
(默认为50%)。根据你使用的数量,在大多数情况下,这意味着一个进程在访问页面时
不会被杀死,但会在内存分配上收到相应的错误。
对于那些想保证他们的内存分配在未来可用而又不需要初始化每一个页面的应用程序来说
是很有用的。
超量使用策略是通过sysctl `vm.overcommit_memory` 设置的。
可以通过 `vm.overcommit_ratio` (百分比)或 `vm.overcommit_kbytes` (绝对值)
来设置超限数量。这些只有在 `vm.overcommit_memory` 被设置为2时才有效果。
在 ``/proc/meminfo`` 中可以分别以CommitLimit和Committed_AS的形式查看当前
的超量使用和提交量。
陷阱
====
C语言的堆栈增长是一个隐含的mremap。如果你想得到绝对的保证,并在接近边缘的地方运行,
你 **必须** 为你认为你需要的最大尺寸的堆栈进行mmap。对于典型的堆栈使用来说,这并
不重要,但如果你真的非常关心的话,这就是一个值得关注的案例。
在模式2中,MAP_NORESERVE标志被忽略。
它是如何工作的
==============
超量使用是基于以下规则
对于文件映射
| SHARED or READ-only - 0 cost (该文件是映射而不是交换)
| PRIVATE WRITABLE - 每个实例的映射大小
对于匿名或者 ``/dev/zero`` 映射
| SHARED - 映射的大小
| PRIVATE READ-only - 0 cost (但作用不大)
| PRIVATE WRITABLE - 每个实例的映射大小
额外的计数
| 通过mmap制作可写副本的页面
| 从同一池中提取的shmfs内存
状态
====
* 我们核算mmap内存映射
* 我们核算mprotect在提交中的变化
* 我们核算mremap的大小变化
* 我们的审计 brk
* 审计munmap
* 我们在/proc中报告commit 状态
* 核对并检查分叉的情况
* 审查堆栈处理/执行中的构建
* 叙述SHMfs的情况
* 实现实际限制的执行
待续
====
* ptrace 页计数(这很难)。
:Original: Documentation/vm/page_frag.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
========
页面片段
========
一个页面片段是一个任意长度的任意偏移的内存区域,它位于一个0或更高阶的复合页面中。
该页中的多个碎片在该页的引用计数器中被单独计算。
page_frag函数,page_frag_alloc和page_frag_free,为页面片段提供了一个简单
的分配框架。这被网络堆栈和网络设备驱动使用,以提供一个内存的支持区域,作为
sk_buff->head使用,或者用于skb_shared_info的 “frags” 部分。
为了使用页面片段API,需要一个支持页面片段的缓冲区。这为碎片分配提供了一个中心点,
并允许多个调用使用一个缓存的页面。这样做的好处是可以避免对get_page的多次调用,
这在分配时开销可能会很大。然而,由于这种缓存的性质,要求任何对缓存的调用都要受到每
个CPU的限制,或者每个CPU的限制,并在执行碎片分配时强制禁止中断。
网络堆栈在每个CPU使用两个独立的缓存来处理碎片分配。netdev_alloc_cache被使用
netdev_alloc_frag和__netdev_alloc_skb调用的调用者使用。napi_alloc_cache
被调用__napi_alloc_frag和__napi_alloc_skb的调用者使用。这两个调用的主要区别是
它们可能被调用的环境。“netdev” 前缀的函数可以在任何上下文中使用,因为这些函数
将禁用中断,而 ”napi“ 前缀的函数只可以在softirq上下文中使用。
许多网络设备驱动程序使用类似的方法来分配页面片段,但页面片段是在环或描述符级别上
缓存的。为了实现这些情况,有必要提供一种拆解页面缓存的通用方法。出于这个原因,
__page_frag_cache_drain被实现了。它允许通过一次调用从一个页面释放多个引用。
这样做的好处是,它允许清理被添加到一个页面的多个引用,以避免每次分配都调用
get_page。
Alexander Duyck,2016年11月29日。
:Original: Documentation/vm/page_owner.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
================================
page owner: 跟踪谁分配的每个页面
================================
概述
====
page owner是用来追踪谁分配的每一个页面。它可以用来调试内存泄漏或找到内存占用者。
当分配发生时,有关分配的信息,如调用堆栈和页面的顺序被存储到每个页面的特定存储中。
当我们需要了解所有页面的状态时,我们可以获得并分析这些信息。
尽管我们已经有了追踪页面分配/释放的tracepoint,但用它来分析谁分配的每个页面是
相当复杂的。我们需要扩大跟踪缓冲区,以防止在用户空间程序启动前出现重叠。而且,启
动的程序会不断地将跟踪缓冲区转出,供以后分析,这将会改变系统的行为,会产生更多的
可能性,而不是仅仅保留在内存中,所以不利于调试。
页面所有者也可以用于各种目的。例如,可以通过每个页面的gfp标志信息获得精确的碎片
统计。如果启用了page owner,它就已经实现并激活了。我们非常欢迎其他用途。
page owner在默认情况下是禁用的。所以,如果你想使用它,你需要在你的启动cmdline
中加入"page_owner=on"。如果内核是用page owner构建的,并且由于没有启用启动
选项而在运行时禁用page owner,那么运行时的开销是很小的。如果在运行时禁用,它不
需要内存来存储所有者信息,所以没有运行时内存开销。而且,页面所有者在页面分配器的
热路径中只插入了两个不可能的分支,如果不启用,那么分配就会像没有页面所有者的内核
一样进行。这两个不可能的分支应该不会影响到分配的性能,特别是在静态键跳转标签修补
功能可用的情况下。以下是由于这个功能而导致的内核代码大小的变化。
- 没有page owner::
text data bss dec hex filename
48392 2333 644 51369 c8a9 mm/page_alloc.o
- 有page owner::
text data bss dec hex filename
48800 2445 644 51889 cab1 mm/page_alloc.o
6662 108 29 6799 1a8f mm/page_owner.o
1025 8 8 1041 411 mm/page_ext.o
虽然总共增加了8KB的代码,但page_alloc.o增加了520字节,其中不到一半是在hotpath
中。构建带有page owner的内核,并在需要时打开它,将是调试内核内存问题的最佳选择。
有一个问题是由实现细节引起的。页所有者将信息存储到struct page扩展的内存中。这
个内存的初始化时间比稀疏内存系统中的页面分配器启动的时间要晚一些,所以,在初始化
之前,许多页面可以被分配,但它们没有所有者信息。为了解决这个问题,这些早期分配的
页面在初始化阶段被调查并标记为分配。虽然这并不意味着它们有正确的所有者信息,但至
少,我们可以更准确地判断该页是否被分配。在2GB内存的x86-64虚拟机上,有13343
个早期分配的页面被捕捉和标记,尽管它们大部分是由结构页扩展功能分配的。总之,在这
之后,没有任何页面处于未追踪状态。
使用方法
========
1) 构建用户空间的帮助::
cd tools/vm
make page_owner_sort
2) 启用page owner: 添加 "page_owner=on" 到 boot cmdline.
3) 做你想调试的工作。
4) 分析来自页面所有者的信息::
cat /sys/kernel/debug/page_owner > page_owner_full.txt
./page_owner_sort page_owner_full.txt sorted_page_owner.txt
``page_owner_full.txt`` 的一般输出情况如下(输出信息无翻译价值)::
Page allocated via order XXX, ...
PFN XXX ...
// Detailed stack
Page allocated via order XXX, ...
PFN XXX ...
// Detailed stack
``page_owner_sort`` 工具忽略了 ``PFN`` 行,将剩余的行放在buf中,使用regexp提
取页序值,计算buf的次数和页数,最后根据参数进行排序。
在 ``sorted_page_owner.txt`` 中可以看到关于谁分配了每个页面的结果。一般输出::
XXX times, XXX pages:
Page allocated via order XXX, ...
// Detailed stack
默认情况下, ``page_owner_sort`` 是根据buf的时间来排序的。如果你想
按buf的页数排序,请使用-m参数。详细的参数是:
基本函数:
Sort:
-a 按内存分配时间排序
-m 按总内存排序
-p 按pid排序。
-P 按tgid排序。
-r 按内存释放时间排序。
-s 按堆栈跟踪排序。
-t 按时间排序(默认)。
其它函数:
Cull:
-c 通过比较堆栈跟踪而不是总块来进行剔除。
Filter:
-f 过滤掉内存已被释放的块的信息。
.. SPDX-License-Identifier: GPL-2.0
:Original: Documentation/vm/page_table_check.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
========
页表检查
========
概述
====
页表检查允许通过确保防止某些类型的内存损坏来强化内核。
当新的页面可以从用户空间访问时,页表检查通过将它们的页表项(PTEs PMD等)添加到页表中来执行额外
的验证。
在检测到损坏的情况下,内核会被崩溃。页表检查有一个小的性能和内存开销。因此,它在默认情况下是禁用
的,但是在额外的加固超过性能成本的系统上,可以选择启用。另外,由于页表检查是同步的,它可以帮助调
试双映射内存损坏问题,在错误的映射发生时崩溃内核,而不是在内存损坏错误发生后内核崩溃。
双重映射检测逻辑
================
+-------------------+-------------------+-------------------+------------------+
| Current Mapping | New mapping | Permissions | Rule |
+===================+===================+===================+==================+
| Anonymous | Anonymous | Read | Allow |
+-------------------+-------------------+-------------------+------------------+
| Anonymous | Anonymous | Read / Write | Prohibit |
+-------------------+-------------------+-------------------+------------------+
| Anonymous | Named | Any | Prohibit |
+-------------------+-------------------+-------------------+------------------+
| Named | Anonymous | Any | Prohibit |
+-------------------+-------------------+-------------------+------------------+
| Named | Named | Any | Allow |
+-------------------+-------------------+-------------------+------------------+
启用页表检查
============
用以下方法构建内核:
- PAGE_TABLE_CHECK=y
注意,它只能在ARCH_SUPPORTS_PAGE_TABLE_CHECK可用的平台上启用。
- 使用 "page_table_check=on" 内核参数启动。
可以选择用PAGE_TABLE_CHECK_ENFORCED来构建内核,以便在没有额外的内核参数的情况下获得页表
支持。
:Original: Documentation/vm/remap_file_pages.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
==============================
remap_file_pages()系统调用
==============================
remap_file_pages()系统调用被用来创建一个非线性映射,也就是说,在这个映射中,
文件的页面被无序映射到内存中。使用remap_file_pages()比重复调用mmap(2)的好
处是,前者不需要内核创建额外的VMA(虚拟内存区)数据结构。
支持非线性映射需要在内核虚拟内存子系统中编写大量的non-trivial的代码,包括热
路径。另外,为了使非线性映射工作,内核需要一种方法来区分正常的页表项和带有文件
偏移的项(pte_file)。内核为达到这个目的在PTE中保留了标志。PTE标志是稀缺资
源,特别是在某些CPU架构上。如果能腾出这个标志用于其他用途就更好了。
幸运的是,在生活中并没有很多remap_file_pages()的用户。只知道有一个企业的RDBMS
实现在32位系统上使用这个系统调用来映射比32位虚拟地址空间线性尺寸更大的文件。
由于64位系统的广泛使用,这种使用情况已经不重要了。
syscall被废弃了,现在用一个模拟来代替它。仿真会创建新的VMA,而不是非线性映射。
对于remap_file_pages()的少数用户来说,它的工作速度会变慢,但ABI被保留了。
仿真的一个副作用(除了性能之外)是,由于额外的VMA,用户可以更容易达到
vm.max_map_count的限制。关于限制的更多细节,请参见DEFAULT_MAX_MAP_COUNT
的注释。
:Original: Documentation/vm/split_page_table_lock.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
=================================
分页表锁(split page table lock)
=================================
最初,mm->page_table_lock spinlock保护了mm_struct的所有页表。但是这种方
法导致了多线程应用程序的缺页异常可扩展性差,因为对锁的争夺很激烈。为了提高可扩
展性,我们引入了分页表锁。
有了分页表锁,我们就有了单独的每张表锁来顺序化对表的访问。目前,我们对PTE和
PMD表使用分页锁。对高层表的访问由mm->page_table_lock保护。
有一些辅助工具来锁定/解锁一个表和其他访问器函数:
- pte_offset_map_lock()
映射pte并获取PTE表锁,返回所取锁的指针;
- pte_unmap_unlock()
解锁和解映射PTE表;
- pte_alloc_map_lock()
如果需要的话,分配PTE表并获取锁,如果分配失败,返回已获取的锁的指针
或NULL;
- pte_lockptr()
返回指向PTE表锁的指针;
- pmd_lock()
取得PMD表锁,返回所取锁的指针。
- pmd_lockptr()
返回指向PMD表锁的指针;
如果CONFIG_SPLIT_PTLOCK_CPUS(通常为4)小于或等于NR_CPUS,则在编译
时启用PTE表的分页表锁。如果分页锁被禁用,所有的表都由mm->page_table_lock
来保护。
如果PMD表启用了分页锁,并且架构支持它,那么PMD表的分页锁就会被启用(见
下文)。
Hugetlb 和分页表锁
==================
Hugetlb可以支持多种页面大小。我们只对PMD级别使用分页锁,但不对PUD使用。
Hugetlb特定的辅助函数:
- huge_pte_lock()
对PMD_SIZE页面采取pmd分割锁,否则mm->page_table_lock;
- huge_pte_lockptr()
返回指向表锁的指针。
架构对分页表锁的支持
====================
没有必要特别启用PTE分页表锁:所有需要的东西都由pgtable_pte_page_ctor()
和pgtable_pte_page_dtor()完成,它们必须在PTE表分配/释放时被调用。
确保架构不使用slab分配器来分配页表:slab使用page->slab_cache来分配其页
面。这个区域与page->ptl共享存储。
PMD分页锁只有在你有两个以上的页表级别时才有意义。
启用PMD分页锁需要在PMD表分配时调用pgtable_pmd_page_ctor(),在释放时调
用pgtable_pmd_page_dtor()。
分配通常发生在pmd_alloc_one()中,释放发生在pmd_free()和pmd_free_tlb()
中,但要确保覆盖所有的PMD表分配/释放路径:即X86_PAE在pgd_alloc()中预先
分配一些PMD。
一切就绪后,你可以设置CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK。
注意:pgtable_pte_page_ctor()和pgtable_pmd_page_ctor()可能失败--必
须正确处理。
page->ptl
=========
page->ptl用于访问分割页表锁,其中'page'是包含该表的页面struct page。它
与page->private(以及union中的其他几个字段)共享存储。
为了避免增加struct page的大小并获得最佳性能,我们使用了一个技巧:
- 如果spinlock_t适合于long,我们使用page->ptr作为spinlock,这样我们
就可以避免间接访问并节省一个缓存行。
- 如果spinlock_t的大小大于long的大小,我们使用page->ptl作为spinlock_t
的指针并动态分配它。这允许在启用DEBUG_SPINLOCK或DEBUG_LOCK_ALLOC的
情况下使用分页锁,但由于间接访问而多花了一个缓存行。
PTE表的spinlock_t分配在pgtable_pte_page_ctor()中,PMD表的spinlock_t
分配在pgtable_pmd_page_ctor()中。
请不要直接访问page->ptl - -使用适当的辅助函数。
:Original: Documentation/vm/z3fold.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
======
z3fold
======
z3fold是一个专门用于存储压缩页的分配器。它被设计为每个物理页最多可以存储三个压缩页。
它是zbud的衍生物,允许更高的压缩率,保持其前辈的简单性和确定性。
z3fold和zbud的主要区别是:
* 与zbud不同的是,z3fold允许最大的PAGE_SIZE分配。
* z3fold在其页面中最多可以容纳3个压缩页面
* z3fold本身没有输出任何API,因此打算通过zpool的API来使用
为了保持确定性和简单性,z3fold,就像zbud一样,总是在每页存储一个整数的压缩页,但是
它最多可以存储3页,不像zbud最多可以存储2页。因此压缩率达到2.7倍左右,而zbud的压缩
率是1.7倍左右。
不像zbud(但也像zsmalloc),z3fold_alloc()那样不返回一个可重复引用的指针。相反,它
返回一个无符号长句柄,它编码了被分配对象的实际位置。
保持有效的压缩率接近于zsmalloc,z3fold不依赖于MMU的启用,并提供更可预测的回收行
为,这使得它更适合于小型和反应迅速的系统。
:Original: Documentation/vm/zs_malloc.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
========
zsmalloc
========
这个分配器是为与zram一起使用而设计的。因此,该分配器应该在低内存条件下工作良好。特别是,
它从未尝试过higher order页面的分配,这在内存压力下很可能会失败。另一方面,如果我们只
是使用单(0-order)页,它将遭受非常高的碎片化 - 任何大小为PAGE_SIZE/2或更大的对象将
占据整个页面。这是其前身(xvmalloc)的主要问题之一。
为了克服这些问题,zsmalloc分配了一堆0-order页面,并使用各种"struct page"字段将它
们链接起来。这些链接的页面作为一个单一的higher order页面,即一个对象可以跨越0-order
页面的边界。代码将这些链接的页面作为一个实体,称为zspage。
为了简单起见,zsmalloc只能分配大小不超过PAGE_SIZE的对象,因为这满足了所有当前用户的
要求(在最坏的情况下,页面是不可压缩的,因此以"原样"即未压缩的形式存储)。对于大于这
个大小的分配请求,会返回失败(见zs_malloc)。
此外,zs_malloc()并不返回一个可重复引用的指针。相反,它返回一个不透明的句柄(无符号
长),它编码了被分配对象的实际位置。这种间接性的原因是zsmalloc并不保持zspages的永久
映射,因为这在32位系统上会导致问题,因为内核空间映射的VA区域非常小。因此,在使用分配
的内存之前,对象必须使用zs_map_object()进行映射以获得一个可用的指针,随后使用
zs_unmap_object()解除映射。
stat
====
通过CONFIG_ZSMALLOC_STAT,我们可以通过 ``/sys/kernel/debug/zsmalloc/<user name>``
看到zsmalloc内部信息。下面是一个统计输出的例子。::
# cat /sys/kernel/debug/zsmalloc/zram0/classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage
...
...
9 176 0 1 186 129 8 4
10 192 1 0 2880 2872 135 3
11 208 0 1 819 795 42 2
12 224 0 1 219 159 12 4
...
...
class
索引
size
zspage存储对象大小
almost_empty
ZS_ALMOST_EMPTY zspage的数量(见下文)。
almost_full
ZS_ALMOST_FULL zspage的数量(见下图)
obj_allocated
已分配对象的数量
obj_used
分配给用户的对象的数量
pages_used
为该类分配的页数
pages_per_zspage
组成一个zspage的0-order页面的数量
当n <= N / f时,我们将一个zspage分配给ZS_ALMOST_EMPTYfullness组,其中
* n = 已分配对象的数量
* N = zspage可以存储的对象总数
* f = fullness_threshold_frac(即,目前是4个)
同样地,我们将zspage分配给:
* ZS_ALMOST_FULL when n > N / f
* ZS_EMPTY when n == 0
* ZS_FULL when n == N
...@@ -13,7 +13,7 @@ Following tables describe the expected semantics which can also be tested during ...@@ -13,7 +13,7 @@ Following tables describe the expected semantics which can also be tested during
boot via CONFIG_DEBUG_VM_PGTABLE option. All future changes in here or the debug boot via CONFIG_DEBUG_VM_PGTABLE option. All future changes in here or the debug
test need to be in sync. test need to be in sync.
======================
PTE Page Table Helpers PTE Page Table Helpers
====================== ======================
...@@ -79,7 +79,7 @@ PTE Page Table Helpers ...@@ -79,7 +79,7 @@ PTE Page Table Helpers
| ptep_set_access_flags | Converts into a more permissive PTE | | ptep_set_access_flags | Converts into a more permissive PTE |
+---------------------------+--------------------------------------------------+ +---------------------------+--------------------------------------------------+
======================
PMD Page Table Helpers PMD Page Table Helpers
====================== ======================
...@@ -153,7 +153,7 @@ PMD Page Table Helpers ...@@ -153,7 +153,7 @@ PMD Page Table Helpers
| pmdp_set_access_flags | Converts into a more permissive PMD | | pmdp_set_access_flags | Converts into a more permissive PMD |
+---------------------------+--------------------------------------------------+ +---------------------------+--------------------------------------------------+
======================
PUD Page Table Helpers PUD Page Table Helpers
====================== ======================
...@@ -209,7 +209,7 @@ PUD Page Table Helpers ...@@ -209,7 +209,7 @@ PUD Page Table Helpers
| pudp_set_access_flags | Converts into a more permissive PUD | | pudp_set_access_flags | Converts into a more permissive PUD |
+---------------------------+--------------------------------------------------+ +---------------------------+--------------------------------------------------+
==========================
HugeTLB Page Table Helpers HugeTLB Page Table Helpers
========================== ==========================
...@@ -235,7 +235,7 @@ HugeTLB Page Table Helpers ...@@ -235,7 +235,7 @@ HugeTLB Page Table Helpers
| huge_ptep_set_access_flags | Converts into a more permissive HugeTLB | | huge_ptep_set_access_flags | Converts into a more permissive HugeTLB |
+---------------------------+--------------------------------------------------+ +---------------------------+--------------------------------------------------+
========================
SWAP Page Table Helpers SWAP Page Table Helpers
======================== ========================
......
.. SPDX-License-Identifier: GPL-2.0
===========
Boot Memory
===========
...@@ -2,12 +2,39 @@ ...@@ -2,12 +2,39 @@
Linux Memory Management Documentation Linux Memory Management Documentation
===================================== =====================================
This is a collection of documents about the Linux memory management (mm) Memory Management Guide
subsystem internals with different level of details ranging from notes and =======================
mailing list responses for elaborating descriptions of data structures and
algorithms. If you are looking for advice on simply allocating memory, see the This is a guide to understanding the memory management subsystem
:ref:`memory_allocation`. For controlling and tuning guides, see the of Linux. If you are looking for advice on simply allocating memory,
:doc:`admin guide <../admin-guide/mm/index>`. see the :ref:`memory_allocation`. For controlling and tuning guides,
see the :doc:`admin guide <../admin-guide/mm/index>`.
.. toctree::
:maxdepth: 1
physical_memory
page_tables
process_addrs
bootmem
page_allocation
vmalloc
slab
highmem
page_reclaim
swap
page_cache
shmfs
oom
Legacy Documentation
====================
This is a collection of older documents about the Linux memory management
(MM) subsystem internals with different level of details ranging from
notes and mailing list responses for elaborating descriptions of data
structures and algorithms. It should all be integrated nicely into the
above structured documentation, or deleted if it has served its purpose.
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
...@@ -18,7 +45,6 @@ algorithms. If you are looking for advice on simply allocating memory, see the ...@@ -18,7 +45,6 @@ algorithms. If you are looking for advice on simply allocating memory, see the
damon/index damon/index
free_page_reporting free_page_reporting
frontswap frontswap
highmem
hmm hmm
hwpoison hwpoison
hugetlbfs_reserv hugetlbfs_reserv
......
.. SPDX-License-Identifier: GPL-2.0
======================
Out Of Memory Handling
======================
.. SPDX-License-Identifier: GPL-2.0
===============
Page Allocation
===============
.. SPDX-License-Identifier: GPL-2.0
==========
Page Cache
==========
.. SPDX-License-Identifier: GPL-2.0
============
Page Reclaim
============
.. SPDX-License-Identifier: GPL-2.0
===========
Page Tables
===========
.. SPDX-License-Identifier: GPL-2.0
===============
Physical Memory
===============
.. SPDX-License-Identifier: GPL-2.0
=================
Process Addresses
=================
.. SPDX-License-Identifier: GPL-2.0
========================
Shared Memory Filesystem
========================
.. SPDX-License-Identifier: GPL-2.0
===============
Slab Allocation
===============
.. SPDX-License-Identifier: GPL-2.0
====
Swap
====
.. SPDX-License-Identifier: GPL-2.0
======================================
Virtually Contiguous Memory Allocation
======================================
...@@ -6,7 +6,8 @@ Supported chips: ...@@ -6,7 +6,8 @@ Supported chips:
* Maxim ds18*20 based temperature sensors. * Maxim ds18*20 based temperature sensors.
* Maxim ds1825 based temperature sensors. * Maxim ds1825 based temperature sensors.
* GXCAS GC20MH01 temperature sensor. * GXCAS GX20MH01 temperature sensor.
* Maxim MAX31850 thermoelement interface.
Author: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Author: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
...@@ -15,7 +16,7 @@ Description ...@@ -15,7 +16,7 @@ Description
----------- -----------
w1_therm provides basic temperature conversion for ds18*20, ds28ea00, GX20MH01 w1_therm provides basic temperature conversion for ds18*20, ds28ea00, GX20MH01
devices. and MAX31850 devices.
Supported family codes: Supported family codes:
...@@ -137,3 +138,7 @@ bits in Config register; R2 bit in Config register enabling 13 and 14 bit ...@@ -137,3 +138,7 @@ bits in Config register; R2 bit in Config register enabling 13 and 14 bit
resolutions. The device is powered up in 14-bit resolution mode. The conversion resolutions. The device is powered up in 14-bit resolution mode. The conversion
times specified in the datasheet are too low and have to be increased. The times specified in the datasheet are too low and have to be increased. The
device supports driver features ``1`` and ``2``. device supports driver features ``1`` and ``2``.
MAX31850 device shares family number 0x3B with DS1825. The device is generally
compatible with DS1825. The higher 4 bits of Config register read all 1,
indicating 15, but the device is always operating in 14-bit resolution mode.
...@@ -32,14 +32,14 @@ Whenever the kernel tries to access an address that is currently not ...@@ -32,14 +32,14 @@ Whenever the kernel tries to access an address that is currently not
accessible, the CPU generates a page fault exception and calls the accessible, the CPU generates a page fault exception and calls the
page fault handler:: page fault handler::
void do_page_fault(struct pt_regs *regs, unsigned long error_code) void exc_page_fault(struct pt_regs *regs, unsigned long error_code)
in arch/x86/mm/fault.c. The parameters on the stack are set up by in arch/x86/mm/fault.c. The parameters on the stack are set up by
the low level assembly glue in arch/x86/entry/entry_32.S. The parameter the low level assembly glue in arch/x86/entry/entry_32.S. The parameter
regs is a pointer to the saved registers on the stack, error_code regs is a pointer to the saved registers on the stack, error_code
contains a reason code for the exception. contains a reason code for the exception.
do_page_fault first obtains the unaccessible address from the CPU exc_page_fault() first obtains the inaccessible address from the CPU
control register CR2. If the address is within the virtual address control register CR2. If the address is within the virtual address
space of the process, the fault probably occurred, because the page space of the process, the fault probably occurred, because the page
was not swapped in, write protected or something similar. However, was not swapped in, write protected or something similar. However,
...@@ -57,10 +57,10 @@ Where does fixup point to? ...@@ -57,10 +57,10 @@ Where does fixup point to?
Since we jump to the contents of fixup, fixup obviously points Since we jump to the contents of fixup, fixup obviously points
to executable code. This code is hidden inside the user access macros. to executable code. This code is hidden inside the user access macros.
I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h I have picked the get_user() macro defined in arch/x86/include/asm/uaccess.h
as an example. The definition is somewhat hard to follow, so let's peek at as an example. The definition is somewhat hard to follow, so let's peek at
the code generated by the preprocessor and the compiler. I selected the code generated by the preprocessor and the compiler. I selected
the get_user call in drivers/char/sysrq.c for a detailed examination. the get_user() call in drivers/char/sysrq.c for a detailed examination.
The original code in sysrq.c line 587:: The original code in sysrq.c line 587::
...@@ -281,12 +281,15 @@ vma occurs? ...@@ -281,12 +281,15 @@ vma occurs?
> c017e7a5 <do_con_write+e1> movb (%ebx),%dl > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
#. MMU generates exception #. MMU generates exception
#. CPU calls do_page_fault #. CPU calls exc_page_fault()
#. do page fault calls search_exception_table (regs->eip == c017e7a5); #. exc_page_fault() calls do_user_addr_fault()
#. search_exception_table looks up the address c017e7a5 in the #. do_user_addr_fault() calls kernelmode_fixup_or_oops()
#. kernelmode_fixup_or_oops() calls fixup_exception() (regs->eip == c017e7a5);
#. fixup_exception() calls search_exception_tables()
#. search_exception_tables() looks up the address c017e7a5 in the
exception table (i.e. the contents of the ELF section __ex_table) exception table (i.e. the contents of the ELF section __ex_table)
and returns the address of the associated fault handle code c0199ff5. and returns the address of the associated fault handle code c0199ff5.
#. do_page_fault modifies its own return address to point to the fault #. fixup_exception() modifies its own return address to point to the fault
handle code and returns. handle code and returns.
#. execution continues in the fault handling code. #. execution continues in the fault handling code.
#. a) EAX becomes -EFAULT (== -14) #. a) EAX becomes -EFAULT (== -14)
...@@ -298,9 +301,9 @@ The steps 8a to 8c in a certain way emulate the faulting instruction. ...@@ -298,9 +301,9 @@ The steps 8a to 8c in a certain way emulate the faulting instruction.
That's it, mostly. If you look at our example, you might ask why That's it, mostly. If you look at our example, you might ask why
we set EAX to -EFAULT in the exception handler code. Well, the we set EAX to -EFAULT in the exception handler code. Well, the
get_user macro actually returns a value: 0, if the user access was get_user() macro actually returns a value: 0, if the user access was
successful, -EFAULT on failure. Our original code did not test this successful, -EFAULT on failure. Our original code did not test this
return value, however the inline assembly code in get_user tries to return value, however the inline assembly code in get_user() tries to
return -EFAULT. GCC selected EAX to return this value. return -EFAULT. GCC selected EAX to return this value.
NOTE: NOTE:
......
...@@ -22,7 +22,7 @@ x86-specific Documentation ...@@ -22,7 +22,7 @@ x86-specific Documentation
mtrr mtrr
pat pat
intel-hfi intel-hfi
intel-iommu iommu
intel_txt intel_txt
amd-memory-encryption amd-memory-encryption
amd_hsmp amd_hsmp
......
=================== =================
Linux IOMMU Support x86 IOMMU Support
=================== =================
The architecture spec can be obtained from the below location. The architecture specs can be obtained from the below locations.
http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf - Intel: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf
- AMD: https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
This guide gives a quick cheat sheet for some basic understanding. This guide gives a quick cheat sheet for some basic understanding.
Some Keywords
- DMAR - DMA remapping
- DRHD - DMA Remapping Hardware Unit Definition
- RMRR - Reserved memory Region Reporting Structure
- ZLR - Zero length reads from PCI devices
- IOVA - IO Virtual address.
Basic stuff Basic stuff
----------- -----------
ACPI enumerates and lists the different DMA engines in the platform, and ACPI enumerates and lists the different IOMMUs on the platform, and
device scope relationships between PCI devices and which DMA engine controls device scope relationships between devices and which IOMMU controls
them. them.
What is RMRR? Some ACPI Keywords:
-------------
- DMAR - Intel DMA Remapping table
- DRHD - Intel DMA Remapping Hardware Unit Definition
- RMRR - Intel Reserved Memory Region Reporting Structure
- IVRS - AMD I/O Virtualization Reporting Structure
- IVDB - AMD I/O Virtualization Definition Block
- IVHD - AMD I/O Virtualization Hardware Definition
What is Intel RMRR?
^^^^^^^^^^^^^^^^^^^
There are some devices the BIOS controls, for e.g USB devices to perform There are some devices the BIOS controls, for e.g USB devices to perform
PS2 emulation. The regions of memory used for these devices are marked PS2 emulation. The regions of memory used for these devices are marked
...@@ -33,57 +35,79 @@ regions will fail. Hence BIOS uses RMRR to specify these regions along with ...@@ -33,57 +35,79 @@ regions will fail. Hence BIOS uses RMRR to specify these regions along with
devices that need to access these regions. OS is expected to setup devices that need to access these regions. OS is expected to setup
unity mappings for these regions for these devices to access these regions. unity mappings for these regions for these devices to access these regions.
How is IOVA generated? What is AMD IVRS?
---------------------- ^^^^^^^^^^^^^^^^^
Well behaved drivers call pci_map_*() calls before sending command to device The architecture defines an ACPI-compatible data structure called an I/O
that needs to perform DMA. Once DMA is completed and mapping is no longer Virtualization Reporting Structure (IVRS) that is used to convey information
required, device performs a pci_unmap_*() calls to unmap the region. related to I/O virtualization to system software. The IVRS describes the
configuration and capabilities of the IOMMUs contained in the platform as
well as information about the devices that each IOMMU virtualizes.
The IVRS provides information about the following:
The Intel IOMMU driver allocates a virtual address per domain. Each PCIE - IOMMUs present in the platform including their capabilities and proper configuration
device has its own domain (hence protection). Devices under p2p bridges - System I/O topology relevant to each IOMMU
share the virtual address with all devices under the p2p bridge due to - Peripheral devices that cannot be otherwise enumerated
transaction id aliasing for p2p bridges. - Memory regions used by SMI/SMM, platform firmware, and platform hardware. These are generally exclusion ranges to be configured by system software.
IOVA generation is pretty generic. We used the same technique as vmalloc() How is an I/O Virtual Address (IOVA) generated?
but these are not global address spaces, but separate for each domain. -----------------------------------------------
Different DMA engines may support different number of domains.
We also allocate guard pages with each mapping, so we can attempt to catch Well behaved drivers call dma_map_*() calls before sending command to device
any overflow that might happen. that needs to perform DMA. Once DMA is completed and mapping is no longer
required, driver performs dma_unmap_*() calls to unmap the region.
Intel Specific Notes
--------------------
Graphics Problems? Graphics Problems?
------------------ ^^^^^^^^^^^^^^^^^^
If you encounter issues with graphics devices, you can try adding If you encounter issues with graphics devices, you can try adding
option intel_iommu=igfx_off to turn off the integrated graphics engine. option intel_iommu=igfx_off to turn off the integrated graphics engine.
If this fixes anything, please ensure you file a bug reporting the problem. If this fixes anything, please ensure you file a bug reporting the problem.
Some exceptions to IOVA Some exceptions to IOVA
----------------------- ^^^^^^^^^^^^^^^^^^^^^^^
Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff). Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
The same is true for peer to peer transactions. Hence we reserve the The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses. address from PCI MMIO ranges so they are not allocated for IOVA addresses.
AMD Specific Notes
------------------
Graphics Problems?
^^^^^^^^^^^^^^^^^^
If you encounter issues with integrated graphics devices, you can try adding
option iommu=pt to the kernel command line use a 1:1 mapping for the IOMMU. If
this fixes anything, please ensure you file a bug reporting the problem.
Fault reporting Fault reporting
--------------- ---------------
When errors are reported, the DMA engine signals via an interrupt. The fault When errors are reported, the IOMMU signals via an interrupt. The fault
reason and device that caused it with fault reason is printed on console. reason and device that caused it is printed on the console.
See below for sample.
Kernel Log Samples
------------------
Boot Message Sample Intel Boot Messages
------------------- ^^^^^^^^^^^^^^^^^^^
Something like this gets printed indicating presence of DMAR tables Something like this gets printed indicating presence of DMAR tables
in ACPI. in ACPI:
ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0 ::
ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0
When DMAR is being processed and initialized by ACPI, prints DMAR locations When DMAR is being processed and initialized by ACPI, prints DMAR locations
and any RMRR's processed:: and any RMRR's processed:
::
ACPI DMAR:Host address width 36 ACPI DMAR:Host address width 36
ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
...@@ -92,13 +116,14 @@ and any RMRR's processed:: ...@@ -92,13 +116,14 @@ and any RMRR's processed::
ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff
ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff
When DMAR is enabled for use, you will notice.. When DMAR is enabled for use, you will notice:
PCI-DMA: Using DMAR IOMMU ::
-------------------------
Fault reporting PCI-DMA: Using DMAR IOMMU
^^^^^^^^^^^^^^^
Intel Fault reporting
^^^^^^^^^^^^^^^^^^^^^
:: ::
...@@ -107,9 +132,20 @@ Fault reporting ...@@ -107,9 +132,20 @@ Fault reporting
DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
DMAR:[fault reason 05] PTE Write access is not set DMAR:[fault reason 05] PTE Write access is not set
TBD AMD Boot Messages
---- ^^^^^^^^^^^^^^^^^
Something like this gets printed indicating presence of the IOMMU:
::
iommu: Default domain type: Translated
iommu: DMA domain TLB invalidation policy: lazy mode
AMD Fault reporting
^^^^^^^^^^^^^^^^^^^
::
- For compatibility testing, could use unity map domain for all devices, just AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0007 address=0xffffc02000 flags=0x0000]
provide a 1-1 for all useful memory under a single domain for all devices. AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x0007 address=0xffffc02000 flags=0x0000]
- API for paravirt ops for abstracting functionality for VMM folks.
...@@ -4632,6 +4632,7 @@ F: Documentation/dev-tools/checkpatch.rst ...@@ -4632,6 +4632,7 @@ F: Documentation/dev-tools/checkpatch.rst
CHINESE DOCUMENTATION CHINESE DOCUMENTATION
M: Alex Shi <alexs@kernel.org> M: Alex Shi <alexs@kernel.org>
M: Yanteng Si <siyanteng@loongson.cn>
S: Maintained S: Maintained
F: Documentation/translations/zh_CN/ F: Documentation/translations/zh_CN/
...@@ -6006,6 +6007,12 @@ L: linux-doc@vger.kernel.org ...@@ -6006,6 +6007,12 @@ L: linux-doc@vger.kernel.org
S: Maintained S: Maintained
F: Documentation/translations/it_IT F: Documentation/translations/it_IT
DOCUMENTATION/JAPANESE
R: Akira Yokosawa <akiyks@gmail.com>
L: linux-doc@vger.kernel.org
S: Maintained
F: Documentation/translations/ja_JP
DONGWOON DW9714 LENS VOICE COIL DRIVER DONGWOON DW9714 LENS VOICE COIL DRIVER
M: Sakari Ailus <sakari.ailus@linux.intel.com> M: Sakari Ailus <sakari.ailus@linux.intel.com>
L: linux-media@vger.kernel.org L: linux-media@vger.kernel.org
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment