docs: move x86 documentation into Documentation/arch/

Move the x86 documentation under Documentation/arch/ as a way of cleaning up the top-level directory and making the structure of our docs more closely match the structure of the source directories it describes. All in-kernel references to the old paths have been updated. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20230315211523.108836-1-corbet@lwn.net/ Signed-off-by: Jonathan Corbet <corbet@lwn.net>
author: Jonathan Corbet <corbet@lwn.net> 2023-03-14 17:06:44 -0600
committer: Jonathan Corbet <corbet@lwn.net> 2023-03-30 12:58:51 -0600
commit: ff61f0791ce969d2db6c9f3b71d74ceec0a2e958 (patch)
tree: fe32be44aaf65f9c436a8f37cd4a18f6ec47c3cb /Documentation/x86
parent: f030c8fd64cea916d57d40bb7b59c1cff9ea3bc3 (diff)
download: lwn-ff61f0791ce969d2db6c9f3b71d74ceec0a2e958.tar.gz
lwn-ff61f0791ce969d2db6c9f3b71d74ceec0a2e958.zip
44 files changed, 0 insertions, 8681 deletions
diff --git a/Documentation/x86/amd-memory-encryption.rst b/Documentation/x86/amd-memory-encryption.rst
deleted file mode 100644
index 934310ce7258..000000000000
--- a/Documentation/x86/amd-memory-encryption.rst
+++ /dev/null
@@ -1,133 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=====================
-AMD Memory Encryption
-=====================
-
-Secure Memory Encryption (SME) and Secure Encrypted Virtualization (SEV) are
-features found on AMD processors.
-
-SME provides the ability to mark individual pages of memory as encrypted using
-the standard x86 page tables.  A page that is marked encrypted will be
-automatically decrypted when read from DRAM and encrypted when written to
-DRAM.  SME can therefore be used to protect the contents of DRAM from physical
-attacks on the system.
-
-SEV enables running encrypted virtual machines (VMs) in which the code and data
-of the guest VM are secured so that a decrypted version is available only
-within the VM itself. SEV guest VMs have the concept of private and shared
-memory. Private memory is encrypted with the guest-specific key, while shared
-memory may be encrypted with hypervisor key. When SME is enabled, the hypervisor
-key is the same key which is used in SME.
-
-A page is encrypted when a page table entry has the encryption bit set (see
-below on how to determine its position).  The encryption bit can also be
-specified in the cr3 register, allowing the PGD table to be encrypted. Each
-successive level of page tables can also be encrypted by setting the encryption
-bit in the page table entry that points to the next table. This allows the full
-page table hierarchy to be encrypted. Note, this means that just because the
-encryption bit is set in cr3, doesn't imply the full hierarchy is encrypted.
-Each page table entry in the hierarchy needs to have the encryption bit set to
-achieve that. So, theoretically, you could have the encryption bit set in cr3
-so that the PGD is encrypted, but not set the encryption bit in the PGD entry
-for a PUD which results in the PUD pointed to by that entry to not be
-encrypted.
-
-When SEV is enabled, instruction pages and guest page tables are always treated
-as private. All the DMA operations inside the guest must be performed on shared
-memory. Since the memory encryption bit is controlled by the guest OS when it
-is operating in 64-bit or 32-bit PAE mode, in all other modes the SEV hardware
-forces the memory encryption bit to 1.
-
-Support for SME and SEV can be determined through the CPUID instruction. The
-CPUID function 0x8000001f reports information related to SME::
-
-	0x8000001f[eax]:
-		Bit[0] indicates support for SME
-		Bit[1] indicates support for SEV
-	0x8000001f[ebx]:
-		Bits[5:0]  pagetable bit number used to activate memory
-			   encryption
-		Bits[11:6] reduction in physical address space, in bits, when
-			   memory encryption is enabled (this only affects
-			   system physical addresses, not guest physical
-			   addresses)
-
-If support for SME is present, MSR 0xc00100010 (MSR_AMD64_SYSCFG) can be used to
-determine if SME is enabled and/or to enable memory encryption::
-
-	0xc0010010:
-		Bit[23]   0 = memory encryption features are disabled
-			  1 = memory encryption features are enabled
-
-If SEV is supported, MSR 0xc0010131 (MSR_AMD64_SEV) can be used to determine if
-SEV is active::
-
-	0xc0010131:
-		Bit[0]	  0 = memory encryption is not active
-			  1 = memory encryption is active
-
-Linux relies on BIOS to set this bit if BIOS has determined that the reduction
-in the physical address space as a result of enabling memory encryption (see
-CPUID information above) will not conflict with the address space resource
-requirements for the system.  If this bit is not set upon Linux startup then
-Linux itself will not set it and memory encryption will not be possible.
-
-The state of SME in the Linux kernel can be documented as follows:
-
-	- Supported:
-	  The CPU supports SME (determined through CPUID instruction).
-
-	- Enabled:
-	  Supported and bit 23 of MSR_AMD64_SYSCFG is set.
-
-	- Active:
-	  Supported, Enabled and the Linux kernel is actively applying
-	  the encryption bit to page table entries (the SME mask in the
-	  kernel is non-zero).
-
-SME can also be enabled and activated in the BIOS. If SME is enabled and
-activated in the BIOS, then all memory accesses will be encrypted and it will
-not be necessary to activate the Linux memory encryption support.  If the BIOS
-merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG), then Linux can activate
-memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or
-by supplying mem_encrypt=on on the kernel command line.  However, if BIOS does
-not enable SME, then Linux will not be able to activate memory encryption, even
-if configured to do so by default or the mem_encrypt=on command line parameter
-is specified.
-
-Secure Nested Paging (SNP)
-==========================
-
-SEV-SNP introduces new features (SEV_FEATURES[1:63]) which can be enabled
-by the hypervisor for security enhancements. Some of these features need
-guest side implementation to function correctly. The below table lists the
-expected guest behavior with various possible scenarios of guest/hypervisor
-SNP feature support.
-
-+-----------------+---------------+---------------+------------------+
-| Feature Enabled | Guest needs   | Guest has     | Guest boot       |
-| by the HV       | implementation| implementation| behaviour        |
-+=================+===============+===============+==================+
-|      No         |      No       |      No       |     Boot         |
-|                 |               |               |                  |
-+-----------------+---------------+---------------+------------------+
-|      No         |      Yes      |      No       |     Boot         |
-|                 |               |               |                  |
-+-----------------+---------------+---------------+------------------+
-|      No         |      Yes      |      Yes      |     Boot         |
-|                 |               |               |                  |
-+-----------------+---------------+---------------+------------------+
-|      Yes        |      No       |      No       | Boot with        |
-|                 |               |               | feature enabled  |
-+-----------------+---------------+---------------+------------------+
-|      Yes        |      Yes      |      No       | Graceful boot    |
-|                 |               |               | failure          |
-+-----------------+---------------+---------------+------------------+
-|      Yes        |      Yes      |      Yes      | Boot with        |
-|                 |               |               | feature enabled  |
-+-----------------+---------------+---------------+------------------+
-
-More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR
-
-[1] https://www.amd.com/system/files/TechDocs/40332.pdf
diff --git a/Documentation/x86/amd_hsmp.rst b/Documentation/x86/amd_hsmp.rst
deleted file mode 100644
index 440e4b645a1c..000000000000
--- a/Documentation/x86/amd_hsmp.rst
+++ /dev/null
@@ -1,86 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-============================================
-AMD HSMP interface
-============================================
-
-Newer Fam19h EPYC server line of processors from AMD support system
-management functionality via HSMP (Host System Management Port).
-
-The Host System Management Port (HSMP) is an interface to provide
-OS-level software with access to system management functions via a
-set of mailbox registers.
-
-More details on the interface can be found in chapter
-"7 Host System Management Port (HSMP)" of the family/model PPR
-Eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
-
-HSMP interface is supported on EPYC server CPU models only.
-
-
-HSMP device
-============================================
-
-amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice
-/dev/hsmp to let user space programs run hsmp mailbox commands.
-
-$ ls -al /dev/hsmp
-crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp
-
-Characteristics of the dev node:
- * Write mode is used for running set/configure commands
- * Read mode is used for running get/status monitor commands
-
-Access restrictions:
- * Only root user is allowed to open the file in write mode.
- * The file can be opened in read mode by all the users.
-
-In-kernel integration:
- * Other subsystems in the kernel can use the exported transport
-   function hsmp_send_message().
- * Locking across callers is taken care by the driver.
-
-
-An example
-==========
-
-To access hsmp device from a C program.
-First, you need to include the headers::
-
-  #include <linux/amd_hsmp.h>
-
-Which defines the supported messages/message IDs.
-
-Next thing, open the device file, as follows::
-
-  int file;
-
-  file = open("/dev/hsmp", O_RDWR);
-  if (file < 0) {
-    /* ERROR HANDLING; you can check errno to see what went wrong */
-    exit(1);
-  }
-
-The following IOCTL is defined:
-
-``ioctl(file, HSMP_IOCTL_CMD, struct hsmp_message *msg)``
-  The argument is a pointer to a::
-
-    struct hsmp_message {
-    	__u32	msg_id;				/* Message ID */
-    	__u16	num_args;			/* Number of input argument words in message */
-    	__u16	response_sz;			/* Number of expected output/response words */
-    	__u32	args[HSMP_MAX_MSG_LEN];		/* argument/response buffer */
-    	__u16	sock_ind;			/* socket number */
-    };
-
-The ioctl would return a non-zero on failure; you can read errno to see
-what happened. The transaction returns 0 on success.
-
-More details on the interface and message definitions can be found in chapter
-"7 Host System Management Port (HSMP)" of the respective family/model PPR
-eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
-
-User space C-APIs are made available by linking against the esmi library,
-which is provided by the E-SMS project https://developer.amd.com/e-sms/.
-See: https://github.com/amd/esmi_ib_library
diff --git a/Documentation/x86/boot.rst b/Documentation/x86/boot.rst
deleted file mode 100644
index 240d084782a6..000000000000
--- a/Documentation/x86/boot.rst
+++ /dev/null
@@ -1,1443 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================
-The Linux/x86 Boot Protocol
-===========================
-
-On the x86 platform, the Linux kernel uses a rather complicated boot
-convention.  This has evolved partially due to historical aspects, as
-well as the desire in the early days to have the kernel itself be a
-bootable image, the complicated PC memory model and due to changed
-expectations in the PC industry caused by the effective demise of
-real-mode DOS as a mainstream operating system.
-
-Currently, the following versions of the Linux/x86 boot protocol exist.
-
-=============	============================================================
-Old kernels	zImage/Image support only.  Some very early kernels
-		may not even support a command line.
-
-Protocol 2.00	(Kernel 1.3.73) Added bzImage and initrd support, as
-		well as a formalized way to communicate between the
-		boot loader and the kernel.  setup.S made relocatable,
-		although the traditional setup area still assumed
-		writable.
-
-Protocol 2.01	(Kernel 1.3.76) Added a heap overrun warning.
-
-Protocol 2.02	(Kernel 2.4.0-test3-pre3) New command line protocol.
-		Lower the conventional memory ceiling.	No overwrite
-		of the traditional setup area, thus making booting
-		safe for systems which use the EBDA from SMM or 32-bit
-		BIOS entry points.  zImage deprecated but still
-		supported.
-
-Protocol 2.03	(Kernel 2.4.18-pre1) Explicitly makes the highest possible
-		initrd address available to the bootloader.
-
-Protocol 2.04	(Kernel 2.6.14) Extend the syssize field to four bytes.
-
-Protocol 2.05	(Kernel 2.6.20) Make protected mode kernel relocatable.
-		Introduce relocatable_kernel and kernel_alignment fields.
-
-Protocol 2.06	(Kernel 2.6.22) Added a field that contains the size of
-		the boot command line.
-
-Protocol 2.07	(Kernel 2.6.24) Added paravirtualised boot protocol.
-		Introduced hardware_subarch and hardware_subarch_data
-		and KEEP_SEGMENTS flag in load_flags.
-
-Protocol 2.08	(Kernel 2.6.26) Added crc32 checksum and ELF format
-		payload. Introduced payload_offset and payload_length
-		fields to aid in locating the payload.
-
-Protocol 2.09	(Kernel 2.6.26) Added a field of 64-bit physical
-		pointer to single linked list of struct	setup_data.
-
-Protocol 2.10	(Kernel 2.6.31) Added a protocol for relaxed alignment
-		beyond the kernel_alignment added, new init_size and
-		pref_address fields.  Added extended boot loader IDs.
-
-Protocol 2.11	(Kernel 3.6) Added a field for offset of EFI handover
-		protocol entry point.
-
-Protocol 2.12	(Kernel 3.8) Added the xloadflags field and extension fields
-		to struct boot_params for loading bzImage and ramdisk
-		above 4G in 64bit.
-
-Protocol 2.13	(Kernel 3.14) Support 32- and 64-bit flags being set in
-		xloadflags to support booting a 64-bit kernel from 32-bit
-		EFI
-
-Protocol 2.14	BURNT BY INCORRECT COMMIT
-                ae7e1238e68f2a472a125673ab506d49158c1889
-		(x86/boot: Add ACPI RSDP address to setup_header)
-		DO NOT USE!!! ASSUME SAME AS 2.13.
-
-Protocol 2.15	(Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max.
-=============	============================================================
-
-.. note::
-     The protocol version number should be changed only if the setup header
-     is changed. There is no need to update the version number if boot_params
-     or kernel_info are changed. Additionally, it is recommended to use
-     xloadflags (in this case the protocol version number should not be
-     updated either) or kernel_info to communicate supported Linux kernel
-     features to the boot loader. Due to very limited space available in
-     the original setup header every update to it should be considered
-     with great care. Starting from the protocol 2.15 the primary way to
-     communicate things to the boot loader is the kernel_info.
-
-
-Memory Layout
-=============
-
-The traditional memory map for the kernel loader, used for Image or
-zImage kernels, typically looks like::
-
-		|			 |
-	0A0000	+------------------------+
-		|  Reserved for BIOS	 |	Do not use.  Reserved for BIOS EBDA.
-	09A000	+------------------------+
-		|  Command line		 |
-		|  Stack/heap		 |	For use by the kernel real-mode code.
-	098000	+------------------------+
-		|  Kernel setup		 |	The kernel real-mode code.
-	090200	+------------------------+
-		|  Kernel boot sector	 |	The kernel legacy boot sector.
-	090000	+------------------------+
-		|  Protected-mode kernel |	The bulk of the kernel image.
-	010000	+------------------------+
-		|  Boot loader		 |	<- Boot sector entry point 0000:7C00
-	001000	+------------------------+
-		|  Reserved for MBR/BIOS |
-	000800	+------------------------+
-		|  Typically used by MBR |
-	000600	+------------------------+
-		|  BIOS use only	 |
-	000000	+------------------------+
-
-When using bzImage, the protected-mode kernel was relocated to
-0x100000 ("high memory"), and the kernel real-mode block (boot sector,
-setup, and stack/heap) was made relocatable to any address between
-0x10000 and end of low memory. Unfortunately, in protocols 2.00 and
-2.01 the 0x90000+ memory range is still used internally by the kernel;
-the 2.02 protocol resolves that problem.
-
-It is desirable to keep the "memory ceiling" -- the highest point in
-low memory touched by the boot loader -- as low as possible, since
-some newer BIOSes have begun to allocate some rather large amounts of
-memory, called the Extended BIOS Data Area, near the top of low
-memory.	 The boot loader should use the "INT 12h" BIOS call to verify
-how much low memory is available.
-
-Unfortunately, if INT 12h reports that the amount of memory is too
-low, there is usually nothing the boot loader can do but to report an
-error to the user.  The boot loader should therefore be designed to
-take up as little space in low memory as it reasonably can.  For
-zImage or old bzImage kernels, which need data written into the
-0x90000 segment, the boot loader should make sure not to use memory
-above the 0x9A000 point; too many BIOSes will break above that point.
-
-For a modern bzImage kernel with boot protocol version >= 2.02, a
-memory layout like the following is suggested::
-
-		~                        ~
-		|  Protected-mode kernel |
-	100000  +------------------------+
-		|  I/O memory hole	 |
-	0A0000	+------------------------+
-		|  Reserved for BIOS	 |	Leave as much as possible unused
-		~                        ~
-		|  Command line		 |	(Can also be below the X+10000 mark)
-	X+10000	+------------------------+
-		|  Stack/heap		 |	For use by the kernel real-mode code.
-	X+08000	+------------------------+
-		|  Kernel setup		 |	The kernel real-mode code.
-		|  Kernel boot sector	 |	The kernel legacy boot sector.
-	X       +------------------------+
-		|  Boot loader		 |	<- Boot sector entry point 0000:7C00
-	001000	+------------------------+
-		|  Reserved for MBR/BIOS |
-	000800	+------------------------+
-		|  Typically used by MBR |
-	000600	+------------------------+
-		|  BIOS use only	 |
-	000000	+------------------------+
-
-  ... where the address X is as low as the design of the boot loader permits.
-
-
-The Real-Mode Kernel Header
-===========================
-
-In the following text, and anywhere in the kernel boot sequence, "a
-sector" refers to 512 bytes.  It is independent of the actual sector
-size of the underlying medium.
-
-The first step in loading a Linux kernel should be to load the
-real-mode code (boot sector and setup code) and then examine the
-following header at offset 0x01f1.  The real-mode code can total up to
-32K, although the boot loader may choose to load only the first two
-sectors (1K) and then examine the bootup sector size.
-
-The header looks like:
-
-===========	========	=====================	============================================
-Offset/Size	Proto		Name			Meaning
-===========	========	=====================	============================================
-01F1/1		ALL(1)		setup_sects		The size of the setup in sectors
-01F2/2		ALL		root_flags		If set, the root is mounted readonly
-01F4/4		2.04+(2)	syssize			The size of the 32-bit code in 16-byte paras
-01F8/2		ALL		ram_size		DO NOT USE - for bootsect.S use only
-01FA/2		ALL		vid_mode		Video mode control
-01FC/2		ALL		root_dev		Default root device number
-01FE/2		ALL		boot_flag		0xAA55 magic number
-0200/2		2.00+		jump			Jump instruction
-0202/4		2.00+		header			Magic signature "HdrS"
-0206/2		2.00+		version			Boot protocol version supported
-0208/4		2.00+		realmode_swtch		Boot loader hook (see below)
-020C/2		2.00+		start_sys_seg		The load-low segment (0x1000) (obsolete)
-020E/2		2.00+		kernel_version		Pointer to kernel version string
-0210/1		2.00+		type_of_loader		Boot loader identifier
-0211/1		2.00+		loadflags		Boot protocol option flags
-0212/2		2.00+		setup_move_size		Move to high memory size (used with hooks)
-0214/4		2.00+		code32_start		Boot loader hook (see below)
-0218/4		2.00+		ramdisk_image		initrd load address (set by boot loader)
-021C/4		2.00+		ramdisk_size		initrd size (set by boot loader)
-0220/4		2.00+		bootsect_kludge		DO NOT USE - for bootsect.S use only
-0224/2		2.01+		heap_end_ptr		Free memory after setup end
-0226/1		2.02+(3)	ext_loader_ver		Extended boot loader version
-0227/1		2.02+(3)	ext_loader_type		Extended boot loader ID
-0228/4		2.02+		cmd_line_ptr		32-bit pointer to the kernel command line
-022C/4		2.03+		initrd_addr_max		Highest legal initrd address
-0230/4		2.05+		kernel_alignment	Physical addr alignment required for kernel
-0234/1		2.05+		relocatable_kernel	Whether kernel is relocatable or not
-0235/1		2.10+		min_alignment		Minimum alignment, as a power of two
-0236/2		2.12+		xloadflags		Boot protocol option flags
-0238/4		2.06+		cmdline_size		Maximum size of the kernel command line
-023C/4		2.07+		hardware_subarch	Hardware subarchitecture
-0240/8		2.07+		hardware_subarch_data	Subarchitecture-specific data
-0248/4		2.08+		payload_offset		Offset of kernel payload
-024C/4		2.08+		payload_length		Length of kernel payload
-0250/8		2.09+		setup_data		64-bit physical pointer to linked list
-							of struct setup_data
-0258/8		2.10+		pref_address		Preferred loading address
-0260/4		2.10+		init_size		Linear memory required during initialization
-0264/4		2.11+		handover_offset		Offset of handover entry point
-0268/4		2.15+		kernel_info_offset	Offset of the kernel_info
-===========	========	=====================	============================================
-
-.. note::
-  (1) For backwards compatibility, if the setup_sects field contains 0, the
-      real value is 4.
-
-  (2) For boot protocol prior to 2.04, the upper two bytes of the syssize
-      field are unusable, which means the size of a bzImage kernel
-      cannot be determined.
-
-  (3) Ignored, but safe to set, for boot protocols 2.02-2.09.
-
-If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
-the boot protocol version is "old".  Loading an old kernel, the
-following parameters should be assumed::
-
-	Image type = zImage
-	initrd not supported
-	Real-mode kernel must be located at 0x90000.
-
-Otherwise, the "version" field contains the protocol version,
-e.g. protocol version 2.01 will contain 0x0201 in this field.  When
-setting fields in the header, you must make sure only to set fields
-supported by the protocol version in use.
-
-
-Details of Header Fields
-========================
-
-For each field, some are information from the kernel to the bootloader
-("read"), some are expected to be filled out by the bootloader
-("write"), and some are expected to be read and modified by the
-bootloader ("modify").
-
-All general purpose boot loaders should write the fields marked
-(obligatory).  Boot loaders who want to load the kernel at a
-nonstandard address should fill in the fields marked (reloc); other
-boot loaders can ignore those fields.
-
-The byte order of all fields is littleendian (this is x86, after all.)
-
-============	===========
-Field name:	setup_sects
-Type:		read
-Offset/size:	0x1f1/1
-Protocol:	ALL
-============	===========
-
-  The size of the setup code in 512-byte sectors.  If this field is
-  0, the real value is 4.  The real-mode code consists of the boot
-  sector (always one 512-byte sector) plus the setup code.
-
-============	=================
-Field name:	root_flags
-Type:		modify (optional)
-Offset/size:	0x1f2/2
-Protocol:	ALL
-============	=================
-
-  If this field is nonzero, the root defaults to readonly.  The use of
-  this field is deprecated; use the "ro" or "rw" options on the
-  command line instead.
-
-============	===============================================
-Field name:	syssize
-Type:		read
-Offset/size:	0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL)
-Protocol:	2.04+
-============	===============================================
-
-  The size of the protected-mode code in units of 16-byte paragraphs.
-  For protocol versions older than 2.04 this field is only two bytes
-  wide, and therefore cannot be trusted for the size of a kernel if
-  the LOAD_HIGH flag is set.
-
-============	===============
-Field name:	ram_size
-Type:		kernel internal
-Offset/size:	0x1f8/2
-Protocol:	ALL
-============	===============
-
-  This field is obsolete.
-
-============	===================
-Field name:	vid_mode
-Type:		modify (obligatory)
-Offset/size:	0x1fa/2
-============	===================
-
-  Please see the section on SPECIAL COMMAND LINE OPTIONS.
-
-============	=================
-Field name:	root_dev
-Type:		modify (optional)
-Offset/size:	0x1fc/2
-Protocol:	ALL
-============	=================
-
-  The default root device device number.  The use of this field is
-  deprecated, use the "root=" option on the command line instead.
-
-============	=========
-Field name:	boot_flag
-Type:		read
-Offset/size:	0x1fe/2
-Protocol:	ALL
-============	=========
-
-  Contains 0xAA55.  This is the closest thing old Linux kernels have
-  to a magic number.
-
-============	=======
-Field name:	jump
-Type:		read
-Offset/size:	0x200/2
-Protocol:	2.00+
-============	=======
-
-  Contains an x86 jump instruction, 0xEB followed by a signed offset
-  relative to byte 0x202.  This can be used to determine the size of
-  the header.
-
-============	=======
-Field name:	header
-Type:		read
-Offset/size:	0x202/4
-Protocol:	2.00+
-============	=======
-
-  Contains the magic number "HdrS" (0x53726448).
-
-============	=======
-Field name:	version
-Type:		read
-Offset/size:	0x206/2
-Protocol:	2.00+
-============	=======
-
-  Contains the boot protocol version, in (major << 8)+minor format,
-  e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
-  10.17.
-
-============	=================
-Field name:	realmode_swtch
-Type:		modify (optional)
-Offset/size:	0x208/4
-Protocol:	2.00+
-============	=================
-
-  Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
-
-============	=============
-Field name:	start_sys_seg
-Type:		read
-Offset/size:	0x20c/2
-Protocol:	2.00+
-============	=============
-
-  The load low segment (0x1000).  Obsolete.
-
-============	==============
-Field name:	kernel_version
-Type:		read
-Offset/size:	0x20e/2
-Protocol:	2.00+
-============	==============
-
-  If set to a nonzero value, contains a pointer to a NUL-terminated
-  human-readable kernel version number string, less 0x200.  This can
-  be used to display the kernel version to the user.  This value
-  should be less than (0x200*setup_sects).
-
-  For example, if this value is set to 0x1c00, the kernel version
-  number string can be found at offset 0x1e00 in the kernel file.
-  This is a valid value if and only if the "setup_sects" field
-  contains the value 15 or higher, as::
-
-	0x1c00  < 15*0x200 (= 0x1e00) but
-	0x1c00 >= 14*0x200 (= 0x1c00)
-
-	0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15.
-
-============	==================
-Field name:	type_of_loader
-Type:		write (obligatory)
-Offset/size:	0x210/1
-Protocol:	2.00+
-============	==================
-
-  If your boot loader has an assigned id (see table below), enter
-  0xTV here, where T is an identifier for the boot loader and V is
-  a version number.  Otherwise, enter 0xFF here.
-
-  For boot loader IDs above T = 0xD, write T = 0xE to this field and
-  write the extended ID minus 0x10 to the ext_loader_type field.
-  Similarly, the ext_loader_ver field can be used to provide more than
-  four bits for the bootloader version.
-
-  For example, for T = 0x15, V = 0x234, write::
-
-	type_of_loader  <- 0xE4
-	ext_loader_type <- 0x05
-	ext_loader_ver  <- 0x23
-
-  Assigned boot loader ids (hexadecimal):
-
-	== =======================================
-	0  LILO
-	   (0x00 reserved for pre-2.00 bootloader)
-	1  Loadlin
-	2  bootsect-loader
-	   (0x20, all other values reserved)
-	3  Syslinux
-	4  Etherboot/gPXE/iPXE
-	5  ELILO
-	7  GRUB
-	8  U-Boot
-	9  Xen
-	A  Gujin
-	B  Qemu
-	C  Arcturus Networks uCbootloader
-	D  kexec-tools
-	E  Extended (see ext_loader_type)
-	F  Special (0xFF = undefined)
-	10 Reserved
-	11 Minimal Linux Bootloader
-	   <http://sebastian-plotz.blogspot.de>
-	12 OVMF UEFI virtualization stack
-	13 barebox
-	== =======================================
-
-  Please contact <hpa@zytor.com> if you need a bootloader ID value assigned.
-
-============	===================
-Field name:	loadflags
-Type:		modify (obligatory)
-Offset/size:	0x211/1
-Protocol:	2.00+
-============	===================
-
-  This field is a bitmask.
-
-  Bit 0 (read):	LOADED_HIGH
-
-	- If 0, the protected-mode code is loaded at 0x10000.
-	- If 1, the protected-mode code is loaded at 0x100000.
-
-  Bit 1 (kernel internal): KASLR_FLAG
-
-	- Used internally by the compressed kernel to communicate
-	  KASLR status to kernel proper.
-
-	    - If 1, KASLR enabled.
-	    - If 0, KASLR disabled.
-
-  Bit 5 (write): QUIET_FLAG
-
-	- If 0, print early messages.
-	- If 1, suppress early messages.
-
-		This requests to the kernel (decompressor and early
-		kernel) to not write early messages that require
-		accessing the display hardware directly.
-
-  Bit 6 (obsolete): KEEP_SEGMENTS
-
-	Protocol: 2.07+
-
-        - This flag is obsolete.
-
-  Bit 7 (write): CAN_USE_HEAP
-
-	Set this bit to 1 to indicate that the value entered in the
-	heap_end_ptr is valid.  If this field is clear, some setup code
-	functionality will be disabled.
-
-
-============	===================
-Field name:	setup_move_size
-Type:		modify (obligatory)
-Offset/size:	0x212/2
-Protocol:	2.00-2.01
-============	===================
-
-  When using protocol 2.00 or 2.01, if the real mode kernel is not
-  loaded at 0x90000, it gets moved there later in the loading
-  sequence.  Fill in this field if you want additional data (such as
-  the kernel command line) moved in addition to the real-mode kernel
-  itself.
-
-  The unit is bytes starting with the beginning of the boot sector.
-
-  This field is can be ignored when the protocol is 2.02 or higher, or
-  if the real-mode code is loaded at 0x90000.
-
-============	========================
-Field name:	code32_start
-Type:		modify (optional, reloc)
-Offset/size:	0x214/4
-Protocol:	2.00+
-============	========================
-
-  The address to jump to in protected mode.  This defaults to the load
-  address of the kernel, and can be used by the boot loader to
-  determine the proper load address.
-
-  This field can be modified for two purposes:
-
-    1. as a boot loader hook (see Advanced Boot Loader Hooks below.)
-
-    2. if a bootloader which does not install a hook loads a
-       relocatable kernel at a nonstandard address it will have to modify
-       this field to point to the load address.
-
-============	==================
-Field name:	ramdisk_image
-Type:		write (obligatory)
-Offset/size:	0x218/4
-Protocol:	2.00+
-============	==================
-
-  The 32-bit linear address of the initial ramdisk or ramfs.  Leave at
-  zero if there is no initial ramdisk/ramfs.
-
-============	==================
-Field name:	ramdisk_size
-Type:		write (obligatory)
-Offset/size:	0x21c/4
-Protocol:	2.00+
-============	==================
-
-  Size of the initial ramdisk or ramfs.  Leave at zero if there is no
-  initial ramdisk/ramfs.
-
-============	===============
-Field name:	bootsect_kludge
-Type:		kernel internal
-Offset/size:	0x220/4
-Protocol:	2.00+
-============	===============
-
-  This field is obsolete.
-
-============	==================
-Field name:	heap_end_ptr
-Type:		write (obligatory)
-Offset/size:	0x224/2
-Protocol:	2.01+
-============	==================
-
-  Set this field to the offset (from the beginning of the real-mode
-  code) of the end of the setup stack/heap, minus 0x0200.
-
-============	================
-Field name:	ext_loader_ver
-Type:		write (optional)
-Offset/size:	0x226/1
-Protocol:	2.02+
-============	================
-
-  This field is used as an extension of the version number in the
-  type_of_loader field.  The total version number is considered to be
-  (type_of_loader & 0x0f) + (ext_loader_ver << 4).
-
-  The use of this field is boot loader specific.  If not written, it
-  is zero.
-
-  Kernels prior to 2.6.31 did not recognize this field, but it is safe
-  to write for protocol version 2.02 or higher.
-
-============	=====================================================
-Field name:	ext_loader_type
-Type:		write (obligatory if (type_of_loader & 0xf0) == 0xe0)
-Offset/size:	0x227/1
-Protocol:	2.02+
-============	=====================================================
-
-  This field is used as an extension of the type number in
-  type_of_loader field.  If the type in type_of_loader is 0xE, then
-  the actual type is (ext_loader_type + 0x10).
-
-  This field is ignored if the type in type_of_loader is not 0xE.
-
-  Kernels prior to 2.6.31 did not recognize this field, but it is safe
-  to write for protocol version 2.02 or higher.
-
-============	==================
-Field name:	cmd_line_ptr
-Type:		write (obligatory)
-Offset/size:	0x228/4
-Protocol:	2.02+
-============	==================
-
-  Set this field to the linear address of the kernel command line.
-  The kernel command line can be located anywhere between the end of
-  the setup heap and 0xA0000; it does not have to be located in the
-  same 64K segment as the real-mode code itself.
-
-  Fill in this field even if your boot loader does not support a
-  command line, in which case you can point this to an empty string
-  (or better yet, to the string "auto".)  If this field is left at
-  zero, the kernel will assume that your boot loader does not support
-  the 2.02+ protocol.
-
-============	===============
-Field name:	initrd_addr_max
-Type:		read
-Offset/size:	0x22c/4
-Protocol:	2.03+
-============	===============
-
-  The maximum address that may be occupied by the initial
-  ramdisk/ramfs contents.  For boot protocols 2.02 or earlier, this
-  field is not present, and the maximum address is 0x37FFFFFF.  (This
-  address is defined as the address of the highest safe byte, so if
-  your ramdisk is exactly 131072 bytes long and this field is
-  0x37FFFFFF, you can start your ramdisk at 0x37FE0000.)
-
-============	============================
-Field name:	kernel_alignment
-Type:		read/modify (reloc)
-Offset/size:	0x230/4
-Protocol:	2.05+ (read), 2.10+ (modify)
-============	============================
-
-  Alignment unit required by the kernel (if relocatable_kernel is
-  true.)  A relocatable kernel that is loaded at an alignment
-  incompatible with the value in this field will be realigned during
-  kernel initialization.
-
-  Starting with protocol version 2.10, this reflects the kernel
-  alignment preferred for optimal performance; it is possible for the
-  loader to modify this field to permit a lesser alignment.  See the
-  min_alignment and pref_address field below.
-
-============	==================
-Field name:	relocatable_kernel
-Type:		read (reloc)
-Offset/size:	0x234/1
-Protocol:	2.05+
-============	==================
-
-  If this field is nonzero, the protected-mode part of the kernel can
-  be loaded at any address that satisfies the kernel_alignment field.
-  After loading, the boot loader must set the code32_start field to
-  point to the loaded code, or to a boot loader hook.
-
-============	=============
-Field name:	min_alignment
-Type:		read (reloc)
-Offset/size:	0x235/1
-Protocol:	2.10+
-============	=============
-
-  This field, if nonzero, indicates as a power of two the minimum
-  alignment required, as opposed to preferred, by the kernel to boot.
-  If a boot loader makes use of this field, it should update the
-  kernel_alignment field with the alignment unit desired; typically::
-
-	kernel_alignment = 1 << min_alignment
-
-  There may be a considerable performance cost with an excessively
-  misaligned kernel.  Therefore, a loader should typically try each
-  power-of-two alignment from kernel_alignment down to this alignment.
-
-============	==========
-Field name:	xloadflags
-Type:		read
-Offset/size:	0x236/2
-Protocol:	2.12+
-============	==========
-
-  This field is a bitmask.
-
-  Bit 0 (read):	XLF_KERNEL_64
-
-	- If 1, this kernel has the legacy 64-bit entry point at 0x200.
-
-  Bit 1 (read): XLF_CAN_BE_LOADED_ABOVE_4G
-
-        - If 1, kernel/boot_params/cmdline/ramdisk can be above 4G.
-
-  Bit 2 (read):	XLF_EFI_HANDOVER_32
-
-	- If 1, the kernel supports the 32-bit EFI handoff entry point
-          given at handover_offset.
-
-  Bit 3 (read): XLF_EFI_HANDOVER_64
-
-	- If 1, the kernel supports the 64-bit EFI handoff entry point
-          given at handover_offset + 0x200.
-
-  Bit 4 (read): XLF_EFI_KEXEC
-
-	- If 1, the kernel supports kexec EFI boot with EFI runtime support.
-
-
-============	============
-Field name:	cmdline_size
-Type:		read
-Offset/size:	0x238/4
-Protocol:	2.06+
-============	============
-
-  The maximum size of the command line without the terminating
-  zero. This means that the command line can contain at most
-  cmdline_size characters. With protocol version 2.05 and earlier, the
-  maximum size was 255.
-
-============	====================================
-Field name:	hardware_subarch
-Type:		write (optional, defaults to x86/PC)
-Offset/size:	0x23c/4
-Protocol:	2.07+
-============	====================================
-
-  In a paravirtualized environment the hardware low level architectural
-  pieces such as interrupt handling, page table handling, and
-  accessing process control registers needs to be done differently.
-
-  This field allows the bootloader to inform the kernel we are in one
-  one of those environments.
-
-  ==========	==============================
-  0x00000000	The default x86/PC environment
-  0x00000001	lguest
-  0x00000002	Xen
-  0x00000003	Moorestown MID
-  0x00000004	CE4100 TV Platform
-  ==========	==============================
-
-============	=========================
-Field name:	hardware_subarch_data
-Type:		write (subarch-dependent)
-Offset/size:	0x240/8
-Protocol:	2.07+
-============	=========================
-
-  A pointer to data that is specific to hardware subarch
-  This field is currently unused for the default x86/PC environment,
-  do not modify.
-
-============	==============
-Field name:	payload_offset
-Type:		read
-Offset/size:	0x248/4
-Protocol:	2.08+
-============	==============
-
-  If non-zero then this field contains the offset from the beginning
-  of the protected-mode code to the payload.
-
-  The payload may be compressed. The format of both the compressed and
-  uncompressed data should be determined using the standard magic
-  numbers.  The currently supported compression formats are gzip
-  (magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA
-  (magic number 5D 00), XZ (magic number FD 37), LZ4 (magic number
-  02 21) and ZSTD (magic number 28 B5). The uncompressed payload is
-  currently always ELF (magic number 7F 45 4C 46).
-
-============	==============
-Field name:	payload_length
-Type:		read
-Offset/size:	0x24c/4
-Protocol:	2.08+
-============	==============
-
-  The length of the payload.
-
-============	===============
-Field name:	setup_data
-Type:		write (special)
-Offset/size:	0x250/8
-Protocol:	2.09+
-============	===============
-
-  The 64-bit physical pointer to NULL terminated single linked list of
-  struct setup_data. This is used to define a more extensible boot
-  parameters passing mechanism. The definition of struct setup_data is
-  as follow::
-
-	struct setup_data {
-		u64 next;
-		u32 type;
-		u32 len;
-		u8  data[0];
-	};
-
-  Where, the next is a 64-bit physical pointer to the next node of
-  linked list, the next field of the last node is 0; the type is used
-  to identify the contents of data; the len is the length of data
-  field; the data holds the real payload.
-
-  This list may be modified at a number of points during the bootup
-  process.  Therefore, when modifying this list one should always make
-  sure to consider the case where the linked list already contains
-  entries.
-
-  The setup_data is a bit awkward to use for extremely large data objects,
-  both because the setup_data header has to be adjacent to the data object
-  and because it has a 32-bit length field. However, it is important that
-  intermediate stages of the boot process have a way to identify which
-  chunks of memory are occupied by kernel data.
-
-  Thus setup_indirect struct and SETUP_INDIRECT type were introduced in
-  protocol 2.15::
-
-    struct setup_indirect {
-      __u32 type;
-      __u32 reserved;  /* Reserved, must be set to zero. */
-      __u64 len;
-      __u64 addr;
-    };
-
-  The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be
-  SETUP_INDIRECT itself since making the setup_indirect a tree structure
-  could require a lot of stack space in something that needs to parse it
-  and stack space can be limited in boot contexts.
-
-  Let's give an example how to point to SETUP_E820_EXT data using setup_indirect.
-  In this case setup_data and setup_indirect will look like this::
-
-    struct setup_data {
-      __u64 next = 0 or <addr_of_next_setup_data_struct>;
-      __u32 type = SETUP_INDIRECT;
-      __u32 len = sizeof(setup_indirect);
-      __u8 data[sizeof(setup_indirect)] = struct setup_indirect {
-        __u32 type = SETUP_INDIRECT | SETUP_E820_EXT;
-        __u32 reserved = 0;
-        __u64 len = <len_of_SETUP_E820_EXT_data>;
-        __u64 addr = <addr_of_SETUP_E820_EXT_data>;
-      }
-    }
-
-.. note::
-     SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished
-     from SETUP_INDIRECT itself. So, this kind of objects cannot be provided
-     by the bootloaders.
-
-============	============
-Field name:	pref_address
-Type:		read (reloc)
-Offset/size:	0x258/8
-Protocol:	2.10+
-============	============
-
-  This field, if nonzero, represents a preferred load address for the
-  kernel.  A relocating bootloader should attempt to load at this
-  address if possible.
-
-  A non-relocatable kernel will unconditionally move itself and to run
-  at this address.
-
-============	=======
-Field name:	init_size
-Type:		read
-Offset/size:	0x260/4
-============	=======
-
-  This field indicates the amount of linear contiguous memory starting
-  at the kernel runtime start address that the kernel needs before it
-  is capable of examining its memory map.  This is not the same thing
-  as the total amount of memory the kernel needs to boot, but it can
-  be used by a relocating boot loader to help select a safe load
-  address for the kernel.
-
-  The kernel runtime start address is determined by the following algorithm::
-
-	if (relocatable_kernel)
-	runtime_start = align_up(load_address, kernel_alignment)
-	else
-	runtime_start = pref_address
-
-============	===============
-Field name:	handover_offset
-Type:		read
-Offset/size:	0x264/4
-============	===============
-
-  This field is the offset from the beginning of the kernel image to
-  the EFI handover protocol entry point. Boot loaders using the EFI
-  handover protocol to boot the kernel should jump to this offset.
-
-  See EFI HANDOVER PROTOCOL below for more details.
-
-============	==================
-Field name:	kernel_info_offset
-Type:		read
-Offset/size:	0x268/4
-Protocol:	2.15+
-============	==================
-
-  This field is the offset from the beginning of the kernel image to the
-  kernel_info. The kernel_info structure is embedded in the Linux image
-  in the uncompressed protected mode region.
-
-
-The kernel_info
-===============
-
-The relationships between the headers are analogous to the various data
-sections:
-
-  setup_header = .data
-  boot_params/setup_data = .bss
-
-What is missing from the above list? That's right:
-
-  kernel_info = .rodata
-
-We have been (ab)using .data for things that could go into .rodata or .bss for
-a long time, for lack of alternatives and -- especially early on -- inertia.
-Also, the BIOS stub is responsible for creating boot_params, so it isn't
-available to a BIOS-based loader (setup_data is, though).
-
-setup_header is permanently limited to 144 bytes due to the reach of the
-2-byte jump field, which doubles as a length field for the structure, combined
-with the size of the "hole" in struct boot_params that a protected-mode loader
-or the BIOS stub has to copy it into. It is currently 119 bytes long, which
-leaves us with 25 very precious bytes. This isn't something that can be fixed
-without revising the boot protocol entirely, breaking backwards compatibility.
-
-boot_params proper is limited to 4096 bytes, but can be arbitrarily extended
-by adding setup_data entries. It cannot be used to communicate properties of
-the kernel image, because it is .bss and has no image-provided content.
-
-kernel_info solves this by providing an extensible place for information about
-the kernel image. It is readonly, because the kernel cannot rely on a
-bootloader copying its contents anywhere, but that is OK; if it becomes
-necessary it can still contain data items that an enabled bootloader would be
-expected to copy into a setup_data chunk.
-
-All kernel_info data should be part of this structure. Fixed size data have to
-be put before kernel_info_var_len_data label. Variable size data have to be put
-after kernel_info_var_len_data label. Each chunk of variable size data has to
-be prefixed with header/magic and its size, e.g.::
-
-  kernel_info:
-          .ascii  "LToP"          /* Header, Linux top (structure). */
-          .long   kernel_info_var_len_data - kernel_info
-          .long   kernel_info_end - kernel_info
-          .long   0x01234567      /* Some fixed size data for the bootloaders. */
-  kernel_info_var_len_data:
-  example_struct:                 /* Some variable size data for the bootloaders. */
-          .ascii  "0123"          /* Header/Magic. */
-          .long   example_struct_end - example_struct
-          .ascii  "Struct"
-          .long   0x89012345
-  example_struct_end:
-  example_strings:                /* Some variable size data for the bootloaders. */
-          .ascii  "ABCD"          /* Header/Magic. */
-          .long   example_strings_end - example_strings
-          .asciz  "String_0"
-          .asciz  "String_1"
-  example_strings_end:
-  kernel_info_end:
-
-This way the kernel_info is self-contained blob.
-
-.. note::
-     Each variable size data header/magic can be any 4-character string,
-     without \0 at the end of the string, which does not collide with
-     existing variable length data headers/magics.
-
-
-Details of the kernel_info Fields
-=================================
-
-============	========
-Field name:	header
-Offset/size:	0x0000/4
-============	========
-
-  Contains the magic number "LToP" (0x506f544c).
-
-============	========
-Field name:	size
-Offset/size:	0x0004/4
-============	========
-
-  This field contains the size of the kernel_info including kernel_info.header.
-  It does not count kernel_info.kernel_info_var_len_data size. This field should be
-  used by the bootloaders to detect supported fixed size fields in the kernel_info
-  and beginning of kernel_info.kernel_info_var_len_data.
-
-============	========
-Field name:	size_total
-Offset/size:	0x0008/4
-============	========
-
-  This field contains the size of the kernel_info including kernel_info.header
-  and kernel_info.kernel_info_var_len_data.
-
-============	==============
-Field name:	setup_type_max
-Offset/size:	0x000c/4
-============	==============
-
-  This field contains maximal allowed type for setup_data and setup_indirect structs.
-
-
-The Image Checksum
-==================
-
-From boot protocol version 2.08 onwards the CRC-32 is calculated over
-the entire file using the characteristic polynomial 0x04C11DB7 and an
-initial remainder of 0xffffffff.  The checksum is appended to the
-file; therefore the CRC of the file up to the limit specified in the
-syssize field of the header is always 0.
-
-
-The Kernel Command Line
-=======================
-
-The kernel command line has become an important way for the boot
-loader to communicate with the kernel.  Some of its options are also
-relevant to the boot loader itself, see "special command line options"
-below.
-
-The kernel command line is a null-terminated string. The maximum
-length can be retrieved from the field cmdline_size.  Before protocol
-version 2.06, the maximum was 255 characters.  A string that is too
-long will be automatically truncated by the kernel.
-
-If the boot protocol version is 2.02 or later, the address of the
-kernel command line is given by the header field cmd_line_ptr (see
-above.)  This address can be anywhere between the end of the setup
-heap and 0xA0000.
-
-If the protocol version is *not* 2.02 or higher, the kernel
-command line is entered using the following protocol:
-
-  - At offset 0x0020 (word), "cmd_line_magic", enter the magic
-    number 0xA33F.
-
-  - At offset 0x0022 (word), "cmd_line_offset", enter the offset
-    of the kernel command line (relative to the start of the
-    real-mode kernel).
-
-  - The kernel command line *must* be within the memory region
-    covered by setup_move_size, so you may need to adjust this
-    field.
-
-
-Memory Layout of The Real-Mode Code
-===================================
-
-The real-mode code requires a stack/heap to be set up, as well as
-memory allocated for the kernel command line.  This needs to be done
-in the real-mode accessible memory in bottom megabyte.
-
-It should be noted that modern machines often have a sizable Extended
-BIOS Data Area (EBDA).  As a result, it is advisable to use as little
-of the low megabyte as possible.
-
-Unfortunately, under the following circumstances the 0x90000 memory
-segment has to be used:
-
-	- When loading a zImage kernel ((loadflags & 0x01) == 0).
-	- When loading a 2.01 or earlier boot protocol kernel.
-
-.. note::
-     For the 2.00 and 2.01 boot protocols, the real-mode code
-     can be loaded at another address, but it is internally
-     relocated to 0x90000.  For the "old" protocol, the
-     real-mode code must be loaded at 0x90000.
-
-When loading at 0x90000, avoid using memory above 0x9a000.
-
-For boot protocol 2.02 or higher, the command line does not have to be
-located in the same 64K segment as the real-mode setup code; it is
-thus permitted to give the stack/heap the full 64K segment and locate
-the command line above it.
-
-The kernel command line should not be located below the real-mode
-code, nor should it be located in high memory.
-
-
-Sample Boot Configuartion
-=========================
-
-As a sample configuration, assume the following layout of the real
-mode segment.
-
-    When loading below 0x90000, use the entire segment:
-
-        =============	===================
-	0x0000-0x7fff	Real mode kernel
-	0x8000-0xdfff	Stack and heap
-	0xe000-0xffff	Kernel command line
-	=============	===================
-
-    When loading at 0x90000 OR the protocol version is 2.01 or earlier:
-
-	=============	===================
-	0x0000-0x7fff	Real mode kernel
-	0x8000-0x97ff	Stack and heap
-	0x9800-0x9fff	Kernel command line
-	=============	===================
-
-Such a boot loader should enter the following fields in the header::
-
-	unsigned long base_ptr;	/* base address for real-mode segment */
-
-	if ( setup_sects == 0 ) {
-		setup_sects = 4;
-	}
-
-	if ( protocol >= 0x0200 ) {
-		type_of_loader = <type code>;
-		if ( loading_initrd ) {
-			ramdisk_image = <initrd_address>;
-			ramdisk_size = <initrd_size>;
-		}
-
-		if ( protocol >= 0x0202 && loadflags & 0x01 )
-			heap_end = 0xe000;
-		else
-			heap_end = 0x9800;
-
-		if ( protocol >= 0x0201 ) {
-			heap_end_ptr = heap_end - 0x200;
-			loadflags |= 0x80; /* CAN_USE_HEAP */
-		}
-
-		if ( protocol >= 0x0202 ) {
-			cmd_line_ptr = base_ptr + heap_end;
-			strcpy(cmd_line_ptr, cmdline);
-		} else {
-			cmd_line_magic	= 0xA33F;
-			cmd_line_offset = heap_end;
-			setup_move_size = heap_end + strlen(cmdline)+1;
-			strcpy(base_ptr+cmd_line_offset, cmdline);
-		}
-	} else {
-		/* Very old kernel */
-
-		heap_end = 0x9800;
-
-		cmd_line_magic	= 0xA33F;
-		cmd_line_offset = heap_end;
-
-		/* A very old kernel MUST have its real-mode code
-		   loaded at 0x90000 */
-
-		if ( base_ptr != 0x90000 ) {
-			/* Copy the real-mode kernel */
-			memcpy(0x90000, base_ptr, (setup_sects+1)*512);
-			base_ptr = 0x90000;		 /* Relocated */
-		}
-
-		strcpy(0x90000+cmd_line_offset, cmdline);
-
-		/* It is recommended to clear memory up to the 32K mark */
-		memset(0x90000 + (setup_sects+1)*512, 0,
-		       (64-(setup_sects+1))*512);
-	}
-
-
-Loading The Rest of The Kernel
-==============================
-
-The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
-in the kernel file (again, if setup_sects == 0 the real value is 4.)
-It should be loaded at address 0x10000 for Image/zImage kernels and
-0x100000 for bzImage kernels.
-
-The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
-bit (LOAD_HIGH) in the loadflags field is set::
-
-	is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
-	load_address = is_bzImage ? 0x100000 : 0x10000;
-
-Note that Image/zImage kernels can be up to 512K in size, and thus use
-the entire 0x10000-0x90000 range of memory.  This means it is pretty
-much a requirement for these kernels to load the real-mode part at
-0x90000.  bzImage kernels allow much more flexibility.
-
-Special Command Line Options
-============================
-
-If the command line provided by the boot loader is entered by the
-user, the user may expect the following command line options to work.
-They should normally not be deleted from the kernel command line even
-though not all of them are actually meaningful to the kernel.  Boot
-loader authors who need additional command line options for the boot
-loader itself should get them registered in
-Documentation/admin-guide/kernel-parameters.rst to make sure they will not
-conflict with actual kernel options now or in the future.
-
-  vga=<mode>
-	<mode> here is either an integer (in C notation, either
-	decimal, octal, or hexadecimal) or one of the strings
-	"normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask"
-	(meaning 0xFFFD).  This value should be entered into the
-	vid_mode field, as it is used by the kernel before the command
-	line is parsed.
-
-  mem=<size>
-	<size> is an integer in C notation optionally followed by
-	(case insensitive) K, M, G, T, P or E (meaning << 10, << 20,
-	<< 30, << 40, << 50 or << 60).  This specifies the end of
-	memory to the kernel. This affects the possible placement of
-	an initrd, since an initrd should be placed near end of
-	memory.  Note that this is an option to *both* the kernel and
-	the bootloader!
-
-  initrd=<file>
-	An initrd should be loaded.  The meaning of <file> is
-	obviously bootloader-dependent, and some boot loaders
-	(e.g. LILO) do not have such a command.
-
-In addition, some boot loaders add the following options to the
-user-specified command line:
-
-  BOOT_IMAGE=<file>
-	The boot image which was loaded.  Again, the meaning of <file>
-	is obviously bootloader-dependent.
-
-  auto
-	The kernel was booted without explicit user intervention.
-
-If these options are added by the boot loader, it is highly
-recommended that they are located *first*, before the user-specified
-or configuration-specified command line.  Otherwise, "init=/bin/sh"
-gets confused by the "auto" option.
-
-
-Running the Kernel
-==================
-
-The kernel is started by jumping to the kernel entry point, which is
-located at *segment* offset 0x20 from the start of the real mode
-kernel.  This means that if you loaded your real-mode kernel code at
-0x90000, the kernel entry point is 9020:0000.
-
-At entry, ds = es = ss should point to the start of the real-mode
-kernel code (0x9000 if the code is loaded at 0x90000), sp should be
-set up properly, normally pointing to the top of the heap, and
-interrupts should be disabled.  Furthermore, to guard against bugs in
-the kernel, it is recommended that the boot loader sets fs = gs = ds =
-es = ss.
-
-In our example from above, we would do::
-
-	/* Note: in the case of the "old" kernel protocol, base_ptr must
-	   be == 0x90000 at this point; see the previous sample code */
-
-	seg = base_ptr >> 4;
-
-	cli();	/* Enter with interrupts disabled! */
-
-	/* Set up the real-mode kernel stack */
-	_SS = seg;
-	_SP = heap_end;
-
-	_DS = _ES = _FS = _GS = seg;
-	jmp_far(seg+0x20, 0);	/* Run the kernel */
-
-If your boot sector accesses a floppy drive, it is recommended to
-switch off the floppy motor before running the kernel, since the
-kernel boot leaves interrupts off and thus the motor will not be
-switched off, especially if the loaded kernel has the floppy driver as
-a demand-loaded module!
-
-
-Advanced Boot Loader Hooks
-==========================
-
-If the boot loader runs in a particularly hostile environment (such as
-LOADLIN, which runs under DOS) it may be impossible to follow the
-standard memory location requirements.  Such a boot loader may use the
-following hooks that, if set, are invoked by the kernel at the
-appropriate time.  The use of these hooks should probably be
-considered an absolutely last resort!
-
-IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and
-%edi across invocation.
-
-  realmode_swtch:
-	A 16-bit real mode far subroutine invoked immediately before
-	entering protected mode.  The default routine disables NMI, so
-	your routine should probably do so, too.
-
-  code32_start:
-	A 32-bit flat-mode routine *jumped* to immediately after the
-	transition to protected mode, but before the kernel is
-	uncompressed.  No segments, except CS, are guaranteed to be
-	set up (current kernels do, but older ones do not); you should
-	set them up to BOOT_DS (0x18) yourself.
-
-	After completing your hook, you should jump to the address
-	that was in this field before your boot loader overwrote it
-	(relocated, if appropriate.)
-
-
-32-bit Boot Protocol
-====================
-
-For machine with some new BIOS other than legacy BIOS, such as EFI,
-LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel
-based on legacy BIOS can not be used, so a 32-bit boot protocol needs
-to be defined.
-
-In 32-bit boot protocol, the first step in loading a Linux kernel
-should be to setup the boot parameters (struct boot_params,
-traditionally known as "zero page"). The memory for struct boot_params
-should be allocated and initialized to all zero. Then the setup header
-from offset 0x01f1 of kernel image on should be loaded into struct
-boot_params and examined. The end of setup header can be calculated as
-follow::
-
-	0x0202 + byte value at offset 0x0201
-
-In addition to read/modify/write the setup header of the struct
-boot_params as that of 16-bit boot protocol, the boot loader should
-also fill the additional fields of the struct boot_params as
-described in chapter Documentation/x86/zero-page.rst.
-
-After setting up the struct boot_params, the boot loader can load the
-32/64-bit kernel in the same way as that of 16-bit boot protocol.
-
-In 32-bit boot protocol, the kernel is started by jumping to the
-32-bit kernel entry point, which is the start address of loaded
-32/64-bit kernel.
-
-At entry, the CPU must be in 32-bit protected mode with paging
-disabled; a GDT must be loaded with the descriptors for selectors
-__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
-segment; __BOOT_CS must have execute/read permission, and __BOOT_DS
-must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
-must be __BOOT_DS; interrupt must be disabled; %esi must hold the base
-address of the struct boot_params; %ebp, %edi and %ebx must be zero.
-
-64-bit Boot Protocol
-====================
-
-For machine with 64bit cpus and 64bit kernel, we could use 64bit bootloader
-and we need a 64-bit boot protocol.
-
-In 64-bit boot protocol, the first step in loading a Linux kernel
-should be to setup the boot parameters (struct boot_params,
-traditionally known as "zero page"). The memory for struct boot_params
-could be allocated anywhere (even above 4G) and initialized to all zero.
-Then, the setup header at offset 0x01f1 of kernel image on should be
-loaded into struct boot_params and examined. The end of setup header
-can be calculated as follows::
-
-	0x0202 + byte value at offset 0x0201
-
-In addition to read/modify/write the setup header of the struct
-boot_params as that of 16-bit boot protocol, the boot loader should
-also fill the additional fields of the struct boot_params as described
-in chapter Documentation/x86/zero-page.rst.
-
-After setting up the struct boot_params, the boot loader can load
-64-bit kernel in the same way as that of 16-bit boot protocol, but
-kernel could be loaded above 4G.
-
-In 64-bit boot protocol, the kernel is started by jumping to the
-64-bit kernel entry point, which is the start address of loaded
-64-bit kernel plus 0x200.
-
-At entry, the CPU must be in 64-bit mode with paging enabled.
-The range with setup_header.init_size from start address of loaded
-kernel and zero page and command line buffer get ident mapping;
-a GDT must be loaded with the descriptors for selectors
-__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
-segment; __BOOT_CS must have execute/read permission, and __BOOT_DS
-must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
-must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base
-address of the struct boot_params.
-
-EFI Handover Protocol (deprecated)
-==================================
-
-This protocol allows boot loaders to defer initialisation to the EFI
-boot stub. The boot loader is required to load the kernel/initrd(s)
-from the boot media and jump to the EFI handover protocol entry point
-which is hdr->handover_offset bytes from the beginning of
-startup_{32,64}.
-
-The boot loader MUST respect the kernel's PE/COFF metadata when it comes
-to section alignment, the memory footprint of the executable image beyond
-the size of the file itself, and any other aspect of the PE/COFF header
-that may affect correct operation of the image as a PE/COFF binary in the
-execution context provided by the EFI firmware.
-
-The function prototype for the handover entry point looks like this::
-
-    efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp)
-
-'handle' is the EFI image handle passed to the boot loader by the EFI
-firmware, 'table' is the EFI system table - these are the first two
-arguments of the "handoff state" as described in section 2.3 of the
-UEFI specification. 'bp' is the boot loader-allocated boot params.
-
-The boot loader *must* fill out the following fields in bp::
-
-  - hdr.cmd_line_ptr
-  - hdr.ramdisk_image (if applicable)
-  - hdr.ramdisk_size  (if applicable)
-
-All other fields should be zero.
-
-NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF
-      entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd
-      loading protocol (refer to [0] for an example of the bootloader side of
-      this), which removes the need for any knowledge on the part of the EFI
-      bootloader regarding the internal representation of boot_params or any
-      requirements/limitations regarding the placement of the command line
-      and ramdisk in memory, or the placement of the kernel image itself.
-
-[0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0
diff --git a/Documentation/x86/booting-dt.rst b/Documentation/x86/booting-dt.rst
deleted file mode 100644
index 965a374071ab..000000000000
--- a/Documentation/x86/booting-dt.rst
+++ /dev/null
@@ -1,21 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-DeviceTree Booting
-------------------
-
-  There is one single 32bit entry point to the kernel at code32_start,
-  the decompressor (the real mode entry point goes to the same  32bit
-  entry point once it switched into protected mode). That entry point
-  supports one calling convention which is documented in
-  Documentation/x86/boot.rst
-  The physical pointer to the device-tree block is passed via setup_data
-  which requires at least boot protocol 2.09.
-  The type filed is defined as
-
-  #define SETUP_DTB                      2
-
-  This device-tree is used as an extension to the "boot page". As such it
-  does not parse / consider data which is already covered by the boot
-  page. This includes memory size, reserved ranges, command line arguments
-  or initrd address. It simply holds information which can not be retrieved
-  otherwise like interrupt routing or a list of devices behind an I2C bus.
diff --git a/Documentation/x86/buslock.rst b/Documentation/x86/buslock.rst
deleted file mode 100644
index 31ec0ef78086..000000000000
--- a/Documentation/x86/buslock.rst
+++ /dev/null
@@ -1,132 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-.. include:: <isonum.txt>
-
-===============================
-Bus lock detection and handling
-===============================
-
-:Copyright: |copy| 2021 Intel Corporation
-:Authors: - Fenghua Yu <fenghua.yu@intel.com>
-          - Tony Luck <tony.luck@intel.com>
-
-Problem
-=======
-
-A split lock is any atomic operation whose operand crosses two cache lines.
-Since the operand spans two cache lines and the operation must be atomic,
-the system locks the bus while the CPU accesses the two cache lines.
-
-A bus lock is acquired through either split locked access to writeback (WB)
-memory or any locked access to non-WB memory. This is typically thousands of
-cycles slower than an atomic operation within a cache line. It also disrupts
-performance on other cores and brings the whole system to its knees.
-
-Detection
-=========
-
-Intel processors may support either or both of the following hardware
-mechanisms to detect split locks and bus locks.
-
-#AC exception for split lock detection
---------------------------------------
-
-Beginning with the Tremont Atom CPU split lock operations may raise an
-Alignment Check (#AC) exception when a split lock operation is attemped.
-
-#DB exception for bus lock detection
-------------------------------------
-
-Some CPUs have the ability to notify the kernel by an #DB trap after a user
-instruction acquires a bus lock and is executed. This allows the kernel to
-terminate the application or to enforce throttling.
-
-Software handling
-=================
-
-The kernel #AC and #DB handlers handle bus lock based on the kernel
-parameter "split_lock_detect". Here is a summary of different options:
-
-+------------------+----------------------------+-----------------------+
-|split_lock_detect=|#AC for split lock		|#DB for bus lock	|
-+------------------+----------------------------+-----------------------+
-|off	  	   |Do nothing			|Do nothing		|
-+------------------+----------------------------+-----------------------+
-|warn		   |Kernel OOPs			|Warn once per task and |
-|(default)	   |Warn once per task, add a	|and continues to run.  |
-|		   |delay, add synchronization	|			|
-|		   |to prevent more than one	|			|
-|		   |core from executing a	|			|
-|		   |split lock in parallel.	|			|
-|		   |sysctl split_lock_mitigate	|			|
-|		   |can be used to avoid the	|			|
-|		   |delay and synchronization	|			|
-|		   |When both features are	|			|
-|		   |supported, warn in #AC	|			|
-+------------------+----------------------------+-----------------------+
-|fatal		   |Kernel OOPs			|Send SIGBUS to user.	|
-|		   |Send SIGBUS to user		|			|
-|		   |When both features are	|			|
-|		   |supported, fatal in #AC	|			|
-+------------------+----------------------------+-----------------------+
-|ratelimit:N	   |Do nothing			|Limit bus lock rate to	|
-|(0 < N <= 1000)   |				|N bus locks per second	|
-|		   |				|system wide and warn on|
-|		   |				|bus locks.		|
-+------------------+----------------------------+-----------------------+
-
-Usages
-======
-
-Detecting and handling bus lock may find usages in various areas:
-
-It is critical for real time system designers who build consolidated real
-time systems. These systems run hard real time code on some cores and run
-"untrusted" user processes on other cores. The hard real time cannot afford
-to have any bus lock from the untrusted processes to hurt real time
-performance. To date the designers have been unable to deploy these
-solutions as they have no way to prevent the "untrusted" user code from
-generating split lock and bus lock to block the hard real time code to
-access memory during bus locking.
-
-It's also useful for general computing to prevent guests or user
-applications from slowing down the overall system by executing instructions
-with bus lock.
-
-
-Guidance
-========
-off
----
-
-Disable checking for split lock and bus lock. This option can be useful if
-there are legacy applications that trigger these events at a low rate so
-that mitigation is not needed.
-
-warn
-----
-
-A warning is emitted when a bus lock is detected which allows to identify
-the offending application. This is the default behavior.
-
-fatal
------
-
-In this case, the bus lock is not tolerated and the process is killed.
-
-ratelimit
----------
-
-A system wide bus lock rate limit N is specified where 0 < N <= 1000. This
-allows a bus lock rate up to N bus locks per second. When the bus lock rate
-is exceeded then any task which is caught via the buslock #DB exception is
-throttled by enforced sleeps until the rate goes under the limit again.
-
-This is an effective mitigation in cases where a minimal impact can be
-tolerated, but an eventual Denial of Service attack has to be prevented. It
-allows to identify the offending processes and analyze whether they are
-malicious or just badly written.
-
-Selecting a rate limit of 1000 allows the bus to be locked for up to about
-seven million cycles each second (assuming 7000 cycles for each bus
-lock). On a 2 GHz processor that would be about 0.35% system slowdown.
diff --git a/Documentation/x86/cpuinfo.rst b/Documentation/x86/cpuinfo.rst
deleted file mode 100644
index 08246e8ac835..000000000000
--- a/Documentation/x86/cpuinfo.rst
+++ /dev/null
@@ -1,154 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=================
-x86 Feature Flags
-=================
-
-Introduction
-============
-
-On x86, flags appearing in /proc/cpuinfo have an X86_FEATURE definition
-in arch/x86/include/asm/cpufeatures.h. If the kernel cares about a feature
-or KVM want to expose the feature to a KVM guest, it can and should have
-an X86_FEATURE_* defined. These flags represent hardware features as
-well as software features.
-
-If users want to know if a feature is available on a given system, they
-try to find the flag in /proc/cpuinfo. If a given flag is present, it
-means that the kernel supports it and is currently making it available.
-If such flag represents a hardware feature, it also means that the
-hardware supports it.
-
-If the expected flag does not appear in /proc/cpuinfo, things are murkier.
-Users need to find out the reason why the flag is missing and find the way
-how to enable it, which is not always easy. There are several factors that
-can explain missing flags: the expected feature failed to enable, the feature
-is missing in hardware, platform firmware did not enable it, the feature is
-disabled at build or run time, an old kernel is in use, or the kernel does
-not support the feature and thus has not enabled it. In general, /proc/cpuinfo
-shows features which the kernel supports. For a full list of CPUID flags
-which the CPU supports, use tools/arch/x86/kcpuid.
-
-How are feature flags created?
-==============================
-
-a: Feature flags can be derived from the contents of CPUID leaves.
-------------------------------------------------------------------
-These feature definitions are organized mirroring the layout of CPUID
-leaves and grouped in words with offsets as mapped in enum cpuid_leafs
-in cpufeatures.h (see arch/x86/include/asm/cpufeatures.h for details).
-If a feature is defined with a X86_FEATURE_<name> definition in
-cpufeatures.h, and if it is detected at run time, the flags will be
-displayed accordingly in /proc/cpuinfo. For example, the flag "avx2"
-comes from X86_FEATURE_AVX2 in cpufeatures.h.
-
-b: Flags can be from scattered CPUID-based features.
-----------------------------------------------------
-Hardware features enumerated in sparsely populated CPUID leaves get
-software-defined values. Still, CPUID needs to be queried to determine
-if a given feature is present. This is done in init_scattered_cpuid_features().
-For instance, X86_FEATURE_CQM_LLC is defined as 11*32 + 0 and its presence is
-checked at runtime in the respective CPUID leaf [EAX=f, ECX=0] bit EDX[1].
-
-The intent of scattering CPUID leaves is to not bloat struct
-cpuinfo_x86.x86_capability[] unnecessarily. For instance, the CPUID leaf
-[EAX=7, ECX=0] has 30 features and is dense, but the CPUID leaf [EAX=7, EAX=1]
-has only one feature and would waste 31 bits of space in the x86_capability[]
-array. Since there is a struct cpuinfo_x86 for each possible CPU, the wasted
-memory is not trivial.
-
-c: Flags can be created synthetically under certain conditions for hardware features.
--------------------------------------------------------------------------------------
-Examples of conditions include whether certain features are present in
-MSR_IA32_CORE_CAPS or specific CPU models are identified. If the needed
-conditions are met, the features are enabled by the set_cpu_cap or
-setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS,
-the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and
-"split_lock_detect" will be displayed. The flag "ring3mwait" will be
-displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors.
-
-d: Flags can represent purely software features.
-------------------------------------------------
-These flags do not represent hardware features. Instead, they represent a
-software feature implemented in the kernel. For example, Kernel Page Table
-Isolation is purely software feature and its feature flag X86_FEATURE_PTI is
-also defined in cpufeatures.h.
-
-Naming of Flags
-===============
-
-The script arch/x86/kernel/cpu/mkcapflags.sh processes the
-#define X86_FEATURE_<name> from cpufeatures.h and generates the
-x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the
-resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming
-of flags in the x86_cap/bug_flags[] are as follows:
-
-a: The name of the flag is from the string in X86_FEATURE_<name> by default.
-----------------------------------------------------------------------------
-By default, the flag <name> in /proc/cpuinfo is extracted from the respective
-X86_FEATURE_<name> in cpufeatures.h. For example, the flag "avx2" is from
-X86_FEATURE_AVX2.
-
-b: The naming can be overridden.
---------------------------------
-If the comment on the line for the #define X86_FEATURE_* starts with a
-double-quote character (""), the string inside the double-quote characters
-will be the name of the flags. For example, the flag "sse4_1" comes from
-the comment "sse4_1" following the X86_FEATURE_XMM4_1 definition.
-
-There are situations in which overriding the displayed name of the flag is
-needed. For instance, /proc/cpuinfo is a userspace interface and must remain
-constant. If, for some reason, the naming of X86_FEATURE_<name> changes, one
-shall override the new naming with the name already used in /proc/cpuinfo.
-
-c: The naming override can be "", which means it will not appear in /proc/cpuinfo.
-----------------------------------------------------------------------------------
-The feature shall be omitted from /proc/cpuinfo if it does not make sense for
-the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is
-defined in cpufeatures.h but that flag is an internal kernel feature used
-in the alternative runtime patching functionality. So, its name is overridden
-with "". Its flag will not appear in /proc/cpuinfo.
-
-Flags are missing when one or more of these happen
-==================================================
-
-a: The hardware does not enumerate support for it.
---------------------------------------------------
-For example, when a new kernel is running on old hardware or the feature is
-not enabled by boot firmware. Even if the hardware is new, there might be a
-problem enabling the feature at run time, the flag will not be displayed.
-
-b: The kernel does not know about the flag.
--------------------------------------------
-For example, when an old kernel is running on new hardware.
-
-c: The kernel disabled support for it at compile-time.
-------------------------------------------------------
-For example, if 5-level-paging is not enabled when building (i.e.,
-CONFIG_X86_5LEVEL is not selected) the flag "la57" will not show up [#f1]_.
-Even though the feature will still be detected via CPUID, the kernel disables
-it by clearing via setup_clear_cpu_cap(X86_FEATURE_LA57).
-
-d: The feature is disabled at boot-time.
-----------------------------------------
-A feature can be disabled either using a command-line parameter or because
-it failed to be enabled. The command-line parameter clearcpuid= can be used
-to disable features using the feature number as defined in
-/arch/x86/include/asm/cpufeatures.h. For instance, User Mode Instruction
-Protection can be disabled using clearcpuid=514. The number 514 is calculated
-from #define X86_FEATURE_UMIP (16*32 + 2).
-
-In addition, there exists a variety of custom command-line parameters that
-disable specific features. The list of parameters includes, but is not limited
-to, nofsgsbase, nosgx, noxsave, etc. 5-level paging can also be disabled using
-"no5lvl".
-
-e: The feature was known to be non-functional.
-----------------------------------------------
-The feature was known to be non-functional because a dependency was
-missing at runtime. For example, AVX flags will not show up if XSAVE feature
-is disabled since they depend on XSAVE feature. Another example would be broken
-CPUs and them missing microcode patches. Due to that, the kernel decides not to
-enable a feature.
-
-.. [#f1] 5-level paging uses linear address of 57 bits.
diff --git a/Documentation/x86/earlyprintk.rst b/Documentation/x86/earlyprintk.rst
deleted file mode 100644
index 51ef11e8f725..000000000000
--- a/Documentation/x86/earlyprintk.rst
+++ /dev/null
@@ -1,151 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-============
-Early Printk
-============
-
-Mini-HOWTO for using the earlyprintk=dbgp boot option with a
-USB2 Debug port key and a debug cable, on x86 systems.
-
-You need two computers, the 'USB debug key' special gadget and
-two USB cables, connected like this::
-
-  [host/target] <-------> [USB debug key] <-------> [client/console]
-
-Hardware requirements
-=====================
-
-  a) Host/target system needs to have USB debug port capability.
-
-     You can check this capability by looking at a 'Debug port' bit in
-     the lspci -vvv output::
-
-       # lspci -vvv
-       ...
-       00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03) (prog-if 20 [EHCI])
-               Subsystem: Lenovo ThinkPad T61
-               Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
-               Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
-               Latency: 0
-               Interrupt: pin D routed to IRQ 19
-               Region 0: Memory at fe227000 (32-bit, non-prefetchable) [size=1K]
-               Capabilities: [50] Power Management version 2
-                       Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
-                       Status: D0 PME-Enable- DSel=0 DScale=0 PME+
-               Capabilities: [58] Debug port: BAR=1 offset=00a0
-                            ^^^^^^^^^^^ <==================== [ HERE ]
-               Kernel driver in use: ehci_hcd
-               Kernel modules: ehci-hcd
-       ...
-
-     .. note::
-       If your system does not list a debug port capability then you probably
-       won't be able to use the USB debug key.
-
-  b) You also need a NetChip USB debug cable/key:
-
-        http://www.plxtech.com/products/NET2000/NET20DC/default.asp
-
-     This is a small blue plastic connector with two USB connections;
-     it draws power from its USB connections.
-
-  c) You need a second client/console system with a high speed USB 2.0 port.
-
-  d) The NetChip device must be plugged directly into the physical
-     debug port on the "host/target" system. You cannot use a USB hub in
-     between the physical debug port and the "host/target" system.
-
-     The EHCI debug controller is bound to a specific physical USB
-     port and the NetChip device will only work as an early printk
-     device in this port.  The EHCI host controllers are electrically
-     wired such that the EHCI debug controller is hooked up to the
-     first physical port and there is no way to change this via software.
-     You can find the physical port through experimentation by trying
-     each physical port on the system and rebooting.  Or you can try
-     and use lsusb or look at the kernel info messages emitted by the
-     usb stack when you plug a usb device into various ports on the
-     "host/target" system.
-
-     Some hardware vendors do not expose the usb debug port with a
-     physical connector and if you find such a device send a complaint
-     to the hardware vendor, because there is no reason not to wire
-     this port into one of the physically accessible ports.
-
-  e) It is also important to note, that many versions of the NetChip
-     device require the "client/console" system to be plugged into the
-     right hand side of the device (with the product logo facing up and
-     readable left to right).  The reason being is that the 5 volt
-     power supply is taken from only one side of the device and it
-     must be the side that does not get rebooted.
-
-Software requirements
-=====================
-
-  a) On the host/target system:
-
-    You need to enable the following kernel config option::
-
-      CONFIG_EARLY_PRINTK_DBGP=y
-
-    And you need to add the boot command line: "earlyprintk=dbgp".
-
-    .. note::
-      If you are using Grub, append it to the 'kernel' line in
-      /etc/grub.conf.  If you are using Grub2 on a BIOS firmware system,
-      append it to the 'linux' line in /boot/grub2/grub.cfg. If you are
-      using Grub2 on an EFI firmware system, append it to the 'linux'
-      or 'linuxefi' line in /boot/grub2/grub.cfg or
-      /boot/efi/EFI/<distro>/grub.cfg.
-
-    On systems with more than one EHCI debug controller you must
-    specify the correct EHCI debug controller number.  The ordering
-    comes from the PCI bus enumeration of the EHCI controllers.  The
-    default with no number argument is "0" or the first EHCI debug
-    controller.  To use the second EHCI debug controller, you would
-    use the command line: "earlyprintk=dbgp1"
-
-    .. note::
-      normally earlyprintk console gets turned off once the
-      regular console is alive - use "earlyprintk=dbgp,keep" to keep
-      this channel open beyond early bootup. This can be useful for
-      debugging crashes under Xorg, etc.
-
-  b) On the client/console system:
-
-    You should enable the following kernel config option::
-
-      CONFIG_USB_SERIAL_DEBUG=y
-
-    On the next bootup with the modified kernel you should
-    get a /dev/ttyUSBx device(s).
-
-    Now this channel of kernel messages is ready to be used: start
-    your favorite terminal emulator (minicom, etc.) and set
-    it up to use /dev/ttyUSB0 - or use a raw 'cat /dev/ttyUSBx' to
-    see the raw output.
-
-  c) On Nvidia Southbridge based systems: the kernel will try to probe
-     and find out which port has a debug device connected.
-
-Testing
-=======
-
-You can test the output by using earlyprintk=dbgp,keep and provoking
-kernel messages on the host/target system. You can provoke a harmless
-kernel message by for example doing::
-
-     echo h > /proc/sysrq-trigger
-
-On the host/target system you should see this help line in "dmesg" output::
-
-     SysRq : HELP : loglevel(0-9) reBoot Crashdump terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
-
-On the client/console system do::
-
-       cat /dev/ttyUSB0
-
-And you should see the help line above displayed shortly after you've
-provoked it on the host system.
-
-If it does not work then please ask about it on the linux-kernel@vger.kernel.org
-mailing list or contact the x86 maintainers.
diff --git a/Documentation/x86/elf_auxvec.rst b/Documentation/x86/elf_auxvec.rst
deleted file mode 100644
index 18e4744717f9..000000000000
--- a/Documentation/x86/elf_auxvec.rst
+++ /dev/null
@@ -1,53 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==================================
-x86-specific ELF Auxiliary Vectors
-==================================
-
-This document describes the semantics of the x86 auxiliary vectors.
-
-Introduction
-============
-
-ELF Auxiliary vectors enable the kernel to efficiently provide
-configuration-specific parameters to userspace. In this example, a program
-allocates an alternate stack based on the kernel-provided size::
-
-   #include <sys/auxv.h>
-   #include <elf.h>
-   #include <signal.h>
-   #include <stdlib.h>
-   #include <assert.h>
-   #include <err.h>
-
-   #ifndef AT_MINSIGSTKSZ
-   #define AT_MINSIGSTKSZ	51
-   #endif
-
-   ....
-   stack_t ss;
-
-   ss.ss_sp = malloc(ss.ss_size);
-   assert(ss.ss_sp);
-
-   ss.ss_size = getauxval(AT_MINSIGSTKSZ) + SIGSTKSZ;
-   ss.ss_flags = 0;
-
-   if (sigaltstack(&ss, NULL))
-        err(1, "sigaltstack");
-
-
-The exposed auxiliary vectors
-=============================
-
-AT_SYSINFO is used for locating the vsyscall entry point.  It is not
-exported on 64-bit mode.
-
-AT_SYSINFO_EHDR is the start address of the page containing the vDSO.
-
-AT_MINSIGSTKSZ denotes the minimum stack size required by the kernel to
-deliver a signal to user-space.  AT_MINSIGSTKSZ comprehends the space
-consumed by the kernel to accommodate the user context for the current
-hardware configuration.  It does not comprehend subsequent user-space stack
-consumption, which must be added by the user.  (e.g. Above, user-space adds
-SIGSTKSZ to AT_MINSIGSTKSZ.)
diff --git a/Documentation/x86/entry_64.rst b/Documentation/x86/entry_64.rst
deleted file mode 100644
index 0afdce3c06f4..000000000000
--- a/Documentation/x86/entry_64.rst
+++ /dev/null
@@ -1,110 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==============
-Kernel Entries
-==============
-
-This file documents some of the kernel entries in
-arch/x86/entry/entry_64.S.  A lot of this explanation is adapted from
-an email from Ingo Molnar:
-
-https://lore.kernel.org/r/20110529191055.GC9835%40elte.hu
-
-The x86 architecture has quite a few different ways to jump into
-kernel code.  Most of these entry points are registered in
-arch/x86/kernel/traps.c and implemented in arch/x86/entry/entry_64.S
-for 64-bit, arch/x86/entry/entry_32.S for 32-bit and finally
-arch/x86/entry/entry_64_compat.S which implements the 32-bit compatibility
-syscall entry points and thus provides for 32-bit processes the
-ability to execute syscalls when running on 64-bit kernels.
-
-The IDT vector assignments are listed in arch/x86/include/asm/irq_vectors.h.
-
-Some of these entries are:
-
- - system_call: syscall instruction from 64-bit code.
-
- - entry_INT80_compat: int 0x80 from 32-bit or 64-bit code; compat syscall
-   either way.
-
- - entry_INT80_compat, ia32_sysenter: syscall and sysenter from 32-bit
-   code
-
- - interrupt: An array of entries.  Every IDT vector that doesn't
-   explicitly point somewhere else gets set to the corresponding
-   value in interrupts.  These point to a whole array of
-   magically-generated functions that make their way to common_interrupt()
-   with the interrupt number as a parameter.
-
- - APIC interrupts: Various special-purpose interrupts for things
-   like TLB shootdown.
-
- - Architecturally-defined exceptions like divide_error.
-
-There are a few complexities here.  The different x86-64 entries
-have different calling conventions.  The syscall and sysenter
-instructions have their own peculiar calling conventions.  Some of
-the IDT entries push an error code onto the stack; others don't.
-IDT entries using the IST alternative stack mechanism need their own
-magic to get the stack frames right.  (You can find some
-documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM,
-Volume 3, Chapter 6.)
-
-Dealing with the swapgs instruction is especially tricky.  Swapgs
-toggles whether gs is the kernel gs or the user gs.  The swapgs
-instruction is rather fragile: it must nest perfectly and only in
-single depth, it should only be used if entering from user mode to
-kernel mode and then when returning to user-space, and precisely
-so. If we mess that up even slightly, we crash.
-
-So when we have a secondary entry, already in kernel mode, we *must
-not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
-not switched/swapped yet.
-
-Now, there's a secondary complication: there's a cheap way to test
-which mode the CPU is in and an expensive way.
-
-The cheap way is to pick this info off the entry frame on the kernel
-stack, from the CS of the ptregs area of the kernel stack::
-
-	xorl %ebx,%ebx
-	testl $3,CS+8(%rsp)
-	je error_kernelspace
-	SWAPGS
-
-The expensive (paranoid) way is to read back the MSR_GS_BASE value
-(which is what SWAPGS modifies)::
-
-	movl $1,%ebx
-	movl $MSR_GS_BASE,%ecx
-	rdmsr
-	testl %edx,%edx
-	js 1f   /* negative -> in kernel */
-	SWAPGS
-	xorl %ebx,%ebx
-  1:	ret
-
-If we are at an interrupt or user-trap/gate-alike boundary then we can
-use the faster check: the stack will be a reliable indicator of
-whether SWAPGS was already done: if we see that we are a secondary
-entry interrupting kernel mode execution, then we know that the GS
-base has already been switched. If it says that we interrupted
-user-space execution then we must do the SWAPGS.
-
-But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
-which might have triggered right after a normal entry wrote CS to the
-stack but before we executed SWAPGS, then the only safe way to check
-for GS is the slower method: the RDMSR.
-
-Therefore, super-atomic entries (except NMI, which is handled separately)
-must use idtentry with paranoid=1 to handle gsbase correctly.  This
-triggers three main behavior changes:
-
- - Interrupt entry will use the slower gsbase check.
- - Interrupt entry from user mode will switch off the IST stack.
- - Interrupt exit to kernel mode will not attempt to reschedule.
-
-We try to only use IST entries and the paranoid entry code for vectors
-that absolutely need the more expensive check for the GS base - and we
-generate all 'normal' entry points with the regular (faster) paranoid=0
-variant.
diff --git a/Documentation/x86/exception-tables.rst b/Documentation/x86/exception-tables.rst
deleted file mode 100644
index efde1fef4fbd..000000000000
--- a/Documentation/x86/exception-tables.rst
+++ /dev/null
@@ -1,357 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===============================
-Kernel level exception handling
-===============================
-
-Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
-
-When a process runs in kernel mode, it often has to access user
-mode memory whose address has been passed by an untrusted program.
-To protect itself the kernel has to verify this address.
-
-In older versions of Linux this was done with the
-int verify_area(int type, const void * addr, unsigned long size)
-function (which has since been replaced by access_ok()).
-
-This function verified that the memory area starting at address
-'addr' and of size 'size' was accessible for the operation specified
-in type (read or write). To do this, verify_read had to look up the
-virtual memory area (vma) that contained the address addr. In the
-normal case (correctly working program), this test was successful.
-It only failed for a few buggy programs. In some kernel profiling
-tests, this normally unneeded verification used up a considerable
-amount of time.
-
-To overcome this situation, Linus decided to let the virtual memory
-hardware present in every Linux-capable CPU handle this test.
-
-How does this work?
-
-Whenever the kernel tries to access an address that is currently not
-accessible, the CPU generates a page fault exception and calls the
-page fault handler::
-
-  void exc_page_fault(struct pt_regs *regs, unsigned long error_code)
-
-in arch/x86/mm/fault.c. The parameters on the stack are set up by
-the low level assembly glue in arch/x86/entry/entry_32.S. The parameter
-regs is a pointer to the saved registers on the stack, error_code
-contains a reason code for the exception.
-
-exc_page_fault() first obtains the inaccessible address from the CPU
-control register CR2. If the address is within the virtual address
-space of the process, the fault probably occurred, because the page
-was not swapped in, write protected or something similar. However,
-we are interested in the other case: the address is not valid, there
-is no vma that contains this address. In this case, the kernel jumps
-to the bad_area label.
-
-There it uses the address of the instruction that caused the exception
-(i.e. regs->eip) to find an address where the execution can continue
-(fixup). If this search is successful, the fault handler modifies the
-return address (again regs->eip) and returns. The execution will
-continue at the address in fixup.
-
-Where does fixup point to?
-
-Since we jump to the contents of fixup, fixup obviously points
-to executable code. This code is hidden inside the user access macros.
-I have picked the get_user() macro defined in arch/x86/include/asm/uaccess.h
-as an example. The definition is somewhat hard to follow, so let's peek at
-the code generated by the preprocessor and the compiler. I selected
-the get_user() call in drivers/char/sysrq.c for a detailed examination.
-
-The original code in sysrq.c line 587::
-
-        get_user(c, buf);
-
-The preprocessor output (edited to become somewhat readable)::
-
-  (
-    {
-      long __gu_err = - 14 , __gu_val = 0;
-      const __typeof__(*( (  buf ) )) *__gu_addr = ((buf));
-      if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||
-        (((sizeof(*(buf))) <= 0xC0000000UL) &&
-        ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
-        do {
-          __gu_err  = 0;
-          switch ((sizeof(*(buf)))) {
-            case 1:
-              __asm__ __volatile__(
-                "1:      mov" "b" " %2,%" "b" "1\n"
-                "2:\n"
-                ".section .fixup,\"ax\"\n"
-                "3:      movl %3,%0\n"
-                "        xor" "b" " %" "b" "1,%" "b" "1\n"
-                "        jmp 2b\n"
-                ".section __ex_table,\"a\"\n"
-                "        .align 4\n"
-                "        .long 1b,3b\n"
-                ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
-                              (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;
-                break;
-            case 2:
-              __asm__ __volatile__(
-                "1:      mov" "w" " %2,%" "w" "1\n"
-                "2:\n"
-                ".section .fixup,\"ax\"\n"
-                "3:      movl %3,%0\n"
-                "        xor" "w" " %" "w" "1,%" "w" "1\n"
-                "        jmp 2b\n"
-                ".section __ex_table,\"a\"\n"
-                "        .align 4\n"
-                "        .long 1b,3b\n"
-                ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
-                              (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  ));
-                break;
-            case 4:
-              __asm__ __volatile__(
-                "1:      mov" "l" " %2,%" "" "1\n"
-                "2:\n"
-                ".section .fixup,\"ax\"\n"
-                "3:      movl %3,%0\n"
-                "        xor" "l" " %" "" "1,%" "" "1\n"
-                "        jmp 2b\n"
-                ".section __ex_table,\"a\"\n"
-                "        .align 4\n"        "        .long 1b,3b\n"
-                ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
-                              (   __gu_addr   )) ), "i"(- 14 ), "0"(__gu_err));
-                break;
-            default:
-              (__gu_val) = __get_user_bad();
-          }
-        } while (0) ;
-      ((c)) = (__typeof__(*((buf))))__gu_val;
-      __gu_err;
-    }
-  );
-
-WOW! Black GCC/assembly magic. This is impossible to follow, so let's
-see what code gcc generates::
-
- >         xorl %edx,%edx
- >         movl current_set,%eax
- >         cmpl $24,788(%eax)
- >         je .L1424
- >         cmpl $-1073741825,64(%esp)
- >         ja .L1423
- > .L1424:
- >         movl %edx,%eax
- >         movl 64(%esp),%ebx
- > #APP
- > 1:      movb (%ebx),%dl                /* this is the actual user access */
- > 2:
- > .section .fixup,"ax"
- > 3:      movl $-14,%eax
- >         xorb %dl,%dl
- >         jmp 2b
- > .section __ex_table,"a"
- >         .align 4
- >         .long 1b,3b
- > .text
- > #NO_APP
- > .L1423:
- >         movzbl %dl,%esi
-
-The optimizer does a good job and gives us something we can actually
-understand. Can we? The actual user access is quite obvious. Thanks
-to the unified address space we can just access the address in user
-memory. But what does the .section stuff do?????
-
-To understand this we have to look at the final kernel::
-
- > objdump --section-headers vmlinux
- >
- > vmlinux:     file format elf32-i386
- >
- > Sections:
- > Idx Name          Size      VMA       LMA       File off  Algn
- >   0 .text         00098f40  c0100000  c0100000  00001000  2**4
- >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
- >   1 .fixup        000016bc  c0198f40  c0198f40  00099f40  2**0
- >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
- >   2 .rodata       0000f127  c019a5fc  c019a5fc  0009b5fc  2**2
- >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
- >   3 __ex_table    000015c0  c01a9724  c01a9724  000aa724  2**2
- >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
- >   4 .data         0000ea58  c01abcf0  c01abcf0  000abcf0  2**4
- >                   CONTENTS, ALLOC, LOAD, DATA
- >   5 .bss          00018e21  c01ba748  c01ba748  000ba748  2**2
- >                   ALLOC
- >   6 .comment      00000ec4  00000000  00000000  000ba748  2**0
- >                   CONTENTS, READONLY
- >   7 .note         00001068  00000ec4  00000ec4  000bb60c  2**0
- >                   CONTENTS, READONLY
-
-There are obviously 2 non standard ELF sections in the generated object
-file. But first we want to find out what happened to our code in the
-final kernel executable::
-
- > objdump --disassemble --section=.text vmlinux
- >
- > c017e785 <do_con_write+c1> xorl   %edx,%edx
- > c017e787 <do_con_write+c3> movl   0xc01c7bec,%eax
- > c017e78c <do_con_write+c8> cmpl   $0x18,0x314(%eax)
- > c017e793 <do_con_write+cf> je     c017e79f <do_con_write+db>
- > c017e795 <do_con_write+d1> cmpl   $0xbfffffff,0x40(%esp,1)
- > c017e79d <do_con_write+d9> ja     c017e7a7 <do_con_write+e3>
- > c017e79f <do_con_write+db> movl   %edx,%eax
- > c017e7a1 <do_con_write+dd> movl   0x40(%esp,1),%ebx
- > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
- > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
-
-The whole user memory access is reduced to 10 x86 machine instructions.
-The instructions bracketed in the .section directives are no longer
-in the normal execution path. They are located in a different section
-of the executable file::
-
- > objdump --disassemble --section=.fixup vmlinux
- >
- > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
- > c0199ffa <.fixup+10ba> xorb   %dl,%dl
- > c0199ffc <.fixup+10bc> jmp    c017e7a7 <do_con_write+e3>
-
-And finally::
-
- > objdump --full-contents --section=__ex_table vmlinux
- >
- >  c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0  ................
- >  c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0  ................
- >  c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0  ................
-
-or in human readable byte order::
-
- >  c01aa7c4 c017c093 c0199fe0 c017c097 c017c099  ................
- >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
-                               ^^^^^^^^^^^^^^^^^
-                               this is the interesting part!
- >  c01aa7e4 c0180a08 c019a001 c0180a0a c019a004  ................
-
-What happened? The assembly directives::
-
-  .section .fixup,"ax"
-  .section __ex_table,"a"
-
-told the assembler to move the following code to the specified
-sections in the ELF object file. So the instructions::
-
-  3:      movl $-14,%eax
-          xorb %dl,%dl
-          jmp 2b
-
-ended up in the .fixup section of the object file and the addresses::
-
-        .long 1b,3b
-
-ended up in the __ex_table section of the object file. 1b and 3b
-are local labels. The local label 1b (1b stands for next label 1
-backward) is the address of the instruction that might fault, i.e.
-in our case the address of the label 1 is c017e7a5:
-the original assembly code: > 1:      movb (%ebx),%dl
-and linked in vmlinux     : > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
-
-The local label 3 (backwards again) is the address of the code to handle
-the fault, in our case the actual value is c0199ff5:
-the original assembly code: > 3:      movl $-14,%eax
-and linked in vmlinux     : > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
-
-If the fixup was able to handle the exception, control flow may be returned
-to the instruction after the one that triggered the fault, ie. local label 2b.
-
-The assembly code::
-
- > .section __ex_table,"a"
- >         .align 4
- >         .long 1b,3b
-
-becomes the value pair::
-
- >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
-                               ^this is ^this is
-                               1b       3b
-
-c017e7a5,c0199ff5 in the exception table of the kernel.
-
-So, what actually happens if a fault from kernel mode with no suitable
-vma occurs?
-
-#. access to invalid address::
-
-    > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
-#. MMU generates exception
-#. CPU calls exc_page_fault()
-#. exc_page_fault() calls do_user_addr_fault()
-#. do_user_addr_fault() calls kernelmode_fixup_or_oops()
-#. kernelmode_fixup_or_oops() calls fixup_exception() (regs->eip == c017e7a5);
-#. fixup_exception() calls search_exception_tables()
-#. search_exception_tables() looks up the address c017e7a5 in the
-   exception table (i.e. the contents of the ELF section __ex_table)
-   and returns the address of the associated fault handle code c0199ff5.
-#. fixup_exception() modifies its own return address to point to the fault
-   handle code and returns.
-#. execution continues in the fault handling code.
-#. a) EAX becomes -EFAULT (== -14)
-   b) DL  becomes zero (the value we "read" from user space)
-   c) execution continues at local label 2 (address of the
-      instruction immediately after the faulting user access).
-
-The steps 8a to 8c in a certain way emulate the faulting instruction.
-
-That's it, mostly. If you look at our example, you might ask why
-we set EAX to -EFAULT in the exception handler code. Well, the
-get_user() macro actually returns a value: 0, if the user access was
-successful, -EFAULT on failure. Our original code did not test this
-return value, however the inline assembly code in get_user() tries to
-return -EFAULT. GCC selected EAX to return this value.
-
-NOTE:
-Due to the way that the exception table is built and needs to be ordered,
-only use exceptions for code in the .text section.  Any other section
-will cause the exception table to not be sorted correctly, and the
-exceptions will fail.
-
-Things changed when 64-bit support was added to x86 Linux. Rather than
-double the size of the exception table by expanding the two entries
-from 32-bits to 64 bits, a clever trick was used to store addresses
-as relative offsets from the table itself. The assembly code changed
-from::
-
-    .long 1b,3b
-  to:
-          .long (from) - .
-          .long (to) - .
-
-and the C-code that uses these values converts back to absolute addresses
-like this::
-
-	ex_insn_addr(const struct exception_table_entry *x)
-	{
-		return (unsigned long)&x->insn + x->insn;
-	}
-
-In v4.6 the exception table entry was expanded with a new field "handler".
-This is also 32-bits wide and contains a third relative function
-pointer which points to one of:
-
-1) ``int ex_handler_default(const struct exception_table_entry *fixup)``
-     This is legacy case that just jumps to the fixup code
-
-2) ``int ex_handler_fault(const struct exception_table_entry *fixup)``
-     This case provides the fault number of the trap that occurred at
-     entry->insn. It is used to distinguish page faults from machine
-     check.
-
-More functions can easily be added.
-
-CONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post
-link of the kernel image, via a host utility scripts/sorttable. It will set the
-symbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section
-at boot time. With the exception table sorted, at runtime when an exception
-occurs we can quickly lookup the __ex_table entry via binary search.
-
-This is not just a boot time optimization, some architectures require this
-table to be sorted in order to handle exceptions relatively early in the boot
-process. For example, i386 makes use of this form of exception handling before
-paging support is even enabled!
diff --git a/Documentation/x86/features.rst b/Documentation/x86/features.rst
deleted file mode 100644
index b663f15053ce..000000000000
--- a/Documentation/x86/features.rst
+++ /dev/null
@@ -1,3 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-.. kernel-feat:: $srctree/Documentation/features x86
diff --git a/Documentation/x86/i386/IO-APIC.rst b/Documentation/x86/i386/IO-APIC.rst
deleted file mode 100644
index ce4d8df15e7c..000000000000
--- a/Documentation/x86/i386/IO-APIC.rst
+++ /dev/null
@@ -1,123 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=======
-IO-APIC
-=======
-
-:Author: Ingo Molnar <mingo@kernel.org>
-
-Most (all) Intel-MP compliant SMP boards have the so-called 'IO-APIC',
-which is an enhanced interrupt controller. It enables us to route
-hardware interrupts to multiple CPUs, or to CPU groups. Without an
-IO-APIC, interrupts from hardware will be delivered only to the
-CPU which boots the operating system (usually CPU#0).
-
-Linux supports all variants of compliant SMP boards, including ones with
-multiple IO-APICs. Multiple IO-APICs are used in high-end servers to
-distribute IRQ load further.
-
-There are (a few) known breakages in certain older boards, such bugs are
-usually worked around by the kernel. If your MP-compliant SMP board does
-not boot Linux, then consult the linux-smp mailing list archives first.
-
-If your box boots fine with enabled IO-APIC IRQs, then your
-/proc/interrupts will look like this one::
-
-  hell:~> cat /proc/interrupts
-             CPU0
-    0:    1360293    IO-APIC-edge  timer
-    1:          4    IO-APIC-edge  keyboard
-    2:          0          XT-PIC  cascade
-   13:          1          XT-PIC  fpu
-   14:       1448    IO-APIC-edge  ide0
-   16:      28232   IO-APIC-level  Intel EtherExpress Pro 10/100 Ethernet
-   17:      51304   IO-APIC-level  eth0
-  NMI:          0
-  ERR:          0
-  hell:~>
-
-Some interrupts are still listed as 'XT PIC', but this is not a problem;
-none of those IRQ sources is performance-critical.
-
-
-In the unlikely case that your board does not create a working mp-table,
-you can use the pirq= boot parameter to 'hand-construct' IRQ entries. This
-is non-trivial though and cannot be automated. One sample /etc/lilo.conf
-entry::
-
-	append="pirq=15,11,10"
-
-The actual numbers depend on your system, on your PCI cards and on their
-PCI slot position. Usually PCI slots are 'daisy chained' before they are
-connected to the PCI chipset IRQ routing facility (the incoming PIRQ1-4
-lines)::
-
-               ,-.        ,-.        ,-.        ,-.        ,-.
-     PIRQ4 ----| |-.    ,-| |-.    ,-| |-.    ,-| |--------| |
-               |S|  \  /  |S|  \  /  |S|  \  /  |S|        |S|
-     PIRQ3 ----|l|-. `/---|l|-. `/---|l|-. `/---|l|--------|l|
-               |o|  \/    |o|  \/    |o|  \/    |o|        |o|
-     PIRQ2 ----|t|-./`----|t|-./`----|t|-./`----|t|--------|t|
-               |1| /\     |2| /\     |3| /\     |4|        |5|
-     PIRQ1 ----| |-  `----| |-  `----| |-  `----| |--------| |
-               `-'        `-'        `-'        `-'        `-'
-
-Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD::
-
-                               ,-.
-                         INTD--| |
-                               |S|
-                         INTC--|l|
-                               |o|
-                         INTB--|t|
-                               |x|
-                         INTA--| |
-                               `-'
-
-These INTA-D PCI IRQs are always 'local to the card', their real meaning
-depends on which slot they are in. If you look at the daisy chaining diagram,
-a card in slot4, issuing INTA IRQ, it will end up as a signal on PIRQ4 of
-the PCI chipset. Most cards issue INTA, this creates optimal distribution
-between the PIRQ lines. (distributing IRQ sources properly is not a
-necessity, PCI IRQs can be shared at will, but it's a good for performance
-to have non shared interrupts). Slot5 should be used for videocards, they
-do not use interrupts normally, thus they are not daisy chained either.
-
-so if you have your SCSI card (IRQ11) in Slot1, Tulip card (IRQ9) in
-Slot2, then you'll have to specify this pirq= line::
-
-	append="pirq=11,9"
-
-the following script tries to figure out such a default pirq= line from
-your PCI configuration::
-
-	echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g'
-
-note that this script won't work if you have skipped a few slots or if your
-board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins
-connected in some strange way). E.g. if in the above case you have your SCSI
-card (IRQ11) in Slot3, and have Slot1 empty::
-
-	append="pirq=0,9,11"
-
-[value '0' is a generic 'placeholder', reserved for empty (or non-IRQ emitting)
-slots.]
-
-Generally, it's always possible to find out the correct pirq= settings, just
-permute all IRQ numbers properly ... it will take some time though. An
-'incorrect' pirq line will cause the booting process to hang, or a device
-won't function properly (e.g. if it's inserted as a module).
-
-If you have 2 PCI buses, then you can use up to 8 pirq values, although such
-boards tend to have a good configuration.
-
-Be prepared that it might happen that you need some strange pirq line::
-
-	append="pirq=0,0,0,0,0,0,9,11"
-
-Use smart trial-and-error techniques to find out the correct pirq line ...
-
-Good luck and mail to linux-smp@vger.kernel.org or
-linux-kernel@vger.kernel.org if you have any problems that are not covered
-by this document.
-
diff --git a/Documentation/x86/i386/index.rst b/Documentation/x86/i386/index.rst
deleted file mode 100644
index 8747cf5bbd49..000000000000
--- a/Documentation/x86/i386/index.rst
+++ /dev/null
@@ -1,10 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-============
-i386 Support
-============
-
-.. toctree::
-   :maxdepth: 2
-
-   IO-APIC
diff --git a/Documentation/x86/ifs.rst b/Documentation/x86/ifs.rst
deleted file mode 100644
index 97abb696a680..000000000000
--- a/Documentation/x86/ifs.rst
+++ /dev/null
@@ -1,2 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-.. kernel-doc:: drivers/platform/x86/intel/ifs/ifs.h
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
deleted file mode 100644
index c73d133fd37c..000000000000
--- a/Documentation/x86/index.rst
+++ /dev/null
@@ -1,44 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==========================
-x86-specific Documentation
-==========================
-
-.. toctree::
-   :maxdepth: 2
-   :numbered:
-
-   boot
-   booting-dt
-   cpuinfo
-   topology
-   exception-tables
-   kernel-stacks
-   entry_64
-   earlyprintk
-   orc-unwinder
-   zero-page
-   tlb
-   mtrr
-   pat
-   intel-hfi
-   iommu
-   intel_txt
-   amd-memory-encryption
-   amd_hsmp
-   tdx
-   pti
-   mds
-   microcode
-   resctrl
-   tsx_async_abort
-   buslock
-   usb-legacy-support
-   i386/index
-   x86_64/index
-   ifs
-   sva
-   sgx
-   features
-   elf_auxvec
-   xstate
diff --git a/Documentation/x86/intel-hfi.rst b/Documentation/x86/intel-hfi.rst
deleted file mode 100644
index 49dea58ea4fb..000000000000
--- a/Documentation/x86/intel-hfi.rst
+++ /dev/null
@@ -1,72 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-============================================================
-Hardware-Feedback Interface for scheduling on Intel Hardware
-============================================================
-
-Overview
---------
-
-Intel has described the Hardware Feedback Interface (HFI) in the Intel 64 and
-IA-32 Architectures Software Developer's Manual (Intel SDM) Volume 3 Section
-14.6 [1]_.
-
-The HFI gives the operating system a performance and energy efficiency
-capability data for each CPU in the system. Linux can use the information from
-the HFI to influence task placement decisions.
-
-The Hardware Feedback Interface
--------------------------------
-
-The Hardware Feedback Interface provides to the operating system information
-about the performance and energy efficiency of each CPU in the system. Each
-capability is given as a unit-less quantity in the range [0-255]. Higher values
-indicate higher capability. Energy efficiency and performance are reported in
-separate capabilities. Even though on some systems these two metrics may be
-related, they are specified as independent capabilities in the Intel SDM.
-
-These capabilities may change at runtime as a result of changes in the
-operating conditions of the system or the action of external factors. The rate
-at which these capabilities are updated is specific to each processor model. On
-some models, capabilities are set at boot time and never change. On others,
-capabilities may change every tens of milliseconds. For instance, a remote
-mechanism may be used to lower Thermal Design Power. Such change can be
-reflected in the HFI. Likewise, if the system needs to be throttled due to
-excessive heat, the HFI may reflect reduced performance on specific CPUs.
-
-The kernel or a userspace policy daemon can use these capabilities to modify
-task placement decisions. For instance, if either the performance or energy
-capabilities of a given logical processor becomes zero, it is an indication that
-the hardware recommends to the operating system to not schedule any tasks on
-that processor for performance or energy efficiency reasons, respectively.
-
-Implementation details for Linux
---------------------------------
-
-The infrastructure to handle thermal event interrupts has two parts. In the
-Local Vector Table of a CPU's local APIC, there exists a register for the
-Thermal Monitor Register. This register controls how interrupts are delivered
-to a CPU when the thermal monitor generates and interrupt. Further details
-can be found in the Intel SDM Vol. 3 Section 10.5 [1]_.
-
-The thermal monitor may generate interrupts per CPU or per package. The HFI
-generates package-level interrupts. This monitor is configured and initialized
-via a set of machine-specific registers. Specifically, the HFI interrupt and
-status are controlled via designated bits in the IA32_PACKAGE_THERM_INTERRUPT
-and IA32_PACKAGE_THERM_STATUS registers, respectively. There exists one HFI
-table per package. Further details can be found in the Intel SDM Vol. 3
-Section 14.9 [1]_.
-
-The hardware issues an HFI interrupt after updating the HFI table and is ready
-for the operating system to consume it. CPUs receive such interrupt via the
-thermal entry in the Local APIC's Local Vector Table.
-
-When servicing such interrupt, the HFI driver parses the updated table and
-relays the update to userspace using the thermal notification framework. Given
-that there may be many HFI updates every second, the updates relayed to
-userspace are throttled at a rate of CONFIG_HZ jiffies.
-
-References
-----------
-
-.. [1] https://www.intel.com/sdm
diff --git a/Documentation/x86/intel_txt.rst b/Documentation/x86/intel_txt.rst
deleted file mode 100644
index d83c1a2122c9..000000000000
--- a/Documentation/x86/intel_txt.rst
+++ /dev/null
@@ -1,227 +0,0 @@
-=====================
-Intel(R) TXT Overview
-=====================
-
-Intel's technology for safer computing, Intel(R) Trusted Execution
-Technology (Intel(R) TXT), defines platform-level enhancements that
-provide the building blocks for creating trusted platforms.
-
-Intel TXT was formerly known by the code name LaGrande Technology (LT).
-
-Intel TXT in Brief:
-
--  Provides dynamic root of trust for measurement (DRTM)
--  Data protection in case of improper shutdown
--  Measurement and verification of launched environment
-
-Intel TXT is part of the vPro(TM) brand and is also available some
-non-vPro systems.  It is currently available on desktop systems
-based on the Q35, X38, Q45, and Q43 Express chipsets (e.g. Dell
-Optiplex 755, HP dc7800, etc.) and mobile systems based on the GM45,
-PM45, and GS45 Express chipsets.
-
-For more information, see http://www.intel.com/technology/security/.
-This site also has a link to the Intel TXT MLE Developers Manual,
-which has been updated for the new released platforms.
-
-Intel TXT has been presented at various events over the past few
-years, some of which are:
-
-      - LinuxTAG 2008:
-          http://www.linuxtag.org/2008/en/conf/events/vp-donnerstag.html
-
-      - TRUST2008:
-          http://www.trust-conference.eu/downloads/Keynote-Speakers/
-          3_David-Grawrock_The-Front-Door-of-Trusted-Computing.pdf
-
-      - IDF, Shanghai:
-          http://www.prcidf.com.cn/index_en.html
-
-      - IDFs 2006, 2007
-	  (I'm not sure if/where they are online)
-
-Trusted Boot Project Overview
-=============================
-
-Trusted Boot (tboot) is an open source, pre-kernel/VMM module that
-uses Intel TXT to perform a measured and verified launch of an OS
-kernel/VMM.
-
-It is hosted on SourceForge at http://sourceforge.net/projects/tboot.
-The mercurial source repo is available at http://www.bughost.org/
-repos.hg/tboot.hg.
-
-Tboot currently supports launching Xen (open source VMM/hypervisor
-w/ TXT support since v3.2), and now Linux kernels.
-
-
-Value Proposition for Linux or "Why should you care?"
-=====================================================
-
-While there are many products and technologies that attempt to
-measure or protect the integrity of a running kernel, they all
-assume the kernel is "good" to begin with.  The Integrity
-Measurement Architecture (IMA) and Linux Integrity Module interface
-are examples of such solutions.
-
-To get trust in the initial kernel without using Intel TXT, a
-static root of trust must be used.  This bases trust in BIOS
-starting at system reset and requires measurement of all code
-executed between system reset through the completion of the kernel
-boot as well as data objects used by that code.  In the case of a
-Linux kernel, this means all of BIOS, any option ROMs, the
-bootloader and the boot config.  In practice, this is a lot of
-code/data, much of which is subject to change from boot to boot
-(e.g. changing NICs may change option ROMs).  Without reference
-hashes, these measurement changes are difficult to assess or
-confirm as benign.  This process also does not provide DMA
-protection, memory configuration/alias checks and locks, crash
-protection, or policy support.
-
-By using the hardware-based root of trust that Intel TXT provides,
-many of these issues can be mitigated.  Specifically: many
-pre-launch components can be removed from the trust chain, DMA
-protection is provided to all launched components, a large number
-of platform configuration checks are performed and values locked,
-protection is provided for any data in the event of an improper
-shutdown, and there is support for policy-based execution/verification.
-This provides a more stable measurement and a higher assurance of
-system configuration and initial state than would be otherwise
-possible.  Since the tboot project is open source, source code for
-almost all parts of the trust chain is available (excepting SMM and
-Intel-provided firmware).
-
-How Does it Work?
-=================
-
--  Tboot is an executable that is launched by the bootloader as
-   the "kernel" (the binary the bootloader executes).
--  It performs all of the work necessary to determine if the
-   platform supports Intel TXT and, if so, executes the GETSEC[SENTER]
-   processor instruction that initiates the dynamic root of trust.
-
-   -  If tboot determines that the system does not support Intel TXT
-      or is not configured correctly (e.g. the SINIT AC Module was
-      incorrect), it will directly launch the kernel with no changes
-      to any state.
-   -  Tboot will output various information about its progress to the
-      terminal, serial port, and/or an in-memory log; the output
-      locations can be configured with a command line switch.
-
--  The GETSEC[SENTER] instruction will return control to tboot and
-   tboot then verifies certain aspects of the environment (e.g. TPM NV
-   lock, e820 table does not have invalid entries, etc.).
--  It will wake the APs from the special sleep state the GETSEC[SENTER]
-   instruction had put them in and place them into a wait-for-SIPI
-   state.
-
-   -  Because the processors will not respond to an INIT or SIPI when
-      in the TXT environment, it is necessary to create a small VT-x
-      guest for the APs.  When they run in this guest, they will
-      simply wait for the INIT-SIPI-SIPI sequence, which will cause
-      VMEXITs, and then disable VT and jump to the SIPI vector.  This
-      approach seemed like a better choice than having to insert
-      special code into the kernel's MP wakeup sequence.
-
--  Tboot then applies an (optional) user-defined launch policy to
-   verify the kernel and initrd.
-
-   -  This policy is rooted in TPM NV and is described in the tboot
-      project.  The tboot project also contains code for tools to
-      create and provision the policy.
-   -  Policies are completely under user control and if not present
-      then any kernel will be launched.
-   -  Policy action is flexible and can include halting on failures
-      or simply logging them and continuing.
-
--  Tboot adjusts the e820 table provided by the bootloader to reserve
-   its own location in memory as well as to reserve certain other
-   TXT-related regions.
--  As part of its launch, tboot DMA protects all of RAM (using the
-   VT-d PMRs).  Thus, the kernel must be booted with 'intel_iommu=on'
-   in order to remove this blanket protection and use VT-d's
-   page-level protection.
--  Tboot will populate a shared page with some data about itself and
-   pass this to the Linux kernel as it transfers control.
-
-   -  The location of the shared page is passed via the boot_params
-      struct as a physical address.
-
--  The kernel will look for the tboot shared page address and, if it
-   exists, map it.
--  As one of the checks/protections provided by TXT, it makes a copy
-   of the VT-d DMARs in a DMA-protected region of memory and verifies
-   them for correctness.  The VT-d code will detect if the kernel was
-   launched with tboot and use this copy instead of the one in the
-   ACPI table.
--  At this point, tboot and TXT are out of the picture until a
-   shutdown (S<n>)
--  In order to put a system into any of the sleep states after a TXT
-   launch, TXT must first be exited.  This is to prevent attacks that
-   attempt to crash the system to gain control on reboot and steal
-   data left in memory.
-
-   -  The kernel will perform all of its sleep preparation and
-      populate the shared page with the ACPI data needed to put the
-      platform in the desired sleep state.
-   -  Then the kernel jumps into tboot via the vector specified in the
-      shared page.
-   -  Tboot will clean up the environment and disable TXT, then use the
-      kernel-provided ACPI information to actually place the platform
-      into the desired sleep state.
-   -  In the case of S3, tboot will also register itself as the resume
-      vector.  This is necessary because it must re-establish the
-      measured environment upon resume.  Once the TXT environment
-      has been restored, it will restore the TPM PCRs and then
-      transfer control back to the kernel's S3 resume vector.
-      In order to preserve system integrity across S3, the kernel
-      provides tboot with a set of memory ranges (RAM and RESERVED_KERN
-      in the e820 table, but not any memory that BIOS might alter over
-      the S3 transition) that tboot will calculate a MAC (message
-      authentication code) over and then seal with the TPM. On resume
-      and once the measured environment has been re-established, tboot
-      will re-calculate the MAC and verify it against the sealed value.
-      Tboot's policy determines what happens if the verification fails.
-      Note that the c/s 194 of tboot which has the new MAC code supports
-      this.
-
-That's pretty much it for TXT support.
-
-
-Configuring the System
-======================
-
-This code works with 32bit, 32bit PAE, and 64bit (x86_64) kernels.
-
-In BIOS, the user must enable:  TPM, TXT, VT-x, VT-d.  Not all BIOSes
-allow these to be individually enabled/disabled and the screens in
-which to find them are BIOS-specific.
-
-grub.conf needs to be modified as follows::
-
-        title Linux 2.6.29-tip w/ tboot
-          root (hd0,0)
-                kernel /tboot.gz logging=serial,vga,memory
-                module /vmlinuz-2.6.29-tip intel_iommu=on ro
-                       root=LABEL=/ rhgb console=ttyS0,115200 3
-                module /initrd-2.6.29-tip.img
-                module /Q35_SINIT_17.BIN
-
-The kernel option for enabling Intel TXT support is found under the
-Security top-level menu and is called "Enable Intel(R) Trusted
-Execution Technology (TXT)".  It is considered EXPERIMENTAL and
-depends on the generic x86 support (to allow maximum flexibility in
-kernel build options), since the tboot code will detect whether the
-platform actually supports Intel TXT and thus whether any of the
-kernel code is executed.
-
-The Q35_SINIT_17.BIN file is what Intel TXT refers to as an
-Authenticated Code Module.  It is specific to the chipset in the
-system and can also be found on the Trusted Boot site.  It is an
-(unencrypted) module signed by Intel that is used as part of the
-DRTM process to verify and configure the system.  It is signed
-because it operates at a higher privilege level in the system than
-any other macrocode and its correct operation is critical to the
-establishment of the DRTM.  The process for determining the correct
-SINIT ACM for a system is documented in the SINIT-guide.txt file
-that is on the tboot SourceForge site under the SINIT ACM downloads.
diff --git a/Documentation/x86/iommu.rst b/Documentation/x86/iommu.rst
deleted file mode 100644
index 42c7a6faa39a..000000000000
--- a/Documentation/x86/iommu.rst
+++ /dev/null
@@ -1,151 +0,0 @@
-=================
-x86 IOMMU Support
-=================
-
-The architecture specs can be obtained from the below locations.
-
-- Intel: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf
-- AMD: https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
-
-This guide gives a quick cheat sheet for some basic understanding.
-
-Basic stuff
------------
-
-ACPI enumerates and lists the different IOMMUs on the platform, and
-device scope relationships between devices and which IOMMU controls
-them.
-
-Some ACPI Keywords:
-
-- DMAR - Intel DMA Remapping table
-- DRHD - Intel DMA Remapping Hardware Unit Definition
-- RMRR - Intel Reserved Memory Region Reporting Structure
-- IVRS - AMD I/O Virtualization Reporting Structure
-- IVDB - AMD I/O Virtualization Definition Block
-- IVHD - AMD I/O Virtualization Hardware Definition
-
-What is Intel RMRR?
-^^^^^^^^^^^^^^^^^^^
-
-There are some devices the BIOS controls, for e.g USB devices to perform
-PS2 emulation. The regions of memory used for these devices are marked
-reserved in the e820 map. When we turn on DMA translation, DMA to those
-regions will fail. Hence BIOS uses RMRR to specify these regions along with
-devices that need to access these regions. OS is expected to setup
-unity mappings for these regions for these devices to access these regions.
-
-What is AMD IVRS?
-^^^^^^^^^^^^^^^^^
-
-The architecture defines an ACPI-compatible data structure called an I/O
-Virtualization Reporting Structure (IVRS) that is used to convey information
-related to I/O virtualization to system software.  The IVRS describes the
-configuration and capabilities of the IOMMUs contained in the platform as
-well as information about the devices that each IOMMU virtualizes.
-
-The IVRS provides information about the following:
-
-- IOMMUs present in the platform including their capabilities and proper configuration
-- System I/O topology relevant to each IOMMU
-- Peripheral devices that cannot be otherwise enumerated
-- Memory regions used by SMI/SMM, platform firmware, and platform hardware. These are generally exclusion ranges to be configured by system software.
-
-How is an I/O Virtual Address (IOVA) generated?
------------------------------------------------
-
-Well behaved drivers call dma_map_*() calls before sending command to device
-that needs to perform DMA. Once DMA is completed and mapping is no longer
-required, driver performs dma_unmap_*() calls to unmap the region.
-
-Intel Specific Notes
---------------------
-
-Graphics Problems?
-^^^^^^^^^^^^^^^^^^
-
-If you encounter issues with graphics devices, you can try adding
-option intel_iommu=igfx_off to turn off the integrated graphics engine.
-If this fixes anything, please ensure you file a bug reporting the problem.
-
-Some exceptions to IOVA
-^^^^^^^^^^^^^^^^^^^^^^^
-
-Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
-The same is true for peer to peer transactions. Hence we reserve the
-address from PCI MMIO ranges so they are not allocated for IOVA addresses.
-
-AMD Specific Notes
-------------------
-
-Graphics Problems?
-^^^^^^^^^^^^^^^^^^
-
-If you encounter issues with integrated graphics devices, you can try adding
-option iommu=pt to the kernel command line use a 1:1 mapping for the IOMMU.  If
-this fixes anything, please ensure you file a bug reporting the problem.
-
-Fault reporting
----------------
-When errors are reported, the IOMMU signals via an interrupt. The fault
-reason and device that caused it is printed on the console.
-
-
-Kernel Log Samples
-------------------
-
-Intel Boot Messages
-^^^^^^^^^^^^^^^^^^^
-
-Something like this gets printed indicating presence of DMAR tables
-in ACPI:
-
-::
-
-	ACPI: DMAR (v001 A M I  OEMDMAR  0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0
-
-When DMAR is being processed and initialized by ACPI, prints DMAR locations
-and any RMRR's processed:
-
-::
-
-	ACPI DMAR:Host address width 36
-	ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
-	ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000
-	ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
-	ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff
-	ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff
-
-When DMAR is enabled for use, you will notice:
-
-::
-
-	PCI-DMA: Using DMAR IOMMU
-
-Intel Fault reporting
-^^^^^^^^^^^^^^^^^^^^^
-
-::
-
-	DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
-	DMAR:[fault reason 05] PTE Write access is not set
-	DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
-	DMAR:[fault reason 05] PTE Write access is not set
-
-AMD Boot Messages
-^^^^^^^^^^^^^^^^^
-
-Something like this gets printed indicating presence of the IOMMU:
-
-::
-
-	iommu: Default domain type: Translated
-	iommu: DMA domain TLB invalidation policy: lazy mode
-
-AMD Fault reporting
-^^^^^^^^^^^^^^^^^^^
-
-::
-
-	AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0007 address=0xffffc02000 flags=0x0000]
-	AMD-Vi: Event logged [IO_PAGE_FAULT device=07:00.0 domain=0x0007 address=0xffffc02000 flags=0x0000]
diff --git a/Documentation/x86/kernel-stacks.rst b/Documentation/x86/kernel-stacks.rst
deleted file mode 100644
index 6b0bcf027ff1..000000000000
--- a/Documentation/x86/kernel-stacks.rst
+++ /dev/null
@@ -1,152 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=============
-Kernel Stacks
-=============
-
-Kernel stacks on x86-64 bit
-===========================
-
-Most of the text from Keith Owens, hacked by AK
-
-x86_64 page size (PAGE_SIZE) is 4K.
-
-Like all other architectures, x86_64 has a kernel stack for every
-active thread.  These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
-These stacks contain useful data as long as a thread is alive or a
-zombie. While the thread is in user space the kernel stack is empty
-except for the thread_info structure at the bottom.
-
-In addition to the per thread stacks, there are specialized stacks
-associated with each CPU.  These stacks are only used while the kernel
-is in control on that CPU; when a CPU returns to user space the
-specialized stacks contain no useful data.  The main CPU stacks are:
-
-* Interrupt stack.  IRQ_STACK_SIZE
-
-  Used for external hardware interrupts.  If this is the first external
-  hardware interrupt (i.e. not a nested hardware interrupt) then the
-  kernel switches from the current task to the interrupt stack.  Like
-  the split thread and interrupt stacks on i386, this gives more room
-  for kernel interrupt processing without having to increase the size
-  of every per thread stack.
-
-  The interrupt stack is also used when processing a softirq.
-
-Switching to the kernel interrupt stack is done by software based on a
-per CPU interrupt nest counter. This is needed because x86-64 "IST"
-hardware stacks cannot nest without races.
-
-x86_64 also has a feature which is not available on i386, the ability
-to automatically switch to a new stack for designated events such as
-double fault or NMI, which makes it easier to handle these unusual
-events on x86_64.  This feature is called the Interrupt Stack Table
-(IST).  There can be up to 7 IST entries per CPU. The IST code is an
-index into the Task State Segment (TSS). The IST entries in the TSS
-point to dedicated stacks; each stack can be a different size.
-
-An IST is selected by a non-zero value in the IST field of an
-interrupt-gate descriptor.  When an interrupt occurs and the hardware
-loads such a descriptor, the hardware automatically sets the new stack
-pointer based on the IST value, then invokes the interrupt handler.  If
-the interrupt came from user mode, then the interrupt handler prologue
-will switch back to the per-thread stack.  If software wants to allow
-nested IST interrupts then the handler must adjust the IST values on
-entry to and exit from the interrupt handler.  (This is occasionally
-done, e.g. for debug exceptions.)
-
-Events with different IST codes (i.e. with different stacks) can be
-nested.  For example, a debug interrupt can safely be interrupted by an
-NMI.  arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
-pointers on entry to and exit from all IST events, in theory allowing
-IST events with the same code to be nested.  However in most cases, the
-stack size allocated to an IST assumes no nesting for the same code.
-If that assumption is ever broken then the stacks will become corrupt.
-
-The currently assigned IST stacks are:
-
-* ESTACK_DF.  EXCEPTION_STKSZ (PAGE_SIZE).
-
-  Used for interrupt 8 - Double Fault Exception (#DF).
-
-  Invoked when handling one exception causes another exception. Happens
-  when the kernel is very confused (e.g. kernel stack pointer corrupt).
-  Using a separate stack allows the kernel to recover from it well enough
-  in many cases to still output an oops.
-
-* ESTACK_NMI.  EXCEPTION_STKSZ (PAGE_SIZE).
-
-  Used for non-maskable interrupts (NMI).
-
-  NMI can be delivered at any time, including when the kernel is in the
-  middle of switching stacks.  Using IST for NMI events avoids making
-  assumptions about the previous state of the kernel stack.
-
-* ESTACK_DB.  EXCEPTION_STKSZ (PAGE_SIZE).
-
-  Used for hardware debug interrupts (interrupt 1) and for software
-  debug interrupts (INT3).
-
-  When debugging a kernel, debug interrupts (both hardware and
-  software) can occur at any time.  Using IST for these interrupts
-  avoids making assumptions about the previous state of the kernel
-  stack.
-
-  To handle nested #DB correctly there exist two instances of DB stacks. On
-  #DB entry the IST stackpointer for #DB is switched to the second instance
-  so a nested #DB starts from a clean stack. The nested #DB switches
-  the IST stackpointer to a guard hole to catch triple nesting.
-
-* ESTACK_MCE.  EXCEPTION_STKSZ (PAGE_SIZE).
-
-  Used for interrupt 18 - Machine Check Exception (#MC).
-
-  MCE can be delivered at any time, including when the kernel is in the
-  middle of switching stacks.  Using IST for MCE events avoids making
-  assumptions about the previous state of the kernel stack.
-
-For more details see the Intel IA32 or AMD AMD64 architecture manuals.
-
-
-Printing backtraces on x86
-==========================
-
-The question about the '?' preceding function names in an x86 stacktrace
-keeps popping up, here's an indepth explanation. It helps if the reader
-stares at print_context_stack() and the whole machinery in and around
-arch/x86/kernel/dumpstack.c.
-
-Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>:
-
-We always scan the full kernel stack for return addresses stored on
-the kernel stack(s) [1]_, from stack top to stack bottom, and print out
-anything that 'looks like' a kernel text address.
-
-If it fits into the frame pointer chain, we print it without a question
-mark, knowing that it's part of the real backtrace.
-
-If the address does not fit into our expected frame pointer chain we
-still print it, but we print a '?'. It can mean two things:
-
- - either the address is not part of the call chain: it's just stale
-   values on the kernel stack, from earlier function calls. This is
-   the common case.
-
- - or it is part of the call chain, but the frame pointer was not set
-   up properly within the function, so we don't recognize it.
-
-This way we will always print out the real call chain (plus a few more
-entries), regardless of whether the frame pointer was set up correctly
-or not - but in most cases we'll get the call chain right as well. The
-entries printed are strictly in stack order, so you can deduce more
-information from that as well.
-
-The most important property of this method is that we _never_ lose
-information: we always strive to print _all_ addresses on the stack(s)
-that look like kernel text addresses, so if debug information is wrong,
-we still print out the real call chain as well - just with more question
-marks than ideal.
-
-.. [1] For things like IRQ and IST stacks, we also scan those stacks, in
-       the right order, and try to cross from one stack into another
-       reconstructing the call chain. This works most of the time.
diff --git a/Documentation/x86/mds.rst b/Documentation/x86/mds.rst
deleted file mode 100644
index 5d4330be200f..000000000000
--- a/Documentation/x86/mds.rst
+++ /dev/null
@@ -1,193 +0,0 @@
-Microarchitectural Data Sampling (MDS) mitigation
-=================================================
-
-.. _mds:
-
-Overview
---------
-
-Microarchitectural Data Sampling (MDS) is a family of side channel attacks
-on internal buffers in Intel CPUs. The variants are:
-
- - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
- - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
- - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
- - Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091)
-
-MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
-dependent load (store-to-load forwarding) as an optimization. The forward
-can also happen to a faulting or assisting load operation for a different
-memory address, which can be exploited under certain conditions. Store
-buffers are partitioned between Hyper-Threads so cross thread forwarding is
-not possible. But if a thread enters or exits a sleep state the store
-buffer is repartitioned which can expose data from one thread to the other.
-
-MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
-L1 miss situations and to hold data which is returned or sent in response
-to a memory or I/O operation. Fill buffers can forward data to a load
-operation and also write data to the cache. When the fill buffer is
-deallocated it can retain the stale data of the preceding operations which
-can then be forwarded to a faulting or assisting load operation, which can
-be exploited under certain conditions. Fill buffers are shared between
-Hyper-Threads so cross thread leakage is possible.
-
-MLPDS leaks Load Port Data. Load ports are used to perform load operations
-from memory or I/O. The received data is then forwarded to the register
-file or a subsequent operation. In some implementations the Load Port can
-contain stale data from a previous operation which can be forwarded to
-faulting or assisting loads under certain conditions, which again can be
-exploited eventually. Load ports are shared between Hyper-Threads so cross
-thread leakage is possible.
-
-MDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load from
-memory that takes a fault or assist can leave data in a microarchitectural
-structure that may later be observed using one of the same methods used by
-MSBDS, MFBDS or MLPDS.
-
-Exposure assumptions
---------------------
-
-It is assumed that attack code resides in user space or in a guest with one
-exception. The rationale behind this assumption is that the code construct
-needed for exploiting MDS requires:
-
- - to control the load to trigger a fault or assist
-
- - to have a disclosure gadget which exposes the speculatively accessed
-   data for consumption through a side channel.
-
- - to control the pointer through which the disclosure gadget exposes the
-   data
-
-The existence of such a construct in the kernel cannot be excluded with
-100% certainty, but the complexity involved makes it extremly unlikely.
-
-There is one exception, which is untrusted BPF. The functionality of
-untrusted BPF is limited, but it needs to be thoroughly investigated
-whether it can be used to create such a construct.
-
-
-Mitigation strategy
--------------------
-
-All variants have the same mitigation strategy at least for the single CPU
-thread case (SMT off): Force the CPU to clear the affected buffers.
-
-This is achieved by using the otherwise unused and obsolete VERW
-instruction in combination with a microcode update. The microcode clears
-the affected CPU buffers when the VERW instruction is executed.
-
-For virtualization there are two ways to achieve CPU buffer
-clearing. Either the modified VERW instruction or via the L1D Flush
-command. The latter is issued when L1TF mitigation is enabled so the extra
-VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
-be issued.
-
-If the VERW instruction with the supplied segment selector argument is
-executed on a CPU without the microcode update there is no side effect
-other than a small number of pointlessly wasted CPU cycles.
-
-This does not protect against cross Hyper-Thread attacks except for MSBDS
-which is only exploitable cross Hyper-thread when one of the Hyper-Threads
-enters a C-state.
-
-The kernel provides a function to invoke the buffer clearing:
-
-    mds_clear_cpu_buffers()
-
-The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
-(idle) transitions.
-
-As a special quirk to address virtualization scenarios where the host has
-the microcode updated, but the hypervisor does not (yet) expose the
-MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
-hope that it might actually clear the buffers. The state is reflected
-accordingly.
-
-According to current knowledge additional mitigations inside the kernel
-itself are not required because the necessary gadgets to expose the leaked
-data cannot be controlled in a way which allows exploitation from malicious
-user space or VM guests.
-
-Kernel internal mitigation modes
---------------------------------
-
- ======= ============================================================
- off      Mitigation is disabled. Either the CPU is not affected or
-          mds=off is supplied on the kernel command line
-
- full     Mitigation is enabled. CPU is affected and MD_CLEAR is
-          advertised in CPUID.
-
- vmwerv	  Mitigation is enabled. CPU is affected and MD_CLEAR is not
-	  advertised in CPUID. That is mainly for virtualization
-	  scenarios where the host has the updated microcode but the
-	  hypervisor does not expose MD_CLEAR in CPUID. It's a best
-	  effort approach without guarantee.
- ======= ============================================================
-
-If the CPU is affected and mds=off is not supplied on the kernel command
-line then the kernel selects the appropriate mitigation mode depending on
-the availability of the MD_CLEAR CPUID bit.
-
-Mitigation points
------------------
-
-1. Return to user space
-^^^^^^^^^^^^^^^^^^^^^^^
-
-   When transitioning from kernel to user space the CPU buffers are flushed
-   on affected CPUs when the mitigation is not disabled on the kernel
-   command line. The migitation is enabled through the static key
-   mds_user_clear.
-
-   The mitigation is invoked in prepare_exit_to_usermode() which covers
-   all but one of the kernel to user space transitions.  The exception
-   is when we return from a Non Maskable Interrupt (NMI), which is
-   handled directly in do_nmi().
-
-   (The reason that NMI is special is that prepare_exit_to_usermode() can
-    enable IRQs.  In NMI context, NMIs are blocked, and we don't want to
-    enable IRQs with NMIs blocked.)
-
-
-2. C-State transition
-^^^^^^^^^^^^^^^^^^^^^
-
-   When a CPU goes idle and enters a C-State the CPU buffers need to be
-   cleared on affected CPUs when SMT is active. This addresses the
-   repartitioning of the store buffer when one of the Hyper-Threads enters
-   a C-State.
-
-   When SMT is inactive, i.e. either the CPU does not support it or all
-   sibling threads are offline CPU buffer clearing is not required.
-
-   The idle clearing is enabled on CPUs which are only affected by MSBDS
-   and not by any other MDS variant. The other MDS variants cannot be
-   protected against cross Hyper-Thread attacks because the Fill Buffer and
-   the Load Ports are shared. So on CPUs affected by other variants, the
-   idle clearing would be a window dressing exercise and is therefore not
-   activated.
-
-   The invocation is controlled by the static key mds_idle_clear which is
-   switched depending on the chosen mitigation mode and the SMT state of
-   the system.
-
-   The buffer clear is only invoked before entering the C-State to prevent
-   that stale data from the idling CPU from spilling to the Hyper-Thread
-   sibling after the store buffer got repartitioned and all entries are
-   available to the non idle sibling.
-
-   When coming out of idle the store buffer is partitioned again so each
-   sibling has half of it available. The back from idle CPU could be then
-   speculatively exposed to contents of the sibling. The buffers are
-   flushed either on exit to user space or on VMENTER so malicious code
-   in user space or the guest cannot speculatively access them.
-
-   The mitigation is hooked into all variants of halt()/mwait(), but does
-   not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
-   has been superseded by the intel_idle driver around 2010 and is
-   preferred on all affected CPUs which are expected to gain the MD_CLEAR
-   functionality in microcode. Aside of that the IO-Port mechanism is a
-   legacy interface which is only used on older systems which are either
-   not affected or do not receive microcode updates anymore.
diff --git a/Documentation/x86/microcode.rst b/Documentation/x86/microcode.rst
deleted file mode 100644
index b627c6f36bcf..000000000000
--- a/Documentation/x86/microcode.rst
+++ /dev/null
@@ -1,240 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==========================
-The Linux Microcode Loader
-==========================
-
-:Authors: - Fenghua Yu <fenghua.yu@intel.com>
-          - Borislav Petkov <bp@suse.de>
-	  - Ashok Raj <ashok.raj@intel.com>
-
-The kernel has a x86 microcode loading facility which is supposed to
-provide microcode loading methods in the OS. Potential use cases are
-updating the microcode on platforms beyond the OEM End-Of-Life support,
-and updating the microcode on long-running systems without rebooting.
-
-The loader supports three loading methods:
-
-Early load microcode
-====================
-
-The kernel can update microcode very early during boot. Loading
-microcode early can fix CPU issues before they are observed during
-kernel boot time.
-
-The microcode is stored in an initrd file. During boot, it is read from
-it and loaded into the CPU cores.
-
-The format of the combined initrd image is microcode in (uncompressed)
-cpio format followed by the (possibly compressed) initrd image. The
-loader parses the combined initrd image during boot.
-
-The microcode files in cpio name space are:
-
-on Intel:
-  kernel/x86/microcode/GenuineIntel.bin
-on AMD  :
-  kernel/x86/microcode/AuthenticAMD.bin
-
-During BSP (BootStrapping Processor) boot (pre-SMP), the kernel
-scans the microcode file in the initrd. If microcode matching the
-CPU is found, it will be applied in the BSP and later on in all APs
-(Application Processors).
-
-The loader also saves the matching microcode for the CPU in memory.
-Thus, the cached microcode patch is applied when CPUs resume from a
-sleep state.
-
-Here's a crude example how to prepare an initrd with microcode (this is
-normally done automatically by the distribution, when recreating the
-initrd, so you don't really have to do it yourself. It is documented
-here for future reference only).
-::
-
-  #!/bin/bash
-
-  if [ -z "$1" ]; then
-      echo "You need to supply an initrd file"
-      exit 1
-  fi
-
-  INITRD="$1"
-
-  DSTDIR=kernel/x86/microcode
-  TMPDIR=/tmp/initrd
-
-  rm -rf $TMPDIR
-
-  mkdir $TMPDIR
-  cd $TMPDIR
-  mkdir -p $DSTDIR
-
-  if [ -d /lib/firmware/amd-ucode ]; then
-          cat /lib/firmware/amd-ucode/microcode_amd*.bin > $DSTDIR/AuthenticAMD.bin
-  fi
-
-  if [ -d /lib/firmware/intel-ucode ]; then
-          cat /lib/firmware/intel-ucode/* > $DSTDIR/GenuineIntel.bin
-  fi
-
-  find . | cpio -o -H newc >../ucode.cpio
-  cd ..
-  mv $INITRD $INITRD.orig
-  cat ucode.cpio $INITRD.orig > $INITRD
-
-  rm -rf $TMPDIR
-
-
-The system needs to have the microcode packages installed into
-/lib/firmware or you need to fixup the paths above if yours are
-somewhere else and/or you've downloaded them directly from the processor
-vendor's site.
-
-Late loading
-============
-
-You simply install the microcode packages your distro supplies and
-run::
-
-  # echo 1 > /sys/devices/system/cpu/microcode/reload
-
-as root.
-
-The loading mechanism looks for microcode blobs in
-/lib/firmware/{intel-ucode,amd-ucode}. The default distro installation
-packages already put them there.
-
-Since kernel 5.19, late loading is not enabled by default.
-
-The /dev/cpu/microcode method has been removed in 5.19.
-
-Why is late loading dangerous?
-==============================
-
-Synchronizing all CPUs
-----------------------
-
-The microcode engine which receives the microcode update is shared
-between the two logical threads in a SMT system. Therefore, when
-the update is executed on one SMT thread of the core, the sibling
-"automatically" gets the update.
-
-Since the microcode can "simulate" MSRs too, while the microcode update
-is in progress, those simulated MSRs transiently cease to exist. This
-can result in unpredictable results if the SMT sibling thread happens to
-be in the middle of an access to such an MSR. The usual observation is
-that such MSR accesses cause #GPs to be raised to signal that former are
-not present.
-
-The disappearing MSRs are just one common issue which is being observed.
-Any other instruction that's being patched and gets concurrently
-executed by the other SMT sibling, can also result in similar,
-unpredictable behavior.
-
-To eliminate this case, a stop_machine()-based CPU synchronization was
-introduced as a way to guarantee that all logical CPUs will not execute
-any code but just wait in a spin loop, polling an atomic variable.
-
-While this took care of device or external interrupts, IPIs including
-LVT ones, such as CMCI etc, it cannot address other special interrupts
-that can't be shut off. Those are Machine Check (#MC), System Management
-(#SMI) and Non-Maskable interrupts (#NMI).
-
-Machine Checks
---------------
-
-Machine Checks (#MC) are non-maskable. There are two kinds of MCEs.
-Fatal un-recoverable MCEs and recoverable MCEs. While un-recoverable
-errors are fatal, recoverable errors can also happen in kernel context
-are also treated as fatal by the kernel.
-
-On certain Intel machines, MCEs are also broadcast to all threads in a
-system. If one thread is in the middle of executing WRMSR, a MCE will be
-taken at the end of the flow. Either way, they will wait for the thread
-performing the wrmsr(0x79) to rendezvous in the MCE handler and shutdown
-eventually if any of the threads in the system fail to check in to the
-MCE rendezvous.
-
-To be paranoid and get predictable behavior, the OS can choose to set
-MCG_STATUS.MCIP. Since MCEs can be at most one in a system, if an
-MCE was signaled, the above condition will promote to a system reset
-automatically. OS can turn off MCIP at the end of the update for that
-core.
-
-System Management Interrupt
----------------------------
-
-SMIs are also broadcast to all CPUs in the platform. Microcode update
-requests exclusive access to the core before writing to MSR 0x79. So if
-it does happen such that, one thread is in WRMSR flow, and the 2nd got
-an SMI, that thread will be stopped in the first instruction in the SMI
-handler.
-
-Since the secondary thread is stopped in the first instruction in SMI,
-there is very little chance that it would be in the middle of executing
-an instruction being patched. Plus OS has no way to stop SMIs from
-happening.
-
-Non-Maskable Interrupts
------------------------
-
-When thread0 of a core is doing the microcode update, if thread1 is
-pulled into NMI, that can cause unpredictable behavior due to the
-reasons above.
-
-OS can choose a variety of methods to avoid running into this situation.
-
-
-Is the microcode suitable for late loading?
--------------------------------------------
-
-Late loading is done when the system is fully operational and running
-real workloads. Late loading behavior depends on what the base patch on
-the CPU is before upgrading to the new patch.
-
-This is true for Intel CPUs.
-
-Consider, for example, a CPU has patch level 1 and the update is to
-patch level 3.
-
-Between patch1 and patch3, patch2 might have deprecated a software-visible
-feature.
-
-This is unacceptable if software is even potentially using that feature.
-For instance, say MSR_X is no longer available after an update,
-accessing that MSR will cause a #GP fault.
-
-Basically there is no way to declare a new microcode update suitable
-for late-loading. This is another one of the problems that caused late
-loading to be not enabled by default.
-
-Builtin microcode
-=================
-
-The loader supports also loading of a builtin microcode supplied through
-the regular builtin firmware method CONFIG_EXTRA_FIRMWARE. Only 64-bit is
-currently supported.
-
-Here's an example::
-
-  CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin"
-  CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
-
-This basically means, you have the following tree structure locally::
-
-  /lib/firmware/
-  |-- amd-ucode
-  ...
-  |   |-- microcode_amd_fam15h.bin
-  ...
-  |-- intel-ucode
-  ...
-  |   |-- 06-3a-09
-  ...
-
-so that the build system can find those files and integrate them into
-the final kernel image. The early loader finds them and applies them.
-
-Needless to say, this method is not the most flexible one because it
-requires rebuilding the kernel each time updated microcode from the CPU
-vendor is available.
diff --git a/Documentation/x86/mtrr.rst b/Documentation/x86/mtrr.rst
deleted file mode 100644
index 9f0b1851771a..000000000000
--- a/Documentation/x86/mtrr.rst
+++ /dev/null
@@ -1,354 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=========================================
-MTRR (Memory Type Range Register) control
-=========================================
-
-:Authors: - Richard Gooch <rgooch@atnf.csiro.au> - 3 Jun 1999
-          - Luis R. Rodriguez <mcgrof@do-not-panic.com> - April 9, 2015
-
-
-Phasing out MTRR use
-====================
-
-MTRR use is replaced on modern x86 hardware with PAT. Direct MTRR use by
-drivers on Linux is now completely phased out, device drivers should use
-arch_phys_wc_add() in combination with ioremap_wc() to make MTRR effective on
-non-PAT systems while a no-op but equally effective on PAT enabled systems.
-
-Even if Linux does not use MTRRs directly, some x86 platform firmware may still
-set up MTRRs early before booting the OS. They do this as some platform
-firmware may still have implemented access to MTRRs which would be controlled
-and handled by the platform firmware directly. An example of platform use of
-MTRRs is through the use of SMI handlers, one case could be for fan control,
-the platform code would need uncachable access to some of its fan control
-registers. Such platform access does not need any Operating System MTRR code in
-place other than mtrr_type_lookup() to ensure any OS specific mapping requests
-are aligned with platform MTRR setup. If MTRRs are only set up by the platform
-firmware code though and the OS does not make any specific MTRR mapping
-requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID.
-
-For details refer to Documentation/x86/pat.rst.
-
-.. tip::
-  On Intel P6 family processors (Pentium Pro, Pentium II and later)
-  the Memory Type Range Registers (MTRRs) may be used to control
-  processor access to memory ranges. This is most useful when you have
-  a video (VGA) card on a PCI or AGP bus. Enabling write-combining
-  allows bus write transfers to be combined into a larger transfer
-  before bursting over the PCI/AGP bus. This can increase performance
-  of image write operations 2.5 times or more.
-
-  The Cyrix 6x86, 6x86MX and M II processors have Address Range
-  Registers (ARRs) which provide a similar functionality to MTRRs. For
-  these, the ARRs are used to emulate the MTRRs.
-
-  The AMD K6-2 (stepping 8 and above) and K6-3 processors have two
-  MTRRs. These are supported.  The AMD Athlon family provide 8 Intel
-  style MTRRs.
-
-  The Centaur C6 (WinChip) has 8 MCRs, allowing write-combining. These
-  are supported.
-
-  The VIA Cyrix III and VIA C3 CPUs offer 8 Intel style MTRRs.
-
-  The CONFIG_MTRR option creates a /proc/mtrr file which may be used
-  to manipulate your MTRRs. Typically the X server should use
-  this. This should have a reasonably generic interface so that
-  similar control registers on other processors can be easily
-  supported.
-
-There are two interfaces to /proc/mtrr: one is an ASCII interface
-which allows you to read and write. The other is an ioctl()
-interface. The ASCII interface is meant for administration. The
-ioctl() interface is meant for C programs (i.e. the X server). The
-interfaces are described below, with sample commands and C code.
-
-
-Reading MTRRs from the shell
-============================
-::
-
-  % cat /proc/mtrr
-  reg00: base=0x00000000 (   0MB), size= 128MB: write-back, count=1
-  reg01: base=0x08000000 ( 128MB), size=  64MB: write-back, count=1
-
-Creating MTRRs from the C-shell::
-
-  # echo "base=0xf8000000 size=0x400000 type=write-combining" >! /proc/mtrr
-
-or if you use bash::
-
-  # echo "base=0xf8000000 size=0x400000 type=write-combining" >| /proc/mtrr
-
-And the result thereof::
-
-  % cat /proc/mtrr
-  reg00: base=0x00000000 (   0MB), size= 128MB: write-back, count=1
-  reg01: base=0x08000000 ( 128MB), size=  64MB: write-back, count=1
-  reg02: base=0xf8000000 (3968MB), size=   4MB: write-combining, count=1
-
-This is for video RAM at base address 0xf8000000 and size 4 megabytes. To
-find out your base address, you need to look at the output of your X
-server, which tells you where the linear framebuffer address is. A
-typical line that you may get is::
-
-  (--) S3: PCI: 968 rev 0, Linear FB @ 0xf8000000
-
-Note that you should only use the value from the X server, as it may
-move the framebuffer base address, so the only value you can trust is
-that reported by the X server.
-
-To find out the size of your framebuffer (what, you don't actually
-know?), the following line will tell you::
-
-  (--) S3: videoram:  4096k
-
-That's 4 megabytes, which is 0x400000 bytes (in hexadecimal).
-A patch is being written for XFree86 which will make this automatic:
-in other words the X server will manipulate /proc/mtrr using the
-ioctl() interface, so users won't have to do anything. If you use a
-commercial X server, lobby your vendor to add support for MTRRs.
-
-
-Creating overlapping MTRRs
-==========================
-::
-
-  %echo "base=0xfb000000 size=0x1000000 type=write-combining" >/proc/mtrr
-  %echo "base=0xfb000000 size=0x1000 type=uncachable" >/proc/mtrr
-
-And the results::
-
-  % cat /proc/mtrr
-  reg00: base=0x00000000 (   0MB), size=  64MB: write-back, count=1
-  reg01: base=0xfb000000 (4016MB), size=  16MB: write-combining, count=1
-  reg02: base=0xfb000000 (4016MB), size=   4kB: uncachable, count=1
-
-Some cards (especially Voodoo Graphics boards) need this 4 kB area
-excluded from the beginning of the region because it is used for
-registers.
-
-NOTE: You can only create type=uncachable region, if the first
-region that you created is type=write-combining.
-
-
-Removing MTRRs from the C-shel
-==============================
-::
-
-  % echo "disable=2" >! /proc/mtrr
-
-or using bash::
-
-  % echo "disable=2" >| /proc/mtrr
-
-
-Reading MTRRs from a C program using ioctl()'s
-==============================================
-::
-
-  /*  mtrr-show.c
-
-      Source file for mtrr-show (example program to show MTRRs using ioctl()'s)
-
-      Copyright (C) 1997-1998  Richard Gooch
-
-      This program is free software; you can redistribute it and/or modify
-      it under the terms of the GNU General Public License as published by
-      the Free Software Foundation; either version 2 of the License, or
-      (at your option) any later version.
-
-      This program is distributed in the hope that it will be useful,
-      but WITHOUT ANY WARRANTY; without even the implied warranty of
-      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-      GNU General Public License for more details.
-
-      You should have received a copy of the GNU General Public License
-      along with this program; if not, write to the Free Software
-      Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-
-      Richard Gooch may be reached by email at  rgooch@atnf.csiro.au
-      The postal address is:
-        Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
-  */
-
-  /*
-      This program will use an ioctl() on /proc/mtrr to show the current MTRR
-      settings. This is an alternative to reading /proc/mtrr.
-
-
-      Written by      Richard Gooch   17-DEC-1997
-
-      Last updated by Richard Gooch   2-MAY-1998
-
-
-  */
-  #include <stdio.h>
-  #include <stdlib.h>
-  #include <string.h>
-  #include <sys/types.h>
-  #include <sys/stat.h>
-  #include <fcntl.h>
-  #include <sys/ioctl.h>
-  #include <errno.h>
-  #include <asm/mtrr.h>
-
-  #define TRUE 1
-  #define FALSE 0
-  #define ERRSTRING strerror (errno)
-
-  static char *mtrr_strings[MTRR_NUM_TYPES] =
-  {
-      "uncachable",               /* 0 */
-      "write-combining",          /* 1 */
-      "?",                        /* 2 */
-      "?",                        /* 3 */
-      "write-through",            /* 4 */
-      "write-protect",            /* 5 */
-      "write-back",               /* 6 */
-  };
-
-  int main ()
-  {
-      int fd;
-      struct mtrr_gentry gentry;
-
-      if ( ( fd = open ("/proc/mtrr", O_RDONLY, 0) ) == -1 )
-      {
-    if (errno == ENOENT)
-    {
-        fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
-        stderr);
-        exit (1);
-    }
-    fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
-    exit (2);
-      }
-      for (gentry.regnum = 0; ioctl (fd, MTRRIOC_GET_ENTRY, &gentry) == 0;
-    ++gentry.regnum)
-      {
-    if (gentry.size < 1)
-    {
-        fprintf (stderr, "Register: %u disabled\n", gentry.regnum);
-        continue;
-    }
-    fprintf (stderr, "Register: %u base: 0x%lx size: 0x%lx type: %s\n",
-      gentry.regnum, gentry.base, gentry.size,
-      mtrr_strings[gentry.type]);
-      }
-      if (errno == EINVAL) exit (0);
-      fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
-      exit (3);
-  }   /*  End Function main  */
-
-
-Creating MTRRs from a C programme using ioctl()'s
-=================================================
-::
-
-  /*  mtrr-add.c
-
-      Source file for mtrr-add (example programme to add an MTRRs using ioctl())
-
-      Copyright (C) 1997-1998  Richard Gooch
-
-      This program is free software; you can redistribute it and/or modify
-      it under the terms of the GNU General Public License as published by
-      the Free Software Foundation; either version 2 of the License, or
-      (at your option) any later version.
-
-      This program is distributed in the hope that it will be useful,
-      but WITHOUT ANY WARRANTY; without even the implied warranty of
-      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-      GNU General Public License for more details.
-
-      You should have received a copy of the GNU General Public License
-      along with this program; if not, write to the Free Software
-      Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-
-      Richard Gooch may be reached by email at  rgooch@atnf.csiro.au
-      The postal address is:
-        Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
-  */
-
-  /*
-      This programme will use an ioctl() on /proc/mtrr to add an entry. The first
-      available mtrr is used. This is an alternative to writing /proc/mtrr.
-
-
-      Written by      Richard Gooch   17-DEC-1997
-
-      Last updated by Richard Gooch   2-MAY-1998
-
-
-  */
-  #include <stdio.h>
-  #include <string.h>
-  #include <stdlib.h>
-  #include <unistd.h>
-  #include <sys/types.h>
-  #include <sys/stat.h>
-  #include <fcntl.h>
-  #include <sys/ioctl.h>
-  #include <errno.h>
-  #include <asm/mtrr.h>
-
-  #define TRUE 1
-  #define FALSE 0
-  #define ERRSTRING strerror (errno)
-
-  static char *mtrr_strings[MTRR_NUM_TYPES] =
-  {
-      "uncachable",               /* 0 */
-      "write-combining",          /* 1 */
-      "?",                        /* 2 */
-      "?",                        /* 3 */
-      "write-through",            /* 4 */
-      "write-protect",            /* 5 */
-      "write-back",               /* 6 */
-  };
-
-  int main (int argc, char **argv)
-  {
-      int fd;
-      struct mtrr_sentry sentry;
-
-      if (argc != 4)
-      {
-    fprintf (stderr, "Usage:\tmtrr-add base size type\n");
-    exit (1);
-      }
-      sentry.base = strtoul (argv[1], NULL, 0);
-      sentry.size = strtoul (argv[2], NULL, 0);
-      for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type)
-      {
-    if (strcmp (argv[3], mtrr_strings[sentry.type]) == 0) break;
-      }
-      if (sentry.type >= MTRR_NUM_TYPES)
-      {
-    fprintf (stderr, "Illegal type: \"%s\"\n", argv[3]);
-    exit (2);
-      }
-      if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 )
-      {
-    if (errno == ENOENT)
-    {
-        fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
-        stderr);
-        exit (3);
-    }
-    fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
-    exit (4);
-      }
-      if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1)
-      {
-    fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
-    exit (5);
-      }
-      fprintf (stderr, "Sleeping for 5 seconds so you can see the new entry\n");
-      sleep (5);
-      close (fd);
-      fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n",
-      stderr);
-  }   /*  End Function main  */
diff --git a/Documentation/x86/orc-unwinder.rst b/Documentation/x86/orc-unwinder.rst
deleted file mode 100644
index cdb257015bd9..000000000000
--- a/Documentation/x86/orc-unwinder.rst
+++ /dev/null
@@ -1,182 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-============
-ORC unwinder
-============
-
-Overview
-========
-
-The kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is
-similar in concept to a DWARF unwinder.  The difference is that the
-format of the ORC data is much simpler than DWARF, which in turn allows
-the ORC unwinder to be much simpler and faster.
-
-The ORC data consists of unwind tables which are generated by objtool.
-They contain out-of-band data which is used by the in-kernel ORC
-unwinder.  Objtool generates the ORC data by first doing compile-time
-stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
-all the code paths of a .o file, it determines information about the
-stack state at each instruction address in the file and outputs that
-information to the .orc_unwind and .orc_unwind_ip sections.
-
-The per-object ORC sections are combined at link time and are sorted and
-post-processed at boot time.  The unwinder uses the resulting data to
-correlate instruction addresses with their stack states at run time.
-
-
-ORC vs frame pointers
-=====================
-
-With frame pointers enabled, GCC adds instrumentation code to every
-function in the kernel.  The kernel's .text size increases by about
-3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
-Gorman [1]_ have shown a slowdown of 5-10% for some workloads.
-
-In contrast, the ORC unwinder has no effect on text size or runtime
-performance, because the debuginfo is out of band.  So if you disable
-frame pointers and enable the ORC unwinder, you get a nice performance
-improvement across the board, and still have reliable stack traces.
-
-Ingo Molnar says:
-
-  "Note that it's not just a performance improvement, but also an
-  instruction cache locality improvement: 3.2% .text savings almost
-  directly transform into a similarly sized reduction in cache
-  footprint. That can transform to even higher speedups for workloads
-  whose cache locality is borderline."
-
-Another benefit of ORC compared to frame pointers is that it can
-reliably unwind across interrupts and exceptions.  Frame pointer based
-unwinds can sometimes skip the caller of the interrupted function, if it
-was a leaf function or if the interrupt hit before the frame pointer was
-saved.
-
-The main disadvantage of the ORC unwinder compared to frame pointers is
-that it needs more memory to store the ORC unwind tables: roughly 2-4MB
-depending on the kernel config.
-
-
-ORC vs DWARF
-============
-
-ORC debuginfo's advantage over DWARF itself is that it's much simpler.
-It gets rid of the complex DWARF CFI state machine and also gets rid of
-the tracking of unnecessary registers.  This allows the unwinder to be
-much simpler, meaning fewer bugs, which is especially important for
-mission critical oops code.
-
-The simpler debuginfo format also enables the unwinder to be much faster
-than DWARF, which is important for perf and lockdep.  In a basic
-performance test by Jiri Slaby [2]_, the ORC unwinder was about 20x
-faster than an out-of-tree DWARF unwinder.  (Note: That measurement was
-taken before some performance tweaks were added, which doubled
-performance, so the speedup over DWARF may be closer to 40x.)
-
-The ORC data format does have a few downsides compared to DWARF.  ORC
-unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
-than DWARF-based eh_frame tables.
-
-Another potential downside is that, as GCC evolves, it's conceivable
-that the ORC data may end up being *too* simple to describe the state of
-the stack for certain optimizations.  But IMO this is unlikely because
-GCC saves the frame pointer for any unusual stack adjustments it does,
-so I suspect we'll really only ever need to keep track of the stack
-pointer and the frame pointer between call frames.  But even if we do
-end up having to track all the registers DWARF tracks, at least we will
-still be able to control the format, e.g. no complex state machines.
-
-
-ORC unwind table generation
-===========================
-
-The ORC data is generated by objtool.  With the existing compile-time
-stack metadata validation feature, objtool already follows all code
-paths, and so it already has all the information it needs to be able to
-generate ORC data from scratch.  So it's an easy step to go from stack
-validation to ORC data generation.
-
-It should be possible to instead generate the ORC data with a simple
-tool which converts DWARF to ORC data.  However, such a solution would
-be incomplete due to the kernel's extensive use of asm, inline asm, and
-special sections like exception tables.
-
-That could be rectified by manually annotating those special code paths
-using GNU assembler .cfi annotations in .S files, and homegrown
-annotations for inline asm in .c files.  But asm annotations were tried
-in the past and were found to be unmaintainable.  They were often
-incorrect/incomplete and made the code harder to read and keep updated.
-And based on looking at glibc code, annotating inline asm in .c files
-might be even worse.
-
-Objtool still needs a few annotations, but only in code which does
-unusual things to the stack like entry code.  And even then, far fewer
-annotations are needed than what DWARF would need, so they're much more
-maintainable than DWARF CFI annotations.
-
-So the advantages of using objtool to generate ORC data are that it
-gives more accurate debuginfo, with very few annotations.  It also
-insulates the kernel from toolchain bugs which can be very painful to
-deal with in the kernel since we often have to workaround issues in
-older versions of the toolchain for years.
-
-The downside is that the unwinder now becomes dependent on objtool's
-ability to reverse engineer GCC code flow.  If GCC optimizations become
-too complicated for objtool to follow, the ORC data generation might
-stop working or become incomplete.  (It's worth noting that livepatch
-already has such a dependency on objtool's ability to follow GCC code
-flow.)
-
-If newer versions of GCC come up with some optimizations which break
-objtool, we may need to revisit the current implementation.  Some
-possible solutions would be asking GCC to make the optimizations more
-palatable, or having objtool use DWARF as an additional input, or
-creating a GCC plugin to assist objtool with its analysis.  But for now,
-objtool follows GCC code quite well.
-
-
-Unwinder implementation details
-===============================
-
-Objtool generates the ORC data by integrating with the compile-time
-stack metadata validation feature, which is described in detail in
-tools/objtool/Documentation/objtool.txt.  After analyzing all
-the code paths of a .o file, it creates an array of orc_entry structs,
-and a parallel array of instruction addresses associated with those
-structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
-respectively.
-
-The ORC data is split into the two arrays for performance reasons, to
-make the searchable part of the data (.orc_unwind_ip) more compact.  The
-arrays are sorted in parallel at boot time.
-
-Performance is further improved by the use of a fast lookup table which
-is created at runtime.  The fast lookup table associates a given address
-with a range of indices for the .orc_unwind table, so that only a small
-subset of the table needs to be searched.
-
-
-Etymology
-=========
-
-Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
-enemies.  Similarly, the ORC unwinder was created in opposition to the
-complexity and slowness of DWARF.
-
-"Although Orcs rarely consider multiple solutions to a problem, they do
-excel at getting things done because they are creatures of action, not
-thought." [3]_  Similarly, unlike the esoteric DWARF unwinder, the
-veracious ORC unwinder wastes no time or siloconic effort decoding
-variable-length zero-extended unsigned-integer byte-coded
-state-machine-based debug information entries.
-
-Similar to how Orcs frequently unravel the well-intentioned plans of
-their adversaries, the ORC unwinder frequently unravels stacks with
-brutal, unyielding efficiency.
-
-ORC stands for Oops Rewind Capability.
-
-
-.. [1] https://lore.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
-.. [2] https://lore.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
-.. [3] http://dustin.wikidot.com/half-orcs-and-orcs
diff --git a/Documentation/x86/pat.rst b/Documentation/x86/pat.rst
deleted file mode 100644
index 5d901771016d..000000000000
--- a/Documentation/x86/pat.rst
+++ /dev/null
@@ -1,240 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==========================
-PAT (Page Attribute Table)
-==========================
-
-x86 Page Attribute Table (PAT) allows for setting the memory attribute at the
-page level granularity. PAT is complementary to the MTRR settings which allows
-for setting of memory types over physical address ranges. However, PAT is
-more flexible than MTRR due to its capability to set attributes at page level
-and also due to the fact that there are no hardware limitations on number of
-such attribute settings allowed. Added flexibility comes with guidelines for
-not having memory type aliasing for the same physical memory with multiple
-virtual addresses.
-
-PAT allows for different types of memory attributes. The most commonly used
-ones that will be supported at this time are:
-
-===  ==============
-WB   Write-back
-UC   Uncached
-WC   Write-combined
-WT   Write-through
-UC-  Uncached Minus
-===  ==============
-
-
-PAT APIs
-========
-
-There are many different APIs in the kernel that allows setting of memory
-attributes at the page level. In order to avoid aliasing, these interfaces
-should be used thoughtfully. Below is a table of interfaces available,
-their intended usage and their memory attribute relationships. Internally,
-these APIs use a reserve_memtype()/free_memtype() interface on the physical
-address range to avoid any aliasing.
-
-+------------------------+----------+--------------+------------------+
-| API                    |    RAM   |  ACPI,...    |  Reserved/Holes  |
-+------------------------+----------+--------------+------------------+
-| ioremap                |    --    |    UC-       |       UC-        |
-+------------------------+----------+--------------+------------------+
-| ioremap_cache          |    --    |    WB        |       WB         |
-+------------------------+----------+--------------+------------------+
-| ioremap_uc             |    --    |    UC        |       UC         |
-+------------------------+----------+--------------+------------------+
-| ioremap_wc             |    --    |    --        |       WC         |
-+------------------------+----------+--------------+------------------+
-| ioremap_wt             |    --    |    --        |       WT         |
-+------------------------+----------+--------------+------------------+
-| set_memory_uc,         |    UC-   |    --        |       --         |
-| set_memory_wb          |          |              |                  |
-+------------------------+----------+--------------+------------------+
-| set_memory_wc,         |    WC    |    --        |       --         |
-| set_memory_wb          |          |              |                  |
-+------------------------+----------+--------------+------------------+
-| set_memory_wt,         |    WT    |    --        |       --         |
-| set_memory_wb          |          |              |                  |
-+------------------------+----------+--------------+------------------+
-| pci sysfs resource     |    --    |    --        |       UC-        |
-+------------------------+----------+--------------+------------------+
-| pci sysfs resource_wc  |    --    |    --        |       WC         |
-| is IORESOURCE_PREFETCH |          |              |                  |
-+------------------------+----------+--------------+------------------+
-| pci proc               |    --    |    --        |       UC-        |
-| !PCIIOC_WRITE_COMBINE  |          |              |                  |
-+------------------------+----------+--------------+------------------+
-| pci proc               |    --    |    --        |       WC         |
-| PCIIOC_WRITE_COMBINE   |          |              |                  |
-+------------------------+----------+--------------+------------------+
-| /dev/mem               |    --    |   WB/WC/UC-  |    WB/WC/UC-     |
-| read-write             |          |              |                  |
-+------------------------+----------+--------------+------------------+
-| /dev/mem               |    --    |    UC-       |       UC-        |
-| mmap SYNC flag         |          |              |                  |
-+------------------------+----------+--------------+------------------+
-| /dev/mem               |    --    |   WB/WC/UC-  |  WB/WC/UC-       |
-| mmap !SYNC flag        |          |              |                  |
-| and                    |          |(from existing|  (from existing  |
-| any alias to this area |          |alias)        |  alias)          |
-+------------------------+----------+--------------+------------------+
-| /dev/mem               |    --    |    WB        |       WB         |
-| mmap !SYNC flag        |          |              |                  |
-| no alias to this area  |          |              |                  |
-| and                    |          |              |                  |
-| MTRR says WB           |          |              |                  |
-+------------------------+----------+--------------+------------------+
-| /dev/mem               |    --    |    --        |       UC-        |
-| mmap !SYNC flag        |          |              |                  |
-| no alias to this area  |          |              |                  |
-| and                    |          |              |                  |
-| MTRR says !WB          |          |              |                  |
-+------------------------+----------+--------------+------------------+
-
-
-Advanced APIs for drivers
-=========================
-
-A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
-vmf_insert_pfn.
-
-Drivers wanting to export some pages to userspace do it by using mmap
-interface and a combination of:
-
-  1) pgprot_noncached()
-  2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
-
-With PAT support, a new API pgprot_writecombine is being added. So, drivers can
-continue to use the above sequence, with either pgprot_noncached() or
-pgprot_writecombine() in step 1, followed by step 2.
-
-In addition, step 2 internally tracks the region as UC or WC in memtype
-list in order to ensure no conflicting mapping.
-
-Note that this set of APIs only works with IO (non RAM) regions. If driver
-wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc()
-as step 0 above and also track the usage of those pages and use set_memory_wb()
-before the page is freed to free pool.
-
-MTRR effects on PAT / non-PAT systems
-=====================================
-
-The following table provides the effects of using write-combining MTRRs when
-using ioremap*() calls on x86 for both non-PAT and PAT systems. Ideally
-mtrr_add() usage will be phased out in favor of arch_phys_wc_add() which will
-be a no-op on PAT enabled systems. The region over which a arch_phys_wc_add()
-is made, should already have been ioremapped with WC attributes or PAT entries,
-this can be done by using ioremap_wc() / set_memory_wc().  Devices which
-combine areas of IO memory desired to remain uncacheable with areas where
-write-combining is desirable should consider use of ioremap_uc() followed by
-set_memory_wc() to white-list effective write-combined areas.  Such use is
-nevertheless discouraged as the effective memory type is considered
-implementation defined, yet this strategy can be used as last resort on devices
-with size-constrained regions where otherwise MTRR write-combining would
-otherwise not be effective.
-::
-
-  ====  =======  ===  =========================  =====================
-  MTRR  Non-PAT  PAT  Linux ioremap value        Effective memory type
-  ====  =======  ===  =========================  =====================
-        PAT                                        Non-PAT |  PAT
-        |PCD                                               |
-        ||PWT                                              |
-        |||                                                |
-  WC    000      WB   _PAGE_CACHE_MODE_WB             WC   |   WC
-  WC    001      WC   _PAGE_CACHE_MODE_WC             WC*  |   WC
-  WC    010      UC-  _PAGE_CACHE_MODE_UC_MINUS       WC*  |   UC
-  WC    011      UC   _PAGE_CACHE_MODE_UC             UC   |   UC
-  ====  =======  ===  =========================  =====================
-
-  (*) denotes implementation defined and is discouraged
-
-.. note:: -- in the above table mean "Not suggested usage for the API". Some
-  of the --'s are strictly enforced by the kernel. Some others are not really
-  enforced today, but may be enforced in future.
-
-For ioremap and pci access through /sys or /proc - The actual type returned
-can be more restrictive, in case of any existing aliasing for that address.
-For example: If there is an existing uncached mapping, a new ioremap_wc can
-return uncached mapping in place of write-combine requested.
-
-set_memory_[uc|wc|wt] and set_memory_wb should be used in pairs, where driver
-will first make a region uc, wc or wt and switch it back to wb after use.
-
-Over time writes to /proc/mtrr will be deprecated in favor of using PAT based
-interfaces. Users writing to /proc/mtrr are suggested to use above interfaces.
-
-Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access
-types.
-
-Drivers should use set_memory_[uc|wc|wt] to set access type for RAM ranges.
-
-
-PAT debugging
-=============
-
-With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by::
-
-  # mount -t debugfs debugfs /sys/kernel/debug
-  # cat /sys/kernel/debug/x86/pat_memtype_list
-  PAT memtype list:
-  uncached-minus @ 0x7fadf000-0x7fae0000
-  uncached-minus @ 0x7fb19000-0x7fb1a000
-  uncached-minus @ 0x7fb1a000-0x7fb1b000
-  uncached-minus @ 0x7fb1b000-0x7fb1c000
-  uncached-minus @ 0x7fb1c000-0x7fb1d000
-  uncached-minus @ 0x7fb1d000-0x7fb1e000
-  uncached-minus @ 0x7fb1e000-0x7fb25000
-  uncached-minus @ 0x7fb25000-0x7fb26000
-  uncached-minus @ 0x7fb26000-0x7fb27000
-  uncached-minus @ 0x7fb27000-0x7fb28000
-  uncached-minus @ 0x7fb28000-0x7fb2e000
-  uncached-minus @ 0x7fb2e000-0x7fb2f000
-  uncached-minus @ 0x7fb2f000-0x7fb30000
-  uncached-minus @ 0x7fb31000-0x7fb32000
-  uncached-minus @ 0x80000000-0x90000000
-
-This list shows physical address ranges and various PAT settings used to
-access those physical address ranges.
-
-Another, more verbose way of getting PAT related debug messages is with
-"debugpat" boot parameter. With this parameter, various debug messages are
-printed to dmesg log.
-
-PAT Initialization
-==================
-
-The following table describes how PAT is initialized under various
-configurations. The PAT MSR must be updated by Linux in order to support WC
-and WT attributes. Otherwise, the PAT MSR has the value programmed in it
-by the firmware. Note, Xen enables WC attribute in the PAT MSR for guests.
-
- ==== ===== ==========================  =========  =======
- MTRR PAT   Call Sequence               PAT State  PAT MSR
- ==== ===== ==========================  =========  =======
- E    E     MTRR -> PAT init            Enabled    OS
- E    D     MTRR -> PAT init            Disabled    -
- D    E     MTRR -> PAT disable         Disabled   BIOS
- D    D     MTRR -> PAT disable         Disabled    -
- -    np/E  PAT  -> PAT disable         Disabled   BIOS
- -    np/D  PAT  -> PAT disable         Disabled    -
- E    !P/E  MTRR -> PAT init            Disabled   BIOS
- D    !P/E  MTRR -> PAT disable         Disabled   BIOS
- !M   !P/E  MTRR stub -> PAT disable    Disabled   BIOS
- ==== ===== ==========================  =========  =======
-
-  Legend
-
- ========= =======================================
- E         Feature enabled in CPU
- D	   Feature disabled/unsupported in CPU
- np	   "nopat" boot option specified
- !P	   CONFIG_X86_PAT option unset
- !M	   CONFIG_MTRR option unset
- Enabled   PAT state set to enabled
- Disabled  PAT state set to disabled
- OS        PAT initializes PAT MSR with OS setting
- BIOS      PAT keeps PAT MSR with BIOS setting
- ========= =======================================
-
diff --git a/Documentation/x86/pti.rst b/Documentation/x86/pti.rst
deleted file mode 100644
index 4b858a9bad8d..000000000000
--- a/Documentation/x86/pti.rst
+++ /dev/null
@@ -1,195 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==========================
-Page Table Isolation (PTI)
-==========================
-
-Overview
-========
-
-Page Table Isolation (pti, previously known as KAISER [1]_) is a
-countermeasure against attacks on the shared user/kernel address
-space such as the "Meltdown" approach [2]_.
-
-To mitigate this class of attacks, we create an independent set of
-page tables for use only when running userspace applications.  When
-the kernel is entered via syscalls, interrupts or exceptions, the
-page tables are switched to the full "kernel" copy.  When the system
-switches back to user mode, the user copy is used again.
-
-The userspace page tables contain only a minimal amount of kernel
-data: only what is needed to enter/exit the kernel such as the
-entry/exit functions themselves and the interrupt descriptor table
-(IDT).  There are a few strictly unnecessary things that get mapped
-such as the first C function when entering an interrupt (see
-comments in pti.c).
-
-This approach helps to ensure that side-channel attacks leveraging
-the paging structures do not function when PTI is enabled.  It can be
-enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
-Once enabled at compile-time, it can be disabled at boot with the
-'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
-
-Page Table Management
-=====================
-
-When PTI is enabled, the kernel manages two sets of page tables.
-The first set is very similar to the single set which is present in
-kernels without PTI.  This includes a complete mapping of userspace
-that the kernel can use for things like copy_to_user().
-
-Although _complete_, the user portion of the kernel page tables is
-crippled by setting the NX bit in the top level.  This ensures
-that any missed kernel->user CR3 switch will immediately crash
-userspace upon executing its first instruction.
-
-The userspace page tables map only the kernel data needed to enter
-and exit the kernel.  This data is entirely contained in the 'struct
-cpu_entry_area' structure which is placed in the fixmap which gives
-each CPU's copy of the area a compile-time-fixed virtual address.
-
-For new userspace mappings, the kernel makes the entries in its
-page tables like normal.  The only difference is when the kernel
-makes entries in the top (PGD) level.  In addition to setting the
-entry in the main kernel PGD, a copy of the entry is made in the
-userspace page tables' PGD.
-
-This sharing at the PGD level also inherently shares all the lower
-layers of the page tables.  This leaves a single, shared set of
-userspace page tables to manage.  One PTE to lock, one set of
-accessed bits, dirty bits, etc...
-
-Overhead
-========
-
-Protection against side-channel attacks is important.  But,
-this protection comes at a cost:
-
-1. Increased Memory Use
-
-  a. Each process now needs an order-1 PGD instead of order-0.
-     (Consumes an additional 4k per process).
-  b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
-     aligned so that it can be mapped by setting a single PMD
-     entry.  This consumes nearly 2MB of RAM once the kernel
-     is decompressed, but no space in the kernel image itself.
-
-2. Runtime Cost
-
-  a. CR3 manipulation to switch between the page table copies
-     must be done at interrupt, syscall, and exception entry
-     and exit (it can be skipped when the kernel is interrupted,
-     though.)  Moves to CR3 are on the order of a hundred
-     cycles, and are required at every entry and exit.
-  b. A "trampoline" must be used for SYSCALL entry.  This
-     trampoline depends on a smaller set of resources than the
-     non-PTI SYSCALL entry code, so requires mapping fewer
-     things into the userspace page tables.  The downside is
-     that stacks must be switched at entry time.
-  c. Global pages are disabled for all kernel structures not
-     mapped into both kernel and userspace page tables.  This
-     feature of the MMU allows different processes to share TLB
-     entries mapping the kernel.  Losing the feature means more
-     TLB misses after a context switch.  The actual loss of
-     performance is very small, however, never exceeding 1%.
-  d. Process Context IDentifiers (PCID) is a CPU feature that
-     allows us to skip flushing the entire TLB when switching page
-     tables by setting a special bit in CR3 when the page tables
-     are changed.  This makes switching the page tables (at context
-     switch, or kernel entry/exit) cheaper.  But, on systems with
-     PCID support, the context switch code must flush both the user
-     and kernel entries out of the TLB.  The user PCID TLB flush is
-     deferred until the exit to userspace, minimizing the cost.
-     See intel.com/sdm for the gory PCID/INVPCID details.
-  e. The userspace page tables must be populated for each new
-     process.  Even without PTI, the shared kernel mappings
-     are created by copying top-level (PGD) entries into each
-     new process.  But, with PTI, there are now *two* kernel
-     mappings: one in the kernel page tables that maps everything
-     and one for the entry/exit structures.  At fork(), we need to
-     copy both.
-  f. In addition to the fork()-time copying, there must also
-     be an update to the userspace PGD any time a set_pgd() is done
-     on a PGD used to map userspace.  This ensures that the kernel
-     and userspace copies always map the same userspace
-     memory.
-  g. On systems without PCID support, each CR3 write flushes
-     the entire TLB.  That means that each syscall, interrupt
-     or exception flushes the TLB.
-  h. INVPCID is a TLB-flushing instruction which allows flushing
-     of TLB entries for non-current PCIDs.  Some systems support
-     PCIDs, but do not support INVPCID.  On these systems, addresses
-     can only be flushed from the TLB for the current PCID.  When
-     flushing a kernel address, we need to flush all PCIDs, so a
-     single kernel address flush will require a TLB-flushing CR3
-     write upon the next use of every PCID.
-
-Possible Future Work
-====================
-1. We can be more careful about not actually writing to CR3
-   unless its value is actually changed.
-2. Allow PTI to be enabled/disabled at runtime in addition to the
-   boot-time switching.
-
-Testing
-========
-
-To test stability of PTI, the following test procedure is recommended,
-ideally doing all of these in parallel:
-
-1. Set CONFIG_DEBUG_ENTRY=y
-2. Run several copies of all of the tools/testing/selftests/x86/ tests
-   (excluding MPX and protection_keys) in a loop on multiple CPUs for
-   several minutes.  These tests frequently uncover corner cases in the
-   kernel entry code.  In general, old kernels might cause these tests
-   themselves to crash, but they should never crash the kernel.
-3. Run the 'perf' tool in a mode (top or record) that generates many
-   frequent performance monitoring non-maskable interrupts (see "NMI"
-   in /proc/interrupts).  This exercises the NMI entry/exit code which
-   is known to trigger bugs in code paths that did not expect to be
-   interrupted, including nested NMIs.  Using "-c" boosts the rate of
-   NMIs, and using two -c with separate counters encourages nested NMIs
-   and less deterministic behavior.
-   ::
-
-	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
-
-4. Launch a KVM virtual machine.
-5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
-   This has been a lightly-tested code path and needs extra scrutiny.
-
-Debugging
-=========
-
-Bugs in PTI cause a few different signatures of crashes
-that are worth noting here.
-
- * Failures of the selftests/x86 code.  Usually a bug in one of the
-   more obscure corners of entry_64.S
- * Crashes in early boot, especially around CPU bringup.  Bugs
-   in the trampoline code or mappings cause these.
- * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
-   like screwing up a page table switch.  Also caused by
-   incorrectly mapping the IRQ handler entry code.
- * Crashes at the first NMI.  The NMI code is separate from main
-   interrupt handlers and can have bugs that do not affect
-   normal interrupts.  Also caused by incorrectly mapping NMI
-   code.  NMIs that interrupt the entry code must be very
-   careful and can be the cause of crashes that show up when
-   running perf.
- * Kernel crashes at the first exit to userspace.  entry_64.S
-   bugs, or failing to map some of the exit code.
- * Crashes at first interrupt that interrupts userspace. The paths
-   in entry_64.S that return to userspace are sometimes separate
-   from the ones that return to the kernel.
- * Double faults: overflowing the kernel stack because of page
-   faults upon page faults.  Caused by touching non-pti-mapped
-   data in the entry code, or forgetting to switch to kernel
-   CR3 before calling into C functions which are not pti-mapped.
- * Userspace segfaults early in boot, sometimes manifesting
-   as mount(8) failing to mount the rootfs.  These have
-   tended to be TLB invalidation issues.  Usually invalidating
-   the wrong PCID, or otherwise missing an invalidation.
-
-.. [1] https://gruss.cc/files/kaiser.pdf
-.. [2] https://meltdownattack.com/meltdown.pdf
diff --git a/Documentation/x86/resctrl.rst b/Documentation/x86/resctrl.rst
deleted file mode 100644
index 387ccbcb558f..000000000000
--- a/Documentation/x86/resctrl.rst
+++ /dev/null
@@ -1,1447 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-.. include:: <isonum.txt>
-
-===========================================
-User Interface for Resource Control feature
-===========================================
-
-:Copyright: |copy| 2016 Intel Corporation
-:Authors: - Fenghua Yu <fenghua.yu@intel.com>
-          - Tony Luck <tony.luck@intel.com>
-          - Vikas Shivappa <vikas.shivappa@intel.com>
-
-
-Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
-AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
-
-This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
-flag bits:
-
-===============================================	================================
-RDT (Resource Director Technology) Allocation	"rdt_a"
-CAT (Cache Allocation Technology)		"cat_l3", "cat_l2"
-CDP (Code and Data Prioritization)		"cdp_l3", "cdp_l2"
-CQM (Cache QoS Monitoring)			"cqm_llc", "cqm_occup_llc"
-MBM (Memory Bandwidth Monitoring)		"cqm_mbm_total", "cqm_mbm_local"
-MBA (Memory Bandwidth Allocation)		"mba"
-SMBA (Slow Memory Bandwidth Allocation)         ""
-BMEC (Bandwidth Monitoring Event Configuration) ""
-===============================================	================================
-
-Historically, new features were made visible by default in /proc/cpuinfo. This
-resulted in the feature flags becoming hard to parse by humans. Adding a new
-flag to /proc/cpuinfo should be avoided if user space can obtain information
-about the feature from resctrl's info directory.
-
-To use the feature mount the file system::
-
- # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
-
-mount options are:
-
-"cdp":
-	Enable code/data prioritization in L3 cache allocations.
-"cdpl2":
-	Enable code/data prioritization in L2 cache allocations.
-"mba_MBps":
-	Enable the MBA Software Controller(mba_sc) to specify MBA
-	bandwidth in MBps
-
-L2 and L3 CDP are controlled separately.
-
-RDT features are orthogonal. A particular system may support only
-monitoring, only control, or both monitoring and control.  Cache
-pseudo-locking is a unique way of using cache control to "pin" or
-"lock" data in the cache. Details can be found in
-"Cache Pseudo-Locking".
-
-
-The mount succeeds if either of allocation or monitoring is present, but
-only those files and directories supported by the system will be created.
-For more details on the behavior of the interface during monitoring
-and allocation, see the "Resource alloc and monitor groups" section.
-
-Info directory
-==============
-
-The 'info' directory contains information about the enabled
-resources. Each resource has its own subdirectory. The subdirectory
-names reflect the resource names.
-
-Each subdirectory contains the following files with respect to
-allocation:
-
-Cache resource(L3/L2)  subdirectory contains the following files
-related to allocation:
-
-"num_closids":
-		The number of CLOSIDs which are valid for this
-		resource. The kernel uses the smallest number of
-		CLOSIDs of all enabled resources as limit.
-"cbm_mask":
-		The bitmask which is valid for this resource.
-		This mask is equivalent to 100%.
-"min_cbm_bits":
-		The minimum number of consecutive bits which
-		must be set when writing a mask.
-
-"shareable_bits":
-		Bitmask of shareable resource with other executing
-		entities (e.g. I/O). User can use this when
-		setting up exclusive cache partitions. Note that
-		some platforms support devices that have their
-		own settings for cache use which can over-ride
-		these bits.
-"bit_usage":
-		Annotated capacity bitmasks showing how all
-		instances of the resource are used. The legend is:
-
-			"0":
-			      Corresponding region is unused. When the system's
-			      resources have been allocated and a "0" is found
-			      in "bit_usage" it is a sign that resources are
-			      wasted.
-
-			"H":
-			      Corresponding region is used by hardware only
-			      but available for software use. If a resource
-			      has bits set in "shareable_bits" but not all
-			      of these bits appear in the resource groups'
-			      schematas then the bits appearing in
-			      "shareable_bits" but no resource group will
-			      be marked as "H".
-			"X":
-			      Corresponding region is available for sharing and
-			      used by hardware and software. These are the
-			      bits that appear in "shareable_bits" as
-			      well as a resource group's allocation.
-			"S":
-			      Corresponding region is used by software
-			      and available for sharing.
-			"E":
-			      Corresponding region is used exclusively by
-			      one resource group. No sharing allowed.
-			"P":
-			      Corresponding region is pseudo-locked. No
-			      sharing allowed.
-
-Memory bandwidth(MB) subdirectory contains the following files
-with respect to allocation:
-
-"min_bandwidth":
-		The minimum memory bandwidth percentage which
-		user can request.
-
-"bandwidth_gran":
-		The granularity in which the memory bandwidth
-		percentage is allocated. The allocated
-		b/w percentage is rounded off to the next
-		control step available on the hardware. The
-		available bandwidth control steps are:
-		min_bandwidth + N * bandwidth_gran.
-
-"delay_linear":
-		Indicates if the delay scale is linear or
-		non-linear. This field is purely informational
-		only.
-
-"thread_throttle_mode":
-		Indicator on Intel systems of how tasks running on threads
-		of a physical core are throttled in cases where they
-		request different memory bandwidth percentages:
-
-		"max":
-			the smallest percentage is applied
-			to all threads
-		"per-thread":
-			bandwidth percentages are directly applied to
-			the threads running on the core
-
-If RDT monitoring is available there will be an "L3_MON" directory
-with the following files:
-
-"num_rmids":
-		The number of RMIDs available. This is the
-		upper bound for how many "CTRL_MON" + "MON"
-		groups can be created.
-
-"mon_features":
-		Lists the monitoring events if
-		monitoring is enabled for the resource.
-		Example::
-
-			# cat /sys/fs/resctrl/info/L3_MON/mon_features
-			llc_occupancy
-			mbm_total_bytes
-			mbm_local_bytes
-
-		If the system supports Bandwidth Monitoring Event
-		Configuration (BMEC), then the bandwidth events will
-		be configurable. The output will be::
-
-			# cat /sys/fs/resctrl/info/L3_MON/mon_features
-			llc_occupancy
-			mbm_total_bytes
-			mbm_total_bytes_config
-			mbm_local_bytes
-			mbm_local_bytes_config
-
-"mbm_total_bytes_config", "mbm_local_bytes_config":
-	Read/write files containing the configuration for the mbm_total_bytes
-	and mbm_local_bytes events, respectively, when the Bandwidth
-	Monitoring Event Configuration (BMEC) feature is supported.
-	The event configuration settings are domain specific and affect
-	all the CPUs in the domain. When either event configuration is
-	changed, the bandwidth counters for all RMIDs of both events
-	(mbm_total_bytes as well as mbm_local_bytes) are cleared for that
-	domain. The next read for every RMID will report "Unavailable"
-	and subsequent reads will report the valid value.
-
-	Following are the types of events supported:
-
-	====    ========================================================
-	Bits    Description
-	====    ========================================================
-	6       Dirty Victims from the QOS domain to all types of memory
-	5       Reads to slow memory in the non-local NUMA domain
-	4       Reads to slow memory in the local NUMA domain
-	3       Non-temporal writes to non-local NUMA domain
-	2       Non-temporal writes to local NUMA domain
-	1       Reads to memory in the non-local NUMA domain
-	0       Reads to memory in the local NUMA domain
-	====    ========================================================
-
-	By default, the mbm_total_bytes configuration is set to 0x7f to count
-	all the event types and the mbm_local_bytes configuration is set to
-	0x15 to count all the local memory events.
-
-	Examples:
-
-	* To view the current configuration::
-	  ::
-
-	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
-	    0=0x7f;1=0x7f;2=0x7f;3=0x7f
-
-	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
-	    0=0x15;1=0x15;3=0x15;4=0x15
-
-	* To change the mbm_total_bytes to count only reads on domain 0,
-	  the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
-	  (in hexadecimal 0x33):
-	  ::
-
-	    # echo  "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
-
-	    # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
-	    0=0x33;1=0x7f;2=0x7f;3=0x7f
-
-	* To change the mbm_local_bytes to count all the slow memory reads on
-	  domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
-	  in binary (in hexadecimal 0x30):
-	  ::
-
-	    # echo  "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
-
-	    # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
-	    0=0x30;1=0x30;3=0x15;4=0x15
-
-"max_threshold_occupancy":
-		Read/write file provides the largest value (in
-		bytes) at which a previously used LLC_occupancy
-		counter can be considered for re-use.
-
-Finally, in the top level of the "info" directory there is a file
-named "last_cmd_status". This is reset with every "command" issued
-via the file system (making new directories or writing to any of the
-control files). If the command was successful, it will read as "ok".
-If the command failed, it will provide more information that can be
-conveyed in the error returns from file operations. E.g.
-::
-
-	# echo L3:0=f7 > schemata
-	bash: echo: write error: Invalid argument
-	# cat info/last_cmd_status
-	mask f7 has non-consecutive 1-bits
-
-Resource alloc and monitor groups
-=================================
-
-Resource groups are represented as directories in the resctrl file
-system.  The default group is the root directory which, immediately
-after mounting, owns all the tasks and cpus in the system and can make
-full use of all resources.
-
-On a system with RDT control features additional directories can be
-created in the root directory that specify different amounts of each
-resource (see "schemata" below). The root and these additional top level
-directories are referred to as "CTRL_MON" groups below.
-
-On a system with RDT monitoring the root directory and other top level
-directories contain a directory named "mon_groups" in which additional
-directories can be created to monitor subsets of tasks in the CTRL_MON
-group that is their ancestor. These are called "MON" groups in the rest
-of this document.
-
-Removing a directory will move all tasks and cpus owned by the group it
-represents to the parent. Removing one of the created CTRL_MON groups
-will automatically remove all MON groups below it.
-
-All groups contain the following files:
-
-"tasks":
-	Reading this file shows the list of all tasks that belong to
-	this group. Writing a task id to the file will add a task to the
-	group. If the group is a CTRL_MON group the task is removed from
-	whichever previous CTRL_MON group owned the task and also from
-	any MON group that owned the task. If the group is a MON group,
-	then the task must already belong to the CTRL_MON parent of this
-	group. The task is removed from any previous MON group.
-
-
-"cpus":
-	Reading this file shows a bitmask of the logical CPUs owned by
-	this group. Writing a mask to this file will add and remove
-	CPUs to/from this group. As with the tasks file a hierarchy is
-	maintained where MON groups may only include CPUs owned by the
-	parent CTRL_MON group.
-	When the resource group is in pseudo-locked mode this file will
-	only be readable, reflecting the CPUs associated with the
-	pseudo-locked region.
-
-
-"cpus_list":
-	Just like "cpus", only using ranges of CPUs instead of bitmasks.
-
-
-When control is enabled all CTRL_MON groups will also contain:
-
-"schemata":
-	A list of all the resources available to this group.
-	Each resource has its own line and format - see below for details.
-
-"size":
-	Mirrors the display of the "schemata" file to display the size in
-	bytes of each allocation instead of the bits representing the
-	allocation.
-
-"mode":
-	The "mode" of the resource group dictates the sharing of its
-	allocations. A "shareable" resource group allows sharing of its
-	allocations while an "exclusive" resource group does not. A
-	cache pseudo-locked region is created by first writing
-	"pseudo-locksetup" to the "mode" file before writing the cache
-	pseudo-locked region's schemata to the resource group's "schemata"
-	file. On successful pseudo-locked region creation the mode will
-	automatically change to "pseudo-locked".
-
-When monitoring is enabled all MON groups will also contain:
-
-"mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
-	directories have one file per event (e.g. "llc_occupancy",
-	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
-	files provide a read out of the current value of the event for
-	all tasks in the group. In CTRL_MON groups these files provide
-	the sum for all tasks in the CTRL_MON group and all tasks in
-	MON groups. Please see example section for more details on usage.
-
-Resource allocation rules
--------------------------
-
-When a task is running the following rules define which resources are
-available to it:
-
-1) If the task is a member of a non-default group, then the schemata
-   for that group is used.
-
-2) Else if the task belongs to the default group, but is running on a
-   CPU that is assigned to some specific group, then the schemata for the
-   CPU's group is used.
-
-3) Otherwise the schemata for the default group is used.
-
-Resource monitoring rules
--------------------------
-1) If a task is a member of a MON group, or non-default CTRL_MON group
-   then RDT events for the task will be reported in that group.
-
-2) If a task is a member of the default CTRL_MON group, but is running
-   on a CPU that is assigned to some specific group, then the RDT events
-   for the task will be reported in that group.
-
-3) Otherwise RDT events for the task will be reported in the root level
-   "mon_data" group.
-
-
-Notes on cache occupancy monitoring and control
-===============================================
-When moving a task from one group to another you should remember that
-this only affects *new* cache allocations by the task. E.g. you may have
-a task in a monitor group showing 3 MB of cache occupancy. If you move
-to a new group and immediately check the occupancy of the old and new
-groups you will likely see that the old group is still showing 3 MB and
-the new group zero. When the task accesses locations still in cache from
-before the move, the h/w does not update any counters. On a busy system
-you will likely see the occupancy in the old group go down as cache lines
-are evicted and re-used while the occupancy in the new group rises as
-the task accesses memory and loads into the cache are counted based on
-membership in the new group.
-
-The same applies to cache allocation control. Moving a task to a group
-with a smaller cache partition will not evict any cache lines. The
-process may continue to use them from the old partition.
-
-Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
-to identify a control group and a monitoring group respectively. Each of
-the resource groups are mapped to these IDs based on the kind of group. The
-number of CLOSid and RMID are limited by the hardware and hence the creation of
-a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
-and creation of "MON" group may fail if we run out of RMIDs.
-
-max_threshold_occupancy - generic concepts
-------------------------------------------
-
-Note that an RMID once freed may not be immediately available for use as
-the RMID is still tagged the cache lines of the previous user of RMID.
-Hence such RMIDs are placed on limbo list and checked back if the cache
-occupancy has gone down. If there is a time when system has a lot of
-limbo RMIDs but which are not ready to be used, user may see an -EBUSY
-during mkdir.
-
-max_threshold_occupancy is a user configurable value to determine the
-occupancy at which an RMID can be freed.
-
-Schemata files - general concepts
----------------------------------
-Each line in the file describes one resource. The line starts with
-the name of the resource, followed by specific values to be applied
-in each of the instances of that resource on the system.
-
-Cache IDs
----------
-On current generation systems there is one L3 cache per socket and L2
-caches are generally just shared by the hyperthreads on a core, but this
-isn't an architectural requirement. We could have multiple separate L3
-caches on a socket, multiple cores could share an L2 cache. So instead
-of using "socket" or "core" to define the set of logical cpus sharing
-a resource we use a "Cache ID". At a given cache level this will be a
-unique number across the whole system (but it isn't guaranteed to be a
-contiguous sequence, there may be gaps).  To find the ID for each logical
-CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
-
-Cache Bit Masks (CBM)
----------------------
-For cache resources we describe the portion of the cache that is available
-for allocation using a bitmask. The maximum value of the mask is defined
-by each cpu model (and may be different for different cache levels). It
-is found using CPUID, but is also provided in the "info" directory of
-the resctrl file system in "info/{resource}/cbm_mask". Intel hardware
-requires that these masks have all the '1' bits in a contiguous block. So
-0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
-and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
-of the capacity of the cache. You could partition the cache into four
-equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
-
-Memory bandwidth Allocation and monitoring
-==========================================
-
-For Memory bandwidth resource, by default the user controls the resource
-by indicating the percentage of total memory bandwidth.
-
-The minimum bandwidth percentage value for each cpu model is predefined
-and can be looked up through "info/MB/min_bandwidth". The bandwidth
-granularity that is allocated is also dependent on the cpu model and can
-be looked up at "info/MB/bandwidth_gran". The available bandwidth
-control steps are: min_bw + N * bw_gran. Intermediate values are rounded
-to the next control step available on the hardware.
-
-The bandwidth throttling is a core specific mechanism on some of Intel
-SKUs. Using a high bandwidth and a low bandwidth setting on two threads
-sharing a core may result in both threads being throttled to use the
-low bandwidth (see "thread_throttle_mode").
-
-The fact that Memory bandwidth allocation(MBA) may be a core
-specific mechanism where as memory bandwidth monitoring(MBM) is done at
-the package level may lead to confusion when users try to apply control
-via the MBA and then monitor the bandwidth to see if the controls are
-effective. Below are such scenarios:
-
-1. User may *not* see increase in actual bandwidth when percentage
-   values are increased:
-
-This can occur when aggregate L2 external bandwidth is more than L3
-external bandwidth. Consider an SKL SKU with 24 cores on a package and
-where L2 external  is 10GBps (hence aggregate L2 external bandwidth is
-240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
-threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
-bandwidth of 100GBps although the percentage value specified is only 50%
-<< 100%. Hence increasing the bandwidth percentage will not yield any
-more bandwidth. This is because although the L2 external bandwidth still
-has capacity, the L3 external bandwidth is fully used. Also note that
-this would be dependent on number of cores the benchmark is run on.
-
-2. Same bandwidth percentage may mean different actual bandwidth
-   depending on # of threads:
-
-For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
-thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
-they have same percentage bandwidth of 10%. This is simply because as
-threads start using more cores in an rdtgroup, the actual bandwidth may
-increase or vary although user specified bandwidth percentage is same.
-
-In order to mitigate this and make the interface more user friendly,
-resctrl added support for specifying the bandwidth in MBps as well.  The
-kernel underneath would use a software feedback mechanism or a "Software
-Controller(mba_sc)" which reads the actual bandwidth using MBM counters
-and adjust the memory bandwidth percentages to ensure::
-
-	"actual bandwidth < user specified bandwidth".
-
-By default, the schemata would take the bandwidth percentage values
-where as user can switch to the "MBA software controller" mode using
-a mount option 'mba_MBps'. The schemata format is specified in the below
-sections.
-
-L3 schemata file details (code and data prioritization disabled)
-----------------------------------------------------------------
-With CDP disabled the L3 schemata format is::
-
-	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
-
-L3 schemata file details (CDP enabled via mount option to resctrl)
-------------------------------------------------------------------
-When CDP is enabled L3 control is split into two separate resources
-so you can specify independent masks for code and data like this::
-
-	L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
-	L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
-
-L2 schemata file details
-------------------------
-CDP is supported at L2 using the 'cdpl2' mount option. The schemata
-format is either::
-
-	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
-
-or
-
-	L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
-	L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
-
-
-Memory bandwidth Allocation (default mode)
-------------------------------------------
-
-Memory b/w domain is L3 cache.
-::
-
-	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
-
-Memory bandwidth Allocation specified in MBps
----------------------------------------------
-
-Memory bandwidth domain is L3 cache.
-::
-
-	MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
-
-Slow Memory Bandwidth Allocation (SMBA)
----------------------------------------
-AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
-CXL.memory is the only supported "slow" memory device. With the
-support of SMBA, the hardware enables bandwidth allocation on
-the slow memory devices. If there are multiple such devices in
-the system, the throttling logic groups all the slow sources
-together and applies the limit on them as a whole.
-
-The presence of SMBA (with CXL.memory) is independent of slow memory
-devices presence. If there are no such devices on the system, then
-configuring SMBA will have no impact on the performance of the system.
-
-The bandwidth domain for slow memory is L3 cache. Its schemata file
-is formatted as:
-::
-
-	SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
-
-Reading/writing the schemata file
----------------------------------
-Reading the schemata file will show the state of all resources
-on all domains. When writing you only need to specify those values
-which you wish to change.  E.g.
-::
-
-  # cat schemata
-  L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
-  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
-  # echo "L3DATA:2=3c0;" > schemata
-  # cat schemata
-  L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
-  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
-
-Reading/writing the schemata file (on AMD systems)
---------------------------------------------------
-Reading the schemata file will show the current bandwidth limit on all
-domains. The allocated resources are in multiples of one eighth GB/s.
-When writing to the file, you need to specify what cache id you wish to
-configure the bandwidth limit.
-
-For example, to allocate 2GB/s limit on the first cache id:
-
-::
-
-  # cat schemata
-    MB:0=2048;1=2048;2=2048;3=2048
-    L3:0=ffff;1=ffff;2=ffff;3=ffff
-
-  # echo "MB:1=16" > schemata
-  # cat schemata
-    MB:0=2048;1=  16;2=2048;3=2048
-    L3:0=ffff;1=ffff;2=ffff;3=ffff
-
-Reading/writing the schemata file (on AMD systems) with SMBA feature
---------------------------------------------------------------------
-Reading and writing the schemata file is the same as without SMBA in
-above section.
-
-For example, to allocate 8GB/s limit on the first cache id:
-
-::
-
-  # cat schemata
-    SMBA:0=2048;1=2048;2=2048;3=2048
-      MB:0=2048;1=2048;2=2048;3=2048
-      L3:0=ffff;1=ffff;2=ffff;3=ffff
-
-  # echo "SMBA:1=64" > schemata
-  # cat schemata
-    SMBA:0=2048;1=  64;2=2048;3=2048
-      MB:0=2048;1=2048;2=2048;3=2048
-      L3:0=ffff;1=ffff;2=ffff;3=ffff
-
-Cache Pseudo-Locking
-====================
-CAT enables a user to specify the amount of cache space that an
-application can fill. Cache pseudo-locking builds on the fact that a
-CPU can still read and write data pre-allocated outside its current
-allocated area on a cache hit. With cache pseudo-locking, data can be
-preloaded into a reserved portion of cache that no application can
-fill, and from that point on will only serve cache hits. The cache
-pseudo-locked memory is made accessible to user space where an
-application can map it into its virtual address space and thus have
-a region of memory with reduced average read latency.
-
-The creation of a cache pseudo-locked region is triggered by a request
-from the user to do so that is accompanied by a schemata of the region
-to be pseudo-locked. The cache pseudo-locked region is created as follows:
-
-- Create a CAT allocation CLOSNEW with a CBM matching the schemata
-  from the user of the cache region that will contain the pseudo-locked
-  memory. This region must not overlap with any current CAT allocation/CLOS
-  on the system and no future overlap with this cache region is allowed
-  while the pseudo-locked region exists.
-- Create a contiguous region of memory of the same size as the cache
-  region.
-- Flush the cache, disable hardware prefetchers, disable preemption.
-- Make CLOSNEW the active CLOS and touch the allocated memory to load
-  it into the cache.
-- Set the previous CLOS as active.
-- At this point the closid CLOSNEW can be released - the cache
-  pseudo-locked region is protected as long as its CBM does not appear in
-  any CAT allocation. Even though the cache pseudo-locked region will from
-  this point on not appear in any CBM of any CLOS an application running with
-  any CLOS will be able to access the memory in the pseudo-locked region since
-  the region continues to serve cache hits.
-- The contiguous region of memory loaded into the cache is exposed to
-  user-space as a character device.
-
-Cache pseudo-locking increases the probability that data will remain
-in the cache via carefully configuring the CAT feature and controlling
-application behavior. There is no guarantee that data is placed in
-cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
-“locked” data from cache. Power management C-states may shrink or
-power off cache. Deeper C-states will automatically be restricted on
-pseudo-locked region creation.
-
-It is required that an application using a pseudo-locked region runs
-with affinity to the cores (or a subset of the cores) associated
-with the cache on which the pseudo-locked region resides. A sanity check
-within the code will not allow an application to map pseudo-locked memory
-unless it runs with affinity to cores associated with the cache on which the
-pseudo-locked region resides. The sanity check is only done during the
-initial mmap() handling, there is no enforcement afterwards and the
-application self needs to ensure it remains affine to the correct cores.
-
-Pseudo-locking is accomplished in two stages:
-
-1) During the first stage the system administrator allocates a portion
-   of cache that should be dedicated to pseudo-locking. At this time an
-   equivalent portion of memory is allocated, loaded into allocated
-   cache portion, and exposed as a character device.
-2) During the second stage a user-space application maps (mmap()) the
-   pseudo-locked memory into its address space.
-
-Cache Pseudo-Locking Interface
-------------------------------
-A pseudo-locked region is created using the resctrl interface as follows:
-
-1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
-2) Change the new resource group's mode to "pseudo-locksetup" by writing
-   "pseudo-locksetup" to the "mode" file.
-3) Write the schemata of the pseudo-locked region to the "schemata" file. All
-   bits within the schemata should be "unused" according to the "bit_usage"
-   file.
-
-On successful pseudo-locked region creation the "mode" file will contain
-"pseudo-locked" and a new character device with the same name as the resource
-group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
-by user space in order to obtain access to the pseudo-locked memory region.
-
-An example of cache pseudo-locked region creation and usage can be found below.
-
-Cache Pseudo-Locking Debugging Interface
-----------------------------------------
-The pseudo-locking debugging interface is enabled by default (if
-CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
-
-There is no explicit way for the kernel to test if a provided memory
-location is present in the cache. The pseudo-locking debugging interface uses
-the tracing infrastructure to provide two ways to measure cache residency of
-the pseudo-locked region:
-
-1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
-   from these measurements are best visualized using a hist trigger (see
-   example below). In this test the pseudo-locked region is traversed at
-   a stride of 32 bytes while hardware prefetchers and preemption
-   are disabled. This also provides a substitute visualization of cache
-   hits and misses.
-2) Cache hit and miss measurements using model specific precision counters if
-   available. Depending on the levels of cache on the system the pseudo_lock_l2
-   and pseudo_lock_l3 tracepoints are available.
-
-When a pseudo-locked region is created a new debugfs directory is created for
-it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
-write-only file, pseudo_lock_measure, is present in this directory. The
-measurement of the pseudo-locked region depends on the number written to this
-debugfs file:
-
-1:
-     writing "1" to the pseudo_lock_measure file will trigger the latency
-     measurement captured in the pseudo_lock_mem_latency tracepoint. See
-     example below.
-2:
-     writing "2" to the pseudo_lock_measure file will trigger the L2 cache
-     residency (cache hits and misses) measurement captured in the
-     pseudo_lock_l2 tracepoint. See example below.
-3:
-     writing "3" to the pseudo_lock_measure file will trigger the L3 cache
-     residency (cache hits and misses) measurement captured in the
-     pseudo_lock_l3 tracepoint.
-
-All measurements are recorded with the tracing infrastructure. This requires
-the relevant tracepoints to be enabled before the measurement is triggered.
-
-Example of latency debugging interface
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In this example a pseudo-locked region named "newlock" was created. Here is
-how we can measure the latency in cycles of reading from this region and
-visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
-is set::
-
-  # :> /sys/kernel/tracing/trace
-  # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
-  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
-  # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
-  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
-  # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
-
-  # event histogram
-  #
-  # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
-  #
-
-  { latency:        456 } hitcount:          1
-  { latency:         50 } hitcount:         83
-  { latency:         36 } hitcount:         96
-  { latency:         44 } hitcount:        174
-  { latency:         48 } hitcount:        195
-  { latency:         46 } hitcount:        262
-  { latency:         42 } hitcount:        693
-  { latency:         40 } hitcount:       3204
-  { latency:         38 } hitcount:       3484
-
-  Totals:
-      Hits: 8192
-      Entries: 9
-    Dropped: 0
-
-Example of cache hits/misses debugging
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In this example a pseudo-locked region named "newlock" was created on the L2
-cache of a platform. Here is how we can obtain details of the cache hits
-and misses using the platform's precision counters.
-::
-
-  # :> /sys/kernel/tracing/trace
-  # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
-  # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
-  # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
-  # cat /sys/kernel/tracing/trace
-
-  # tracer: nop
-  #
-  #                              _-----=> irqs-off
-  #                             / _----=> need-resched
-  #                            | / _---=> hardirq/softirq
-  #                            || / _--=> preempt-depth
-  #                            ||| /     delay
-  #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
-  #              | |       |   ||||       |         |
-  pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
-
-
-Examples for RDT allocation usage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-1) Example 1
-
-On a two socket machine (one L3 cache per socket) with just four bits
-for cache bit masks, minimum b/w of 10% with a memory bandwidth
-granularity of 10%.
-::
-
-  # mount -t resctrl resctrl /sys/fs/resctrl
-  # cd /sys/fs/resctrl
-  # mkdir p0 p1
-  # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
-  # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
-
-The default resource group is unmodified, so we have access to all parts
-of all caches (its schemata file reads "L3:0=f;1=f").
-
-Tasks that are under the control of group "p0" may only allocate from the
-"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
-Tasks in group "p1" use the "lower" 50% of cache on both sockets.
-
-Similarly, tasks that are under the control of group "p0" may use a
-maximum memory b/w of 50% on socket0 and 50% on socket 1.
-Tasks in group "p1" may also use 50% memory b/w on both sockets.
-Note that unlike cache masks, memory b/w cannot specify whether these
-allocations can overlap or not. The allocations specifies the maximum
-b/w that the group may be able to use and the system admin can configure
-the b/w accordingly.
-
-If resctrl is using the software controller (mba_sc) then user can enter the
-max b/w in MB rather than the percentage values.
-::
-
-  # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
-  # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
-
-In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
-of 1024MB where as on socket 1 they would use 500MB.
-
-2) Example 2
-
-Again two sockets, but this time with a more realistic 20-bit mask.
-
-Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
-processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
-neighbors, each of the two real-time tasks exclusively occupies one quarter
-of L3 cache on socket 0.
-::
-
-  # mount -t resctrl resctrl /sys/fs/resctrl
-  # cd /sys/fs/resctrl
-
-First we reset the schemata for the default group so that the "upper"
-50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
-ordinary tasks::
-
-  # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
-
-Next we make a resource group for our first real time task and give
-it access to the "top" 25% of the cache on socket 0.
-::
-
-  # mkdir p0
-  # echo "L3:0=f8000;1=fffff" > p0/schemata
-
-Finally we move our first real time task into this resource group. We
-also use taskset(1) to ensure the task always runs on a dedicated CPU
-on socket 0. Most uses of resource groups will also constrain which
-processors tasks run on.
-::
-
-  # echo 1234 > p0/tasks
-  # taskset -cp 1 1234
-
-Ditto for the second real time task (with the remaining 25% of cache)::
-
-  # mkdir p1
-  # echo "L3:0=7c00;1=fffff" > p1/schemata
-  # echo 5678 > p1/tasks
-  # taskset -cp 2 5678
-
-For the same 2 socket system with memory b/w resource and CAT L3 the
-schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
-10):
-
-For our first real time task this would request 20% memory b/w on socket 0.
-::
-
-  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
-
-For our second real time task this would request an other 20% memory b/w
-on socket 0.
-::
-
-  # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
-
-3) Example 3
-
-A single socket system which has real-time tasks running on core 4-7 and
-non real-time workload assigned to core 0-3. The real-time tasks share text
-and data, so a per task association is not required and due to interaction
-with the kernel it's desired that the kernel on these cores shares L3 with
-the tasks.
-::
-
-  # mount -t resctrl resctrl /sys/fs/resctrl
-  # cd /sys/fs/resctrl
-
-First we reset the schemata for the default group so that the "upper"
-50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
-cannot be used by ordinary tasks::
-
-  # echo "L3:0=3ff\nMB:0=50" > schemata
-
-Next we make a resource group for our real time cores and give it access
-to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
-socket 0.
-::
-
-  # mkdir p0
-  # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
-
-Finally we move core 4-7 over to the new group and make sure that the
-kernel and the tasks running there get 50% of the cache. They should
-also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
-siblings and only the real time threads are scheduled on the cores 4-7.
-::
-
-  # echo F0 > p0/cpus
-
-4) Example 4
-
-The resource groups in previous examples were all in the default "shareable"
-mode allowing sharing of their cache allocations. If one resource group
-configures a cache allocation then nothing prevents another resource group
-to overlap with that allocation.
-
-In this example a new exclusive resource group will be created on a L2 CAT
-system with two L2 cache instances that can be configured with an 8-bit
-capacity bitmask. The new exclusive resource group will be configured to use
-25% of each cache instance.
-::
-
-  # mount -t resctrl resctrl /sys/fs/resctrl/
-  # cd /sys/fs/resctrl
-
-First, we observe that the default group is configured to allocate to all L2
-cache::
-
-  # cat schemata
-  L2:0=ff;1=ff
-
-We could attempt to create the new resource group at this point, but it will
-fail because of the overlap with the schemata of the default group::
-
-  # mkdir p0
-  # echo 'L2:0=0x3;1=0x3' > p0/schemata
-  # cat p0/mode
-  shareable
-  # echo exclusive > p0/mode
-  -sh: echo: write error: Invalid argument
-  # cat info/last_cmd_status
-  schemata overlaps
-
-To ensure that there is no overlap with another resource group the default
-resource group's schemata has to change, making it possible for the new
-resource group to become exclusive.
-::
-
-  # echo 'L2:0=0xfc;1=0xfc' > schemata
-  # echo exclusive > p0/mode
-  # grep . p0/*
-  p0/cpus:0
-  p0/mode:exclusive
-  p0/schemata:L2:0=03;1=03
-  p0/size:L2:0=262144;1=262144
-
-A new resource group will on creation not overlap with an exclusive resource
-group::
-
-  # mkdir p1
-  # grep . p1/*
-  p1/cpus:0
-  p1/mode:shareable
-  p1/schemata:L2:0=fc;1=fc
-  p1/size:L2:0=786432;1=786432
-
-The bit_usage will reflect how the cache is used::
-
-  # cat info/L2/bit_usage
-  0=SSSSSSEE;1=SSSSSSEE
-
-A resource group cannot be forced to overlap with an exclusive resource group::
-
-  # echo 'L2:0=0x1;1=0x1' > p1/schemata
-  -sh: echo: write error: Invalid argument
-  # cat info/last_cmd_status
-  overlaps with exclusive group
-
-Example of Cache Pseudo-Locking
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
-region is exposed at /dev/pseudo_lock/newlock that can be provided to
-application for argument to mmap().
-::
-
-  # mount -t resctrl resctrl /sys/fs/resctrl/
-  # cd /sys/fs/resctrl
-
-Ensure that there are bits available that can be pseudo-locked, since only
-unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
-removed from the default resource group's schemata::
-
-  # cat info/L2/bit_usage
-  0=SSSSSSSS;1=SSSSSSSS
-  # echo 'L2:1=0xfc' > schemata
-  # cat info/L2/bit_usage
-  0=SSSSSSSS;1=SSSSSS00
-
-Create a new resource group that will be associated with the pseudo-locked
-region, indicate that it will be used for a pseudo-locked region, and
-configure the requested pseudo-locked region capacity bitmask::
-
-  # mkdir newlock
-  # echo pseudo-locksetup > newlock/mode
-  # echo 'L2:1=0x3' > newlock/schemata
-
-On success the resource group's mode will change to pseudo-locked, the
-bit_usage will reflect the pseudo-locked region, and the character device
-exposing the pseudo-locked region will exist::
-
-  # cat newlock/mode
-  pseudo-locked
-  # cat info/L2/bit_usage
-  0=SSSSSSSS;1=SSSSSSPP
-  # ls -l /dev/pseudo_lock/newlock
-  crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
-
-::
-
-  /*
-  * Example code to access one page of pseudo-locked cache region
-  * from user space.
-  */
-  #define _GNU_SOURCE
-  #include <fcntl.h>
-  #include <sched.h>
-  #include <stdio.h>
-  #include <stdlib.h>
-  #include <unistd.h>
-  #include <sys/mman.h>
-
-  /*
-  * It is required that the application runs with affinity to only
-  * cores associated with the pseudo-locked region. Here the cpu
-  * is hardcoded for convenience of example.
-  */
-  static int cpuid = 2;
-
-  int main(int argc, char *argv[])
-  {
-    cpu_set_t cpuset;
-    long page_size;
-    void *mapping;
-    int dev_fd;
-    int ret;
-
-    page_size = sysconf(_SC_PAGESIZE);
-
-    CPU_ZERO(&cpuset);
-    CPU_SET(cpuid, &cpuset);
-    ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
-    if (ret < 0) {
-      perror("sched_setaffinity");
-      exit(EXIT_FAILURE);
-    }
-
-    dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
-    if (dev_fd < 0) {
-      perror("open");
-      exit(EXIT_FAILURE);
-    }
-
-    mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
-            dev_fd, 0);
-    if (mapping == MAP_FAILED) {
-      perror("mmap");
-      close(dev_fd);
-      exit(EXIT_FAILURE);
-    }
-
-    /* Application interacts with pseudo-locked memory @mapping */
-
-    ret = munmap(mapping, page_size);
-    if (ret < 0) {
-      perror("munmap");
-      close(dev_fd);
-      exit(EXIT_FAILURE);
-    }
-
-    close(dev_fd);
-    exit(EXIT_SUCCESS);
-  }
-
-Locking between applications
-----------------------------
-
-Certain operations on the resctrl filesystem, composed of read/writes
-to/from multiple files, must be atomic.
-
-As an example, the allocation of an exclusive reservation of L3 cache
-involves:
-
-  1. Read the cbmmasks from each directory or the per-resource "bit_usage"
-  2. Find a contiguous set of bits in the global CBM bitmask that is clear
-     in any of the directory cbmmasks
-  3. Create a new directory
-  4. Set the bits found in step 2 to the new directory "schemata" file
-
-If two applications attempt to allocate space concurrently then they can
-end up allocating the same bits so the reservations are shared instead of
-exclusive.
-
-To coordinate atomic operations on the resctrlfs and to avoid the problem
-above, the following locking procedure is recommended:
-
-Locking is based on flock, which is available in libc and also as a shell
-script command
-
-Write lock:
-
- A) Take flock(LOCK_EX) on /sys/fs/resctrl
- B) Read/write the directory structure.
- C) funlock
-
-Read lock:
-
- A) Take flock(LOCK_SH) on /sys/fs/resctrl
- B) If success read the directory structure.
- C) funlock
-
-Example with bash::
-
-  # Atomically read directory structure
-  $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
-
-  # Read directory contents and create new subdirectory
-
-  $ cat create-dir.sh
-  find /sys/fs/resctrl/ > output.txt
-  mask = function-of(output.txt)
-  mkdir /sys/fs/resctrl/newres/
-  echo mask > /sys/fs/resctrl/newres/schemata
-
-  $ flock /sys/fs/resctrl/ ./create-dir.sh
-
-Example with C::
-
-  /*
-  * Example code do take advisory locks
-  * before accessing resctrl filesystem
-  */
-  #include <sys/file.h>
-  #include <stdlib.h>
-
-  void resctrl_take_shared_lock(int fd)
-  {
-    int ret;
-
-    /* take shared lock on resctrl filesystem */
-    ret = flock(fd, LOCK_SH);
-    if (ret) {
-      perror("flock");
-      exit(-1);
-    }
-  }
-
-  void resctrl_take_exclusive_lock(int fd)
-  {
-    int ret;
-
-    /* release lock on resctrl filesystem */
-    ret = flock(fd, LOCK_EX);
-    if (ret) {
-      perror("flock");
-      exit(-1);
-    }
-  }
-
-  void resctrl_release_lock(int fd)
-  {
-    int ret;
-
-    /* take shared lock on resctrl filesystem */
-    ret = flock(fd, LOCK_UN);
-    if (ret) {
-      perror("flock");
-      exit(-1);
-    }
-  }
-
-  void main(void)
-  {
-    int fd, ret;
-
-    fd = open("/sys/fs/resctrl", O_DIRECTORY);
-    if (fd == -1) {
-      perror("open");
-      exit(-1);
-    }
-    resctrl_take_shared_lock(fd);
-    /* code to read directory contents */
-    resctrl_release_lock(fd);
-
-    resctrl_take_exclusive_lock(fd);
-    /* code to read and write directory contents */
-    resctrl_release_lock(fd);
-  }
-
-Examples for RDT Monitoring along with allocation usage
-=======================================================
-Reading monitored data
-----------------------
-Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
-show the current snapshot of LLC occupancy of the corresponding MON
-group or CTRL_MON group.
-
-
-Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
-------------------------------------------------------------------------
-On a two socket machine (one L3 cache per socket) with just four bits
-for cache bit masks::
-
-  # mount -t resctrl resctrl /sys/fs/resctrl
-  # cd /sys/fs/resctrl
-  # mkdir p0 p1
-  # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
-  # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
-  # echo 5678 > p1/tasks
-  # echo 5679 > p1/tasks
-
-The default resource group is unmodified, so we have access to all parts
-of all caches (its schemata file reads "L3:0=f;1=f").
-
-Tasks that are under the control of group "p0" may only allocate from the
-"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
-Tasks in group "p1" use the "lower" 50% of cache on both sockets.
-
-Create monitor groups and assign a subset of tasks to each monitor group.
-::
-
-  # cd /sys/fs/resctrl/p1/mon_groups
-  # mkdir m11 m12
-  # echo 5678 > m11/tasks
-  # echo 5679 > m12/tasks
-
-fetch data (data shown in bytes)
-::
-
-  # cat m11/mon_data/mon_L3_00/llc_occupancy
-  16234000
-  # cat m11/mon_data/mon_L3_01/llc_occupancy
-  14789000
-  # cat m12/mon_data/mon_L3_00/llc_occupancy
-  16789000
-
-The parent ctrl_mon group shows the aggregated data.
-::
-
-  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
-  31234000
-
-Example 2 (Monitor a task from its creation)
---------------------------------------------
-On a two socket machine (one L3 cache per socket)::
-
-  # mount -t resctrl resctrl /sys/fs/resctrl
-  # cd /sys/fs/resctrl
-  # mkdir p0 p1
-
-An RMID is allocated to the group once its created and hence the <cmd>
-below is monitored from its creation.
-::
-
-  # echo $$ > /sys/fs/resctrl/p1/tasks
-  # <cmd>
-
-Fetch the data::
-
-  # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
-  31789000
-
-Example 3 (Monitor without CAT support or before creating CAT groups)
----------------------------------------------------------------------
-
-Assume a system like HSW has only CQM and no CAT support. In this case
-the resctrl will still mount but cannot create CTRL_MON directories.
-But user can create different MON groups within the root group thereby
-able to monitor all tasks including kernel threads.
-
-This can also be used to profile jobs cache size footprint before being
-able to allocate them to different allocation groups.
-::
-
-  # mount -t resctrl resctrl /sys/fs/resctrl
-  # cd /sys/fs/resctrl
-  # mkdir mon_groups/m01
-  # mkdir mon_groups/m02
-
-  # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
-  # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
-
-Monitor the groups separately and also get per domain data. From the
-below its apparent that the tasks are mostly doing work on
-domain(socket) 0.
-::
-
-  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
-  31234000
-  # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
-  34555
-  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
-  31234000
-  # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
-  32789
-
-
-Example 4 (Monitor real time tasks)
------------------------------------
-
-A single socket system which has real time tasks running on cores 4-7
-and non real time tasks on other cpus. We want to monitor the cache
-occupancy of the real time threads on these cores.
-::
-
-  # mount -t resctrl resctrl /sys/fs/resctrl
-  # cd /sys/fs/resctrl
-  # mkdir p1
-
-Move the cpus 4-7 over to p1::
-
-  # echo f0 > p1/cpus
-
-View the llc occupancy snapshot::
-
-  # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
-  11234000
-
-Intel RDT Errata
-================
-
-Intel MBM Counters May Report System Memory Bandwidth Incorrectly
------------------------------------------------------------------
-
-Errata SKX99 for Skylake server and BDF102 for Broadwell server.
-
-Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
-according to the assigned Resource Monitor ID (RMID) for that logical
-core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
-metrics, may report incorrect system bandwidth for certain RMID values.
-
-Implication: Due to the errata, system memory bandwidth may not match
-what is reported.
-
-Workaround: MBM total and local readings are corrected according to the
-following correction factor table:
-
-+---------------+---------------+---------------+-----------------+
-|core count	|rmid count	|rmid threshold	|correction factor|
-+---------------+---------------+---------------+-----------------+
-|1		|8		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|2		|16		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|3		|24		|15		|0.969650	  |
-+---------------+---------------+---------------+-----------------+
-|4		|32		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|6		|48		|31		|0.969650	  |
-+---------------+---------------+---------------+-----------------+
-|7		|56		|47		|1.142857	  |
-+---------------+---------------+---------------+-----------------+
-|8		|64		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|9		|72		|63		|1.185115	  |
-+---------------+---------------+---------------+-----------------+
-|10		|80		|63		|1.066553	  |
-+---------------+---------------+---------------+-----------------+
-|11		|88		|79		|1.454545	  |
-+---------------+---------------+---------------+-----------------+
-|12		|96		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|13		|104		|95		|1.230769	  |
-+---------------+---------------+---------------+-----------------+
-|14		|112		|95		|1.142857	  |
-+---------------+---------------+---------------+-----------------+
-|15		|120		|95		|1.066667	  |
-+---------------+---------------+---------------+-----------------+
-|16		|128		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|17		|136		|127		|1.254863	  |
-+---------------+---------------+---------------+-----------------+
-|18		|144		|127		|1.185255	  |
-+---------------+---------------+---------------+-----------------+
-|19		|152		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|20		|160		|127		|1.066667	  |
-+---------------+---------------+---------------+-----------------+
-|21		|168		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|22		|176		|159		|1.454334	  |
-+---------------+---------------+---------------+-----------------+
-|23		|184		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|24		|192		|127		|0.969744	  |
-+---------------+---------------+---------------+-----------------+
-|25		|200		|191		|1.280246	  |
-+---------------+---------------+---------------+-----------------+
-|26		|208		|191		|1.230921	  |
-+---------------+---------------+---------------+-----------------+
-|27		|216		|0		|1.000000	  |
-+---------------+---------------+---------------+-----------------+
-|28		|224		|191		|1.143118	  |
-+---------------+---------------+---------------+-----------------+
-
-If rmid > rmid threshold, MBM total and local values should be multiplied
-by the correction factor.
-
-See:
-
-1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
-http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
-
-2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
-http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
-
-3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
-https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
-
-for further information.
diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
deleted file mode 100644
index 2bcbffacbed5..000000000000
--- a/Documentation/x86/sgx.rst
+++ /dev/null
@@ -1,302 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===============================
-Software Guard eXtensions (SGX)
-===============================
-
-Overview
-========
-
-Software Guard eXtensions (SGX) hardware enables for user space applications
-to set aside private memory regions of code and data:
-
-* Privileged (ring-0) ENCLS functions orchestrate the construction of the
-  regions.
-* Unprivileged (ring-3) ENCLU functions allow an application to enter and
-  execute inside the regions.
-
-These memory regions are called enclaves. An enclave can be only entered at a
-fixed set of entry points. Each entry point can hold a single hardware thread
-at a time.  While the enclave is loaded from a regular binary file by using
-ENCLS functions, only the threads inside the enclave can access its memory. The
-region is denied from outside access by the CPU, and encrypted before it leaves
-from LLC.
-
-The support can be determined by
-
-	``grep sgx /proc/cpuinfo``
-
-SGX must both be supported in the processor and enabled by the BIOS.  If SGX
-appears to be unsupported on a system which has hardware support, ensure
-support is enabled in the BIOS.  If a BIOS presents a choice between "Enabled"
-and "Software Enabled" modes for SGX, choose "Enabled".
-
-Enclave Page Cache
-==================
-
-SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated
-with an enclave. It is contained in a BIOS-reserved region of physical memory.
-Unlike pages used for regular memory, pages can only be accessed from outside of
-the enclave during enclave construction with special, limited SGX instructions.
-
-Only a CPU executing inside an enclave can directly access enclave memory.
-However, a CPU executing inside an enclave may access normal memory outside the
-enclave.
-
-The kernel manages enclave memory similar to how it treats device memory.
-
-Enclave Page Types
-------------------
-
-**SGX Enclave Control Structure (SECS)**
-   Enclave's address range, attributes and other global data are defined
-   by this structure.
-
-**Regular (REG)**
-   Regular EPC pages contain the code and data of an enclave.
-
-**Thread Control Structure (TCS)**
-   Thread Control Structure pages define the entry points to an enclave and
-   track the execution state of an enclave thread.
-
-**Version Array (VA)**
-   Version Array pages contain 512 slots, each of which can contain a version
-   number for a page evicted from the EPC.
-
-Enclave Page Cache Map
-----------------------
-
-The processor tracks EPC pages in a hardware metadata structure called the
-*Enclave Page Cache Map (EPCM)*.  The EPCM contains an entry for each EPC page
-which describes the owning enclave, access rights and page type among the other
-things.
-
-EPCM permissions are separate from the normal page tables.  This prevents the
-kernel from, for instance, allowing writes to data which an enclave wishes to
-remain read-only.  EPCM permissions may only impose additional restrictions on
-top of normal x86 page permissions.
-
-For all intents and purposes, the SGX architecture allows the processor to
-invalidate all EPCM entries at will.  This requires that software be prepared to
-handle an EPCM fault at any time.  In practice, this can happen on events like
-power transitions when the ephemeral key that encrypts enclave memory is lost.
-
-Application interface
-=====================
-
-Enclave build functions
------------------------
-
-In addition to the traditional compiler and linker build process, SGX has a
-separate enclave “build” process.  Enclaves must be built before they can be
-executed (entered). The first step in building an enclave is opening the
-**/dev/sgx_enclave** device.  Since enclave memory is protected from direct
-access, special privileged instructions are then used to copy data into enclave
-pages and establish enclave page permissions.
-
-.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
-   :functions: sgx_ioc_enclave_create
-               sgx_ioc_enclave_add_pages
-               sgx_ioc_enclave_init
-               sgx_ioc_enclave_provision
-
-Enclave runtime management
---------------------------
-
-Systems supporting SGX2 additionally support changes to initialized
-enclaves: modifying enclave page permissions and type, and dynamically
-adding and removing of enclave pages. When an enclave accesses an address
-within its address range that does not have a backing page then a new
-regular page will be dynamically added to the enclave. The enclave is
-still required to run EACCEPT on the new page before it can be used.
-
-.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c
-   :functions: sgx_ioc_enclave_restrict_permissions
-               sgx_ioc_enclave_modify_types
-               sgx_ioc_enclave_remove_pages
-
-Enclave vDSO
-------------
-
-Entering an enclave can only be done through SGX-specific EENTER and ERESUME
-functions, and is a non-trivial process.  Because of the complexity of
-transitioning to and from an enclave, enclaves typically utilize a library to
-handle the actual transitions.  This is roughly analogous to how glibc
-implementations are used by most applications to wrap system calls.
-
-Another crucial characteristic of enclaves is that they can generate exceptions
-as part of their normal operation that need to be handled in the enclave or are
-unique to SGX.
-
-Instead of the traditional signal mechanism to handle these exceptions, SGX
-can leverage special exception fixup provided by the vDSO.  The kernel-provided
-vDSO function wraps low-level transitions to/from the enclave like EENTER and
-ERESUME.  The vDSO function intercepts exceptions that would otherwise generate
-a signal and return the fault information directly to its caller.  This avoids
-the need to juggle signal handlers.
-
-.. kernel-doc:: arch/x86/include/uapi/asm/sgx.h
-   :functions: vdso_sgx_enter_enclave_t
-
-ksgxd
-=====
-
-SGX support includes a kernel thread called *ksgxd*.
-
-EPC sanitization
-----------------
-
-ksgxd is started when SGX initializes.  Enclave memory is typically ready
-for use when the processor powers on or resets.  However, if SGX has been in
-use since the reset, enclave pages may be in an inconsistent state.  This might
-occur after a crash and kexec() cycle, for instance.  At boot, ksgxd
-reinitializes all enclave pages so that they can be allocated and re-used.
-
-The sanitization is done by going through EPC address space and applying the
-EREMOVE function to each physical page. Some enclave pages like SECS pages have
-hardware dependencies on other pages which prevents EREMOVE from functioning.
-Executing two EREMOVE passes removes the dependencies.
-
-Page reclaimer
---------------
-
-Similar to the core kswapd, ksgxd, is responsible for managing the
-overcommitment of enclave memory.  If the system runs out of enclave memory,
-*ksgxd* “swaps” enclave memory to normal memory.
-
-Launch Control
-==============
-
-SGX provides a launch control mechanism. After all enclave pages have been
-copied, kernel executes EINIT function, which initializes the enclave. Only after
-this the CPU can execute inside the enclave.
-
-EINIT function takes an RSA-3072 signature of the enclave measurement.  The function
-checks that the measurement is correct and signature is signed with the key
-hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the
-SHA256 of a public key.
-
-Those MSRs can be configured by the BIOS to be either readable or writable.
-Linux supports only writable configuration in order to give full control to the
-kernel on launch control policy. Before calling EINIT function, the driver sets
-the MSRs to match the enclave's signing key.
-
-Encryption engines
-==================
-
-In order to conceal the enclave data while it is out of the CPU package, the
-memory controller has an encryption engine to transparently encrypt and decrypt
-enclave memory.
-
-In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to
-encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in
-SRAM to maintain integrity of the encrypted data. This provides integrity and
-anti-replay protection but does not scale to large memory sizes because the time
-required to update the Merkle tree grows logarithmically in relation to the
-memory size.
-
-CPUs starting from Icelake use Total Memory Encryption (TME) in the place of
-MEE. TME-based SGX implementations do not have an integrity Merkle tree, which
-means integrity and replay-attacks are not mitigated.  B, it includes
-additional changes to prevent cipher text from being returned and SW memory
-aliases from being created.
-
-DMA to enclave memory is blocked by range registers on both MEE and TME systems
-(SDM section 41.10).
-
-Usage Models
-============
-
-Shared Library
---------------
-
-Sensitive data and the code that acts on it is partitioned from the application
-into a separate library. The library is then linked as a DSO which can be loaded
-into an enclave. The application can then make individual function calls into
-the enclave through special SGX instructions. A run-time within the enclave is
-configured to marshal function parameters into and out of the enclave and to
-call the correct library function.
-
-Application Container
----------------------
-
-An application may be loaded into a container enclave which is specially
-configured with a library OS and run-time which permits the application to run.
-The enclave run-time and library OS work together to execute the application
-when a thread enters the enclave.
-
-Impact of Potential Kernel SGX Bugs
-===================================
-
-EPC leaks
----------
-
-When EPC page leaks happen, a WARNING like this is shown in dmesg:
-
-"EREMOVE returned ... and an EPC page was leaked.  SGX may become unusable..."
-
-This is effectively a kernel use-after-free of an EPC page, and due
-to the way SGX works, the bug is detected at freeing. Rather than
-adding the page back to the pool of available EPC pages, the kernel
-intentionally leaks the page to avoid additional errors in the future.
-
-When this happens, the kernel will likely soon leak more EPC pages, and
-SGX will likely become unusable because the memory available to SGX is
-limited. However, while this may be fatal to SGX, the rest of the kernel
-is unlikely to be impacted and should continue to work.
-
-As a result, when this happpens, user should stop running any new
-SGX workloads, (or just any new workloads), and migrate all valuable
-workloads. Although a machine reboot can recover all EPC memory, the bug
-should be reported to Linux developers.
-
-
-Virtual EPC
-===========
-
-The implementation has also a virtual EPC driver to support SGX enclaves
-in guests. Unlike the SGX driver, an EPC page allocated by the virtual
-EPC driver doesn't have a specific enclave associated with it. This is
-because KVM doesn't track how a guest uses EPC pages.
-
-As a result, the SGX core page reclaimer doesn't support reclaiming EPC
-pages allocated to KVM guests through the virtual EPC driver. If the
-user wants to deploy SGX applications both on the host and in guests
-on the same machine, the user should reserve enough EPC (by taking out
-total virtual EPC size of all SGX VMs from the physical EPC size) for
-host SGX applications so they can run with acceptable performance.
-
-Architectural behavior is to restore all EPC pages to an uninitialized
-state also after a guest reboot.  Because this state can be reached only
-through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc``
-provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction
-on all pages in the virtual EPC.
-
-``EREMOVE`` can fail for three reasons.  Userspace must pay attention
-to expected failures and handle them as follows:
-
-1. Page removal will always fail when any thread is running in the
-   enclave to which the page belongs.  In this case the ioctl will
-   return ``EBUSY`` independent of whether it has successfully removed
-   some pages; userspace can avoid these failures by preventing execution
-   of any vcpu which maps the virtual EPC.
-
-2. Page removal will cause a general protection fault if two calls to
-   ``EREMOVE`` happen concurrently for pages that refer to the same
-   "SECS" metadata pages.  This can happen if there are concurrent
-   invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc``
-   file descriptor in the guest is closed at the same time as
-   ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``.
-   This can be avoided in userspace by serializing calls to the ioctl()
-   and to close(), but in general it should not be a problem.
-
-3. Finally, page removal will fail for SECS metadata pages which still
-   have child pages.  Child pages can be removed by executing
-   ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors
-   mapped into the guest.  This means that the ioctl() must be called
-   twice: an initial set of calls to remove child pages and a subsequent
-   set of calls to remove SECS pages.  The second set of calls is only
-   required for those mappings that returned a nonzero value from the
-   first call.  It indicates a bug in the kernel or the userspace client
-   if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
-   a return code other than 0.
diff --git a/Documentation/x86/sva.rst b/Documentation/x86/sva.rst
deleted file mode 100644
index 2e9b8b0f9a0f..000000000000
--- a/Documentation/x86/sva.rst
+++ /dev/null
@@ -1,286 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================================
-Shared Virtual Addressing (SVA) with ENQCMD
-===========================================
-
-Background
-==========
-
-Shared Virtual Addressing (SVA) allows the processor and device to use the
-same virtual addresses avoiding the need for software to translate virtual
-addresses to physical addresses. SVA is what PCIe calls Shared Virtual
-Memory (SVM).
-
-In addition to the convenience of using application virtual addresses
-by the device, it also doesn't require pinning pages for DMA.
-PCIe Address Translation Services (ATS) along with Page Request Interface
-(PRI) allow devices to function much the same way as the CPU handling
-application page-faults. For more information please refer to the PCIe
-specification Chapter 10: ATS Specification.
-
-Use of SVA requires IOMMU support in the platform. IOMMU is also
-required to support the PCIe features ATS and PRI. ATS allows devices
-to cache translations for virtual addresses. The IOMMU driver uses the
-mmu_notifier() support to keep the device TLB cache and the CPU cache in
-sync. When an ATS lookup fails for a virtual address, the device should
-use the PRI in order to request the virtual address to be paged into the
-CPU page tables. The device must use ATS again in order the fetch the
-translation before use.
-
-Shared Hardware Workqueues
-==========================
-
-Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
-the use of Shared Work Queues (SWQ) by both applications and Virtual
-Machines (VM's). This allows better hardware utilization vs. hard
-partitioning resources that could result in under utilization. In order to
-allow the hardware to distinguish the context for which work is being
-executed in the hardware by SWQ interface, SIOV uses Process Address Space
-ID (PASID), which is a 20-bit number defined by the PCIe SIG.
-
-PASID value is encoded in all transactions from the device. This allows the
-IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
-Resource Identifier (RID) which is the Bus/Device/Function.
-
-
-ENQCMD
-======
-
-ENQCMD is a new instruction on Intel platforms that atomically submits a
-work descriptor to a device. The descriptor includes the operation to be
-performed, virtual addresses of all parameters, virtual address of a completion
-record, and the PASID (process address space ID) of the current process.
-
-ENQCMD works with non-posted semantics and carries a status back if the
-command was accepted by hardware. This allows the submitter to know if the
-submission needs to be retried or other device specific mechanisms to
-implement fairness or ensure forward progress should be provided.
-
-ENQCMD is the glue that ensures applications can directly submit commands
-to the hardware and also permits hardware to be aware of application context
-to perform I/O operations via use of PASID.
-
-Process Address Space Tagging
-=============================
-
-A new thread-scoped MSR (IA32_PASID) provides the connection between
-user processes and the rest of the hardware. When an application first
-accesses an SVA-capable device, this MSR is initialized with a newly
-allocated PASID. The driver for the device calls an IOMMU-specific API
-that sets up the routing for DMA and page-requests.
-
-For example, the Intel Data Streaming Accelerator (DSA) uses
-iommu_sva_bind_device(), which will do the following:
-
-- Allocate the PASID, and program the process page-table (%cr3 register) in the
-  PASID context entries.
-- Register for mmu_notifier() to track any page-table invalidations to keep
-  the device TLB in sync. For example, when a page-table entry is invalidated,
-  the IOMMU propagates the invalidation to the device TLB. This will force any
-  future access by the device to this virtual address to participate in
-  ATS. If the IOMMU responds with proper response that a page is not
-  present, the device would request the page to be paged in via the PCIe PRI
-  protocol before performing I/O.
-
-This MSR is managed with the XSAVE feature set as "supervisor state" to
-ensure the MSR is updated during context switch.
-
-PASID Management
-================
-
-The kernel must allocate a PASID on behalf of each process which will use
-ENQCMD and program it into the new MSR to communicate the process identity to
-platform hardware.  ENQCMD uses the PASID stored in this MSR to tag requests
-from this process.  When a user submits a work descriptor to a device using the
-ENQCMD instruction, the PASID field in the descriptor is auto-filled with the
-value from MSR_IA32_PASID. Requests for DMA from the device are also tagged
-with the same PASID. The platform IOMMU uses the PASID in the transaction to
-perform address translation. The IOMMU APIs setup the corresponding PASID
-entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
-x86).
-
-The MSR must be configured on each logical CPU before any application
-thread can interact with a device. Threads that belong to the same
-process share the same page tables, thus the same MSR value.
-
-PASID Life Cycle Management
-===========================
-
-PASID is initialized as INVALID_IOASID (-1) when a process is created.
-
-Only processes that access SVA-capable devices need to have a PASID
-allocated. This allocation happens when a process opens/binds an SVA-capable
-device but finds no PASID for this process. Subsequent binds of the same, or
-other devices will share the same PASID.
-
-Although the PASID is allocated to the process by opening a device,
-it is not active in any of the threads of that process. It's loaded to the
-IA32_PASID MSR lazily when a thread tries to submit a work descriptor
-to a device using the ENQCMD.
-
-That first access will trigger a #GP fault because the IA32_PASID MSR
-has not been initialized with the PASID value assigned to the process
-when the device was opened. The Linux #GP handler notes that a PASID has
-been allocated for the process, and so initializes the IA32_PASID MSR
-and returns so that the ENQCMD instruction is re-executed.
-
-On fork(2) or exec(2) the PASID is removed from the process as it no
-longer has the same address space that it had when the device was opened.
-
-On clone(2) the new task shares the same address space, so will be
-able to use the PASID allocated to the process. The IA32_PASID is not
-preemptively initialized as the PASID value might not be allocated yet or
-the kernel does not know whether this thread is going to access the device
-and the cleared IA32_PASID MSR reduces context switch overhead by xstate
-init optimization. Since #GP faults have to be handled on any threads that
-were created before the PASID was assigned to the mm of the process, newly
-created threads might as well be treated in a consistent way.
-
-Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in
-all threads in unbind, free the PASID lazily only on mm exit.
-
-If a process does a close(2) of the device file descriptor and munmap(2)
-of the device MMIO portal, then the driver will unbind the device. The
-PASID is still marked VALID in the PASID_MSR for any threads in the
-process that accessed the device. But this is harmless as without the
-MMIO portal they cannot submit new work to the device.
-
-Relationships
-=============
-
- * Each process has many threads, but only one PASID.
- * Devices have a limited number (~10's to 1000's) of hardware workqueues.
-   The device driver manages allocating hardware workqueues.
- * A single mmap() maps a single hardware workqueue as a "portal" and
-   each portal maps down to a single workqueue.
- * For each device with which a process interacts, there must be
-   one or more mmap()'d portals.
- * Many threads within a process can share a single portal to access
-   a single device.
- * Multiple processes can separately mmap() the same portal, in
-   which case they still share one device hardware workqueue.
- * The single process-wide PASID is used by all threads to interact
-   with all devices.  There is not, for instance, a PASID for each
-   thread or each thread<->device pair.
-
-FAQ
-===
-
-* What is SVA/SVM?
-
-Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
-work in the same address space, i.e., to share it. Some call it Shared
-Virtual Memory (SVM), but Linux community wanted to avoid confusing it with
-POSIX Shared Memory and Secure Virtual Machines which were terms already in
-circulation.
-
-* What is a PASID?
-
-A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
-(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
-PASID is included in all transactions between the platform and the device.
-
-* How are shared workqueues different?
-
-Traditionally, in order for userspace applications to interact with hardware,
-there is a separate hardware instance required per process. For example,
-consider doorbells as a mechanism of informing hardware about work to process.
-Each doorbell is required to be spaced 4k (or page-size) apart for process
-isolation. This requires hardware to provision that space and reserve it in
-MMIO. This doesn't scale as the number of threads becomes quite large. The
-hardware also manages the queue depth for Shared Work Queues (SWQ), and
-consumers don't need to track queue depth. If there is no space to accept
-a command, the device will return an error indicating retry.
-
-A user should check Deferrable Memory Write (DMWr) capability on the device
-and only submits ENQCMD when the device supports it. In the new DMWr PCIe
-terminology, devices need to support DMWr completer capability. In addition,
-it requires all switch ports to support DMWr routing and must be enabled by
-the PCIe subsystem, much like how PCIe atomic operations are managed for
-instance.
-
-SWQ allows hardware to provision just a single address in the device. When
-used with ENQCMD to submit work, the device can distinguish the process
-submitting the work since it will include the PASID assigned to that
-process. This helps the device scale to a large number of processes.
-
-* Is this the same as a user space device driver?
-
-Communicating with the device via the shared workqueue is much simpler
-than a full blown user space driver. The kernel driver does all the
-initialization of the hardware. User space only needs to worry about
-submitting work and processing completions.
-
-* Is this the same as SR-IOV?
-
-Single Root I/O Virtualization (SR-IOV) focuses on providing independent
-hardware interfaces for virtualizing hardware. Hence, it's required to be
-almost fully functional interface to software supporting the traditional
-BARs, space for interrupts via MSI-X, its own register layout.
-Virtual Functions (VFs) are assisted by the Physical Function (PF)
-driver.
-
-Scalable I/O Virtualization builds on the PASID concept to create device
-instances for virtualization. SIOV requires host software to assist in
-creating virtual devices; each virtual device is represented by a PASID
-along with the bus/device/function of the device.  This allows device
-hardware to optimize device resource creation and can grow dynamically on
-demand. SR-IOV creation and management is very static in nature. Consult
-references below for more details.
-
-* Why not just create a virtual function for each app?
-
-Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
-duplicated hardware for PCI config space and interrupts such as MSI-X.
-Resources such as interrupts have to be hard partitioned between VFs at
-creation time, and cannot scale dynamically on demand. The VFs are not
-completely independent from the Physical Function (PF). Most VFs require
-some communication and assistance from the PF driver. SIOV, in contrast,
-creates a software-defined device where all the configuration and control
-aspects are mediated via the slow path. The work submission and completion
-happen without any mediation.
-
-* Does this support virtualization?
-
-ENQCMD can be used from within a guest VM. In these cases, the VMM helps
-with setting up a translation table to translate from Guest PASID to Host
-PASID. Please consult the ENQCMD instruction set reference for more
-details.
-
-* Does memory need to be pinned?
-
-When devices support SVA along with platform hardware such as IOMMU
-supporting such devices, there is no need to pin memory for DMA purposes.
-Devices that support SVA also support other PCIe features that remove the
-pinning requirement for memory.
-
-Device TLB support - Device requests the IOMMU to lookup an address before
-use via Address Translation Service (ATS) requests.  If the mapping exists
-but there is no page allocated by the OS, IOMMU hardware returns that no
-mapping exists.
-
-Device requests the virtual address to be mapped via Page Request
-Interface (PRI). Once the OS has successfully completed the mapping, it
-returns the response back to the device. The device requests again for
-a translation and continues.
-
-IOMMU works with the OS in managing consistency of page-tables with the
-device. When removing pages, it interacts with the device to remove any
-device TLB entry that might have been cached before removing the mappings from
-the OS.
-
-References
-==========
-
-VT-D:
-https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
-
-SIOV:
-https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
-
-ENQCMD in ISE:
-https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
-
-DSA spec:
-https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
deleted file mode 100644
index dc8d9fd2c3f7..000000000000
--- a/Documentation/x86/tdx.rst
+++ /dev/null
@@ -1,261 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=====================================
-Intel Trust Domain Extensions (TDX)
-=====================================
-
-Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
-the host and physical attacks by isolating the guest register state and by
-encrypting the guest memory. In TDX, a special module running in a special
-mode sits between the host and the guest and manages the guest/host
-separation.
-
-Since the host cannot directly access guest registers or memory, much
-normal functionality of a hypervisor must be moved into the guest. This is
-implemented using a Virtualization Exception (#VE) that is handled by the
-guest kernel. A #VE is handled entirely inside the guest kernel, but some
-require the hypervisor to be consulted.
-
-TDX includes new hypercall-like mechanisms for communicating from the
-guest to the hypervisor or the TDX module.
-
-New TDX Exceptions
-==================
-
-TDX guests behave differently from bare-metal and traditional VMX guests.
-In TDX guests, otherwise normal instructions or memory accesses can cause
-#VE or #GP exceptions.
-
-Instructions marked with an '*' conditionally cause exceptions.  The
-details for these instructions are discussed below.
-
-Instruction-based #VE
----------------------
-
-- Port I/O (INS, OUTS, IN, OUT)
-- HLT
-- MONITOR, MWAIT
-- WBINVD, INVD
-- VMCALL
-- RDMSR*,WRMSR*
-- CPUID*
-
-Instruction-based #GP
----------------------
-
-- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
-  VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
-- ENCLS, ENCLU
-- GETSEC
-- RSM
-- ENQCMD
-- RDMSR*,WRMSR*
-
-RDMSR/WRMSR Behavior
---------------------
-
-MSR access behavior falls into three categories:
-
-- #GP generated
-- #VE generated
-- "Just works"
-
-In general, the #GP MSRs should not be used in guests.  Their use likely
-indicates a bug in the guest.  The guest may try to handle the #GP with a
-hypercall but it is unlikely to succeed.
-
-The #VE MSRs are typically able to be handled by the hypervisor.  Guests
-can make a hypercall to the hypervisor to handle the #VE.
-
-The "just works" MSRs do not need any special guest handling.  They might
-be implemented by directly passing through the MSR to the hardware or by
-trapping and handling in the TDX module.  Other than possibly being slow,
-these MSRs appear to function just as they would on bare metal.
-
-CPUID Behavior
---------------
-
-For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
-return values (in guest EAX/EBX/ECX/EDX) are configurable by the
-hypervisor. For such cases, the Intel TDX module architecture defines two
-virtualization types:
-
-- Bit fields for which the hypervisor controls the value seen by the guest
-  TD.
-
-- Bit fields for which the hypervisor configures the value such that the
-  guest TD either sees their native value or a value of 0.  For these bit
-  fields, the hypervisor can mask off the native values, but it can not
-  turn *on* values.
-
-A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
-not know how to handle. The guest kernel may ask the hypervisor for the
-value with a hypercall.
-
-#VE on Memory Accesses
-======================
-
-There are essentially two classes of TDX memory: private and shared.
-Private memory receives full TDX protections.  Its content is protected
-against access from the hypervisor.  Shared memory is expected to be
-shared between guest and hypervisor and does not receive full TDX
-protections.
-
-A TD guest is in control of whether its memory accesses are treated as
-private or shared.  It selects the behavior with a bit in its page table
-entries.  This helps ensure that a guest does not place sensitive
-information in shared memory, exposing it to the untrusted hypervisor.
-
-#VE on Shared Memory
---------------------
-
-Access to shared mappings can cause a #VE.  The hypervisor ultimately
-controls whether a shared memory access causes a #VE, so the guest must be
-careful to only reference shared pages it can safely handle a #VE.  For
-instance, the guest should be careful not to access shared memory in the
-#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
-
-Shared mapping content is entirely controlled by the hypervisor. The guest
-should only use shared mappings for communicating with the hypervisor.
-Shared mappings must never be used for sensitive memory content like kernel
-stacks.  A good rule of thumb is that hypervisor-shared memory should be
-treated the same as memory mapped to userspace.  Both the hypervisor and
-userspace are completely untrusted.
-
-MMIO for virtual devices is implemented as shared memory.  The guest must
-be careful not to access device MMIO regions unless it is also prepared to
-handle a #VE.
-
-#VE on Private Pages
---------------------
-
-An access to private mappings can also cause a #VE.  Since all kernel
-memory is also private memory, the kernel might theoretically need to
-handle a #VE on arbitrary kernel memory accesses.  This is not feasible, so
-TDX guests ensure that all guest memory has been "accepted" before memory
-is used by the kernel.
-
-A modest amount of memory (typically 512M) is pre-accepted by the firmware
-before the kernel runs to ensure that the kernel can start up without
-being subjected to a #VE.
-
-The hypervisor is permitted to unilaterally move accepted pages to a
-"blocked" state. However, if it does this, page access will not generate a
-#VE.  It will, instead, cause a "TD Exit" where the hypervisor is required
-to handle the exception.
-
-Linux #VE handler
-=================
-
-Just like page faults or #GP's, #VE exceptions can be either handled or be
-fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
-An unhandled kernel #VE results in an oops.
-
-Handling nested exceptions on x86 is typically nasty business.  A #VE
-could be interrupted by an NMI which triggers another #VE and hilarity
-ensues.  The TDX #VE architecture anticipated this scenario and includes a
-feature to make it slightly less nasty.
-
-During #VE handling, the TDX module ensures that all interrupts (including
-NMIs) are blocked.  The block remains in place until the guest makes a
-TDG.VP.VEINFO.GET TDCALL.  This allows the guest to control when interrupts
-or a new #VE can be delivered.
-
-However, the guest kernel must still be careful to avoid potential
-#VE-triggering actions (discussed above) while this block is in place.
-While the block is in place, any #VE is elevated to a double fault (#DF)
-which is not recoverable.
-
-MMIO handling
-=============
-
-In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
-mapping which will cause a VMEXIT on access, and then the hypervisor
-emulates the access.  That is not possible in TDX guests because VMEXIT
-will expose the register state to the host. TDX guests don't trust the host
-and can't have their state exposed to the host.
-
-In TDX, MMIO regions typically trigger a #VE exception in the guest.  The
-guest #VE handler then emulates the MMIO instruction inside the guest and
-converts it into a controlled TDCALL to the host, rather than exposing
-guest state to the host.
-
-MMIO addresses on x86 are just special physical addresses. They can
-theoretically be accessed with any instruction that accesses memory.
-However, the kernel instruction decoding method is limited. It is only
-designed to decode instructions like those generated by io.h macros.
-
-MMIO access via other means (like structure overlays) may result in an
-oops.
-
-Shared Memory Conversions
-=========================
-
-All TDX guest memory starts out as private at boot.  This memory can not
-be accessed by the hypervisor.  However, some kernel users like device
-drivers might have a need to share data with the hypervisor.  To do this,
-memory must be converted between shared and private.  This can be
-accomplished using some existing memory encryption helpers:
-
- * set_memory_decrypted() converts a range of pages to shared.
- * set_memory_encrypted() converts memory back to private.
-
-Device drivers are the primary user of shared memory, but there's no need
-to touch every driver. DMA buffers and ioremap() do the conversions
-automatically.
-
-TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
-converted to shared on boot.
-
-For coherent DMA allocation, the DMA buffer gets converted on the
-allocation. Check force_dma_unencrypted() for details.
-
-Attestation
-===========
-
-Attestation is used to verify the TDX guest trustworthiness to other
-entities before provisioning secrets to the guest. For example, a key
-server may want to use attestation to verify that the guest is the
-desired one before releasing the encryption keys to mount the encrypted
-rootfs or a secondary drive.
-
-The TDX module records the state of the TDX guest in various stages of
-the guest boot process using the build time measurement register (MRTD)
-and runtime measurement registers (RTMR). Measurements related to the
-guest initial configuration and firmware image are recorded in the MRTD
-register. Measurements related to initial state, kernel image, firmware
-image, command line options, initrd, ACPI tables, etc are recorded in
-RTMR registers. For more details, as an example, please refer to TDX
-Virtual Firmware design specification, section titled "TD Measurement".
-At TDX guest runtime, the attestation process is used to attest to these
-measurements.
-
-The attestation process consists of two steps: TDREPORT generation and
-Quote generation.
-
-TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT)
-from the TDX module. TDREPORT is a fixed-size data structure generated by
-the TDX module which contains guest-specific information (such as build
-and boot measurements), platform security version, and the MAC to protect
-the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used
-as input and included in the TDREPORT. Typically it can be some nonce
-provided by attestation service so the TDREPORT can be verified uniquely.
-More details about the TDREPORT can be found in Intel TDX Module
-specification, section titled "TDG.MR.REPORT Leaf".
-
-After getting the TDREPORT, the second step of the attestation process
-is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT
-by design can only be verified on the local platform as the MAC key is
-bound to the platform. To support remote verification of the TDREPORT,
-TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally
-and convert it to a remotely verifiable Quote. Method of sending TDREPORT
-to QE is implementation specific. Attestation software can choose
-whatever communication channel available (i.e. vsock or TCP/IP) to
-send the TDREPORT to QE and receive the Quote.
-
-References
-==========
-
-TDX reference material is collected here:
-
-https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
diff --git a/Documentation/x86/tlb.rst b/Documentation/x86/tlb.rst
deleted file mode 100644
index 82ec58ae63a8..000000000000
--- a/Documentation/x86/tlb.rst
+++ /dev/null
@@ -1,83 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=======
-The TLB
-=======
-
-When the kernel unmaps or modified the attributes of a range of
-memory, it has two choices:
-
- 1. Flush the entire TLB with a two-instruction sequence.  This is
-    a quick operation, but it causes collateral damage: TLB entries
-    from areas other than the one we are trying to flush will be
-    destroyed and must be refilled later, at some cost.
- 2. Use the invlpg instruction to invalidate a single page at a
-    time.  This could potentially cost many more instructions, but
-    it is a much more precise operation, causing no collateral
-    damage to other TLB entries.
-
-Which method to do depends on a few things:
-
- 1. The size of the flush being performed.  A flush of the entire
-    address space is obviously better performed by flushing the
-    entire TLB than doing 2^48/PAGE_SIZE individual flushes.
- 2. The contents of the TLB.  If the TLB is empty, then there will
-    be no collateral damage caused by doing the global flush, and
-    all of the individual flush will have ended up being wasted
-    work.
- 3. The size of the TLB.  The larger the TLB, the more collateral
-    damage we do with a full flush.  So, the larger the TLB, the
-    more attractive an individual flush looks.  Data and
-    instructions have separate TLBs, as do different page sizes.
- 4. The microarchitecture.  The TLB has become a multi-level
-    cache on modern CPUs, and the global flushes have become more
-    expensive relative to single-page flushes.
-
-There is obviously no way the kernel can know all these things,
-especially the contents of the TLB during a given flush.  The
-sizes of the flush will vary greatly depending on the workload as
-well.  There is essentially no "right" point to choose.
-
-You may be doing too many individual invalidations if you see the
-invlpg instruction (or instructions _near_ it) show up high in
-profiles.  If you believe that individual invalidations being
-called too often, you can lower the tunable::
-
-	/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
-
-This will cause us to do the global flush for more cases.
-Lowering it to 0 will disable the use of the individual flushes.
-Setting it to 1 is a very conservative setting and it should
-never need to be 0 under normal circumstances.
-
-Despite the fact that a single individual flush on x86 is
-guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full
-flushes.  THP is treated exactly the same as normal memory.
-
-You might see invlpg inside of flush_tlb_mm_range() show up in
-profiles, or you can use the trace_tlb_flush() tracepoints. to
-determine how long the flush operations are taking.
-
-Essentially, you are balancing the cycles you spend doing invlpg
-with the cycles that you spend refilling the TLB later.
-
-You can measure how expensive TLB refills are by using
-performance counters and 'perf stat', like this::
-
-  perf stat -e
-    cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
-    cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
-    cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
-    cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
-    cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
-    cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
-
-That works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
-may have differently-named counters, but they should at least
-be there in some form.  You can use pmu-tools 'ocperf list'
-(https://github.com/andikleen/pmu-tools) to find the right
-counters for a given CPU.
-
-.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
-   says: "One execution of INVLPG is sufficient even for a page
-   with size greater than 4 KBytes."
diff --git a/Documentation/x86/topology.rst b/Documentation/x86/topology.rst
deleted file mode 100644
index 7f58010ea86a..000000000000
--- a/Documentation/x86/topology.rst
+++ /dev/null
@@ -1,234 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-============
-x86 Topology
-============
-
-This documents and clarifies the main aspects of x86 topology modelling and
-representation in the kernel. Update/change when doing changes to the
-respective code.
-
-The architecture-agnostic topology definitions are in
-Documentation/admin-guide/cputopology.rst. This file holds x86-specific
-differences/specialities which must not necessarily apply to the generic
-definitions. Thus, the way to read up on Linux topology on x86 is to start
-with the generic one and look at this one in parallel for the x86 specifics.
-
-Needless to say, code should use the generic functions - this file is *only*
-here to *document* the inner workings of x86 topology.
-
-Started by Thomas Gleixner <tglx@linutronix.de> and Borislav Petkov <bp@alien8.de>.
-
-The main aim of the topology facilities is to present adequate interfaces to
-code which needs to know/query/use the structure of the running system wrt
-threads, cores, packages, etc.
-
-The kernel does not care about the concept of physical sockets because a
-socket has no relevance to software. It's an electromechanical component. In
-the past a socket always contained a single package (see below), but with the
-advent of Multi Chip Modules (MCM) a socket can hold more than one package. So
-there might be still references to sockets in the code, but they are of
-historical nature and should be cleaned up.
-
-The topology of a system is described in the units of:
-
-    - packages
-    - cores
-    - threads
-
-Package
-=======
-Packages contain a number of cores plus shared resources, e.g. DRAM
-controller, shared caches etc.
-
-Modern systems may also use the term 'Die' for package.
-
-AMD nomenclature for package is 'Node'.
-
-Package-related topology information in the kernel:
-
-  - cpuinfo_x86.x86_max_cores:
-
-    The number of cores in a package. This information is retrieved via CPUID.
-
-  - cpuinfo_x86.x86_max_dies:
-
-    The number of dies in a package. This information is retrieved via CPUID.
-
-  - cpuinfo_x86.cpu_die_id:
-
-    The physical ID of the die. This information is retrieved via CPUID.
-
-  - cpuinfo_x86.phys_proc_id:
-
-    The physical ID of the package. This information is retrieved via CPUID
-    and deduced from the APIC IDs of the cores in the package.
-
-    Modern systems use this value for the socket. There may be multiple
-    packages within a socket. This value may differ from cpu_die_id.
-
-  - cpuinfo_x86.logical_proc_id:
-
-    The logical ID of the package. As we do not trust BIOSes to enumerate the
-    packages in a consistent way, we introduced the concept of logical package
-    ID so we can sanely calculate the number of maximum possible packages in
-    the system and have the packages enumerated linearly.
-
-  - topology_max_packages():
-
-    The maximum possible number of packages in the system. Helpful for per
-    package facilities to preallocate per package information.
-
-  - cpu_llc_id:
-
-    A per-CPU variable containing:
-
-      - On Intel, the first APIC ID of the list of CPUs sharing the Last Level
-        Cache
-
-      - On AMD, the Node ID or Core Complex ID containing the Last Level
-        Cache. In general, it is a number identifying an LLC uniquely on the
-        system.
-
-Cores
-=====
-A core consists of 1 or more threads. It does not matter whether the threads
-are SMT- or CMT-type threads.
-
-AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses
-"core".
-
-Core-related topology information in the kernel:
-
-  - smp_num_siblings:
-
-    The number of threads in a core. The number of threads in a package can be
-    calculated by::
-
-	threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings
-
-
-Threads
-=======
-A thread is a single scheduling unit. It's the equivalent to a logical Linux
-CPU.
-
-AMDs nomenclature for CMT threads is "Compute Unit Core". The kernel always
-uses "thread".
-
-Thread-related topology information in the kernel:
-
-  - topology_core_cpumask():
-
-    The cpumask contains all online threads in the package to which a thread
-    belongs.
-
-    The number of online threads is also printed in /proc/cpuinfo "siblings."
-
-  - topology_sibling_cpumask():
-
-    The cpumask contains all online threads in the core to which a thread
-    belongs.
-
-  - topology_logical_package_id():
-
-    The logical package ID to which a thread belongs.
-
-  - topology_physical_package_id():
-
-    The physical package ID to which a thread belongs.
-
-  - topology_core_id();
-
-    The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo
-    "core_id."
-
-
-
-System topology examples
-========================
-
-.. note::
-  The alternative Linux CPU enumeration depends on how the BIOS enumerates the
-  threads. Many BIOSes enumerate all threads 0 first and then all threads 1.
-  That has the "advantage" that the logical Linux CPU numbers of threads 0 stay
-  the same whether threads are enabled or not. That's merely an implementation
-  detail and has no practical impact.
-
-1) Single Package, Single Core::
-
-   [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
-
-2) Single Package, Dual Core
-
-   a) One thread per core::
-
-	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
-		    -> [core 1] -> [thread 0] -> Linux CPU 1
-
-   b) Two threads per core::
-
-	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
-				-> [thread 1] -> Linux CPU 1
-		    -> [core 1] -> [thread 0] -> Linux CPU 2
-				-> [thread 1] -> Linux CPU 3
-
-      Alternative enumeration::
-
-	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
-				-> [thread 1] -> Linux CPU 2
-		    -> [core 1] -> [thread 0] -> Linux CPU 1
-				-> [thread 1] -> Linux CPU 3
-
-      AMD nomenclature for CMT systems::
-
-	[node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
-				     -> [Compute Unit Core 1] -> Linux CPU 1
-		 -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2
-				     -> [Compute Unit Core 1] -> Linux CPU 3
-
-4) Dual Package, Dual Core
-
-   a) One thread per core::
-
-	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
-		    -> [core 1] -> [thread 0] -> Linux CPU 1
-
-	[package 1] -> [core 0] -> [thread 0] -> Linux CPU 2
-		    -> [core 1] -> [thread 0] -> Linux CPU 3
-
-   b) Two threads per core::
-
-	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
-				-> [thread 1] -> Linux CPU 1
-		    -> [core 1] -> [thread 0] -> Linux CPU 2
-				-> [thread 1] -> Linux CPU 3
-
-	[package 1] -> [core 0] -> [thread 0] -> Linux CPU 4
-				-> [thread 1] -> Linux CPU 5
-		    -> [core 1] -> [thread 0] -> Linux CPU 6
-				-> [thread 1] -> Linux CPU 7
-
-      Alternative enumeration::
-
-	[package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
-				-> [thread 1] -> Linux CPU 4
-		    -> [core 1] -> [thread 0] -> Linux CPU 1
-				-> [thread 1] -> Linux CPU 5
-
-	[package 1] -> [core 0] -> [thread 0] -> Linux CPU 2
-				-> [thread 1] -> Linux CPU 6
-		    -> [core 1] -> [thread 0] -> Linux CPU 3
-				-> [thread 1] -> Linux CPU 7
-
-      AMD nomenclature for CMT systems::
-
-	[node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
-				     -> [Compute Unit Core 1] -> Linux CPU 1
-		 -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2
-				     -> [Compute Unit Core 1] -> Linux CPU 3
-
-	[node 1] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 4
-				     -> [Compute Unit Core 1] -> Linux CPU 5
-		 -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 6
-				     -> [Compute Unit Core 1] -> Linux CPU 7
diff --git a/Documentation/x86/tsx_async_abort.rst b/Documentation/x86/tsx_async_abort.rst
deleted file mode 100644
index 583ddc185ba2..000000000000
--- a/Documentation/x86/tsx_async_abort.rst
+++ /dev/null
@@ -1,117 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-TSX Async Abort (TAA) mitigation
-================================
-
-.. _tsx_async_abort:
-
-Overview
---------
-
-TSX Async Abort (TAA) is a side channel attack on internal buffers in some
-Intel processors similar to Microachitectural Data Sampling (MDS).  In this
-case certain loads may speculatively pass invalid data to dependent operations
-when an asynchronous abort condition is pending in a Transactional
-Synchronization Extensions (TSX) transaction.  This includes loads with no
-fault or assist condition. Such loads may speculatively expose stale data from
-the same uarch data structures as in MDS, with same scope of exposure i.e.
-same-thread and cross-thread. This issue affects all current processors that
-support TSX.
-
-Mitigation strategy
--------------------
-
-a) TSX disable - one of the mitigations is to disable TSX. A new MSR
-IA32_TSX_CTRL will be available in future and current processors after
-microcode update which can be used to disable TSX. In addition, it
-controls the enumeration of the TSX feature bits (RTM and HLE) in CPUID.
-
-b) Clear CPU buffers - similar to MDS, clearing the CPU buffers mitigates this
-vulnerability. More details on this approach can be found in
-:ref:`Documentation/admin-guide/hw-vuln/mds.rst <mds>`.
-
-Kernel internal mitigation modes
---------------------------------
-
- =============    ============================================================
- off              Mitigation is disabled. Either the CPU is not affected or
-                  tsx_async_abort=off is supplied on the kernel command line.
-
- tsx disabled     Mitigation is enabled. TSX feature is disabled by default at
-                  bootup on processors that support TSX control.
-
- verw             Mitigation is enabled. CPU is affected and MD_CLEAR is
-                  advertised in CPUID.
-
- ucode needed     Mitigation is enabled. CPU is affected and MD_CLEAR is not
-                  advertised in CPUID. That is mainly for virtualization
-                  scenarios where the host has the updated microcode but the
-                  hypervisor does not expose MD_CLEAR in CPUID. It's a best
-                  effort approach without guarantee.
- =============    ============================================================
-
-If the CPU is affected and the "tsx_async_abort" kernel command line parameter is
-not provided then the kernel selects an appropriate mitigation depending on the
-status of RTM and MD_CLEAR CPUID bits.
-
-Below tables indicate the impact of tsx=on|off|auto cmdline options on state of
-TAA mitigation, VERW behavior and TSX feature for various combinations of
-MSR_IA32_ARCH_CAPABILITIES bits.
-
-1. "tsx=off"
-
-=========  =========  ============  ============  ==============  ===================  ======================
-MSR_IA32_ARCH_CAPABILITIES bits     Result with cmdline tsx=off
-----------------------------------  -------------------------------------------------------------------------
-TAA_NO     MDS_NO     TSX_CTRL_MSR  TSX state     VERW can clear  TAA mitigation       TAA mitigation
-                                    after bootup  CPU buffers     tsx_async_abort=off  tsx_async_abort=full
-=========  =========  ============  ============  ==============  ===================  ======================
-    0          0           0         HW default         Yes           Same as MDS           Same as MDS
-    0          0           1        Invalid case   Invalid case       Invalid case          Invalid case
-    0          1           0         HW default         No         Need ucode update     Need ucode update
-    0          1           1          Disabled          Yes           TSX disabled          TSX disabled
-    1          X           1          Disabled           X             None needed           None needed
-=========  =========  ============  ============  ==============  ===================  ======================
-
-2. "tsx=on"
-
-=========  =========  ============  ============  ==============  ===================  ======================
-MSR_IA32_ARCH_CAPABILITIES bits     Result with cmdline tsx=on
-----------------------------------  -------------------------------------------------------------------------
-TAA_NO     MDS_NO     TSX_CTRL_MSR  TSX state     VERW can clear  TAA mitigation       TAA mitigation
-                                    after bootup  CPU buffers     tsx_async_abort=off  tsx_async_abort=full
-=========  =========  ============  ============  ==============  ===================  ======================
-    0          0           0         HW default        Yes            Same as MDS          Same as MDS
-    0          0           1        Invalid case   Invalid case       Invalid case         Invalid case
-    0          1           0         HW default        No          Need ucode update     Need ucode update
-    0          1           1          Enabled          Yes               None              Same as MDS
-    1          X           1          Enabled          X              None needed          None needed
-=========  =========  ============  ============  ==============  ===================  ======================
-
-3. "tsx=auto"
-
-=========  =========  ============  ============  ==============  ===================  ======================
-MSR_IA32_ARCH_CAPABILITIES bits     Result with cmdline tsx=auto
-----------------------------------  -------------------------------------------------------------------------
-TAA_NO     MDS_NO     TSX_CTRL_MSR  TSX state     VERW can clear  TAA mitigation       TAA mitigation
-                                    after bootup  CPU buffers     tsx_async_abort=off  tsx_async_abort=full
-=========  =========  ============  ============  ==============  ===================  ======================
-    0          0           0         HW default    Yes                Same as MDS           Same as MDS
-    0          0           1        Invalid case  Invalid case        Invalid case          Invalid case
-    0          1           0         HW default    No              Need ucode update     Need ucode update
-    0          1           1          Disabled      Yes               TSX disabled          TSX disabled
-    1          X           1          Enabled       X                 None needed           None needed
-=========  =========  ============  ============  ==============  ===================  ======================
-
-In the tables, TSX_CTRL_MSR is a new bit in MSR_IA32_ARCH_CAPABILITIES that
-indicates whether MSR_IA32_TSX_CTRL is supported.
-
-There are two control bits in IA32_TSX_CTRL MSR:
-
-      Bit 0: When set it disables the Restricted Transactional Memory (RTM)
-             sub-feature of TSX (will force all transactions to abort on the
-             XBEGIN instruction).
-
-      Bit 1: When set it disables the enumeration of the RTM and HLE feature
-             (i.e. it will make CPUID(EAX=7).EBX{bit4} and
-             CPUID(EAX=7).EBX{bit11} read as 0).
diff --git a/Documentation/x86/usb-legacy-support.rst b/Documentation/x86/usb-legacy-support.rst
deleted file mode 100644
index e01c08b7c981..000000000000
--- a/Documentation/x86/usb-legacy-support.rst
+++ /dev/null
@@ -1,50 +0,0 @@
-
-.. SPDX-License-Identifier: GPL-2.0
-
-==================
-USB Legacy support
-==================
-
-:Author: Vojtech Pavlik <vojtech@suse.cz>, January 2004
-
-
-Also known as "USB Keyboard" or "USB Mouse support" in the BIOS Setup is a
-feature that allows one to use the USB mouse and keyboard as if they were
-their classic PS/2 counterparts.  This means one can use an USB keyboard to
-type in LILO for example.
-
-It has several drawbacks, though:
-
-1) On some machines, the emulated PS/2 mouse takes over even when no USB
-   mouse is present and a real PS/2 mouse is present.  In that case the extra
-   features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may
-   not be available.
-
-2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause
-   system crashes, because the SMM BIOS is not expecting to be in PAE mode.
-   The Intel E7505 is a typical machine where this happens.
-
-3) If AMD64 64-bit mode is enabled, again system crashes often happen,
-   because the SMM BIOS isn't expecting the CPU to be in 64-bit mode.  The
-   BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit
-   yet.
-
-Solutions:
-
-Problem 1)
-  can be solved by loading the USB drivers prior to loading the
-  PS/2 mouse driver. Since the PS/2 mouse driver is in 2.6 compiled into
-  the kernel unconditionally, this means the USB drivers need to be
-  compiled-in, too.
-
-Problem 2)
-  can currently only be solved by either disabling HIGHMEM64G
-  in the kernel config or USB Legacy support in the BIOS. A BIOS update
-  could help, but so far no such update exists.
-
-Problem 3)
-  is usually fixed by a BIOS update. Check the board
-  manufacturers web site. If an update is not available, disable USB
-  Legacy support in the BIOS. If this alone doesn't help, try also adding
-  idle=poll on the kernel command line. The BIOS may be entering the SMM
-  on the HLT instruction as well.
diff --git a/Documentation/x86/x86_64/5level-paging.rst b/Documentation/x86/x86_64/5level-paging.rst
deleted file mode 100644
index b792bbdc0b01..000000000000
--- a/Documentation/x86/x86_64/5level-paging.rst
+++ /dev/null
@@ -1,67 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==============
-5-level paging
-==============
-
-Overview
-========
-Original x86-64 was limited by 4-level paging to 256 TiB of virtual address
-space and 64 TiB of physical address space. We are already bumping into
-this limit: some vendors offer servers with 64 TiB of memory today.
-
-To overcome the limitation upcoming hardware will introduce support for
-5-level paging. It is a straight-forward extension of the current page
-table structure adding one more layer of translation.
-
-It bumps the limits to 128 PiB of virtual address space and 4 PiB of
-physical address space. This "ought to be enough for anybody" ©.
-
-QEMU 2.9 and later support 5-level paging.
-
-Virtual memory layout for 5-level paging is described in
-Documentation/x86/x86_64/mm.rst
-
-
-Enabling 5-level paging
-=======================
-CONFIG_X86_5LEVEL=y enables the feature.
-
-Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware.
-In this case additional page table level -- p4d -- will be folded at
-runtime.
-
-User-space and large virtual address space
-==========================================
-On x86, 5-level paging enables 56-bit userspace virtual address space.
-Not all user space is ready to handle wide addresses. It's known that
-at least some JIT compilers use higher bits in pointers to encode their
-information. It collides with valid pointers with 5-level paging and
-leads to crashes.
-
-To mitigate this, we are not going to allocate virtual address space
-above 47-bit by default.
-
-But userspace can ask for allocation from full address space by
-specifying hint address (with or without MAP_FIXED) above 47-bits.
-
-If hint address set above 47-bit, but MAP_FIXED is not specified, we try
-to look for unmapped area by specified address. If it's already
-occupied, we look for unmapped area in *full* address space, rather than
-from 47-bit window.
-
-A high hint address would only affect the allocation in question, but not
-any future mmap()s.
-
-Specifying high hint address on older kernel or on machine without 5-level
-paging support is safe. The hint will be ignored and kernel will fall back
-to allocation from 47-bit address space.
-
-This approach helps to easily make application's memory allocator aware
-about large address space without manually tracking allocated virtual
-address space.
-
-One important case we need to handle here is interaction with MPX.
-MPX (without MAWA extension) cannot handle addresses above 47-bit, so we
-need to make sure that MPX cannot be enabled we already have VMA above
-the boundary and forbid creating such VMAs once MPX is enabled.
diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst
deleted file mode 100644
index cbd14124a667..000000000000
--- a/Documentation/x86/x86_64/boot-options.rst
+++ /dev/null
@@ -1,319 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================
-AMD64 Specific Boot Options
-===========================
-
-There are many others (usually documented in driver documentation), but
-only the AMD64 specific ones are listed here.
-
-Machine check
-=============
-Please see Documentation/x86/x86_64/machinecheck.rst for sysfs runtime tunables.
-
-   mce=off
-		Disable machine check
-   mce=no_cmci
-		Disable CMCI(Corrected Machine Check Interrupt) that
-		Intel processor supports.  Usually this disablement is
-		not recommended, but it might be handy if your hardware
-		is misbehaving.
-		Note that you'll get more problems without CMCI than with
-		due to the shared banks, i.e. you might get duplicated
-		error logs.
-   mce=dont_log_ce
-		Don't make logs for corrected errors.  All events reported
-		as corrected are silently cleared by OS.
-		This option will be useful if you have no interest in any
-		of corrected errors.
-   mce=ignore_ce
-		Disable features for corrected errors, e.g. polling timer
-		and CMCI.  All events reported as corrected are not cleared
-		by OS and remained in its error banks.
-		Usually this disablement is not recommended, however if
-		there is an agent checking/clearing corrected errors
-		(e.g. BIOS or hardware monitoring applications), conflicting
-		with OS's error handling, and you cannot deactivate the agent,
-		then this option will be a help.
-   mce=no_lmce
-		Do not opt-in to Local MCE delivery. Use legacy method
-		to broadcast MCEs.
-   mce=bootlog
-		Enable logging of machine checks left over from booting.
-		Disabled by default on AMD Fam10h and older because some BIOS
-		leave bogus ones.
-		If your BIOS doesn't do that it's a good idea to enable though
-		to make sure you log even machine check events that result
-		in a reboot. On Intel systems it is enabled by default.
-   mce=nobootlog
-		Disable boot machine check logging.
-   mce=monarchtimeout (number)
-		monarchtimeout:
-		Sets the time in us to wait for other CPUs on machine checks. 0
-		to disable.
-   mce=bios_cmci_threshold
-		Don't overwrite the bios-set CMCI threshold. This boot option
-		prevents Linux from overwriting the CMCI threshold set by the
-		bios. Without this option, Linux always sets the CMCI
-		threshold to 1. Enabling this may make memory predictive failure
-		analysis less effective if the bios sets thresholds for memory
-		errors since we will not see details for all errors.
-   mce=recovery
-		Force-enable recoverable machine check code paths
-
-   nomce (for compatibility with i386)
-		same as mce=off
-
-   Everything else is in sysfs now.
-
-APICs
-=====
-
-   apic
-	Use IO-APIC. Default
-
-   noapic
-	Don't use the IO-APIC.
-
-   disableapic
-	Don't use the local APIC
-
-   nolapic
-     Don't use the local APIC (alias for i386 compatibility)
-
-   pirq=...
-	See Documentation/x86/i386/IO-APIC.rst
-
-   noapictimer
-	Don't set up the APIC timer
-
-   no_timer_check
-	Don't check the IO-APIC timer. This can work around
-	problems with incorrect timer initialization on some boards.
-
-   apicpmtimer
-	Do APIC timer calibration using the pmtimer. Implies
-	apicmaintimer. Useful when your PIT timer is totally broken.
-
-Timing
-======
-
-  notsc
-    Deprecated, use tsc=unstable instead.
-
-  nohpet
-    Don't use the HPET timer.
-
-Idle loop
-=========
-
-  idle=poll
-    Don't do power saving in the idle loop using HLT, but poll for rescheduling
-    event. This will make the CPUs eat a lot more power, but may be useful
-    to get slightly better performance in multiprocessor benchmarks. It also
-    makes some profiling using performance counters more accurate.
-    Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
-    CPUs) this option has no performance advantage over the normal idle loop.
-    It may also interact badly with hyperthreading.
-
-Rebooting
-=========
-
-   reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old]
-      bios
-        Use the CPU reboot vector for warm reset
-      warm
-        Don't set the cold reboot flag
-      cold
-        Set the cold reboot flag
-      triple
-        Force a triple fault (init)
-      kbd
-        Use the keyboard controller. cold reset (default)
-      acpi
-        Use the ACPI RESET_REG in the FADT. If ACPI is not configured or
-        the ACPI reset does not work, the reboot path attempts the reset
-        using the keyboard controller.
-      efi
-        Use efi reset_system runtime service. If EFI is not configured or
-        the EFI reset does not work, the reboot path attempts the reset using
-        the keyboard controller.
-      pci
-        Use a write to the PCI config space register 0xcf9 to trigger reboot.
-
-   Using warm reset will be much faster especially on big memory
-   systems because the BIOS will not go through the memory check.
-   Disadvantage is that not all hardware will be completely reinitialized
-   on reboot so there may be boot problems on some systems.
-
-   reboot=force
-     Don't stop other CPUs on reboot. This can make reboot more reliable
-     in some cases.
-
-   reboot=default
-     There are some built-in platform specific "quirks" - you may see:
-     "reboot: <name> series board detected. Selecting <type> for reboots."
-     In the case where you think the quirk is in error (e.g. you have
-     newer BIOS, or newer board) using this option will ignore the built-in
-     quirk table, and use the generic default reboot actions.
-
-NUMA
-====
-
-  numa=off
-    Only set up a single NUMA node spanning all memory.
-
-  numa=noacpi
-    Don't parse the SRAT table for NUMA setup
-
-  numa=nohmat
-    Don't parse the HMAT table for NUMA setup, or soft-reserved memory
-    partitioning.
-
-  numa=fake=<size>[MG]
-    If given as a memory unit, fills all system RAM with nodes of
-    size interleaved over physical nodes.
-
-  numa=fake=<N>
-    If given as an integer, fills all system RAM with N fake nodes
-    interleaved over physical nodes.
-
-  numa=fake=<N>U
-    If given as an integer followed by 'U', it will divide each
-    physical node into N emulated nodes.
-
-ACPI
-====
-
-  acpi=off
-    Don't enable ACPI
-  acpi=ht
-    Use ACPI boot table parsing, but don't enable ACPI interpreter
-  acpi=force
-    Force ACPI on (currently not needed)
-  acpi=strict
-    Disable out of spec ACPI workarounds.
-  acpi_sci={edge,level,high,low}
-    Set up ACPI SCI interrupt.
-  acpi=noirq
-    Don't route interrupts
-  acpi=nocmcff
-    Disable firmware first mode for corrected errors. This
-    disables parsing the HEST CMC error source to check if
-    firmware has set the FF flag. This may result in
-    duplicate corrected error reports.
-
-PCI
-===
-
-  pci=off
-    Don't use PCI
-  pci=conf1
-    Use conf1 access.
-  pci=conf2
-    Use conf2 access.
-  pci=rom
-    Assign ROMs.
-  pci=assign-busses
-    Assign busses
-  pci=irqmask=MASK
-    Set PCI interrupt mask to MASK
-  pci=lastbus=NUMBER
-    Scan up to NUMBER busses, no matter what the mptable says.
-  pci=noacpi
-    Don't use ACPI to set up PCI interrupt routing.
-
-IOMMU (input/output memory management unit)
-===========================================
-Multiple x86-64 PCI-DMA mapping implementations exist, for example:
-
-   1. <kernel/dma/direct.c>: use no hardware/software IOMMU at all
-      (e.g. because you have < 3 GB memory).
-      Kernel boot message: "PCI-DMA: Disabling IOMMU"
-
-   2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
-      Kernel boot message: "PCI-DMA: using GART IOMMU"
-
-   3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
-      e.g. if there is no hardware IOMMU in the system and it is need because
-      you have >3GB memory or told the kernel to us it (iommu=soft))
-      Kernel boot message: "PCI-DMA: Using software bounce buffering
-      for IO (SWIOTLB)"
-
-::
-
-  iommu=[<size>][,noagp][,off][,force][,noforce]
-  [,memaper[=<order>]][,merge][,fullflush][,nomerge]
-  [,noaperture]
-
-General iommu options:
-
-    off
-      Don't initialize and use any kind of IOMMU.
-    noforce
-      Don't force hardware IOMMU usage when it is not needed. (default).
-    force
-      Force the use of the hardware IOMMU even when it is
-      not actually needed (e.g. because < 3 GB memory).
-    soft
-      Use software bounce buffering (SWIOTLB) (default for
-      Intel machines). This can be used to prevent the usage
-      of an available hardware IOMMU.
-
-iommu options only relevant to the AMD GART hardware IOMMU:
-
-    <size>
-      Set the size of the remapping area in bytes.
-    allowed
-      Overwrite iommu off workarounds for specific chipsets.
-    fullflush
-      Flush IOMMU on each allocation (default).
-    nofullflush
-      Don't use IOMMU fullflush.
-    memaper[=<order>]
-      Allocate an own aperture over RAM with size 32MB<<order.
-      (default: order=1, i.e. 64MB)
-    merge
-      Do scatter-gather (SG) merging. Implies "force" (experimental).
-    nomerge
-      Don't do scatter-gather (SG) merging.
-    noaperture
-      Ask the IOMMU not to touch the aperture for AGP.
-    noagp
-      Don't initialize the AGP driver and use full aperture.
-    panic
-      Always panic when IOMMU overflows.
-
-iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
-implementation:
-
-    swiotlb=<slots>[,force,noforce]
-      <slots>
-        Prereserve that many 2K slots for the software IO bounce buffering.
-      force
-        Force all IO through the software TLB.
-      noforce
-        Do not initialize the software TLB.
-
-
-Miscellaneous
-=============
-
-  nogbpages
-    Do not use GB pages for kernel direct mappings.
-  gbpages
-    Use GB pages for kernel direct mappings.
-
-
-AMD SEV (Secure Encrypted Virtualization)
-=========================================
-Options relating to AMD SEV, specified via the following format:
-
-::
-
-   sev=option1[,option2]
-
-The available options are:
-
-   debug
-     Enable debug messages.
diff --git a/Documentation/x86/x86_64/cpu-hotplug-spec.rst b/Documentation/x86/x86_64/cpu-hotplug-spec.rst
deleted file mode 100644
index 8d1c91f0c880..000000000000
--- a/Documentation/x86/x86_64/cpu-hotplug-spec.rst
+++ /dev/null
@@ -1,24 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===================================================
-Firmware support for CPU hotplug under Linux/x86-64
-===================================================
-
-Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
-know in advance of boot time the maximum number of CPUs that could be plugged
-into the system. ACPI 3.0 currently has no official way to supply
-this information from the firmware to the operating system.
-
-In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the
-ACPI 3.0 specification).  ACPI already has the concept of disabled LAPIC
-objects by setting the Enabled bit in the LAPIC object to zero.
-
-For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable
-CPU is already available in the MADT. If the CPU is not available yet
-it should have its LAPIC Enabled bit set to 0. Linux will use the number
-of disabled LAPICs to compute the maximum number of future CPUs.
-
-In the worst case the user can overwrite this choice using a command line
-option (additional_cpus=...), but it is recommended to supply the correct
-number (or a reasonable approximation of it, with erring towards more not less)
-in the MADT to avoid manual configuration.
diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst
deleted file mode 100644
index ff9bcfd2cc14..000000000000
--- a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst
+++ /dev/null
@@ -1,78 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=====================
-Fake NUMA For CPUSets
-=====================
-
-:Author: David Rientjes <rientjes@cs.washington.edu>
-
-Using numa=fake and CPUSets for Resource Management
-
-This document describes how the numa=fake x86_64 command-line option can be used
-in conjunction with cpusets for coarse memory management.  Using this feature,
-you can create fake NUMA nodes that represent contiguous chunks of memory and
-assign them to cpusets and their attached tasks.  This is a way of limiting the
-amount of system memory that are available to a certain class of tasks.
-
-For more information on the features of cpusets, see
-Documentation/admin-guide/cgroup-v1/cpusets.rst.
-There are a number of different configurations you can use for your needs.  For
-more information on the numa=fake command line option and its various ways of
-configuring fake nodes, see Documentation/x86/x86_64/boot-options.rst.
-
-For the purposes of this introduction, we'll assume a very primitive NUMA
-emulation setup of "numa=fake=4*512,".  This will split our system memory into
-four equal chunks of 512M each that we can now use to assign to cpusets.  As
-you become more familiar with using this combination for resource control,
-you'll determine a better setup to minimize the number of nodes you have to deal
-with.
-
-A machine may be split as follows with "numa=fake=4*512," as reported by dmesg::
-
-	Faking node 0 at 0000000000000000-0000000020000000 (512MB)
-	Faking node 1 at 0000000020000000-0000000040000000 (512MB)
-	Faking node 2 at 0000000040000000-0000000060000000 (512MB)
-	Faking node 3 at 0000000060000000-0000000080000000 (512MB)
-	...
-	On node 0 totalpages: 130975
-	On node 1 totalpages: 131072
-	On node 2 totalpages: 131072
-	On node 3 totalpages: 131072
-
-Now following the instructions for mounting the cpusets filesystem from
-Documentation/admin-guide/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory
-address spaces) to individual cpusets::
-
-	[root@xroads /]# mkdir exampleset
-	[root@xroads /]# mount -t cpuset none exampleset
-	[root@xroads /]# mkdir exampleset/ddset
-	[root@xroads /]# cd exampleset/ddset
-	[root@xroads /exampleset/ddset]# echo 0-1 > cpus
-	[root@xroads /exampleset/ddset]# echo 0-1 > mems
-
-Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
-memory allocations (1G).
-
-You can now assign tasks to these cpusets to limit the memory resources
-available to them according to the fake nodes assigned as mems::
-
-	[root@xroads /exampleset/ddset]# echo $$ > tasks
-	[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
-	[1] 13425
-
-Notice the difference between the system memory usage as reported by
-/proc/meminfo between the restricted cpuset case above and the unrestricted
-case (i.e. running the same 'dd' command without assigning it to a fake NUMA
-cpuset):
-
-	========	============	==========
-	Name		Unrestricted	Restricted
-	========	============	==========
-	MemTotal	3091900 kB	3091900 kB
-	MemFree		42113 kB	1513236 kB
-	========	============	==========
-
-This allows for coarse memory management for the tasks you assign to particular
-cpusets.  Since cpusets can form a hierarchy, you can create some pretty
-interesting combinations of use-cases for various classes of tasks for your
-memory management needs.
diff --git a/Documentation/x86/x86_64/fsgs.rst b/Documentation/x86/x86_64/fsgs.rst
deleted file mode 100644
index 50960e09e1f6..000000000000
--- a/Documentation/x86/x86_64/fsgs.rst
+++ /dev/null
@@ -1,199 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Using FS and GS segments in user space applications
-===================================================
-
-The x86 architecture supports segmentation. Instructions which access
-memory can use segment register based addressing mode. The following
-notation is used to address a byte within a segment:
-
-  Segment-register:Byte-address
-
-The segment base address is added to the Byte-address to compute the
-resulting virtual address which is accessed. This allows to access multiple
-instances of data with the identical Byte-address, i.e. the same code. The
-selection of a particular instance is purely based on the base-address in
-the segment register.
-
-In 32-bit mode the CPU provides 6 segments, which also support segment
-limits. The limits can be used to enforce address space protections.
-
-In 64-bit mode the CS/SS/DS/ES segments are ignored and the base address is
-always 0 to provide a full 64bit address space. The FS and GS segments are
-still functional in 64-bit mode.
-
-Common FS and GS usage
-------------------------------
-
-The FS segment is commonly used to address Thread Local Storage (TLS). FS
-is usually managed by runtime code or a threading library. Variables
-declared with the '__thread' storage class specifier are instantiated per
-thread and the compiler emits the FS: address prefix for accesses to these
-variables. Each thread has its own FS base address so common code can be
-used without complex address offset calculations to access the per thread
-instances. Applications should not use FS for other purposes when they use
-runtimes or threading libraries which manage the per thread FS.
-
-The GS segment has no common use and can be used freely by
-applications. GCC and Clang support GS based addressing via address space
-identifiers.
-
-Reading and writing the FS/GS base address
-------------------------------------------
-
-There exist two mechanisms to read and write the FS/GS base address:
-
- - the arch_prctl() system call
-
- - the FSGSBASE instruction family
-
-Accessing FS/GS base with arch_prctl()
---------------------------------------
-
- The arch_prctl(2) based mechanism is available on all 64-bit CPUs and all
- kernel versions.
-
- Reading the base:
-
-   arch_prctl(ARCH_GET_FS, &fsbase);
-   arch_prctl(ARCH_GET_GS, &gsbase);
-
- Writing the base:
-
-   arch_prctl(ARCH_SET_FS, fsbase);
-   arch_prctl(ARCH_SET_GS, gsbase);
-
- The ARCH_SET_GS prctl may be disabled depending on kernel configuration
- and security settings.
-
-Accessing FS/GS base with the FSGSBASE instructions
----------------------------------------------------
-
- With the Ivy Bridge CPU generation Intel introduced a new set of
- instructions to access the FS and GS base registers directly from user
- space. These instructions are also supported on AMD Family 17H CPUs. The
- following instructions are available:
-
-  =============== ===========================
-  RDFSBASE %reg   Read the FS base register
-  RDGSBASE %reg   Read the GS base register
-  WRFSBASE %reg   Write the FS base register
-  WRGSBASE %reg   Write the GS base register
-  =============== ===========================
-
- The instructions avoid the overhead of the arch_prctl() syscall and allow
- more flexible usage of the FS/GS addressing modes in user space
- applications. This does not prevent conflicts between threading libraries
- and runtimes which utilize FS and applications which want to use it for
- their own purpose.
-
-FSGSBASE instructions enablement
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- The instructions are enumerated in CPUID leaf 7, bit 0 of EBX. If
- available /proc/cpuinfo shows 'fsgsbase' in the flag entry of the CPUs.
-
- The availability of the instructions does not enable them
- automatically. The kernel has to enable them explicitly in CR4. The
- reason for this is that older kernels make assumptions about the values in
- the GS register and enforce them when GS base is set via
- arch_prctl(). Allowing user space to write arbitrary values to GS base
- would violate these assumptions and cause malfunction.
-
- On kernels which do not enable FSGSBASE the execution of the FSGSBASE
- instructions will fault with a #UD exception.
-
- The kernel provides reliable information about the enabled state in the
- ELF AUX vector. If the HWCAP2_FSGSBASE bit is set in the AUX vector, the
- kernel has FSGSBASE instructions enabled and applications can use them.
- The following code example shows how this detection works::
-
-   #include <sys/auxv.h>
-   #include <elf.h>
-
-   /* Will be eventually in asm/hwcap.h */
-   #ifndef HWCAP2_FSGSBASE
-   #define HWCAP2_FSGSBASE        (1 << 1)
-   #endif
-
-   ....
-
-   unsigned val = getauxval(AT_HWCAP2);
-
-   if (val & HWCAP2_FSGSBASE)
-        printf("FSGSBASE enabled\n");
-
-FSGSBASE instructions compiler support
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE
-instructions. Clang 5 supports them as well.
-
-  =================== ===========================
-  _readfsbase_u64()   Read the FS base register
-  _readfsbase_u64()   Read the GS base register
-  _writefsbase_u64()  Write the FS base register
-  _writegsbase_u64()  Write the GS base register
-  =================== ===========================
-
-To utilize these instrinsics <immintrin.h> must be included in the source
-code and the compiler option -mfsgsbase has to be added.
-
-Compiler support for FS/GS based addressing
--------------------------------------------
-
-GCC version 6 and newer provide support for FS/GS based addressing via
-Named Address Spaces. GCC implements the following address space
-identifiers for x86:
-
-  ========= ====================================
-  __seg_fs  Variable is addressed relative to FS
-  __seg_gs  Variable is addressed relative to GS
-  ========= ====================================
-
-The preprocessor symbols __SEG_FS and __SEG_GS are defined when these
-address spaces are supported. Code which implements fallback modes should
-check whether these symbols are defined. Usage example::
-
-  #ifdef __SEG_GS
-
-  long data0 = 0;
-  long data1 = 1;
-
-  long __seg_gs *ptr;
-
-  /* Check whether FSGSBASE is enabled by the kernel (HWCAP2_FSGSBASE) */
-  ....
-
-  /* Set GS base to point to data0 */
-  _writegsbase_u64(&data0);
-
-  /* Access offset 0 of GS */
-  ptr = 0;
-  printf("data0 = %ld\n", *ptr);
-
-  /* Set GS base to point to data1 */
-  _writegsbase_u64(&data1);
-  /* ptr still addresses offset 0! */
-  printf("data1 = %ld\n", *ptr);
-
-
-Clang does not provide the GCC address space identifiers, but it provides
-address spaces via an attribute based mechanism in Clang 2.6 and newer
-versions:
-
- ==================================== =====================================
-  __attribute__((address_space(256))  Variable is addressed relative to GS
-  __attribute__((address_space(257))  Variable is addressed relative to FS
- ==================================== =====================================
-
-FS/GS based addressing with inline assembly
--------------------------------------------
-
-In case the compiler does not support address spaces, inline assembly can
-be used for FS/GS based addressing mode::
-
-	mov %fs:offset, %reg
-	mov %gs:offset, %reg
-
-	mov %reg, %fs:offset
-	mov %reg, %gs:offset
diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst
deleted file mode 100644
index a56070fc8e77..000000000000
--- a/Documentation/x86/x86_64/index.rst
+++ /dev/null
@@ -1,17 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==============
-x86_64 Support
-==============
-
-.. toctree::
-   :maxdepth: 2
-
-   boot-options
-   uefi
-   mm
-   5level-paging
-   fake-numa-for-cpusets
-   cpu-hotplug-spec
-   machinecheck
-   fsgs
diff --git a/Documentation/x86/x86_64/machinecheck.rst b/Documentation/x86/x86_64/machinecheck.rst
deleted file mode 100644
index cea12ee97200..000000000000
--- a/Documentation/x86/x86_64/machinecheck.rst
+++ /dev/null
@@ -1,33 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===============================================================
-Configurable sysfs parameters for the x86-64 machine check code
-===============================================================
-
-Machine checks report internal hardware error conditions detected
-by the CPU. Uncorrected errors typically cause a machine check
-(often with panic), corrected ones cause a machine check log entry.
-
-Machine checks are organized in banks (normally associated with
-a hardware subsystem) and subevents in a bank. The exact meaning
-of the banks and subevent is CPU specific.
-
-mcelog knows how to decode them.
-
-When you see the "Machine check errors logged" message in the system
-log then mcelog should run to collect and decode machine check entries
-from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
-
-Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
-(N = CPU number).
-
-The directory contains some configurable entries. See
-Documentation/ABI/testing/sysfs-mce for more details.
-
-TBD document entries for AMD threshold interrupt configuration
-
-For more details about the x86 machine check architecture
-see the Intel and AMD architecture manuals from their developer websites.
-
-For more details about the architecture
-see http://one.firstfloor.org/~andi/mce.pdf
diff --git a/Documentation/x86/x86_64/mm.rst b/Documentation/x86/x86_64/mm.rst
deleted file mode 100644
index 35e5e18c83d0..000000000000
--- a/Documentation/x86/x86_64/mm.rst
+++ /dev/null
@@ -1,157 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=================
-Memory Management
-=================
-
-Complete virtual memory map with 4-level page tables
-====================================================
-
-.. note::
-
- - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
-   from the top of the 64-bit address space. It's easier to understand the layout
-   when seen both in absolute addresses and in distance-from-top notation.
-
-   For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
-   64-bit address space (ffffffffffffffff).
-
-   Note that as we get closer to the top of the address space, the notation changes
-   from TB to GB and then MB/KB.
-
- - "16M TB" might look weird at first sight, but it's an easier way to visualize size
-   notation than "16 EB", which few will recognize at first sight as 16 exabytes.
-   It also shows it nicely how incredibly large 64-bit address space is.
-
-::
-
-  ========================================================================================================================
-      Start addr    |   Offset   |     End addr     |  Size   | VM area description
-  ========================================================================================================================
-                    |            |                  |         |
-   0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
-  __________________|____________|__________________|_________|___________________________________________________________
-                    |            |                  |         |
-   0000800000000000 | +128    TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
-                    |            |                  |         |     virtual memory addresses up to the -128 TB
-                    |            |                  |         |     starting offset of kernel mappings.
-  __________________|____________|__________________|_________|___________________________________________________________
-                                                              |
-                                                              | Kernel-space virtual memory, shared between all processes:
-  ____________________________________________________________|___________________________________________________________
-                    |            |                  |         |
-   ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
-   ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
-   ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
-   ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
-   ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
-   ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
-   ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
-   ffffeb0000000000 |  -21    TB | ffffebffffffffff |    1 TB | ... unused hole
-   ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
-  __________________|____________|__________________|_________|____________________________________________________________
-                                                              |
-                                                              | Identical layout to the 56-bit one from here on:
-  ____________________________________________________________|____________________________________________________________
-                    |            |                  |         |
-   fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
-                    |            |                  |         | vaddr_end for KASLR
-   fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
-   fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
-   ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
-   ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
-   ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
-   ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
-   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
-   ffffffff80000000 |-2048    MB |                  |         |
-   ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
-   ffffffffff000000 |  -16    MB |                  |         |
-      FIXADDR_START | ~-11    MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
-   ffffffffff600000 |  -10    MB | ffffffffff600fff |    4 kB | legacy vsyscall ABI
-   ffffffffffe00000 |   -2    MB | ffffffffffffffff |    2 MB | ... unused hole
-  __________________|____________|__________________|_________|___________________________________________________________
-
-
-Complete virtual memory map with 5-level page tables
-====================================================
-
-.. note::
-
- - With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
-   from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting
-   offset and many of the regions expand to support the much larger physical
-   memory supported.
-
-::
-
-  ========================================================================================================================
-      Start addr    |   Offset   |     End addr     |  Size   | VM area description
-  ========================================================================================================================
-                    |            |                  |         |
-   0000000000000000 |    0       | 00ffffffffffffff |   64 PB | user-space virtual memory, different per mm
-  __________________|____________|__________________|_________|___________________________________________________________
-                    |            |                  |         |
-   0100000000000000 |  +64    PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
-                    |            |                  |         |     virtual memory addresses up to the -64 PB
-                    |            |                  |         |     starting offset of kernel mappings.
-  __________________|____________|__________________|_________|___________________________________________________________
-                                                              |
-                                                              | Kernel-space virtual memory, shared between all processes:
-  ____________________________________________________________|___________________________________________________________
-                    |            |                  |         |
-   ff00000000000000 |  -64    PB | ff0fffffffffffff |    4 PB | ... guard hole, also reserved for hypervisor
-   ff10000000000000 |  -60    PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
-   ff11000000000000 |  -59.75 PB | ff90ffffffffffff |   32 PB | direct mapping of all physical memory (page_offset_base)
-   ff91000000000000 |  -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
-   ffa0000000000000 |  -24    PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
-   ffd2000000000000 |  -11.5  PB | ffd3ffffffffffff |  0.5 PB | ... unused hole
-   ffd4000000000000 |  -11    PB | ffd5ffffffffffff |  0.5 PB | virtual memory map (vmemmap_base)
-   ffd6000000000000 |  -10.5  PB | ffdeffffffffffff | 2.25 PB | ... unused hole
-   ffdf000000000000 |   -8.25 PB | fffffbffffffffff |   ~8 PB | KASAN shadow memory
-  __________________|____________|__________________|_________|____________________________________________________________
-                                                              |
-                                                              | Identical layout to the 47-bit one from here on:
-  ____________________________________________________________|____________________________________________________________
-                    |            |                  |         |
-   fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
-                    |            |                  |         | vaddr_end for KASLR
-   fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
-   fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
-   ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
-   ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
-   ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
-   ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
-   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
-   ffffffff80000000 |-2048    MB |                  |         |
-   ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
-   ffffffffff000000 |  -16    MB |                  |         |
-      FIXADDR_START | ~-11    MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
-   ffffffffff600000 |  -10    MB | ffffffffff600fff |    4 kB | legacy vsyscall ABI
-   ffffffffffe00000 |   -2    MB | ffffffffffffffff |    2 MB | ... unused hole
-  __________________|____________|__________________|_________|___________________________________________________________
-
-Architecture defines a 64-bit virtual address. Implementations can support
-less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
-through to the most-significant implemented bit are sign extended.
-This causes hole between user space and kernel addresses if you interpret them
-as unsigned.
-
-The direct mapping covers all memory in the system up to the highest
-memory address (this means in some cases it can also include PCI memory
-holes).
-
-We map EFI runtime services in the 'efi_pgd' PGD in a 64GB large virtual
-memory window (this size is arbitrary, it can be raised later if needed).
-The mappings are not part of any other kernel PGD and are only available
-during EFI runtime calls.
-
-Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
-physical memory, vmalloc/ioremap space and virtual memory map are randomized.
-Their order is preserved but their base will be offset early at boot time.
-
-Be very careful vs. KASLR when changing anything here. The KASLR address
-range must not overlap with anything except the KASAN shadow area, which is
-correct as KASAN disables KASLR.
-
-For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
-hole: ffffffffffff4111
diff --git a/Documentation/x86/x86_64/uefi.rst b/Documentation/x86/x86_64/uefi.rst
deleted file mode 100644
index fbc30c9a071d..000000000000
--- a/Documentation/x86/x86_64/uefi.rst
+++ /dev/null
@@ -1,58 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=====================================
-General note on [U]EFI x86_64 support
-=====================================
-
-The nomenclature EFI and UEFI are used interchangeably in this document.
-
-Although the tools below are _not_ needed for building the kernel,
-the needed bootloader support and associated tools for x86_64 platforms
-with EFI firmware and specifications are listed below.
-
-1. UEFI specification:  http://www.uefi.org
-
-2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
-   support. Elilo with x86_64 support can be used.
-
-3. x86_64 platform with EFI/UEFI firmware.
-
-Mechanics
----------
-
-- Build the kernel with the following configuration::
-
-	CONFIG_FB_EFI=y
-	CONFIG_FRAMEBUFFER_CONSOLE=y
-
-  If EFI runtime services are expected, the following configuration should
-  be selected::
-
-	CONFIG_EFI=y
-	CONFIG_EFIVAR_FS=y or m		# optional
-
-- Create a VFAT partition on the disk
-- Copy the following to the VFAT partition:
-
-	elilo bootloader with x86_64 support, elilo configuration file,
-	kernel image built in first step and corresponding
-	initrd. Instructions on building elilo and its dependencies
-	can be found in the elilo sourceforge project.
-
-- Boot to EFI shell and invoke elilo choosing the kernel image built
-  in first step.
-- If some or all EFI runtime services don't work, you can try following
-  kernel command line parameters to turn off some or all EFI runtime
-  services.
-
-	noefi
-		turn off all EFI runtime services
-	reboot_type=k
-		turn off EFI reboot runtime service
-
-- If the EFI memory map has additional entries not in the E820 map,
-  you can include those entries in the kernels memory map of available
-  physical RAM by using the following kernel command line parameter.
-
-	add_efi_memmap
-		include EFI memory map of available physical RAM
diff --git a/Documentation/x86/xstate.rst b/Documentation/x86/xstate.rst
deleted file mode 100644
index 5cec7fb558d6..000000000000
--- a/Documentation/x86/xstate.rst
+++ /dev/null
@@ -1,74 +0,0 @@
-Using XSTATE features in user space applications
-================================================
-
-The x86 architecture supports floating-point extensions which are
-enumerated via CPUID. Applications consult CPUID and use XGETBV to
-evaluate which features have been enabled by the kernel XCR0.
-
-Up to AVX-512 and PKRU states, these features are automatically enabled by
-the kernel if available. Features like AMX TILE_DATA (XSTATE component 18)
-are enabled by XCR0 as well, but the first use of related instruction is
-trapped by the kernel because by default the required large XSTATE buffers
-are not allocated automatically.
-
-Using dynamically enabled XSTATE features in user space applications
---------------------------------------------------------------------
-
-The kernel provides an arch_prctl(2) based mechanism for applications to
-request the usage of such features. The arch_prctl(2) options related to
-this are:
-
--ARCH_GET_XCOMP_SUPP
-
- arch_prctl(ARCH_GET_XCOMP_SUPP, &features);
-
- ARCH_GET_XCOMP_SUPP stores the supported features in userspace storage of
- type uint64_t. The second argument is a pointer to that storage.
-
--ARCH_GET_XCOMP_PERM
-
- arch_prctl(ARCH_GET_XCOMP_PERM, &features);
-
- ARCH_GET_XCOMP_PERM stores the features for which the userspace process
- has permission in userspace storage of type uint64_t. The second argument
- is a pointer to that storage.
-
--ARCH_REQ_XCOMP_PERM
-
- arch_prctl(ARCH_REQ_XCOMP_PERM, feature_nr);
-
- ARCH_REQ_XCOMP_PERM allows to request permission for a dynamically enabled
- feature or a feature set. A feature set can be mapped to a facility, e.g.
- AMX, and can require one or more XSTATE components to be enabled.
-
- The feature argument is the number of the highest XSTATE component which
- is required for a facility to work.
-
-When requesting permission for a feature, the kernel checks the
-availability. The kernel ensures that sigaltstacks in the process's tasks
-are large enough to accommodate the resulting large signal frame. It
-enforces this both during ARCH_REQ_XCOMP_SUPP and during any subsequent
-sigaltstack(2) calls. If an installed sigaltstack is smaller than the
-resulting sigframe size, ARCH_REQ_XCOMP_SUPP results in -ENOSUPP. Also,
-sigaltstack(2) results in -ENOMEM if the requested altstack is too small
-for the permitted features.
-
-Permission, when granted, is valid per process. Permissions are inherited
-on fork(2) and cleared on exec(3).
-
-The first use of an instruction related to a dynamically enabled feature is
-trapped by the kernel. The trap handler checks whether the process has
-permission to use the feature. If the process has no permission then the
-kernel sends SIGILL to the application. If the process has permission then
-the handler allocates a larger xstate buffer for the task so the large
-state can be context switched. In the unlikely cases that the allocation
-fails, the kernel sends SIGSEGV.
-
-Dynamic features in signal frames
----------------------------------
-
-Dynamcally enabled features are not written to the signal frame upon signal
-entry if the feature is in its initial configuration.  This differs from
-non-dynamic features which are always written regardless of their
-configuration.  Signal handlers can examine the XSAVE buffer's XSTATE_BV
-field to determine if a features was written.
diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst
deleted file mode 100644
index 45aa9cceb4f1..000000000000
--- a/Documentation/x86/zero-page.rst
+++ /dev/null
@@ -1,47 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=========
-Zero Page
-=========
-The additional fields in struct boot_params as a part of 32-bit boot
-protocol of kernel. These should be filled by bootloader or 16-bit
-real-mode setup code of the kernel. References/settings to it mainly
-are in::
-
-  arch/x86/include/uapi/asm/bootparam.h
-
-===========	=====	=======================	=================================================
-Offset/Size	Proto	Name			Meaning
-
-000/040		ALL	screen_info		Text mode or frame buffer information
-						(struct screen_info)
-040/014		ALL	apm_bios_info		APM BIOS information (struct apm_bios_info)
-058/008		ALL	tboot_addr      	Physical address of tboot shared page
-060/010		ALL	ist_info		Intel SpeedStep (IST) BIOS support information
-						(struct ist_info)
-070/008		ALL	acpi_rsdp_addr		Physical address of ACPI RSDP table
-080/010		ALL	hd0_info		hd0 disk parameter, OBSOLETE!!
-090/010		ALL	hd1_info		hd1 disk parameter, OBSOLETE!!
-0A0/010		ALL	sys_desc_table		System description table (struct sys_desc_table),
-						OBSOLETE!!
-0B0/010		ALL	olpc_ofw_header		OLPC's OpenFirmware CIF and friends
-0C0/004		ALL	ext_ramdisk_image	ramdisk_image high 32bits
-0C4/004		ALL	ext_ramdisk_size	ramdisk_size high 32bits
-0C8/004		ALL	ext_cmd_line_ptr	cmd_line_ptr high 32bits
-13C/004		ALL	cc_blob_address		Physical address of Confidential Computing blob
-140/080		ALL	edid_info		Video mode setup (struct edid_info)
-1C0/020		ALL	efi_info		EFI 32 information (struct efi_info)
-1E0/004		ALL	alt_mem_k		Alternative mem check, in KB
-1E4/004		ALL	scratch			Scratch field for the kernel setup code
-1E8/001		ALL	e820_entries		Number of entries in e820_table (below)
-1E9/001		ALL	eddbuf_entries		Number of entries in eddbuf (below)
-1EA/001		ALL	edd_mbr_sig_buf_entries	Number of entries in edd_mbr_sig_buffer
-						(below)
-1EB/001		ALL     kbd_status      	Numlock is enabled
-1EC/001		ALL     secure_boot		Secure boot is enabled in the firmware
-1EF/001		ALL	sentinel		Used to detect broken bootloaders
-290/040		ALL	edd_mbr_sig_buffer	EDD MBR signatures
-2D0/A00		ALL	e820_table		E820 memory map table
-						(array of struct e820_entry)
-D00/1EC		ALL	eddbuf			EDD data (array of struct edd_info)
-===========	=====	=======================	=================================================
author	Jonathan Corbet <corbet@lwn.net>	2023-03-14 17:06:44 -0600
committer	Jonathan Corbet <corbet@lwn.net>	2023-03-30 12:58:51 -0600
commit	ff61f0791ce969d2db6c9f3b71d74ceec0a2e958 (patch)
tree	fe32be44aaf65f9c436a8f37cd4a18f6ec47c3cb /Documentation/x86
parent	f030c8fd64cea916d57d40bb7b59c1cff9ea3bc3 (diff)
download	lwn-ff61f0791ce969d2db6c9f3b71d74ceec0a2e958.tar.gz lwn-ff61f0791ce969d2db6c9f3b71d74ceec0a2e958.zip