blob: 2b84dc8c3149d04be0306021051528a61b952a32 (
plain) (
tree)
|
|
What: /sys/bus/platform/devices/smpro-errmon.*/error_[core|mem|pcie|other]_[ce|ue]
KernelVersion: 6.1
Contact: Quan Nguyen <quan@os.amperecomputing.com>
Description:
(RO) Contains the 48-byte Ampere (Vendor-Specific) Error Record printed
in hex format according to the table below:
+--------+---------------+-------------+------------------------------------------------------------+
| Offset | Field | Size (byte) | Description |
+--------+---------------+-------------+------------------------------------------------------------+
| 00 | Error Type | 1 | See :ref:`the table below <smpro-error-types>` for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 01 | Subtype | 1 | See :ref:`the table below <smpro-error-types>` for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 02 | Instance | 2 | See :ref:`the table below <smpro-error-types>` for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 04 | Error status | 4 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 08 | Error Address | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 16 | Error Misc 0 | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 24 | Error Misc 1 | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 32 | Error Misc 2 | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
| 40 | Error Misc 3 | 8 | See ARM RAS specification for details |
+--------+---------------+-------------+------------------------------------------------------------+
The table below defines the value of error types, their subtype, subcomponent and instance:
.. _smpro-error-types:
+-----------------+------------+----------+----------------+----------------------------------------+
| Error Group | Error Type | Sub type | Sub component | Instance |
+-----------------+------------+----------+----------------+----------------------------------------+
| CPM (core) | 0 | 0 | Snoop-Logic | CPM # |
+-----------------+------------+----------+----------------+----------------------------------------+
| CPM (core) | 0 | 2 | Armv8 Core 1 | CPM # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 1 | ERR1 | MCU # \| SLOT << 11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 2 | ERR2 | MCU # \| SLOT << 11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 3 | ERR3 | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 4 | ERR4 | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 5 | ERR5 | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 6 | ERR6 | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| MCU (mem) | 1 | 7 | Link Error | MCU # |
+-----------------+------------+----------+----------------+----------------------------------------+
| Mesh (other) | 2 | 0 | Cross Point | X \| (Y << 5) \| NS <<11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| Mesh (other) | 2 | 1 | Home Node(IO) | X \| (Y << 5) \| NS <<11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| Mesh (other) | 2 | 2 | Home Node(Mem) | X \| (Y << 5) \| NS <<11 \| device<<12 |
+-----------------+------------+----------+----------------+----------------------------------------+
| Mesh (other) | 2 | 4 | CCIX Node | X \| (Y << 5) \| NS <<11 |
+-----------------+------------+----------+----------------+----------------------------------------+
| 2P Link (other) | 3 | 0 | N/A | Altra 2P Link # |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 0 | ERR0 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 1 | ERR1 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 2 | ERR2 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 3 | ERR3 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 4 | ERR4 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 5 | ERR5 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 6 | ERR6 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 7 | ERR7 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 8 | ERR8 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 9 | ERR9 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 10 | ERR10 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 11 | ERR11 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 12 | ERR12 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| GIC (other) | 5 | 13-21 | ERR13 | RC # + 1 |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TCU | 100 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU0 | 0 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU1 | 1 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU2 | 2 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU3 | 3 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU4 | 4 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU5 | 5 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU6 | 6 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU7 | 7 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU8 | 8 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMMU (other) | 6 | TBU9 | 9 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe AER (pcie) | 7 | Root | 0 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe AER (pcie) | 7 | Device | 1 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe RC (pcie) | 8 | RCA HB | 0 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe RC (pcie) | 8 | RCB HB | 1 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| PCIe RC (pcie) | 8 | RASDP | 8 | RC # |
+-----------------+------------+----------+----------------+----------------------------------------+
| OCM (other) | 9 | ERR0 | 0 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| OCM (other) | 9 | ERR1 | 1 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| OCM (other) | 9 | ERR2 | 2 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMpro (other) | 10 | ERR0 | 0 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMpro (other) | 10 | ERR1 | 1 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| SMpro (other) | 10 | MPA_ERR | 2 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| PMpro (other) | 11 | ERR0 | 0 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| PMpro (other) | 11 | ERR1 | 1 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
| PMpro (other) | 11 | MPA_ERR | 2 | 0 |
+-----------------+------------+----------+----------------+----------------------------------------+
Example::
# cat error_other_ue
880807001e004010401040101500000001004010401040100c0000000000000000000000000000000000000000000000
The detail of each sysfs entries is as below:
+-------------+---------------------------------------------------------+----------------------------------+
| Error | Sysfs entry | Description (when triggered) |
+-------------+---------------------------------------------------------+----------------------------------+
| Core's CE | /sys/bus/platform/devices/smpro-errmon.*/error_core_ce | Core has CE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Core's UE | /sys/bus/platform/devices/smpro-errmon.*/error_core_ue | Core has UE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Memory's CE | /sys/bus/platform/devices/smpro-errmon.*/error_mem_ce | Memory has CE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Memory's UE | /sys/bus/platform/devices/smpro-errmon.*/error_mem_ue | Memory has UE error |
+-------------+---------------------------------------------------------+----------------------------------+
| PCIe's CE | /sys/bus/platform/devices/smpro-errmon.*/error_pcie_ce | any PCIe controller has CE error |
+-------------+---------------------------------------------------------+----------------------------------+
| PCIe's UE | /sys/bus/platform/devices/smpro-errmon.*/error_pcie_ue | any PCIe controller has UE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Other's CE | /sys/bus/platform/devices/smpro-errmon.*/error_other_ce | any other CE error |
+-------------+---------------------------------------------------------+----------------------------------+
| Other's UE | /sys/bus/platform/devices/smpro-errmon.*/error_other_ue | any other UE error |
+-------------+---------------------------------------------------------+----------------------------------+
UE: Uncorrect-able Error
CE: Correct-able Error
For details, see section `3.3 Ampere (Vendor-Specific) Error Record Formats,
Altra Family RAS Supplement`.
What: /sys/bus/platform/devices/smpro-errmon.*/overflow_[core|mem|pcie|other]_[ce|ue]
KernelVersion: 6.1
Contact: Quan Nguyen <quan@os.amperecomputing.com>
Description:
(RO) Return the overflow status of each type HW error reported:
- 0 : No overflow
- 1 : There is an overflow and the oldest HW errors are dropped
The detail of each sysfs entries is as below:
+-------------+-----------------------------------------------------------+---------------------------------------+
| Overflow | Sysfs entry | Description |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Core's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_core_ce | Core CE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Core's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_core_ue | Core UE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Memory's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_mem_ce | Memory CE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Memory's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_mem_ue | Memory UE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| PCIe's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_pcie_ce | any PCIe controller CE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| PCIe's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_pcie_ue | any PCIe controller UE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Other's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_other_ce| any other CE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
| Other's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_other_ue| other UE error overflow |
+-------------+-----------------------------------------------------------+---------------------------------------+
where:
- UE: Uncorrect-able Error
- CE: Correct-able Error
What: /sys/bus/platform/devices/smpro-errmon.*/[error|warn]_[smpro|pmpro]
KernelVersion: 6.1
Contact: Quan Nguyen <quan@os.amperecomputing.com>
Description:
(RO) Contains the internal firmware error/warning printed as hex format.
The detail of each sysfs entries is as below:
+---------------+------------------------------------------------------+--------------------------+
| Error | Sysfs entry | Description |
+---------------+------------------------------------------------------+--------------------------+
| SMpro error | /sys/bus/platform/devices/smpro-errmon.*/error_smpro | system has SMpro error |
+---------------+------------------------------------------------------+--------------------------+
| SMpro warning | /sys/bus/platform/devices/smpro-errmon.*/warn_smpro | system has SMpro warning |
+---------------+------------------------------------------------------+--------------------------+
| PMpro error | /sys/bus/platform/devices/smpro-errmon.*/error_pmpro | system has PMpro error |
+---------------+------------------------------------------------------+--------------------------+
| PMpro warning | /sys/bus/platform/devices/smpro-errmon.*/warn_pmpro | system has PMpro warning |
+---------------+------------------------------------------------------+--------------------------+
For details, see section `5.10 RAS Internal Error Register Definitions,
Altra Family Soc BMC Interface Specification`.
What: /sys/bus/platform/devices/smpro-errmon.*/event_[vrd_warn_fault|vrd_hot|dimm_hot]
KernelVersion: 6.1
Contact: Quan Nguyen <quan@os.amperecomputing.com>
Description:
(RO) Contains the detail information in case of VRD/DIMM warning/hot events
in hex format as below::
AAAA
where:
- ``AAAA``: The event detail information data
The detail of each sysfs entries is as below:
+---------------+---------------------------------------------------------------+---------------------+
| Event | Sysfs entry | Description |
+---------------+---------------------------------------------------------------+---------------------+
| VRD HOT | /sys/bus/platform/devices/smpro-errmon.*/event_vrd_hot | VRD Hot |
+---------------+---------------------------------------------------------------+---------------------+
| VR Warn/Fault | /sys/bus/platform/devices/smpro-errmon.*/event_vrd_warn_fault | VR Warning or Fault |
+---------------+---------------------------------------------------------------+---------------------+
| DIMM HOT | /sys/bus/platform/devices/smpro-errmon.*/event_dimm_hot | DIMM Hot |
+---------------+---------------------------------------------------------------+---------------------+
For more details, see section `5.7 GPI Status Registers,
Altra Family Soc BMC Interface Specification`.
|