diff options
author | Rafael J. Wysocki <rafael.j.wysocki@intel.com> | 2024-07-18 21:01:14 +0200 |
---|---|---|
committer | Rafael J. Wysocki <rafael.j.wysocki@intel.com> | 2024-07-24 12:40:23 +0200 |
commit | f7c1b0e4ae47e67c6f9af84568a5f4a80638ccd8 (patch) | |
tree | 1de4c3f5587f3df8621bfdaf9de12e78fff447fb /drivers/thermal/thermal_core.h | |
parent | e5f98896efb3b6350cb6f1c241394966dcbcf240 (diff) | |
download | lwn-f7c1b0e4ae47e67c6f9af84568a5f4a80638ccd8.tar.gz lwn-f7c1b0e4ae47e67c6f9af84568a5f4a80638ccd8.zip |
thermal: core: Back off when polling thermal zones on errors
Commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone
temperature is invalid") introduced a polling mechanism by which the
thermal core attampts to get a valid temperature value for thermal zones
where the .get_temp() callback returns errors to start with (for
example, due to initialization ordering woes). However, this polling is
carried out periodically ad infinitum and every iteration of it causes
a message to be printed to the kernel log which means a lot of log noise
on systems where there are thermal zones that never get ready for some
reason. It is also not really useful to continuously poll thermal zones
that never respond.
To address this, modify the thermal core to increase the delay between
consecutive thermal zone temperature checks after every check that fails
until it reaches a certain maximum value. At that point, the thermal
zone in question will be disabled, but user space will be able to
reenable it if it believes that the failure is transient.
Also change the code to print messages regarding failed temperature
checks to the kernel log only twice, once when the thermal zone's
.get_temp() callback returns an error for the first time and once when
disabling the given thermal zone. In addition, a dev_crit() message
will be printed at that point if the given thermal zone contains a
critical trip point to notify the system operator about the situation.
Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid")
Link: https://lore.kernel.org/linux-acpi/CAGnHSE=RyPK++UG0-wAtVKgeJxe0uzFYgLxm+RUOKKoQquW=Ow@mail.gmail.com/
Reported-by: Tom Yan <tom.ty89@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2962033.e9J7NaK4W3@rjwysocki.net
Diffstat (limited to 'drivers/thermal/thermal_core.h')
-rw-r--r-- | drivers/thermal/thermal_core.h | 10 |
1 files changed, 7 insertions, 3 deletions
diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h index ba8e6fc807ca..4cf2b7230d04 100644 --- a/drivers/thermal/thermal_core.h +++ b/drivers/thermal/thermal_core.h @@ -67,6 +67,8 @@ struct thermal_governor { * @polling_delay_jiffies: number of jiffies to wait between polls when * checking whether trip points have been crossed (0 for * interrupt driven systems) + * @recheck_delay_jiffies: delay after a failed attempt to determine the zone + * temperature before trying again * @temperature: current temperature. This is only for core code, * drivers should use thermal_zone_get_temp() to get the * current temperature @@ -108,6 +110,7 @@ struct thermal_zone_device { int num_trips; unsigned long passive_delay_jiffies; unsigned long polling_delay_jiffies; + unsigned long recheck_delay_jiffies; int temperature; int last_temperature; int emul_temperature; @@ -137,10 +140,11 @@ struct thermal_zone_device { #define THERMAL_TEMP_INIT INT_MIN /* - * Default delay after a failing thermal zone temperature check before - * attempting to check it again. + * Default and maximum delay after a failed thermal zone temperature check + * before attempting to check it again (in jiffies). */ -#define THERMAL_RECHECK_DELAY_MS 250 +#define THERMAL_RECHECK_DELAY msecs_to_jiffies(250) +#define THERMAL_MAX_RECHECK_DELAY (120 * HZ) /* Default Thermal Governor */ #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE) |