summaryrefslogtreecommitdiff
path: root/Documentation/mm/damon/design.rst
blob: fe08a3796e60124ff99a10e4d44dadbf8b1fcfc1 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
.. SPDX-License-Identifier: GPL-2.0

======
Design
======


.. _damon_design_execution_model_and_data_structures:

Execution Model and Data Structures
===================================

The monitoring-related information including the monitoring request
specification and DAMON-based operation schemes are stored in a data structure
called DAMON ``context``.  DAMON executes each context with a kernel thread
called ``kdamond``.  Multiple kdamonds could run in parallel, for different
types of monitoring.


Overall Architecture
====================

DAMON subsystem is configured with three layers including

- Operations Set: Implements fundamental operations for DAMON that depends on
  the given monitoring target address-space and available set of
  software/hardware primitives,
- Core: Implements core logics including monitoring overhead/accuracy control
  and access-aware system operations on top of the operations set layer, and
- Modules: Implements kernel modules for various purposes that provides
  interfaces for the user space, on top of the core layer.


.. _damon_design_configurable_operations_set:

Configurable Operations Set
---------------------------

For data access monitoring and additional low level work, DAMON needs a set of
implementations for specific operations that are dependent on and optimized for
the given target address space.  On the other hand, the accuracy and overhead
tradeoff mechanism, which is the core logic of DAMON, is in the pure logic
space.  DAMON separates the two parts in different layers, namely DAMON
Operations Set and DAMON Core Logics Layers, respectively.  It further defines
the interface between the layers to allow various operations sets to be
configured with the core logic.

Due to this design, users can extend DAMON for any address space by configuring
the core logic to use the appropriate operations set.  If any appropriate set
is unavailable, users can implement one on their own.

For example, physical memory, virtual memory, swap space, those for specific
processes, NUMA nodes, files, and backing memory devices would be supportable.
Also, if some architectures or devices supporting special optimized access
check primitives, those will be easily configurable.


Programmable Modules
--------------------

Core layer of DAMON is implemented as a framework, and exposes its application
programming interface to all kernel space components such as subsystems and
modules.  For common use cases of DAMON, DAMON subsystem provides kernel
modules that built on top of the core layer using the API, which can be easily
used by the user space end users.


.. _damon_operations_set:

Operations Set Layer
====================

The monitoring operations are defined in two parts:

1. Identification of the monitoring target address range for the address space.
2. Access check of specific address range in the target space.

DAMON currently provides below three operation sets.  Below two subsections
describe how those work.

 - vaddr: Monitor virtual address spaces of specific processes
 - fvaddr: Monitor fixed virtual address ranges
 - paddr: Monitor the physical address space of the system


 .. _damon_design_vaddr_target_regions_construction:

VMA-based Target Address Range Construction
-------------------------------------------

A mechanism of ``vaddr`` DAMON operations set that automatically initializes
and updates the monitoring target address regions so that entire memory
mappings of the target processes can be covered.

This mechanism is only for the ``vaddr`` operations set.  In cases of
``fvaddr`` and ``paddr`` operation sets, users are asked to manually set the
monitoring target address ranges.

Only small parts in the super-huge virtual address space of the processes are
mapped to the physical memory and accessed.  Thus, tracking the unmapped
address regions is just wasteful.  However, because DAMON can deal with some
level of noise using the adaptive regions adjustment mechanism, tracking every
mapping is not strictly required but could even incur a high overhead in some
cases.  That said, too huge unmapped areas inside the monitoring target should
be removed to not take the time for the adaptive mechanism.

For the reason, this implementation converts the complex mappings to three
distinct regions that cover every mapped area of the address space.  The two
gaps between the three regions are the two biggest unmapped areas in the given
address space.  The two biggest unmapped areas would be the gap between the
heap and the uppermost mmap()-ed region, and the gap between the lowermost
mmap()-ed region and the stack in most of the cases.  Because these gaps are
exceptionally huge in usual address spaces, excluding these will be sufficient
to make a reasonable trade-off.  Below shows this in detail::

    <heap>
    <BIG UNMAPPED REGION 1>
    <uppermost mmap()-ed region>
    (small mmap()-ed regions and munmap()-ed regions)
    <lowermost mmap()-ed region>
    <BIG UNMAPPED REGION 2>
    <stack>


PTE Accessed-bit Based Access Check
-----------------------------------

Both of the implementations for physical and virtual address spaces use PTE
Accessed-bit for basic access checks.  Only one difference is the way of
finding the relevant PTE Accessed bit(s) from the address.  While the
implementation for the virtual address walks the page table for the target task
of the address, the implementation for the physical address walks every page
table having a mapping to the address.  In this way, the implementations find
and clear the bit(s) for next sampling target address and checks whether the
bit(s) set again after one sampling period.  This could disturb other kernel
subsystems using the Accessed bits, namely Idle page tracking and the reclaim
logic.  DAMON does nothing to avoid disturbing Idle page tracking, so handling
the interference is the responsibility of sysadmins.  However, it solves the
conflict with the reclaim logic using ``PG_idle`` and ``PG_young`` page flags,
as Idle page tracking does.


Core Logics
===========


Monitoring
----------

Below four sections describe each of the DAMON core mechanisms and the five
monitoring attributes, ``sampling interval``, ``aggregation interval``,
``update interval``, ``minimum number of regions``, and ``maximum number of
regions``.


Access Frequency Monitoring
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The output of DAMON says what pages are how frequently accessed for a given
duration.  The resolution of the access frequency is controlled by setting
``sampling interval`` and ``aggregation interval``.  In detail, DAMON checks
access to each page per ``sampling interval`` and aggregates the results.  In
other words, counts the number of the accesses to each page.  After each
``aggregation interval`` passes, DAMON calls callback functions that previously
registered by users so that users can read the aggregated results and then
clears the results.  This can be described in below simple pseudo-code::

    while monitoring_on:
        for page in monitoring_target:
            if accessed(page):
                nr_accesses[page] += 1
        if time() % aggregation_interval == 0:
            for callback in user_registered_callbacks:
                callback(monitoring_target, nr_accesses)
            for page in monitoring_target:
                nr_accesses[page] = 0
        sleep(sampling interval)

The monitoring overhead of this mechanism will arbitrarily increase as the
size of the target workload grows.


.. _damon_design_region_based_sampling:

Region Based Sampling
~~~~~~~~~~~~~~~~~~~~~

To avoid the unbounded increase of the overhead, DAMON groups adjacent pages
that assumed to have the same access frequencies into a region.  As long as the
assumption (pages in a region have the same access frequencies) is kept, only
one page in the region is required to be checked.  Thus, for each ``sampling
interval``, DAMON randomly picks one page in each region, waits for one
``sampling interval``, checks whether the page is accessed meanwhile, and
increases the access frequency counter of the region if so.  The counter is
called ``nr_accesses`` of the region.  Therefore, the monitoring overhead is
controllable by setting the number of regions.  DAMON allows users to set the
minimum and the maximum number of regions for the trade-off.

This scheme, however, cannot preserve the quality of the output if the
assumption is not guaranteed.


Adaptive Regions Adjustment
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Even somehow the initial monitoring target regions are well constructed to
fulfill the assumption (pages in same region have similar access frequencies),
the data access pattern can be dynamically changed.  This will result in low
monitoring quality.  To keep the assumption as much as possible, DAMON
adaptively merges and splits each region based on their access frequency.

For each ``aggregation interval``, it compares the access frequencies
(``nr_accesses``) of adjacent regions.  If the difference is small, and if the
sum of the two regions' sizes is smaller than the size of total regions divided
by the ``minimum number of regions``, DAMON merges the two regions.  If the
resulting number of total regions is still higher than ``maximum number of
regions``, it repeats the merging with increasing access frequenceis difference
threshold until the upper-limit of the number of regions is met, or the
threshold becomes higher than possible maximum value (``aggregation interval``
divided by ``sampling interval``).   Then, after it reports and clears the
aggregated access frequency of each region, it splits each region into two or
three regions if the total number of regions will not exceed the user-specified
maximum number of regions after the split.

In this way, DAMON provides its best-effort quality and minimal overhead while
keeping the bounds users set for their trade-off.


.. _damon_design_age_tracking:

Age Tracking
~~~~~~~~~~~~

By analyzing the monitoring results, users can also find how long the current
access pattern of a region has maintained.  That could be used for good
understanding of the access pattern.  For example, page placement algorithm
utilizing both the frequency and the recency could be implemented using that.
To make such access pattern maintained period analysis easier, DAMON maintains
yet another counter called ``age`` in each region.  For each ``aggregation
interval``, DAMON checks if the region's size and access frequency
(``nr_accesses``) has significantly changed.  If so, the counter is reset to
zero.  Otherwise, the counter is increased.


Dynamic Target Space Updates Handling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The monitoring target address range could dynamically changed.  For example,
virtual memory could be dynamically mapped and unmapped.  Physical memory could
be hot-plugged.

As the changes could be quite frequent in some cases, DAMON allows the
monitoring operations to check dynamic changes including memory mapping changes
and applies it to monitoring operations-related data structures such as the
abstracted monitoring target memory area only for each of a user-specified time
interval (``update interval``).


.. _damon_design_damos:

Operation Schemes
-----------------

One common purpose of data access monitoring is access-aware system efficiency
optimizations.  For example,

    paging out memory regions that are not accessed for more than two minutes

or

    using THP for memory regions that are larger than 2 MiB and showing a high
    access frequency for more than one minute.

One straightforward approach for such schemes would be profile-guided
optimizations.  That is, getting data access monitoring results of the
workloads or the system using DAMON, finding memory regions of special
characteristics by profiling the monitoring results, and making system
operation changes for the regions.  The changes could be made by modifying or
providing advice to the software (the application and/or the kernel), or
reconfiguring the hardware.  Both offline and online approaches could be
available.

Among those, providing advice to the kernel at runtime would be flexible and
effective, and therefore widely be used.   However, implementing such schemes
could impose unnecessary redundancy and inefficiency.  The profiling could be
redundant if the type of interest is common.  Exchanging the information
including monitoring results and operation advice between kernel and user
spaces could be inefficient.

To allow users to reduce such redundancy and inefficiencies by offloading the
works, DAMON provides a feature called Data Access Monitoring-based Operation
Schemes (DAMOS).  It lets users specify their desired schemes at a high
level.  For such specifications, DAMON starts monitoring, finds regions having
the access pattern of interest, and applies the user-desired operation actions
to the regions, for every user-specified time interval called
``apply_interval``.


.. _damon_design_damos_action:

Operation Action
~~~~~~~~~~~~~~~~

The management action that the users desire to apply to the regions of their
interest.  For example, paging out, prioritizing for next reclamation victim
selection, advising ``khugepaged`` to collapse or split, or doing nothing but
collecting statistics of the regions.

The list of supported actions is defined in DAMOS, but the implementation of
each action is in the DAMON operations set layer because the implementation
normally depends on the monitoring target address space.  For example, the code
for paging specific virtual address ranges out would be different from that for
physical address ranges.  And the monitoring operations implementation sets are
not mandated to support all actions of the list.  Hence, the availability of
specific DAMOS action depends on what operations set is selected to be used
together.

The list of the supported actions, their meaning, and DAMON operations sets
that supports each action are as below.

 - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
   Supported by ``vaddr`` and ``fvaddr`` operations set.
 - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``.
   Supported by ``vaddr`` and ``fvaddr`` operations set.
 - ``pageout``: Reclaim the region.
   Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set.
 - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``.
   Supported by ``vaddr`` and ``fvaddr`` operations set.
 - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``.
   Supported by ``vaddr`` and ``fvaddr`` operations set.
 - ``lru_prio``: Prioritize the region on its LRU lists.
   Supported by ``paddr`` operations set.
 - ``lru_deprio``: Deprioritize the region on its LRU lists.
   Supported by ``paddr`` operations set.
 - ``migrate_hot``: Migrate the regions prioritizing warmer regions.
   Supported by ``paddr`` operations set.
 - ``migrate_cold``: Migrate the regions prioritizing colder regions.
   Supported by ``paddr`` operations set.
 - ``stat``: Do nothing but count the statistics.
   Supported by all operations sets.

Applying the actions except ``stat`` to a region is considered as changing the
region's characteristics.  Hence, DAMOS resets the age of regions when any such
actions are applied to those.


.. _damon_design_damos_access_pattern:

Target Access Pattern
~~~~~~~~~~~~~~~~~~~~~

The access pattern of the schemes' interest.  The patterns are constructed with
the properties that DAMON's monitoring results provide, specifically the size,
the access frequency, and the age.  Users can describe their access pattern of
interest by setting minimum and maximum values of the three properties.  If a
region's three properties are in the ranges, DAMOS classifies it as one of the
regions that the scheme is having an interest in.


.. _damon_design_damos_quotas:

Quotas
~~~~~~

DAMOS upper-bound overhead control feature.  DAMOS could incur high overhead if
the target access pattern is not properly tuned.  For example, if a huge memory
region having the access pattern of interest is found, applying the scheme's
action to all pages of the huge region could consume unacceptably large system
resources.  Preventing such issues by tuning the access pattern could be
challenging, especially if the access patterns of the workloads are highly
dynamic.

To mitigate that situation, DAMOS provides an upper-bound overhead control
feature called quotas.  It lets users specify an upper limit of time that DAMOS
can use for applying the action, and/or a maximum bytes of memory regions that
the action can be applied within a user-specified time duration.


.. _damon_design_damos_quotas_prioritization:

Prioritization
^^^^^^^^^^^^^^

A mechanism for making a good decision under the quotas.  When the action
cannot be applied to all regions of interest due to the quotas, DAMOS
prioritizes regions and applies the action to only regions having high enough
priorities so that it will not exceed the quotas.

The prioritization mechanism should be different for each action.  For example,
rarely accessed (colder) memory regions would be prioritized for page-out
scheme action.  In contrast, the colder regions would be deprioritized for huge
page collapse scheme action.  Hence, the prioritization mechanisms for each
action are implemented in each DAMON operations set, together with the actions.

Though the implementation is up to the DAMON operations set, it would be common
to calculate the priority using the access pattern properties of the regions.
Some users would want the mechanisms to be personalized for their specific
case.  For example, some users would want the mechanism to weigh the recency
(``age``) more than the access frequency (``nr_accesses``).  DAMOS allows users
to specify the weight of each access pattern property and passes the
information to the underlying mechanism.  Nevertheless, how and even whether
the weight will be respected are up to the underlying prioritization mechanism
implementation.


.. _damon_design_damos_quotas_auto_tuning:

Aim-oriented Feedback-driven Auto-tuning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Automatic feedback-driven quota tuning.  Instead of setting the absolute quota
value, users can specify the metric of their interest, and what target value
they want the metric value to be.  DAMOS then automatically tunes the
aggressiveness (the quota) of the corresponding scheme.  For example, if DAMOS
is under achieving the goal, DAMOS automatically increases the quota.  If DAMOS
is over achieving the goal, it decreases the quota.

The goal can be specified with three parameters, namely ``target_metric``,
``target_value``, and ``current_value``.  The auto-tuning mechanism tries to
make ``current_value`` of ``target_metric`` be same to ``target_value``.
Currently, two ``target_metric`` are provided.

- ``user_input``: User-provided value.  Users could use any metric that they
  has interest in for the value.  Use space main workload's latency or
  throughput, system metrics like free memory ratio or memory pressure stall
  time (PSI) could be examples.  Note that users should explicitly set
  ``current_value`` on their own in this case.  In other words, users should
  repeatedly provide the feedback.
- ``some_mem_psi_us``: System-wide ``some`` memory pressure stall information
  in microseconds that measured from last quota reset to next quota reset.
  DAMOS does the measurement on its own, so only ``target_value`` need to be
  set by users at the initial time.  In other words, DAMOS does self-feedback.


.. _damon_design_damos_watermarks:

Watermarks
~~~~~~~~~~

Conditional DAMOS (de)activation automation.  Users might want DAMOS to run
only under certain situations.  For example, when a sufficient amount of free
memory is guaranteed, running a scheme for proactive reclamation would only
consume unnecessary system resources.  To avoid such consumption, the user would
need to manually monitor some metrics such as free memory ratio, and turn
DAMON/DAMOS on or off.

DAMOS allows users to offload such works using three watermarks.  It allows the
users to configure the metric of their interest, and three watermark values,
namely high, middle, and low.  If the value of the metric becomes above the
high watermark or below the low watermark, the scheme is deactivated.  If the
metric becomes below the mid watermark but above the low watermark, the scheme
is activated.  If all schemes are deactivated by the watermarks, the monitoring
is also deactivated.  In this case, the DAMON worker thread only periodically
checks the watermarks and therefore incurs nearly zero overhead.


.. _damon_design_damos_filters:

Filters
~~~~~~~

Non-access pattern-based target memory regions filtering.  If users run
self-written programs or have good profiling tools, they could know something
more than the kernel, such as future access patterns or some special
requirements for specific types of memory. For example, some users may know
only anonymous pages can impact their program's performance.  They can also
have a list of latency-critical processes.

To let users optimize DAMOS schemes with such special knowledge, DAMOS provides
a feature called DAMOS filters.  The feature allows users to set an arbitrary
number of filters for each scheme.  Each filter specifies the type of target
memory, and whether it should exclude the memory of the type (filter-out), or
all except the memory of the type (filter-in).

For efficient handling of filters, some types of filters are handled by the
core layer, while others are handled by operations set.  In the latter case,
hence, support of the filter types depends on the DAMON operations set.  In
case of the core layer-handled filters, the memory regions that excluded by the
filter are not counted as the scheme has tried to the region.  In contrast, if
a memory regions is filtered by an operations set layer-handled filter, it is
counted as the scheme has tried.  This difference affects the statistics.

Below types of filters are currently supported.

- anonymous page
    - Applied to pages that containing data that not stored in files.
    - Handled by operations set layer.  Supported by only ``paddr`` set.
- memory cgroup
    - Applied to pages that belonging to a given cgroup.
    - Handled by operations set layer.  Supported by only ``paddr`` set.
- young page
    - Applied to pages that are accessed after the last access check from the
      scheme.
    - Handled by operations set layer.  Supported by only ``paddr`` set.
- address range
    - Applied to pages that belonging to a given address range.
    - Handled by the core logic.
- DAMON monitoring target
    - Applied to pages that belonging to a given DAMON monitoring target.
    - Handled by the core logic.


Application Programming Interface
---------------------------------

The programming interface for kernel space data access-aware applications.
DAMON is a framework, so it does nothing by itself.  Instead, it only helps
other kernel components such as subsystems and modules building their data
access-aware applications using DAMON's core features.  For this, DAMON exposes
its all features to other kernel components via its application programming
interface, namely ``include/linux/damon.h``.  Please refer to the API
:doc:`document </mm/damon/api>` for details of the interface.


Modules
=======

Because the core of DAMON is a framework for kernel components, it doesn't
provide any direct interface for the user space.  Such interfaces should be
implemented by each DAMON API user kernel components, instead.  DAMON subsystem
itself implements such DAMON API user modules, which are supposed to be used
for general purpose DAMON control and special purpose data access-aware system
operations, and provides stable application binary interfaces (ABI) for the
user space.  The user space can build their efficient data access-aware
applications using the interfaces.


General Purpose User Interface Modules
--------------------------------------

DAMON modules that provide user space ABIs for general purpose DAMON usage in
runtime.

DAMON user interface modules, namely 'DAMON sysfs interface' and 'DAMON debugfs
interface' are DAMON API user kernel modules that provide ABIs to the
user-space.  Please note that DAMON debugfs interface is currently deprecated.

Like many other ABIs, the modules create files on sysfs and debugfs, allow
users to specify their requests to and get the answers from DAMON by writing to
and reading from the files.  As a response to such I/O, DAMON user interface
modules control DAMON and retrieve the results as user requested via the DAMON
API, and return the results to the user-space.

The ABIs are designed to be used for user space applications development,
rather than human beings' fingers.  Human users are recommended to use such
user space tools.  One such Python-written user space tool is available at
Github (https://github.com/awslabs/damo), Pypi
(https://pypistats.org/packages/damo), and Fedora
(https://packages.fedoraproject.org/pkgs/python-damo/damo/).

Please refer to the ABI :doc:`document </admin-guide/mm/damon/usage>` for
details of the interfaces.


Special-Purpose Access-aware Kernel Modules
-------------------------------------------

DAMON modules that provide user space ABI for specific purpose DAMON usage.

DAMON sysfs/debugfs user interfaces are for full control of all DAMON features
in runtime.  For each special-purpose system-wide data access-aware system
operations such as proactive reclamation or LRU lists balancing, the interfaces
could be simplified by removing unnecessary knobs for the specific purpose, and
extended for boot-time and even compile time control.  Default values of DAMON
control parameters for the usage would also need to be optimized for the
purpose.

To support such cases, yet more DAMON API user kernel modules that provide more
simple and optimized user space interfaces are available.  Currently, two
modules for proactive reclamation and LRU lists manipulation are provided.  For
more detail, please read the usage documents for those
(:doc:`/admin-guide/mm/damon/reclaim` and
:doc:`/admin-guide/mm/damon/lru_sort`).