<feed xmlns='http://www.w3.org/2005/Atom'>
<title>lwn.git/arch/arm64/include/asm/atomic_lse.h, branch docs-mw</title>
<subtitle>Linux kernel documentation tree maintained by Jonathan Corbet</subtitle>
<id>http://mirrors.hust.edu.cn/git/lwn.git/atom?h=docs-mw</id>
<link rel='self' href='http://mirrors.hust.edu.cn/git/lwn.git/atom?h=docs-mw'/>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/'/>
<updated>2025-11-27T18:15:24+00:00</updated>
<entry>
<title>arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros</title>
<updated>2025-11-27T18:15:24+00:00</updated>
<author>
<name>Seongsu Park</name>
<email>sgsu.park@samsung.com</email>
</author>
<published>2025-11-26T02:10:25+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=c86d9f8764ba2ffa4e19cca40918c12ccc3ad909'/>
<id>urn:sha1:c86d9f8764ba2ffa4e19cca40918c12ccc3ad909</id>
<content type='text'>
The ATOMIC_FETCH_OP_AND and ATOMIC64_FETCH_OP_AND macros accept 'mb' and
'cl' parameters but never use them in their implementation. These macros
simply delegate to the corresponding andnot functions, which handle the
actual atomic operations and memory barriers.

Signed-off-by: Seongsu Park &lt;sgsu.park@samsung.com&gt;
Acked-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arch: Remove cmpxchg_double</title>
<updated>2023-06-05T07:36:39+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2023-05-31T13:08:44+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=febe950dbfb464799beb0339cc6fb10699f4a5da'/>
<id>urn:sha1:febe950dbfb464799beb0339cc6fb10699f4a5da</id>
<content type='text'>
No moar users, remove the monster.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Reviewed-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Acked-by: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Tested-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Link: https://lore.kernel.org/r/20230531132323.991907085@infradead.org
</content>
</entry>
<entry>
<title>arch: Introduce arch_{,try_}_cmpxchg128{,_local}()</title>
<updated>2023-06-05T07:36:35+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2023-05-31T13:08:36+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=b23e139d0b66c0216e7e9361a5021290395f504c'/>
<id>urn:sha1:b23e139d0b66c0216e7e9361a5021290395f504c</id>
<content type='text'>
For all architectures that currently support cmpxchg_double()
implement the cmpxchg128() family of functions that is basically the
same but with a saner interface.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Reviewed-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Acked-by: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Acked-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Tested-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Link: https://lore.kernel.org/r/20230531132323.452120708@infradead.org
</content>
</entry>
<entry>
<title>arm64: atomics: lse: improve cmpxchg implementation</title>
<updated>2023-03-28T20:13:25+00:00</updated>
<author>
<name>Mark Rutland</name>
<email>mark.rutland@arm.com</email>
</author>
<published>2023-03-14T15:36:57+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=e5cacb540fd2509484d6849c0d5372bd67d174b9'/>
<id>urn:sha1:e5cacb540fd2509484d6849c0d5372bd67d174b9</id>
<content type='text'>
For historical reasons, the LSE implementation of cmpxchg*() hard-codes
the GPRs to use, and shuffles registers around with MOVs. This is no
longer necessary, and can be simplified.

When the LSE cmpxchg implementation was added in commit:

  c342f78217e822d2 ("arm64: cmpxchg: patch in lse instructions when supported by the CPU")

... the LL/SC implementation of cmpxchg() would be placed out-of-line,
and the in-line assembly for cmpxchg would default to:

	NOP
	BL	&lt;ll_sc_cmpxchg*_implementation&gt;
	NOP

The LL/SC implementation of each cmpxchg() function accepted arguments
as per AAPCS64 rules, to it was necessary to place the pointer in x0,
the older value in X1, and the new value in x2, and acquire the return
value from x0. The LL/SC implementation required a temporary register
(e.g. for the STXR status value). As the LL/SC implementation preserved
the old value, the LSE implementation does likewise.

Since commit:

  addfc38672c73efd ("arm64: atomics: avoid out-of-line ll/sc atomics")

... the LSE and LL/SC implementations of cmpxchg are inlined as separate
asm blocks, with another branch choosing between thw two. Due to this,
it is no longer necessary for the LSE implementation to match the
register constraints of the LL/SC implementation. This was partially
dealt with by removing the hard-coded use of x30 in commit:

  3337cb5aea594e40 ("arm64: avoid using hard-coded registers for LSE atomics")

... but we didn't clean up the hard-coding of x0, x1, and x2.

This patch simplifies the LSE implementation of cmpxchg, removing the
register shuffling and directly clobbering the 'old' argument. This
gives the compiler greater freedom for register allocation, and avoids
redundant work.

The new constraints permit 'old' (Rs) and 'new' (Rt) to be allocated to
the same register when the initial values of the two are the same, e.g.
resulting in:

	CAS	X0, X0, [X1]

This is safe as Rs is only written back after the initial values of Rs
and Rt are consumed, and there are no UNPREDICTABLE behaviours to avoid
when Rs == Rt.

The new constraints also permit 'new' to be allocated to the zero
register, avoiding a MOV in a few cases. The same cannot be done for
'old' as it is both an input and output, and any caller of cmpxchg()
should care about the output value. Note that for CAS* the use of the
zero register never affects the ordering (while for SWP* the use of the
zero regsiter for the 'old' value drops any ACQUIRE semantic).

Compared to v6.2-rc4, a defconfig vmlinux is ~116KiB smaller, though the
resulting Image is the same size due to internal alignment and padding:

  [mark@lakrids:~/src/linux]% ls -al vmlinux-*
  -rwxr-xr-x 1 mark mark 137269304 Jan 16 11:59 vmlinux-after
  -rwxr-xr-x 1 mark mark 137387936 Jan 16 10:54 vmlinux-before
  [mark@lakrids:~/src/linux]% ls -al Image-*
  -rw-r--r-- 1 mark mark 38711808 Jan 16 11:59 Image-after
  -rw-r--r-- 1 mark mark 38711808 Jan 16 10:54 Image-before

This patch does not touch cmpxchg_double*() as that requires contiguous
register pairs, and separate patches will replace it with cmpxchg128*().

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Robin Murphy &lt;robin.murphy@arm.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Link: https://lore.kernel.org/r/20230314153700.787701-2-mark.rutland@arm.com
Signed-off-by: Will Deacon &lt;will@kernel.org&gt;
</content>
</entry>
<entry>
<title>arm64: cmpxchg_double*: hazard against entire exchange variable</title>
<updated>2023-01-05T15:28:15+00:00</updated>
<author>
<name>Mark Rutland</name>
<email>mark.rutland@arm.com</email>
</author>
<published>2023-01-04T15:16:26+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=031af50045ea97ed4386eb3751ca2c134d0fc911'/>
<id>urn:sha1:031af50045ea97ed4386eb3751ca2c134d0fc911</id>
<content type='text'>
The inline assembly for arm64's cmpxchg_double*() implementations use a
+Q constraint to hazard against other accesses to the memory location
being exchanged. However, the pointer passed to the constraint is a
pointer to unsigned long, and thus the hazard only applies to the first
8 bytes of the location.

GCC can take advantage of this, assuming that other portions of the
location are unchanged, leading to a number of potential problems.

This is similar to what we fixed back in commit:

  fee960bed5e857eb ("arm64: xchg: hazard against entire exchange variable")

... but we forgot to adjust cmpxchg_double*() similarly at the same
time.

The same problem applies, as demonstrated with the following test:

| struct big {
|         u64 lo, hi;
| } __aligned(128);
|
| unsigned long foo(struct big *b)
| {
|         u64 hi_old, hi_new;
|
|         hi_old = b-&gt;hi;
|         cmpxchg_double_local(&amp;b-&gt;lo, &amp;b-&gt;hi, 0x12, 0x34, 0x56, 0x78);
|         hi_new = b-&gt;hi;
|
|         return hi_old ^ hi_new;
| }

... which GCC 12.1.0 compiles as:

| 0000000000000000 &lt;foo&gt;:
|    0:   d503233f        paciasp
|    4:   aa0003e4        mov     x4, x0
|    8:   1400000e        b       40 &lt;foo+0x40&gt;
|    c:   d2800240        mov     x0, #0x12                       // #18
|   10:   d2800681        mov     x1, #0x34                       // #52
|   14:   aa0003e5        mov     x5, x0
|   18:   aa0103e6        mov     x6, x1
|   1c:   d2800ac2        mov     x2, #0x56                       // #86
|   20:   d2800f03        mov     x3, #0x78                       // #120
|   24:   48207c82        casp    x0, x1, x2, x3, [x4]
|   28:   ca050000        eor     x0, x0, x5
|   2c:   ca060021        eor     x1, x1, x6
|   30:   aa010000        orr     x0, x0, x1
|   34:   d2800000        mov     x0, #0x0                        // #0    &lt;--- BANG
|   38:   d50323bf        autiasp
|   3c:   d65f03c0        ret
|   40:   d2800240        mov     x0, #0x12                       // #18
|   44:   d2800681        mov     x1, #0x34                       // #52
|   48:   d2800ac2        mov     x2, #0x56                       // #86
|   4c:   d2800f03        mov     x3, #0x78                       // #120
|   50:   f9800091        prfm    pstl1strm, [x4]
|   54:   c87f1885        ldxp    x5, x6, [x4]
|   58:   ca0000a5        eor     x5, x5, x0
|   5c:   ca0100c6        eor     x6, x6, x1
|   60:   aa0600a6        orr     x6, x5, x6
|   64:   b5000066        cbnz    x6, 70 &lt;foo+0x70&gt;
|   68:   c8250c82        stxp    w5, x2, x3, [x4]
|   6c:   35ffff45        cbnz    w5, 54 &lt;foo+0x54&gt;
|   70:   d2800000        mov     x0, #0x0                        // #0     &lt;--- BANG
|   74:   d50323bf        autiasp
|   78:   d65f03c0        ret

Notice that at the lines with "BANG" comments, GCC has assumed that the
higher 8 bytes are unchanged by the cmpxchg_double() call, and that
`hi_old ^ hi_new` can be reduced to a constant zero, for both LSE and
LL/SC versions of cmpxchg_double().

This patch fixes the issue by passing a pointer to __uint128_t into the
+Q constraint, ensuring that the compiler hazards against the entire 16
bytes being modified.

With this change, GCC 12.1.0 compiles the above test as:

| 0000000000000000 &lt;foo&gt;:
|    0:   f9400407        ldr     x7, [x0, #8]
|    4:   d503233f        paciasp
|    8:   aa0003e4        mov     x4, x0
|    c:   1400000f        b       48 &lt;foo+0x48&gt;
|   10:   d2800240        mov     x0, #0x12                       // #18
|   14:   d2800681        mov     x1, #0x34                       // #52
|   18:   aa0003e5        mov     x5, x0
|   1c:   aa0103e6        mov     x6, x1
|   20:   d2800ac2        mov     x2, #0x56                       // #86
|   24:   d2800f03        mov     x3, #0x78                       // #120
|   28:   48207c82        casp    x0, x1, x2, x3, [x4]
|   2c:   ca050000        eor     x0, x0, x5
|   30:   ca060021        eor     x1, x1, x6
|   34:   aa010000        orr     x0, x0, x1
|   38:   f9400480        ldr     x0, [x4, #8]
|   3c:   d50323bf        autiasp
|   40:   ca0000e0        eor     x0, x7, x0
|   44:   d65f03c0        ret
|   48:   d2800240        mov     x0, #0x12                       // #18
|   4c:   d2800681        mov     x1, #0x34                       // #52
|   50:   d2800ac2        mov     x2, #0x56                       // #86
|   54:   d2800f03        mov     x3, #0x78                       // #120
|   58:   f9800091        prfm    pstl1strm, [x4]
|   5c:   c87f1885        ldxp    x5, x6, [x4]
|   60:   ca0000a5        eor     x5, x5, x0
|   64:   ca0100c6        eor     x6, x6, x1
|   68:   aa0600a6        orr     x6, x5, x6
|   6c:   b5000066        cbnz    x6, 78 &lt;foo+0x78&gt;
|   70:   c8250c82        stxp    w5, x2, x3, [x4]
|   74:   35ffff45        cbnz    w5, 5c &lt;foo+0x5c&gt;
|   78:   f9400480        ldr     x0, [x4, #8]
|   7c:   d50323bf        autiasp
|   80:   ca0000e0        eor     x0, x7, x0
|   84:   d65f03c0        ret

... sampling the high 8 bytes before and after the cmpxchg, and
performing an EOR, as we'd expect.

For backporting, I've tested this atop linux-4.9.y with GCC 5.5.0. Note
that linux-4.9.y is oldest currently supported stable release, and
mandates GCC 5.1+. Unfortunately I couldn't get a GCC 5.1 binary to run
on my machines due to library incompatibilities.

I've also used a standalone test to check that we can use a __uint128_t
pointer in a +Q constraint at least as far back as GCC 4.8.5 and LLVM
3.9.1.

Fixes: 5284e1b4bc8a ("arm64: xchg: Implement cmpxchg_double")
Fixes: e9a4b795652f ("arm64: cmpxchg_dbl: patch in lse instructions when supported by the CPU")
Reported-by: Boqun Feng &lt;boqun.feng@gmail.com&gt;
Link: https://lore.kernel.org/lkml/Y6DEfQXymYVgL3oJ@boqun-archlinux/
Reported-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/lkml/Y6GXoO4qmH9OIZ5Q@hirez.programming.kicks-ass.net/
Signed-off-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: stable@vger.kernel.org
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Steve Capper &lt;steve.capper@arm.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Link: https://lore.kernel.org/r/20230104151626.3262137-1-mark.rutland@arm.com
Signed-off-by: Will Deacon &lt;will@kernel.org&gt;
</content>
</entry>
<entry>
<title>arm64: atomic: always inline the assembly</title>
<updated>2022-09-09T12:58:33+00:00</updated>
<author>
<name>Mark Rutland</name>
<email>mark.rutland@arm.com</email>
</author>
<published>2022-08-17T15:59:14+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=78f6f5c994ed22a35ce1cd3ec9aeda8e2fa328e6'/>
<id>urn:sha1:78f6f5c994ed22a35ce1cd3ec9aeda8e2fa328e6</id>
<content type='text'>
The __lse_*() and __ll_sc_*() atomic implementations are marked as
inline rather than __always_inline, permitting a compiler to generate
out-of-line versions, which may be instrumented.

We marked the atomic wrappers as __always_inline in commit:

  c35a824c31834d94 ("arm64: make atomic helpers __always_inline")

... but did not think to do the same for the underlying implementations.

If the compiler were to out-of-line an LSE or LL/SC atomic, this could
break noinstr code. Ensure this doesn't happen by marking the underlying
implementations as __always_inline.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Link: https://lore.kernel.org/r/20220817155914.3975112-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arm64: atomics: lse: Dereference matching size</title>
<updated>2022-01-20T09:13:48+00:00</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2022-01-12T20:22:59+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=3364c6ce23c6e347c63fc895e0d6d1e8a9407849'/>
<id>urn:sha1:3364c6ce23c6e347c63fc895e0d6d1e8a9407849</id>
<content type='text'>
When building with -Warray-bounds, the following warning is generated:

In file included from ./arch/arm64/include/asm/lse.h:16,
                 from ./arch/arm64/include/asm/cmpxchg.h:14,
                 from ./arch/arm64/include/asm/atomic.h:16,
                 from ./include/linux/atomic.h:7,
                 from ./include/asm-generic/bitops/atomic.h:5,
                 from ./arch/arm64/include/asm/bitops.h:25,
                 from ./include/linux/bitops.h:33,
                 from ./include/linux/kernel.h:22,
                 from kernel/printk/printk.c:22:
./arch/arm64/include/asm/atomic_lse.h:247:9: warning: array subscript 'long unsigned int[0]' is partly outside array bounds of 'atomic_t[1]' [-Warray-bounds]
  247 |         asm volatile(                                                   \
      |         ^~~
./arch/arm64/include/asm/atomic_lse.h:266:1: note: in expansion of macro '__CMPXCHG_CASE'
  266 | __CMPXCHG_CASE(w,  , acq_, 32,  a, "memory")
      | ^~~~~~~~~~~~~~
kernel/printk/printk.c:3606:17: note: while referencing 'printk_cpulock_owner'
 3606 | static atomic_t printk_cpulock_owner = ATOMIC_INIT(-1);
      |                 ^~~~~~~~~~~~~~~~~~~~

This is due to the compiler seeing an unsigned long * cast against
something (atomic_t) that is int sized. Replace the cast with the
matching size cast. This results in no change in binary output.

Note that __ll_sc__cmpxchg_case_##name##sz already uses the same
constraint:

	[v] "+Q" (*(u##sz *)ptr

Which is why only the LSE form needs updating and not the
LL/SC form, so this change is unlikely to be problematic.

Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Boqun Feng &lt;boqun.feng@gmail.com&gt;
Cc: linux-arm-kernel@lists.infradead.org
Acked-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
Acked-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/20220112202259.3950286-1-keescook@chromium.org
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arm64: atomics: lse: define RETURN ops in terms of FETCH ops</title>
<updated>2021-12-14T13:00:33+00:00</updated>
<author>
<name>Mark Rutland</name>
<email>mark.rutland@arm.com</email>
</author>
<published>2021-12-10T15:14:10+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=053f58bab331e6c31c486661fdb5f6bad5f413de'/>
<id>urn:sha1:053f58bab331e6c31c486661fdb5f6bad5f413de</id>
<content type='text'>
The FEAT_LSE atomic instructions include LD* instructions which return
the original value of a memory location can be used to directly
implement FETCH opertations. Each RETURN op is implemented as a copy of
the corresponding FETCH op with a trailing instruction to generate the
new value of the memory location. We only directly implement
*_fetch_add*(), for which we have a trailing `add` instruction.

As the compiler has no visibility of the `add`, this leads to less than
optimal code generation when consuming the result.

For example, the compiler cannot constant-fold the addition into later
operations, and currently GCC 11.1.0 will compile:

       return __lse_atomic_sub_return(1, v) == 0;

As:

	mov     w1, #0xffffffff
	ldaddal w1, w2, [x0]
	add     w1, w1, w2
	cmp     w1, #0x0
	cset    w0, eq  // eq = none
	ret

This patch improves this by replacing the `add` with C addition after
the inline assembly block, e.g.

	ret += i;

This allows the compiler to manipulate `i`. This permits the compiler to
merge the `add` and `cmp` for the above, e.g.

	mov     w1, #0xffffffff
	ldaddal w1, w1, [x0]
	cmp     w1, #0x1
	cset    w0, eq  // eq = none
	ret

With this change the assembly for each RETURN op is identical to the
corresponding FETCH op (including barriers and clobbers) so I've removed
the inline assembly and rewritten each RETURN op in terms of the
corresponding FETCH op, e.g.

| static inline void __lse_atomic_add_return(int i, atomic_t *v)
| {
|       return __lse_atomic_fetch_add(i, v) + i
| }

The new construction does not adversely affect the common case, and
before and after this patch GCC 11.1.0 can compile:

	__lse_atomic_add_return(i, v)

As:

	ldaddal w0, w2, [x1]
	add     w0, w0, w2

... while having the freedom to do better elsewhere.

This is intended as an optimization and cleanup.
There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Boqun Feng &lt;boqun.feng@gmail.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Acked-by: Will Deacon &lt;will@kernel.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20211210151410.2782645-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arm64: atomics: lse: improve constraints for simple ops</title>
<updated>2021-12-14T13:00:23+00:00</updated>
<author>
<name>Mark Rutland</name>
<email>mark.rutland@arm.com</email>
</author>
<published>2021-12-10T15:14:09+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=8a578a759ad61b13240190a12cdfa8845c5fa949'/>
<id>urn:sha1:8a578a759ad61b13240190a12cdfa8845c5fa949</id>
<content type='text'>
We have overly conservative assembly constraints for the basic FEAT_LSE
atomic instructions, and using more accurate and permissive constraints
will allow for better code generation.

The FEAT_LSE basic atomic instructions have come in two forms:

	LD{op}{order}{size} &lt;Rs&gt;, &lt;Rt&gt;, [&lt;Rn&gt;]
	ST{op}{order}{size} &lt;Rs&gt;, [&lt;Rn&gt;]

The ST* forms are aliases of the LD* forms where:

	ST{op}{order}{size} &lt;Rs&gt;, [&lt;Rn&gt;]
Is:
	LD{op}{order}{size} &lt;Rs&gt;, XZR, [&lt;Rn&gt;]

For either form, both &lt;Rs&gt; and &lt;Rn&gt; are read but not written back to,
and &lt;Rt&gt; is written with the original value of the memory location.
Where (&lt;Rt&gt; == &lt;Rs&gt;) or (&lt;Rt&gt; == &lt;Rn&gt;), &lt;Rt&gt; is written *after* the
other register value(s) are consumed. There are no UNPREDICTABLE or
CONSTRAINED UNPREDICTABLE behaviours when any pair of &lt;Rs&gt;, &lt;Rt&gt;, or
&lt;Rn&gt; are the same register.

Our current inline assembly always uses &lt;Rs&gt; == &lt;Rt&gt;, treating this
register as both an input and an output (using a '+r' constraint). This
forces the compiler to do some unnecessary register shuffling and/or
redundant value generation.

For example, the compiler cannot reuse the &lt;Rs&gt; value, and currently GCC
11.1.0 will compile:

	__lse_atomic_add(1, a);
	__lse_atomic_add(1, b);
	__lse_atomic_add(1, c);

As:

	mov     w3, #0x1
	mov     w4, w3
	stadd   w4, [x0]
	mov     w0, w3
	stadd   w0, [x1]
	stadd   w3, [x2]

We can improve this with more accurate constraints, separating &lt;Rs&gt; and
&lt;Rt&gt;, where &lt;Rs&gt; is an input-only register ('r'), and &lt;Rt&gt; is an
output-only value ('=r'). As &lt;Rt&gt; is written back after &lt;Rs&gt; is
consumed, it does not need to be earlyclobber ('=&amp;r'), leaving the
compiler free to use the same register for both &lt;Rs&gt; and &lt;Rt&gt; where this
is desirable.

At the same time, the redundant 'r' constraint for `v` is removed, as
the `+Q` constraint is sufficient.

With this change, the above example becomes:

	mov     w3, #0x1
	stadd   w3, [x0]
	stadd   w3, [x1]
	stadd   w3, [x2]

I've made this change for the non-value-returning and FETCH ops. The
RETURN ops have a multi-instruction sequence for which we cannot use the
same constraints, and a subsequent patch will rewrite hte RETURN ops in
terms of the FETCH ops, relying on the ability for the compiler to reuse
the &lt;Rs&gt; value.

This is intended as an optimization.
There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Boqun Feng &lt;boqun.feng@gmail.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Acked-by: Will Deacon &lt;will@kernel.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20211210151410.2782645-5-mark.rutland@arm.com
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arm64: atomics: lse: define ANDs in terms of ANDNOTs</title>
<updated>2021-12-14T13:00:23+00:00</updated>
<author>
<name>Mark Rutland</name>
<email>mark.rutland@arm.com</email>
</author>
<published>2021-12-10T15:14:08+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=5e9e43c987b2614078b795d68eae6598eea3067a'/>
<id>urn:sha1:5e9e43c987b2614078b795d68eae6598eea3067a</id>
<content type='text'>
The FEAT_LSE atomic instructions include atomic bit-clear instructions
(`ldclr*` and `stclr*`) which can be used to directly implement ANDNOT
operations. Each AND op is implemented as a copy of the corresponding
ANDNOT op with a leading `mvn` instruction to apply a bitwise NOT to the
`i` argument.

As the compiler has no visibility of the `mvn`, this leads to less than
optimal code generation when generating `i` into a register. For
example, __lse_atomic_fetch_and(0xf, v) can be compiled to:

	mov     w1, #0xf
	mvn     w1, w1
	ldclral w1, w1, [x2]

This patch improves this by replacing the `mvn` with NOT in C before the
inline assembly block, e.g.

	i = ~i;

This allows the compiler to generate `i` into a register more optimally,
e.g.

	mov     w1, #0xfffffff0
	ldclral w1, w1, [x2]

With this change the assembly for each AND op is identical to the
corresponding ANDNOT op (including barriers and clobbers), so I've
removed the inline assembly and rewritten each AND op in terms of the
corresponding ANDNOT op, e.g.

| static inline void __lse_atomic_and(int i, atomic_t *v)
| {
| 	return __lse_atomic_andnot(~i, v);
| }

This is intended as an optimization and cleanup.
There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Boqun Feng &lt;boqun.feng@gmail.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Acked-by: Will Deacon &lt;will@kernel.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20211210151410.2782645-4-mark.rutland@arm.com
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
</feed>
