XOR swap algorithm

It is sometimes discussed as a program optimization, but there are almost no cases where swapping via exclusive or provides benefit over the standard, obvious technique.Note that on some architectures the first operand of the XOR instruction specifies the target location at which the result of the operation is stored, preventing this interchangeability.The algorithm typically corresponds to three machine-code instructions, represented by corresponding pseudocode and assembly instructions in the three rows of the following table: In the above System/370 assembly code sample, R1 and R2 are distinct registers, and each XR operation leaves its result in the register named in the first argument.In RISC-V assembly, value X and Y are in registers X10 and X11, and xor places the result of the operation in the first register (same as X86) However, in the pseudocode or high-level language version or implementation, the algorithm fails if x and y use the same storage location, since the value stored in that location will be zeroed out by the first XOR instruction, and then remain zero; it will not be "swapped with itself".denotes XOR):[a] Suppose that we have two distinct registers R1 and R2 as in the table below, with initial values A and B respectively.), which expresses the elementary matrix of switching two rows (or columns) in terms of the transvections (shears) of adding one element to the other.At least on recent x86 CPUs, both by AMD and Intel, moving between registers regularly incurs zero latency.Similar problems occur with call by name, as in Jensen's Device, where swapping i and A[i] via a temporary variable yields incorrect results due to the arguments being related: swapping via temp = i; i = A[i]; A[i] = temp changes the value for i in the second statement, which then results in the incorrect i value for A[i] in the third statement.The underlying principle of the XOR swap algorithm can be applied to any operation meeting criteria L1 through L4 above.Replacing XOR by addition and subtraction gives various slightly different, but largely equivalent, formulations.For example:[4] Unlike the XOR swap, this variation requires that the underlying processor or programming language uses a method such as modular arithmetic or bignums to guarantee that the computation of X + Y cannot cause an error due to integer overflow.Signed integer overflow is an undefined behavior in C and thus modular arithmetic is not guaranteed by the standard, which may lead to incorrect results.The sequence of operations in AddSwap can be expressed via matrix multiplication as: On architectures lacking a dedicated swap instruction, because it avoids the extra temporary register, the XOR swap algorithm is required for optimal register allocation.On modern GPU architectures, spilling variables is expensive due to limited memory bandwidth and high memory latency, while limiting register usage can improve performance due to dynamic partitioning of the register file.