Sequentialising Parallel Moves
This article aims to describe the topic of lowering parallel moves in a compiler. Many compiler textbooks do not cover this topic at all, rendering their treatment of things such as emitting register arguments (for calls) incorrect in general.
Introduction
A set of parallel moves (or assignments),
Example
Parallel moves appear in a few places in compilers. Most notably, in the destruction of ϕ-nodes when going out of SSA and when populating register arguments for calls. This article will focus on the latter for demonstration.
Consider the following C snippet:
int bar(int y, int x);
int foo(int x, int y) {
return bar(y, x);
}
If we assume a calling convention where arguments are passed in registers r0
, r1
, ...
, rN
,
then we expect that this function will enter with x
within r0
and y
within r1
. These variables
are then passed to bar
in reverse order. In order to do this shuffling of registers correctly, some care must be taken.
For example, consider the naive (incorrect) lowering of this function:
foo:
mov r0, r1
mov r1, r0
b bar
This is clearly incorrect as it will logically result in a call similar to bar(y, y)
. In other words, we have overwritten one of our source registers because it also appears as a destination register within our parallel move. This example is a canonical example of the "swap" problem (as it is known in SSA literature).
In order to do this lowering correctly, the compiler must identify that there is a cycle between these assignment locations and either break the cycle with a temporary register (to preserve the overwritten value) or with a dedicated swap instruction (such as xchg
in x86_64).
For example, the following is a correct lowering:
foo:
mov r2, r0
mov r0, r1
mov r1, r2
b bar
As we will see later, it turns out you can always sequentialise parallel moves with a single temporary location (per register class). The following sections describe how we can tackle this problem in a principled way by identifying the two major categories of dependencies between locations.
Location Transfer Graph
In order to visualise dependencies between locations involved in a parallel move, it is convenient to create a graph representation.
For each
Consider the graph created for the parallel moves
To aid the presentation, each edge is labelled with its index as it appears in the parallel move notation.
If we were to naively introduce copies in order of appearance, we’d get:
B := A
D := A
C := B
This is wrong, as
Consider a correct lowering:
C := B
B := A
D := A
This is correct because
C := B
D := A
B := A
Each correct lowering corresponds to traversing the transfer location
graph in post-order. This makes sense when you consider that
post-order traversals of graphs result in a node visitation order that
corresponds to a valid topological sort of the transpose (reverse)
graph. This knowledge motivates an algorithm that, before emitting a
move
Dealing with Cycles
However, what if cycles exist in the location transfer graph? In the
case of cycles, we can always use a single temporary to break the
cycle. The basic idea is that if we begin processing edge
Let us add a cycle to the previous example to see how this works. The
parallel moves under consideration below are now
Suppose we start by considering edge
t := A
Then,
t := A
D := A
Then, as our post-order traversal works its way back up the traversal/recursion stack, we continue to emit the moves for
t := A
D := A
A := C
C := B
Then, when we get back to the first move we considered,
t := A
D := A
A := C
C := B
B := t
It should be clear that the order in which certain moves are emitted depends on which (successive) edges the algorithm considers first at each point (if there is more than 1 successive edge). This is similar in concept to the fact that many valid topological orderings can exist for the same DAG.
Topological Sort Algorithm
Earlier, there was a mention of the fact that a post-order traversal produces a node visitation order that is a valid topological ordering of the transpose graph (ignoring the fact that true topological orderings can only be produced for DAGs). Indeed, we could formulate an algorithm based on Kahn’s algorithm that works on the transpose graph by processing the edges for nodes that have no predecessors first (removing such nodes, and related edges, as it goes). Then, if the graph still contains edges, there must be a cycle, so chooses an edge at random to preserve the edge’s destination location into a temporary location and then continues to process nodes with no successors. Care must be taken in such an algorithm to process the components of the location transfer graph in order, as the graph may have many components! (we wouldn’t want cycle resolution in one component to utilise the same temporary location currently in use by another component that hasn’t been fully processed - so we must break one cycle at a time and fully process one component before focusing on another)
In the next section, we’ll see how we can dispense with the idea of a graph-centric algorithm, choosing, instead, to implement an algorithm that ensures coverage of each edge in one pass by maintaining a set of processed edges as it processes each move in the order given.
Algorithm
The following algorithm is taken (with comments added) from page 5 of the paper Tilting at windmills with Coq.
(* state of each edge being considered in the algorithm *)
type status = To_move | Being_moved | Moved
let parallel_move src dst tmp =
(* initialise state array *)
let n = Array.length src in
let status = Array.make n To_move in
(* procedure to process one edge and its dependent moves *)
let rec move_one i =
(* don't process redundant self-edges *)
if src.(i) <> dst.(i) then
begin
(* mark node as being visited *)
status.(i) <- Being_moved;
(* visit its children first (recursive post-order) *)
for j = 0 to n - 1 do
(* found an child; move whose source depends on current move's destination *)
if src.(j) = dst.(i) then
match status.(j) with
| To_move -> (* yet to be moved, move it *)
move_one j
| Being_moved -> (* cycle detected, emit a preservation *)
let t = tmp src.(j) in
Printf.printf "%s := %s\n" t src.(j);
(* destructively book-keep the fact j's source is the temporary *)
src.(j) <- t
| Moved -> () (* already moved, do nothing *)
done;
Printf.printf "%s := %s\n" dst.(i) src.(i);
status.(i) <- Moved
end
in
(* loop to ensure each move is considered at least once *)
for i = 0 to n - 1 do
if status.(i) = To_move then move_one i
done
The purpose of the tmp
parameter is because we need a single temporary
register per register class (floating point registers, for example,
will require a different temporary register).
An iterative version of the same algorithm can be found here.
Demo
Notes
-
The algorithm is described above is quadratic because of the way it finds successive edges in the location transfer graph (linear search). A real implementation may wish to preprocess the edges to make this lookup faster. However, it’s often taken that
is very small in compiler’s use cases for parallel moves, so it probably doesn’t matter in practice. A practical implementation may wish to replace recursion with iteration, this is straightforward with a stack and extra book-keeping (gist). -
The fact that each component in the location transfer graph can have at most one cycle is a consequence of the destination locations
being distinct - this is crucial to the definition of a parallel move. -
In the paper by Leroy et al., it is noted that one could process the “windmills” (the name given to the shape of components in a location transfer graph for parallel moves) by processing the “blades” of each windmill followed by the “axles”. This idea is exactly the variant of Kahn’s algorithm described in this article for conceptual purposes. An extant compiler back-end that uses the Leroy et al. algorithm is QBE (and of course, CompCert, for which the algorithm was originally intended).
-
There are other strategies in instruction set architectures for swapping the contents of two registers. Notably, in x86(_64), the
xchg
instruction can be used. However, I don’t know of a mainstream compiler that uses it for the purpose of swapping registers in lowering parallel moves before calls to functions. In fact, GCC - when pushed - will actually spill to the stack (see here) rather than emit anxchg
(it seems there’s no code path in GCC that would ever emitxchg
in this case). The reason for not emitting xchg in general is probably because: (1) the instruction issues 3 micro-ops that cannot benefit from mov elision, and (2) using a memory operand with xchg implies atomicity. Of course, other swapping strategies such as XOR swapping could be used but are probably equally as undesirable.