A variant of a diff algorithm for constrained conditions
Introduction
One of the projects I currently work on is Praefectus, which will be a GTD application for my needs.
Several weeks ago, I've decided that one of the first tasks Praefectus will do
will be file renaming: having a set of numbered files on my disk (
say, 1.homework-a.md
, 2.office-work.md
, 3.homework-b.md
) and certain
ordering condition (say, I've decided that all the homework is more important
than any office work today), it should rename the files to match the ordering
condition. As I store my files under a version control system (usually, Git), it
would be good to have it performing only a minimal set of renames: there could
be gaps between items (which may eventually be filled), but for basic rename, to
keep my file history more clean, I'd like it to not rename items that could stay
as-is.
For my example it would mean that Praefectus should only
rename 2.office-work.md
into 4.office-work.md
, and leave number 2
unoccupied.
While investigating different algorithms solving similar problems, I eventually came to the conclusion that my task is very similar to what a diff algorithm would do: having two sequences of items, it calculates a minimal set of edits on sequence one to get sequence two as the result.
The task is similar to a diff algorithm one, but it's not the same. Usual diff algorithm doesn't know anything about item numbering: it just views the input sequence as a sequence, and "thinks" it could insert any amount of new items between any two. While this will work for some cases, for some it won't.
For example, let's consider this sequence of items: 1.A
, 3.D
, 4.E
, 5.B
,
6.C
. We want to take this sequence and change the numbers near each items for the letters to follow the alphabet order; for example, 1.A
, 10.B
, 20.C
, 30.D
, 40.E
would be a valid answer — albeit not the best one.
A usual diff algorithm would consider taking items B
and C
, and
inserting them between A
and D
. The problem is there's only one free
position between A
and D
, so this particular edit graph is impossible and
shouldn't be considered at all! What's possible is to delete item 3.D
and
insert it as 4.D
(that would require us to also rename 4.E
to free the
space). Though, a better solution would be to move D
and E
to the end of the
sequence.
The algorithm should be significantly modified to take item positions into account. In this article, I'll describe the required algorithm modifications.
Task Description
Let A
and B
to be two sequences of items (equatable to each other),
and ItemsA
a mapping from each item in sequence A
to a number. For
simplicity let's not consider cases when there're several equal items in the
sequence A
or sequence B
, and every number in a mapping ItemsA
is distinct
(no two items map to same number).
The goal of the algorithm is to produce a shortest command sequence which,
when applied to a sequence A
and mapping ItemsA
, will produce a mapping
ItemsB
, which has the following properties:
- It maps every item in sequence
B
to a number. - All the numbers are distinct (no two items map to same number).
- When ordered by the corresponding numbers in
ItemsB
, sequenceB
remains itself.
A command is one of the following:
- add item (from sequence
B
with positionX
inItemsB
) - move item (from the corresponding position from
ItemsA
to a positionX
inItemsB
)
Diff Algorithm
A similar task has already been solved in diff algorithms (for example, see Eugene W. Myers, An O(ND) Difference Algorithm and Its Variations: Algorithmica (1986), pp. 251-266).
A diff algorithm considers an edit graph of the sequences: a grid of size
Size(A)+1 × Size(B)+1
, where the horizontal movements (from left to right)
correspond to deletions of the corresponding item of the initial sequence;
vertical movements (from top to bottom) correspond to insertions of an item from
the target sequence; diagonal movements (only available in places where the
initial and the target sequences match) correspond to leaving an item of the
initial sequence.
Here's an example edit graph for converting sequence ABCD
into ACBD
:
0 A C B D
·──·──·──·──·
│╲ │ │ │ │
│ ╲│ │ │ │
A ·──·──·──·──·
│ │ │╲ │ │
│ │ │ ╲│ │
B ·──·──·──·──·
│ │╲ │ │ │
│ │ ╲│ │ │
C ·──·──·──·──·
│ │ │ │╲ │
│ │ │ │ ╲│
D ·──·──·──·──·
The goal is to get from the top left corner to the bottom right corner of the graph.
Myers' algorithm helps to find the shortest path in this graph.
Now, let's consider an example of applying the order ABCD (B = "ABCD"
) to the
initial sequence "1A 3C 4B 5D" (A = ACBD
,
ItemsA = (A → 1, B → 4, C → 3, D → 5)
) with a minimal amount of moves.
0 1A 3C 4B 5D
·──·──·──·──·
║╲ │ ║ ║ │
║ ╲│ ║ ║ │
A ·──·──·──·──·
║ │ ║╲ ║ │
║ │ ║ ╲║ │
B ·──·──·──·──·
║ │╲ ║ ║ │
║ │ ╲║ ║ │
C ·──·──·──·──·
║ │ ║ ║╲ │
║ │ ║ ║ ╲│
D ·──·──·──·──·
In this graph, certain paths (marked as double lines ║) are forbidden, because they could lead to insertion of the file in between of two existing subsequent files (or with a number below 1), which would lead to us having to renumber the latter file anyway, which is essentially the same in complexity as removing a file and inserting a new one.
Though, these movements are only forbidden conditionally: it's still possible to
delete the corresponding item and insert another one in its place. For example,
such route in the edit graph is forbidden, because it requires to insert an item
after 1A
, but before 2C
:
0 1A 2C
· · ·
╲
╲
A · · ·
│
│
B · · ·
╲
╲
C · · ·
Although this route isn't forbidden, because it deletes the item 2C
before
inserting anything on its place (so it would produce a sequence 1A 2B 3C
, with
two moves):
0 1A 2C
· · ·
╲
╲
A · · ·
│
│
B · ·──·
│
│
C · · ·
To accomodate the concept of "available insertion count between columns" and these forbidden paths, I propose to use a set of extensions to the edit graph.
- The X dimension of the edit graph should be determined not by the item count
of the sequence
A
, but by the maximal number in the mappingItemsA
. Say, for sequence with mapping "1A 3C 4B 5D", the max X dimension should be 5 and not 4. - Diagonal movement from any node to a node in a column where no mapping exists (to the column 2 in the example sequence) is allowed (as if such a nonexistent item was equal to any other item).
- No vertical movements allowed, except for several cases:
- moves in the rightmost column (since there's an infinite amount of space for the new items there);
- moves involved in a "special maneuver": a vertical move immediately after a horizontal one (essentially, insertion of item into a position from which an item was just removed).
For the test example, this set of rules will allow to produce the following edit graph:
0 1A 2_ 3C 4B 5D
·──·──·──·──·──·
╲ ║╲ ║ ║ │
╲║ ╲ ║ ║ │
A ·──·──·──·──·──·
║╲ ║╲ ║ │
║ ╲ ║ ╲║ │
B ·──·──·──·──·──·
║╲ ╲ ║ ║ │
║ ╲ ╲║ ║ │
C ·──·──·──·──·──·
║╲ ║ ║╲ │
║ ╲ ║ ║ ╲│
D ·──·──·──·──·──·
Here, double lines ║ mark paths that may only be taken as part of the "special maneuver".
Traversing of this graph with a simple breadth-first algorithm will allow to get a shortest edit sequence. While traversing the graph, the following should count as a single move: a single step right or down (when allowed), optionally followed by traversing all the available diagonal steps.
Currently, this rule is implemented and tested in a Praefectus code (as of
commit 9970a266d584d53536195145b611b808c48838b3
, see on GitHub).
Possible Improvements
It would be beneficial to involve property-based testing into testing the algorithm.
Current algorithm doesn't take into account that "skipping" the empty columns should be counted as a "free" move. I.e. the following route only has two edits (delete item
1A
, delete item3C
), not three (since deleting of the item2_
doesn't count):0 1A 2_ 3C ·──·──·──·
Further reading
To better understand Myers' algorithm (and especially how to reconstruct the edit sequence basing on the data it provides), see the blog post series of James Colgan: