A variant of a diff algorithm for constrained conditions

Publication date: 2021-02-12

Introduction

One of the projects I currently work on is Praefectus, which will be a GTD application for my needs.

Several weeks ago, I've decided that one of the first tasks Praefectus will do will be file renaming: having a set of numbered files on my disk ( say, 1.homework-a.md, 2.office-work.md, 3.homework-b.md) and certain ordering condition (say, I've decided that all the homework is more important than any office work today), it should rename the files to match the ordering condition. As I store my files under a version control system (usually, Git), it would be good to have it performing only a minimal set of renames: there could be gaps between items (which may eventually be filled), but for basic rename, to keep my file history more clean, I'd like it to not rename items that could stay as-is.

For my example it would mean that Praefectus should only rename 2.office-work.md into 4.office-work.md, and leave number 2 unoccupied.

While investigating different algorithms solving similar problems, I eventually came to the conclusion that my task is very similar to what a diff algorithm would do: having two sequences of items, it calculates a minimal set of edits on sequence one to get sequence two as the result.

The task is similar to a diff algorithm one, but it's not the same. Usual diff algorithm doesn't know anything about item numbering: it just views the input sequence as a sequence, and "thinks" it could insert any amount of new items between any two. While this will work for some cases, for some it won't.

For example, let's consider this sequence of items: 1.A, 3.D, 4.E, 5.B, 6.C. We want to take this sequence and change the numbers near each items for the letters to follow the alphabet order; for example, 1.A, 10.B, 20.C, 30.D, 40.E would be a valid answer — albeit not the best one.

A usual diff algorithm would consider taking items B and C, and inserting them between A and D. The problem is there's only one free position between A and D, so this particular edit graph is impossible and shouldn't be considered at all! What's possible is to delete item 3.D and insert it as 4.D (that would require us to also rename 4.E to free the space). Though, a better solution would be to move D and E to the end of the sequence.

The algorithm should be significantly modified to take item positions into account. In this article, I'll describe the required algorithm modifications.

Task Description

Let A and B to be two sequences of items (equatable to each other), and ItemsA a mapping from each item in sequence A to a number. For simplicity let's not consider cases when there're several equal items in the sequence A or sequence B, and every number in a mapping ItemsA is distinct (no two items map to same number).

The goal of the algorithm is to produce a shortest command sequence which, when applied to a sequence A and mapping ItemsA, will produce a mapping ItemsB, which has the following properties:

  1. It maps every item in sequence B to a number.
  2. All the numbers are distinct (no two items map to same number).
  3. When ordered by the corresponding numbers in ItemsB, sequence B remains itself.

A command is one of the following:

Diff Algorithm

A similar task has already been solved in diff algorithms (for example, see Eugene W. Myers, An O(ND) Difference Algorithm and Its Variations: Algorithmica (1986), pp. 251-266).

A diff algorithm considers an edit graph of the sequences: a grid of size Size(A)+1 × Size(B)+1, where the horizontal movements (from left to right) correspond to deletions of the corresponding item of the initial sequence; vertical movements (from top to bottom) correspond to insertions of an item from the target sequence; diagonal movements (only available in places where the initial and the target sequences match) correspond to leaving an item of the initial sequence.

Here's an example edit graph for converting sequence ABCD into ACBD:

 0   A  C  B  D
  ·──·──·──·──·
  │╲ │  │  │  │
  │ ╲│  │  │  │
A ·──·──·──·──·
  │  │  │╲ │  │
  │  │  │ ╲│  │
B ·──·──·──·──·
  │  │╲ │  │  │
  │  │ ╲│  │  │
C ·──·──·──·──·
  │  │  │  │╲ │
  │  │  │  │ ╲│
D ·──·──·──·──·

The goal is to get from the top left corner to the bottom right corner of the graph.

Myers' algorithm helps to find the shortest path in this graph.

Now, let's consider an example of applying the order ABCD (B = "ABCD") to the initial sequence "1A 3C 4B 5D" (A = ACBD, ItemsA = (A → 1, B → 4, C → 3, D → 5)) with a minimal amount of moves.

 0   1A 3C 4B 5D
  ·──·──·──·──·
  ║╲ │  ║  ║  │
  ║ ╲│  ║  ║  │
A ·──·──·──·──·
  ║  │  ║╲ ║  │
  ║  │  ║ ╲║  │
B ·──·──·──·──·
  ║  │╲ ║  ║  │
  ║  │ ╲║  ║  │
C ·──·──·──·──·
  ║  │  ║  ║╲ │
  ║  │  ║  ║ ╲│
D ·──·──·──·──·

In this graph, certain paths (marked as double lines ║) are forbidden, because they could lead to insertion of the file in between of two existing subsequent files (or with a number below 1), which would lead to us having to renumber the latter file anyway, which is essentially the same in complexity as removing a file and inserting a new one.

Though, these movements are only forbidden conditionally: it's still possible to delete the corresponding item and insert another one in its place. For example, such route in the edit graph is forbidden, because it requires to insert an item after 1A, but before 2C:

 0   1A 2C
  ·  ·  ·
   ╲
    ╲
A ·  ·  ·
     │
     │
B ·  ·  ·
      ╲
       ╲
C ·  ·  ·

Although this route isn't forbidden, because it deletes the item 2C before inserting anything on its place (so it would produce a sequence 1A 2B 3C, with two moves):

 0   1A 2C
  ·  ·  ·
   ╲
    ╲
A ·  ·  ·
     │
     │
B ·  ·──·
        │
        │
C ·  ·  ·

To accomodate the concept of "available insertion count between columns" and these forbidden paths, I propose to use a set of extensions to the edit graph.

  1. The X dimension of the edit graph should be determined not by the item count of the sequence A, but by the maximal number in the mapping ItemsA. Say, for sequence with mapping "1A 3C 4B 5D", the max X dimension should be 5 and not 4.
  2. Diagonal movement from any node to a node in a column where no mapping exists (to the column 2 in the example sequence) is allowed (as if such a nonexistent item was equal to any other item).
  3. No vertical movements allowed, except for several cases:
    • moves in the rightmost column (since there's an infinite amount of space for the new items there);
    • moves involved in a "special maneuver": a vertical move immediately after a horizontal one (essentially, insertion of item into a position from which an item was just removed).

For the test example, this set of rules will allow to produce the following edit graph:

 0   1A 2_ 3C 4B 5D
  ·──·──·──·──·──·
   ╲ ║╲    ║  ║  │
    ╲║ ╲   ║  ║  │
A ·──·──·──·──·──·
     ║╲    ║╲ ║  │
     ║ ╲   ║ ╲║  │
B ·──·──·──·──·──·
     ║╲  ╲ ║  ║  │
     ║ ╲  ╲║  ║  │
C ·──·──·──·──·──·
     ║╲    ║  ║╲ │
     ║ ╲   ║  ║ ╲│
D ·──·──·──·──·──·

Here, double lines ║ mark paths that may only be taken as part of the "special maneuver".

Traversing of this graph with a simple breadth-first algorithm will allow to get a shortest edit sequence. While traversing the graph, the following should count as a single move: a single step right or down (when allowed), optionally followed by traversing all the available diagonal steps.

Currently, this rule is implemented and tested in a Praefectus code (as of commit 9970a266d584d53536195145b611b808c48838b3 , see on GitHub).

Possible Improvements

  1. It would be beneficial to involve property-based testing into testing the algorithm.

  2. Current algorithm doesn't take into account that "skipping" the empty columns should be counted as a "free" move. I.e. the following route only has two edits (delete item 1A, delete item 3C), not three (since deleting of the item 2_ doesn't count):

     0   1A 2_ 3C
      ·──·──·──·
    

Further reading

To better understand Myers' algorithm (and especially how to reconstruct the edit sequence basing on the data it provides), see the blog post series of James Colgan: