{\huge Puzzling Evidence}

A connection between two ancient problems \vspace{.5in}

Arturs Backurs and Piotr Indyk are student and advisor. The latter, Piotr, is one of the leading algorithms and complexity theorists in the world---what an honor it must be for Arturs to work with him as his advisor.

Today Ken and I want to talk about their paper on edit distance and an aspect that we find puzzling.

The paper in question is, ``Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false).'' It is in this coming STOC 2015.

What we find puzzling is that the beautiful connection it makes between two old problems operates between two different levels of scaling, ``$latex {n}&fg=000000$'' and ``$latex {2^n.}&fg=000000$'' This messes up our intuition, at least mine.

I, Dick, have thought long and hard, over many years, about both the edit distance problem and about algorithms for SAT. I always felt that both algorithms should have much better than the ``obvious'' algorithms. However, I was much more positive about the ability for us to make a breakthrough on the edit distance, then to do the same for the SAT problem.

Their way of linking the two problems is to me quite puzzling. Quoting my favorite band, the Talking Heads:

... Now don't you wanna get right with me? (puzzling evidence) I hope you get ev'rything you need (puzzling evidence)

Puzzling evidence Puzzling evidence Puzzling evidence ...

The Problems

The edit distance between two strings is defined as the minimum number of insertions, deletions, or substitutions of symbols needed to transform one string into another. Thus CAT requires three substitutions to become ATE, but it can also be done by one insertion and one deletion: pop the C to make AT and then append E to make ATE. Thus the edit distance between these two words is 2. The problem of computing the edit distance occurs in so many fields of science that it is hard to figure out who invented what first. The case of strings of length $latex {N}&fg=000000$ is easily seen to be computable in time quadratic, $latex {O(N^{2})}&fg=000000$, by a dynamic programming algorithm that builds up edit distances between initial substrings.

Chak-Kuen Wong and Ashok Chandra proved this is optimal in the restricted model where one can only compare characters to each other. There are algorithms that beat quadratic by logarithmic factors---they essentially treat blocks of characters as one. But it remain open after much research whether there is an algorithm that runs in $latex {n^{1.999998}}&fg=000000$, for example.

The SAT problem is the usual question of testing Boolean clauses to see if they can all be satisfied at the same time by the same assignment to the Boolean variable.

The Connection

Backurs and Indyk prove that if there exists $latex {\delta > 0}&fg=000000$ such that edit distance can be decided in time $latex {O(N^{2 - \delta})}&fg=000000$, then there exists $latex {\epsilon > 0}&fg=000000$ such that $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ for formulas with $latex {n}&fg=000000$ variables and $latex {m}&fg=000000$ clauses can be solved in time $latex {m^{O(1)}2^{(1-\epsilon)}n}&fg=000000$. They build on a connection to SETH showed ten years ago by Ryan Williams in part of a wider-ranging paper.

The basic idea is how $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ behaves with regard to partitions of the $latex {n}&fg=000000$ variables of a formula $latex {\phi = C_1 \wedge\cdots \wedge C_m}&fg=000000$ into two size-$latex {n/2}&fg=000000$ subsets, call them $latex {X_1}&fg=000000$ and $latex {X_2}&fg=000000$. Let $latex {A}&fg=000000$ be the set of assignments to $latex {X_1}&fg=000000$ and $latex {B}&fg=000000$ to $latex {X_2}&fg=000000$. For every assignment $latex {a \in A}&fg=000000$, let $latex {S_a}&fg=000000$ be the set of clauses it satisfies, and $latex {T_a}&fg=000000$ the remaining clauses which it does not satisfy. Similarly define $latex {S_b}&fg=000000$ and $latex {T_b}&fg=000000$ for $latex {b \in B}&fg=000000$. Then

$latex \displaystyle \phi \in \mathsf{SAT} \iff (\exists a,b)[|S_a \cup S_b| = m] \iff (\exists a,b)[T_a \cap T_b = \emptyset]. &fg=000000$

Now let us identify $latex {a}&fg=000000$ with $latex {T_a}&fg=000000$ regarded as an $latex {m}&fg=000000$-bit vector and similarly $latex {b}&fg=000000$ with $latex {T_b}&fg=000000$, also re-labeling $latex {A}&fg=000000$ to be the set of $latex {N = 2^{n/2}}&fg=000000$-many $latex {T_a}&fg=000000$'s, $latex {B}&fg=000000$ for $latex {T_b}&fg=000000$'s. Then as Williams observed, $latex {\phi}&fg=000000$ is satisfiable if and only if we have a yes-instance of the following problem:

Orthogonal Vectors Problem ($latex {\mathsf{OVP}}&fg=000000$): Given two sets $latex {A,B}&fg=000000$ of $latex {N}&fg=000000$-many length-$latex {m}&fg=000000$ binary vectors, are there $latex {a \in A}&fg=000000$ and $latex {b \in B}&fg=000000$ such that $latex {\sum_{i=1}^m a_i b_i = 0}&fg=000000$?

It is obvious to solve $latex {\mathsf{OVP}}&fg=000000$ in time $latex {O(N^2 m)}&fg=000000$ by trying all pairs. The nub is what happens if we achieve anything slightly better in the exponent than quadratic, say time $latex {N^{2-\epsilon} m^{O(1)}}&fg=000000$. Then with $latex {\epsilon' = \epsilon/2}&fg=000000$ we get time

$latex \displaystyle \left(2^{n/2}\right)^{2-\epsilon} m^{O(1)} = 2^{n - \epsilon' n} m^{O(1)} &fg=000000$

for $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$, which contradicts SETH.

What's puzzling is that the evidence against doing better than quadratic comes when $latex {N}&fg=000000$ is already exponential, $latex {N = 2^{n/2}}&fg=000000$. Moreover, the $latex {\mathsf{OVP}}&fg=000000$ instances $latex {(A,B)}&fg=000000$ involved are ridiculously large, exponential sized, and we don't even care that they have a succinct genesis in terms of $latex {\phi}&fg=000000$. (Note that we have swapped the letters $latex {n}&fg=000000$ and $latex {N}&fg=000000$ from their paper---we find it helpful to keep ``$latex {N}&fg=000000$'' the larger one.)

Backurs and Indyk itemize several problems to which this connection was extended since 2010, but we agree that Edit Distance is the most striking addition to this list. Their new result is a kind of ``SETH-reduction'' from $latex {\mathsf{OVP}}&fg=000000$ to Edit Distance. Can we capture its essence without referencing $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ each time?

The Reductions

The results here and before all use an unusual type of reduction. I think it would be useful to formalize this reduction, and try to understand its properties. It is not totally correct to call it simply a quasi-linear time reduction because multiple parameters are involved---we can call them $latex {N}&fg=000000$ and $latex {m}&fg=000000$ quite generally.

In the above case with $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ and $latex {\mathsf{OVP}}&fg=000000$, if the clause size is fixed then we have $latex {m = n^{O(1)}}&fg=000000$ so $latex {m = \mathsf{polylog}(N)}&fg=000000$. It hence suffices to have a reduction from $latex {\mathsf{OVP}}&fg=000000$ to Edit Distance that is computable in quasi-linear time, here meaning time $latex {N\mathsf{polylog}(N)}&fg=000000$. Indeed, we can allow time $latex {N\cdot g(N)}&fg=000000$ for any function $latex {g(N) = N^{o(1)}}&fg=000000$.

When talking just about the problems $latex {\mathsf{OVP}}&fg=000000$ and Edit Distance, however---without reference to $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$---$latex {m}&fg=000000$ and $latex {N}&fg=000000$ are separate parameters with no relation specified. It suffices to say that the reduction is polynomial in $latex {m}&fg=000000$ and quasi-linear in $latex {N}&fg=000000$. This is essentially what Backurs and Indyk do. Their ``$latex {m}&fg=000000$'' is called ``$latex {d}&fg=000000$''; then they define $latex {\ell_0 = 1000d}&fg=000000$, $latex {\ell_1 = \ell_0^2}&fg=000000$, and $latex {\ell_2 = \ell_0^3}&fg=000000$; then they multiply $latex {\ell_0 d}&fg=000000$, and so on. The details in their paper are considerable, involving an initial reduction from $latex {\mathsf{OVP}}&fg=000000$ to a problem they call $latex {\mathsf{PATTERN}}&fg=000000$, and this is one reason we'd like to streamline the reduction concept.

If we assume $latex {m = \Omega(\log N)}&fg=000000$, then ``quasi-linear in $latex {N}&fg=000000$ and polynomial in $latex {m}&fg=000000$'' is the same as ``linear in $latex {N}&fg=000000$ and polynomial in $latex {m}&fg=000000$.'' Perhaps the latter phrase is the simplest and best way to define the reduction? However, now harking back to the $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ instances with $latex {m = O(n)}&fg=000000$, ultimately we do not need to specify ``polynomial in $latex {m}&fg=000000$'' either. It is enough to have a sub-exponential time in $latex {m}&fg=000000$, but of the stronger form $latex {2^{o(m)}}&fg=000000$ not $latex {2^{m - \epsilon m}}&fg=000000$. If the clause size is not fixed so that, say, $latex {m = n^2}&fg=000000$ comes in to play, then we would need $latex {2^{o(m^{1/2})}}&fg=000000$.

Parameterized complexity defines reductions with two parameters, but the simplest ones are not exactly suitable. Finally, we wonder whether it would help to stipulate any of the specific structure that comes from $latex {\mathsf{SAT}}&fg=000000$ including that the instances $latex {(A,B)}&fg=000000$ are succinct. Note that we once covered succinctness and a hypothesis roughly related to SETH (see this comment for a circuit form). This paper highlighted by Backurs and Indyk works from $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ but says it could have used $latex {\mathsf{OVP}}&fg=000000$, while still not formalizing the reduction concept. Likewise their other major references; some work from $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ and others not. The latest of them, by Karl Bringmann and Marvin Künnemann who show ``SETH-hardness'' for Edit Distance on binary strings, defines a framework for the gadgets used in all these reductions.

Power and Puzzlement

The remarks about the reduction time in ``$latex {m}&fg=000000$'' make us recall the three most common levels of exponential hardness. The ``power index'' of a language $latex {L}&fg=000000$ was coined by Richard Stearns and Harry Hunt in a 1990 paper. It is the infimum of $latex {\epsilon > 0}&fg=000000$ such that $latex {L}&fg=000000$ belongs to time $latex {2^{O(n^\epsilon)}}&fg=000000$. Their ``Satisfiability Hypothesis'' (SH) about the power index of satisfiability is much weaker than SETH, though not as weak as conjecturing merely a lower bound of $latex {2^{n^{\Omega(1)}}}&fg=000000$.

SH: $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ has power index 1.
ETH: All algorithms for $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ require time $latex {2^{\Omega(n)}}&fg=000000$.
SETH: All algorithms for $latex {\mathsf{CNF}\text{-}\mathsf{SAT}}&fg=000000$ require time $latex {2^{n - o(n)}}&fg=000000$.

The latter two come in slightly different versions bringing $latex {m}&fg=000000$ and/or a fixed $latex {k}&fg=000000$ in $latex {k}&fg=000000$-$latex {\mathsf{SAT}}&fg=000000$ into the picture, and of course all these versions might be false. SETH is distinguished as the closest to the upper bound. There are randomized algorithms for $latex {k}&fg=000000$-$latex {\mathsf{SAT}}&fg=000000$ that run in time $latex {2^{n - (\epsilon/k)n}}&fg=000000$ for some $latex {\epsilon > 0}&fg=000000$ and all $latex {k}&fg=000000$, and $latex {k}&fg=000000$ can be replaced by $latex {\log(m/n)}&fg=000000$ in general.

Stearns and Hunt emphasized the effect of reductions on SH and the power index in general. The same can be done for ETH. But we remain equally puzzled about the issue of the size of the problems used in the reductions. We start with a SAT problem that uses $latex {b}&fg=000000$ bits in its description. This is viewed then eventually as an edit distance problem that uses exponential in $latex {b}&fg=000000$ bits. The $latex {N}&fg=000000$ is the edit problem is extremely large. Of course this is just fine, since we only claim to get an exponential time algorithm.

The point is that our intuition about the edit distance problem is all on problems that have modest size $latex {n}&fg=000000$. I, Dick, actually had a student build a hardware machine that did modest size such problems many times faster than any computer. So all my intuition was---is---about small size edit problems. When the size of $latex {n}&fg=000000$ becomes astronomical my intuition may fall apart. Could this explain the reason that the result here seems puzzling?

Open Problems

So what is the complexity of computing the edit distance? Can we really not do better than the obvious algorithms? This seems hard, no puzzling, to us; but it may indeed be the case.

The 3SUM problem, which we recently covered, is attacked by some of these papers but has not of yet been brought within the scope of the reductions from $latex {\mathsf{OVP}}&fg=000000$. The $latex {\tilde{O}(n^{3/2})}&fg=000000$ decision tree upper bound has not yet yielded an $latex {n^{2-\epsilon}}&fg=000000$-time algorithm, but perhaps the reductions are also preserving decision-tree complexity so this is a concrete obstacle?