{\huge Minsky The Theorist}
Marvin Minsky's contributions to complexity theory
\vspace{.25in} Marvin Minsky sadly passed away last Sunday. He was one of the great leaders of artificial intelligence (AI), and his early work helped shape the field for decades.
Today Ken and I remember him also as a theorist.
Like many early experts in computing, Minsky started out as a mathematician. He majored in math at Harvard and then earned a PhD at Princeton under Albert Tucker for a thesis titled, ``Theory of Neural-Analog Reinforcement Systems and Its Application to the Brain Model Problem.'' This shows his interest in AI with focus on machine learning and neural nets from the get-go. However, he also made foundational contributions to computability and complexity and one of his first students, Manuel Blum who finished in 1964, laid down axioms for complexity measures.
We will present two beautiful results of Minksy's that prove our point. AI may have lost a founder, but complexity theory has too. His brilliance will be missed by all of computing, not just AI.
Counter Machines
In 1967 Minsky published his book Computation: Finite and Infinite Machines. This book was really not an AI book, but rather a book on the power of various types of abstract machines.
\includegraphics[width=1.25in]{comp.png}
The book includes a beautiful result which we'll call the two-counter theorem:
Theorem 1 A two-counter machine is universal, and hence has an undecidable halting problem.
A two-counter machine has two counters---not surprising. It can test each counter to see if it is zero or not, and can add $latex {1}&fg=000000$ to a counter, or subtract $latex {1}&fg=000000$ from a counter. The finite control of the machine is deterministic. The import of the two-counter theorem is that only two counters are needed to build a universal computational device. It is easy to prove that a small number of counters suffice to simulate Turing machines. But the beauty of this theorem is getting the number down to two.
To recall some formal language theory, one counter is not sufficient to build a universal machine. This follows because a counter is a restricted kind of pushdown store: it has a unary alphabet plus ability to detect when the store is empty. Single-pushdown machines have a decidable halting problem. It was long known that two pushdown stores can simulate one Turing tape and hence are universal: one holds the characters to the left of the tape head and the other holds the characters to the right. What is surprising, useful, and nontrivial is that two counters are equally powerful. I will not go into the details of how Minsky proved his theorem, but will give a hint. The main trick is to use a clever representation of sets of numbers that can fit into a single number. Perhaps Minksy used his AI thinking here. I mention this since one of the key ideas in AI is how do we represent information. In this case he represents, say, a vector $latex {(a,b,c)}&fg=000000$ of natural numbers by
$latex \displaystyle 2^{a}3^{b}5^{c}. &fg=000000$
This is stored in one counter. The other is used to test whether say $latex {b=0}&fg=000000$. Or to replace $latex {b}&fg=000000$ by $latex {b+1}&fg=000000$. Making this work is a nice exercise in programming; I covered this in a post back in March 2009.I must add a personal note here. I used this theorem in one of my early results, way back when I started doing research. A very pretty theorem of Minsky. Ken was also inspired by the connection from counter machines to algebraic geometry via polynomial ideal theory that was developed by Ernst Mayr and Albert Meyer while at MIT. The basic idea is that if we have, say, an instruction {\tt (q,x--,y++,r)} that decrements counter $latex {x}&fg=000000$ and increments $latex {y}&fg=000000$ while going from state $latex {q}&fg=000000$ to state $latex {r}&fg=000000$, then that is like the equation represented by the polynomial $latex {qx - ry}&fg=000000$. If the start and final states are $latex {s}&fg=000000$ and $latex {f}&fg=000000$ then the counter machine accepts the null input if and only if the polynomial $latex {s - f}&fg=000000$ belongs to the ideal generated by the instructions.
Perceptrons
In 1969 Minksy and Seymour Papert published their book Perceptrons: An Introduction to Computational Geometry. It was later republished in 1987.
\includegraphics[width=2.5in]{PerceptronsCovers.png}
The book made a huge impression on me personally, but I completely missed its importance. Also the book is notorious for having created a storm of controversy among the AI community. Perhaps everyone missed what the book really was about. Let's take a look at what it was about and why it was misunderstood.
The Book
For starters the book was unusual. While it is a math book, with definitions and theorems, it is easy to see that it looks different from any other math book. The results are clear, but they are not stated always in a precise manner. For example, Jan Mycielski's review in the Jan. 1972 AMS Bulletin upgrades the statement of the main positive result about learning by perceptrons. The grand effect of this---combined with hand-drawn diagrams---is a bit misleading. It makes the book not seem like a math book; it makes it very readable and fun, but somehow hides the beautiful ideas that are there. At least in did that for me when I first read the book. Even the original cover is in colors that are hard to read. Apparently this was done to make a point about human perception.
It is interesting that even Minsky said the book was misunderstood. As quoted in a history of the controversy over it, he said: ``It would seem that Perceptrons has much the same role as [H.P. Lovecraft's] The Necronomicon---that is, often cited but never read.''
With hindsight knowledge of how complexity theory developed after the excitement over circuit lower bounds in the 1980s gave way to barriers in the 1990s, Ken and I propose a simple explanation for why the book was misunderstood:
It gave the first strong lower bounds for a hefty class of Boolean circuits, over a decade ahead of its time.
How hefty was shown by Richard Beigel, the late Nick Reingold, and Dan Spielman (BRS) in their 1991 paper, ``The Perceptron Strikes Back'':
Every family of $latex {\mathsf{AC^0}}&fg=000000$ depth-$latex {d}&fg=000000$ circuits can be simulated by probabilistic perceptrons of size $latex {2^{O(\log^{4d} n)}}&fg=000000$.
That is, $latex {\mathsf{AC^0}}&fg=000000$ has quasipolynomial size perceptrons. The key link is provided by families of low-degree polynomials of a kind that were first instrumental in Seinosuke Toda's celebrated theorem putting the polynomial hierarchy inside unbounded-error probabilistic polynomial time. In consequence of the strong 1980s lower bounds on constant-depth circuits computing parity, BRS observed that subexponential-size perceptrons cannot compute parity either.
That last fact is exactly what those Perceptrons editions symbolize on their covers. By the Jordan curve theorem, every simple closed curve divides the plane into ``odd'' and ``even'' regions. A perceptron cannot tell whether two arbitrary points are in the same region. This is so even though one need only count the parity of the number of crossings of a generic line between the points. Minsky and Papert went on to deduce that connectedness of a graphically-drawn region cannot be recognized by perceptrons either. Of course we know that connectedness is hard for parity under simple reductions and is in fact complete in deterministic logspace.
Missed Opportunities and Understandings
How close were Minsky and Papert to knowing they had $latex {\mathsf{AC^0}}&fg=000000$ and more besides? Papert subsequently wrote a book with Robert McNaughton, Counter-Free Automata. This 1971 book drew connections to algebraic and logical characterizations of formal languages, but did not achieve the power and clarity of descriptive complexity which flowered in the 1980s.
The other key concept they lacked was approximation. This could have been available via their probabilistic analysis but would have needed the mathematical sophistication employed by BRS: probabilistic polynomials and approximation by polynomials. Perhaps they could have succeeded by noting connections between approximation and randomized algorithms that were employed a decade later by the likes of Andy Yao. Of course we all who worked in the 1970s had those opportunities.
The misunderstanding of their book reminds Ken of some attempts he's seen to prove that the $latex {\mathsf{NC}}&fg=000000$ hierarchy of polylog circuit depth (and polynomial size) is properly contained in polynomial time. Take a function $latex {f}&fg=000000$ which you can prove to not be computed by your favorite family of (quasi-)polynomial size circuits of $latex {C_n}&fg=000000$ depth 1---or depth ``1-plus'' as perceptrons are. We may suppose $latex {f(x)}&fg=000000$ always has the same length as $latex {x}&fg=000000$. Now define the function $latex {f'(x)}&fg=000000$ on any $latex {x}&fg=000000$ of length $latex {n}&fg=000000$ to be the $latex {n}&fg=000000$-fold composition $latex {f^n(x)}&fg=000000$. Then $latex {f'}&fg=000000$ still belongs to $latex {\mathsf{P}}&fg=000000$. You might expect to prove that computing $latex {f'}&fg=000000$ on inputs of length $latex {n}&fg=000000$ requires $latex {n}&fg=000000$ levels of your circuits $latex {C_n}&fg=000000$, thus placing $latex {f'}&fg=000000$ outside $latex {\mathsf{NC}}&fg=000000$ and establishing $latex {\mathsf{P \neq NC}}&fg=000000$.
But this conclusion about $latex {f'}&fg=000000$ does not follow ipso facto. The inability of one layer of perceptrons to compute simple functions $latex {f}&fg=000000$ does not extend to multiple layers---and some length-preserving offshoots $latex {f}&fg=000000$ of parity make $latex {f'}&fg=000000$ just essentially the same function, no more complex. Of course we understand this now about layers in circuit complexity. Getting snagged on this point---without sharp mathematical guidelines in the book---is what strikes us as coming out even in Wikipedia's discussion of perceptrons:
{}[The lower bounds] led to the field of neural network research stagnating for many years, before it was recognized that a feedforward neural network with two or more layers (also called a multilayer perceptron) had far greater processing power than perceptrons with one layer. ... It is often believed that [Minsky and Papert] also conjectured (incorrectly) that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function.
One can judge our point further relative to this in-depth paper on the controversy.
Not Missing the Frontier
None of this should dim our appreciation of Minsky as a theorist. We offer a simple way to put this positively. The two-pronged frontier of circuit complexity lower bounds is represented by:
We haven't cracked counting modulo 6 any more than counting modulo any composite $latex {m}&fg=000000$ in general. Nor do we have strong lower bounds against threshold circuits of those depths---while in the arithmetical case at least, high exponential bounds on depth 3 broadly suffice.
Well, a perceptron is a one-layer threshold circuit with some extra AND/OR gates, and modular counting is the first nub that Minsky and Papert identified. They knew they needed some extra layers and probably flew over the import of having an extra factor in the modulus, but the point is that they landed right around those frontiers---in 1969. They were after broader scientific exploration than ``$latex {\mathsf{P = NP}}&fg=000000$'' which hadn't yet become a continent on the map. This makes us wonder what kind of deeper reflection than worst-case complexity might land us closer to where solutions to the lower-bound impasse lie.
Open Problems
Our condolences to his family and the many people who worked with him and knew him best.