This is an especially opportune time to carry out this exercise. The President's Information Technology Advisory Committee (PITAC) Report calls for an increase in support of information technology research of roughly a billion dollars over the next five years, and the Administration's proposed Federal budget for FY 2000 demonstrates a commitment to sustained growth in information technology research through its initiative, Information Technology for the Twenty-First Century (IT2). Since NSF will be the lead agency of IT2 and NSF is essentially the sole agency funding research in theory of computing, it is certainly appropriate to inquire into the impact of theoretical research on computer science and to attempt to measure the importance of theoretical research for solving strategic long-range problems as called for by the PITAC Report.
We find that theory of computing is a vibrant and exciting discipline of essential importance, about which we will demonstrate the following major points:
Theoretical research will need to play a critical role in order to meet the challenges set forth in the PITAC report. New fundamental advances will be necessary as will focused research in various applications areas, where collaboration between theoreticians and practitioners inform and enrich one another.
There is an historically demonstrated, enormous potential impact of theory of computing research, in view of which, funding for cutting-edge theory of computing should be increased dramatically. We find that there is a need to increase the number of researchers in theoretical computer science, increase interaction among researchers, and encourage various types of collaboration. To these ends, we conclude this report with a series of specific recommendations to NSF. We recommend a sizable increase in the number and size of awards, creation of a postdoctoral research program, and facilitation of proposals from groups of researchers at different institutions. Also, we recommend several new initiatives:
The historical role of theory of computing is clear: ``Theory pervades the daily practice of computer science and lends legitimacy to the very identity of the field.'' (Funding a Revolution: Government Support for Computing Research, Computer Science and Telecommunications Board, NRC). Today theory of computing is thriving. In this report we will provide evidence to demonstrate the following major points:
Research innovations in the last ten to fifteen years have resulted in new formulations and results that promise a big impact in the future. Now we have fast (polynomial-time) algorithms that provide approximate answers to many NP-complete problems. We use randomized algorithms that provide almost surely fast solutions to hard problems. We employ interactive proof systems, serving to convince one player of the truth of a statement known to a second player, to verify electronic exchanges. These are but a few examples of promising research directions. It is important to realize that several of these innovations resulted from theoreticians' attempts to understand fundamental aspects of computing. They sprang out of the context of theory of computing and would not have occurred in the context of specific applications within technologies that existed at the time that those discoveries were made. Also, it is important to note that theoretical concepts often have taken decades to be assimilated into the mainstream of computing, but such assimilation has had profound practical impact.
Theory of computing ``underlies many aspects of the construction, explanation, and understanding of computers. ... Many ... theoretical concepts from different sources have now become so embedded in computing and communications that they pervade the thinking of all computer scientists.'' (ibid.)
One strength of theory of computing is that it is not tied to any particular technology, but provides tools for understanding and reasoning about technology in general. For example, the concept of an abstract machine and the simulation of one abstract machine by another, though invented by theoreticians decades ago, can help to understand the modern Web browser. Namely, the Web browser provides users with a common abstract machine across different computing platforms. When a user follows a link to data, a browser invokes the proper interpreter (an abstract machine) to process the data, for example to view an image or run a Java program.
To summarize, ``We need more basic research---the kind of groundbreaking, high-risk/high-return research that will provide the ideas and methods for new disciplinary paradigms a decade or more in the future. We must make wise investments that will bear fruit over the next forty years.'' (Information Technology Research: Investing in Our Future, PITAC Report, March 1998)
``The Committee finds that the Federal agenda for information technology R&D has moved too far in the direction of near-term applications development, at the expense of long-term, high risk, fundamental investigations in the underlying issues confronting the field.'' (PITAC Report, March 1998)Here we explain how theoretical research will play a major role in carrying out the type of fundamental research that is called for in the PITAC report. As we proceed, we will substantiate the following points:
Lasting contributions to these and other applications areas are likely to be rooted not only in theoretical research that has an applications focus, but also in foundational research whose relevance to a given application may not be apparent at first.
The problems of providing reliable and secure electronic commerce are far from solved. There is a need to develop new cryptographic algorithms that will offer added security, improved efficiency, and solve newly arising technical problems.
Cryptographic algorithms may be the cornerstone of network security, but on top of these algorithms one must add multiple layers of network protocols, a common security infrastructure, and software to actually realize all this. The security of a network depends on the security and soundness of each of these components together with their interactions. Yet how can we tell whether the software that implements these protocols meets their specifications, or even if these specifications guarantee the desired level of security? To deal with these very subtle problems, one needs tools from concurrency theory, semantics, and formal methods. Theoreticians have and will continue to provide formalisms for reasoning about the correctness and security of protocols. (Formal approaches were used to identify the exact source of the Pentium bug.)
With the growing size of the Internet and the amounts of data available, the scalability of our computational methods is becoming increasingly important. Extremely large problems arise when searching the Web, and in managing huge servers that have to service millions of requests per day against exabyte (1018 bytes) data stores. Theoreticians have long focused on developing algorithms with low asymptotic complexity and methods that work well on large data sets that reside on distributed computing devices or external memories. With increased problem sizes this focus is becoming even more important. Current methods already incorporate many nontrivial theoretical ideas such as hashing, graph search techniques, and divide-and-conquer.
Theoreticians typically create algorithms that serve as the basic building blocks of large systems. An important way for NSF to engage in long-term research in this area is to invest more in such building blocks. New technologies are bringing many new algorithmic problems into focus. Research on scalable methods and asymptotic complexity provides the best basis for addressing the challenge that these problems present. The following paragraphs elaborate on specific building blocks that theory of computing will contribute to scalable information systems.
Theoreticians have developed methods that provide astonishing speedups for large problem instances. Two of the most successful methods are randomization, such as random sampling, and approximation algorithms. These techniques can provide speedups that are well beyond what was previously conceivable. Research in these areas is very active, producing a continuous stream of dramatically improved algorithms for important problems. Random sampling is essential in order to obtain answers in sublinear time. When a problem is known to be hard, typically the current response is to resort to heuristics with no guarantees. Instead, approximation methods can lead to simple and practical algorithms that guarantee a close to optimal solution. The difference between an approximation algorithm and a heuristic is the proof that the latter has a performance guarantee. Approaches such as these would have been nearly impossible to discover in the absence of a suitable theoretical framework.
Thinking of the Web as a huge graph is frequently important for the development of Web search techniques. Graph-algorithmic methods, studied by theoreticians for decades, are used in current commercial Web search engines. Many new issues arise due to the Web's enormous size and the fact that the graph is dynamically changing and implemented as a distributed system. Another important aspect is the information latent in the Web's hyperlink structure. These issues lead to hard graph-algorithmic problems, which when solved will be valuable for future Web search engines.
The fields of data mining and discovery science seek ways to find interesting properties in information, with applications to commerce, user interfaces, and more. The challenge is to query and process huge amounts of information quickly and accurately. Researchers are exploring new methods that are rooted in graph algorithms, information theory, and learning theory to advance these fields.
Many of the known asymptotically faster methods use linear space. Implementation of these linear-space algorithms utilize external data storage. Researchers are trying to discover whether we can make effective use of small internal memory instead. Theoreticians are addressing the problem on multiple fronts. Some approaches seek to compress data, often by provably good sampling techniques, so that the online processing can be done in the much faster internal memory. Another important line of research is to exploit locality in data layout in order to optimize access to the data in external storage. Theory is needed to identify the performance limits of these approaches and to determine when each is applicable.
The rapid growth of the Internet, the demands on network bandwidth, and the stringent service requirements of real-time applications, such as on-demand video, highlight the need for improved data compression and caching. Theoreticians have made a variety of contributions to data compression. Often the contributions require expertise in more than one theory domain, further emphasizing the needed role of the theoretician. For example, prefetching and caching mechanisms based upon universal data compression techniques have been shown to capture locality present in data accesses and thus optimize bandwidth.
Model checking provides an automatic verification method for hardware and software. It has been very successful for large hardware designs because the regularity of hardware supports the use of compact data structures (e.g., binary decision diagrams) that can succinctly represent very large state spaces. To achieve similar success for software systems, more flexible representational techniques must be found. In regard to programming logics: modal and temporal logics provide the basis of model checking as specification languages, process calculi are commonly used to design and analyze network and cryptographic protocols, and the recent invention of proof-carrying code is an important new tool for network security.
While the tools and insights provided by model checking, logics of programs, and the like, are quite useful, each of these areas addresses only a very narrow set of problems. The practice of software development desperately needs a strong, sound mathematical foundation that encompasses the entire development process. The challenge for theory is to broaden and integrate these areas to deliver this foundation.
Related to biomolecular computing research is the area of molecular nanostructures, which seeks to reliably self-assemble large structures from a small number of molecular building blocks. Theoreticians have shown that the self-assembly process is controllable by appropriate design (``programming'') of the initial building blocks, thereby significantly expanding the potential for creation of interesting nanostructures.
In addition to the preceding speculative technological advances, there are numerous innovative advances that already exist, but which are not being exploited to capacity due to insufficient knowledge about how to use them. A prime example is the use of clusters of workstations as a platform for parallel computing. Whereas the technology for achieving cooperation among workstations in a cluster is well developed, the environment that they present is sufficiently different from that of a multiprocessor that we do not yet know how to schedule parallel computations effectively on clusters. A few recent theoretical studies are just beginning to develop the understanding and the algorithms necessary to remedy this situation.
In addition, modeling of biomolecules raises new questions in geometry and physics. G. D. Rose wrote that ``what role a protein takes in the grand biological opera depends on exactly one thing: its shape. For a protein molecule, function follows form.'' (No assembly required, The Sciences, 36 (1996), 26-31). New models for biomolecules, such as alpha shapes, are being developed by theoretical computer scientists that can aid in the design of drugs, in the classification of molecular components, and in the simulation of docking and folding processes.
The proliferation of biological data, and the need for its systematic and flexible storage, retrieval and manipulation, is creating opportunities in the database field. Genomic databases are heterogeneous, distributed, and semistructured, or with a schema that is in flux, thus offering novel challenges to database design, including its more fundamental aspects. The study of gene expression, protein structure and cell differentiation, as well as the emergence of microarrays, are introducing potentially unlimited data (while the whole human genome is less than one gigabyte, a limited size that initially misled many database researchers to underestimate the field). There are new challenges for indexing of sequences and three-dimensional structures, for lab workflow storage and process modeling, and for processing queries that involve specialized approximate pattern matching and complex geometric relations such as docking. These problems would benefit much from the theory community's extensive experience with string matching, scheduling, and computational geometry. As it has happened in the past with several other challenges to database research, early principled theoretical investigation of these problems is already an important ingredient of the field's research agenda.
Quite apart from technically-grounded evidence such as we present here, there are many other indicators of the importance, relevance, and health of theory of computing today. For example, consider the number of conferences centered on theoretical work, ranging from some that focus heavily on particular application domains to others that focus on foundational studies. Fundamental and innovative work in theoretical computer science has received favorable attention in broad-based scientific publications and in the national media over the past decade (for example, interactive proof systems, zero-knowledge, probabilistically checkable proofs, descriptive complexity); this has helped to make computer science research visible to those outside of our field. Researchers in theory of computing collaborate with computer scientists and engineers in many areas, with mathematicians, and with researchers in finance and the physical and biological sciences. Several leading theoreticians at major research institutions have founded their own companies in cryptography, security, web applications, and other areas. Many others are being lured from academia to be leading scientists at such companies, at large well-established corporations, and in arenas such as finance and biotechnology.
While many of the deep theoretical problems that are attracting the best minds in our field are rooted in, or inspired by, challenges such as those listed above, it is often difficult for researchers to properly tackle such problems in the context of an application-driven research environment. One reason for this is the long time period needed for work on fundamental problems to come to full fruition. In addition, solutions to such problems draw on diverse mathematical and computational methods, and so the interaction of a broad community of theoretical researchers is essential.
Moreover, the best theoretical results typically influence the course of research in several application areas, and so it is extremely useful to maintain an identifiable corpus of theoretical knowledge that is accessible to the Computer Science community and to the scientific research community at large.
A third reason for the importance of foundational research which is not targeted at a specific application is that, because of its focus on high-risk and speculative problems, and because of the often surprising or unpredictable consequences that stem from theoretical research, such research provides the seed corn for innovations of the future. This is underscored by the introductory quote of this section.
For all of these reasons, unfettered research in foundational theoretical areas is vital, in order to provide better understanding of the capabilities and limitations of computers, and to ensure future innovations in science and technology.
The theory of computing community continues to produce wonderful fundamental ideas, and, over time, these influence practice in important ways. The interplay among concepts such as pseudorandom number generation, interactive proofs, and secure cryptographic protocols is beautiful and deep, and has significant potential to the practice of cryptography. The introduction of interactive proof systems and probabilistically checkable proofs has broadened and enriched our understanding of the concept of proof. Probabilistically checkable proofs have turned out to be a fundamental tool for studying the limits of polynomial-time approximation algorithms.
Foundational questions will require a concerted effort in the areas of classical Turing machine-like models and variants (such as randomized or quantum models), models of learning, formal methods and program inference, models of nonsymbolic reasoning, logical characterization of complexity classes, lower bounds, models of on-line computation, models for communication of information, models for inferring information from incomplete data, models for data storage and retrieval in a multimedia context, and parallel and distributed models. Study of connections between models and results in more than one of these areas can be particularly fruitful. For example, on-line algorithms, common in key frameworks such as operating systems, financial control, and real-time systems, have generated fundamental concepts that are important for distributed systems design, in particular where information flow is more complex than traditional input/output.
In other cases, (such as linear programming) initial breakthroughs in reducing asymptotic running time, while not practical in and of themselves, serve to stimulate new research that eventually leads to practical algorithms. What tends to happen is that once a barrier, such as the existence of a polynomial-time algorithm for a problem, is broken, there is strong justification and real motivation for researchers to revisit a problem that previously appeared impenetrable. Very often, painstaking refinement of the seminal breakthrough technique leads to a truly practical algorithm. Ultimately, this type of research has long-lasting impact on the practice of computing. Tantalizing open questions remain, such as whether there is a polynomial time algorithm for graph isomorphism, or whether one can efficiently learn Boolean formulas that are in disjunctive normal form from random examples.
On numerous occasions, theory of computing research has provided the insights that explain why popular, important heuristic algorithms work. Significantly, these insights have suggested major improvements to the heuristics. One example of this scenario was the study by Turner in the 1980s that explained why the decades-old Cuthill-McKee heuristic for minimizing the bandwidth of sparse linear systems works well in practice; this understanding allowed Turner to devise an improved heuristic. A second example, also from the 1980s, explained why the Kernighan-Lin graph bisection heuristic, which is important in the field of circuit layout, works well in practice.
We learned that unfettered research in foundational theoretical areas, research that is not targeted at specific applications, is vital. Historic evidence demonstrates that such research transforms computer science dramatically.
Our recommendations fall into two broad categories. The first addresses ways in which traditional NSF funding for theory of computing needs to be strengthened. The second lists innovative new initiatives for enhancing research capabilities. The traditional grant program should be improved by the following means: