The report of the 2008 Extremely Large Database (XLDB) Workshop, a conference that
addresses methods, architectures and best practices for data intensive science, observed,
“For the largest-scale datasets, there is no debate that computation must be moved close to
where the data resides, rather than moving the data to the computation.” New programming
models such as MapReduce for data processing on large clusters have shown excellent performance
(e.g., in Google applications) with rather low programming overhead. A widely used open
source implementation of MapReduce programming model is Hadoop.
Simultaneously, there has
been an implementation of active disks in the form of Netezza Performance Server (NPS), a data intensive supercomputer (DISC). The results from NPS DISC have
proven to be quite remarkable, with data analysis delivered orders of magnitude faster
than the currently used platforms.
The Center for Computational Research (CCR) at UB provide extensive on-line documentation
and a wide variety of
training, including hosting workshops. Topics include fundamentals
of parallel computing, introduction to CCR, debugging and profiling tools, bioinformatics resources etc.
Software packages (in the fields of bioinformatics, Chemistry/Biochemistry, Engineering and Physics) currently installed and maintained by CCR Staff can be found
here.
|