http://www.stanford.edu/~navneet/ee392c.html
Navneet Aron, Shivnath Babu, Sorav Bansal,
and Abhyudaya Chodisetti
Abstract
With the advent of high speed link technologies, the bottleneck in network
processing is shifting to the CPU. In general, dedicated network processors are
used to process high speed network traffic fast enough. TCP/IP processing may
become a bottleneck if a general purpose processor is used. In this paper, we
examine if the TCP/IP stack processing can be made more efficient by
parallelizing it, making it suited for chip-multiprocessors. We implemented a
parallel TCP stack by extending an existing TCP stack for embedded devices
called LwIP. We found that we could indeed parallelize and gain performance by
parallelizing the LwIP stack and show the performance improvement through
experiments on a shared-memory multiprocessor machine.
Joel Coburn, Jayanth Gummaraju, Varun
Malhotra, Janani Ravi, and Suzanne Rivoire
Abstract
Reconfigurable
architectures present many opportunities for specialized optimization of
hardware and software, which can often be harnessed only with a high-level
understanding of the programs to be run. Using generic standard libraries is a
convenient mechanism for the programmer to convey this high-level information
to the compiler. In this paper, we propose compile-time techniques for
optimizing programs that use the C++ Standard Template Library and a framework
for integrating the techniques into a compiler. We also show ways to refine our
analyses in response to future experimental results.
Metha Jeeradit, Jean Suh, Honggo Wijaya,
and Chi Ho Yue
Abstract
Designing a CMP system is a hard optimization problem because of
the large parameter space involved. In this paper, we present a method for automatically
determining an ideal CMP configuration for a given application set using
genetic programming. Our approach is a static, software-based implementation
that provides an effective method of searching for the ideal hardware
configurations. The results of our approach can also be used as a training set
for a neural network extension that can dynamically reconfigure the hardware
based on the applications being run.
Initial results are promising and
show substantial improvements between the best configuration found in the first
iteration and the last iteration. Convergence of the algorithm is also fast to
within 20 generations for a combined application set.
Jing Jiang, Ilya Katsnelson, and Ernesto
Staroswiecki
Abstract
The drive to
achieve high levels of availability and reliability in computer systems has
fueled the development of fault-tolerance techniques for some time. These techniques,
however, have been designed mostly with single processor, symmetric
multiprocessor (SMP), or small chip-multiprocessor (CMP) systems in mind (i.e.
up to two cores per chip) [1][2][3]. There is a need for this work to be
extended for larger CMP and Polymorphic systems.
In this paper we present a technique to detect and recover from both
transient and permanent errors within a Chip-Multiprocessor. We also run
benchmarks in the presence of faults to evaluate both the overall performance
degradation caused by the errors when our fault-tolerance scheme is used, and
the local performance profile to better understand the implications of a fault
to a processor element and its neighbors.
Finally, since this is not only a research paper, but also a class report
for EE392C, Spring 2003, we describe our experience while working on this
project, as well as most of our ideas that are now part of the future work
section.
Dave Bloom, Brad Schumitsch, Garret Smith,
and John Whaley
Abstract
We investigated techniques for automatically exploiting method-level
parallelism using method-level speculation. AMP profiles the code and then
automatically identifies candidates for speculation. We constructed a full tool
chain for trace simulations, and investigated a variety of heuristics for
choosing which methods upon which to speculate. Using a 4 processor system, on
a set of standard sequential benchmarks our best heuristic has an average
running time of only 73% the sequential version.
Rohit Kumar Gupta, Paul Wang Lee, Wajahat
Qadeer, and Rebecca Sara Schultz
Abstract
In order to obtain significant speed-ups for data parallel applications,
vector processors have been used successfully. However, vector architectures
may not be suitable for general purpose computing. Many novel techniques have
been developed for that purpose, most notably in the direction of
multithreading. We have developed a core for a scalable chip multiprocessor
that efficiently combined both topologies. We present a hardware overview,
discuss details of the pipeline and instruction set and demonstrate how our
system compares to other scalar and vector systems.
Amin Firoozshahian, Arjun Singh and John
Kim
Abstract
As the number of input
ports scale in an interconnect network on chip, the existing structures used such
as a bus or a crossbar have limitations. Other network topologies such as a
torus or a mesh topology scale well but they are not amenable to the ordering
desired with its large path diversity. This paper presents two schemes to
overcome these difficulties: a multistage arbitration scheme, which allows the
arbitration scheme to be divided such that the arbitration cycle time can be
small, and a Clos-like structure, which allows for better scaling of the interconnection and also
achieves higher throughput. To verify these two schemes, a cycle accurate
simulator was developed and the performance of the two schemes are compared in
terms of throughput, latency and logic cost. Simulation results of the two
schemes as well as their properties presented in the paper show that a
Clos-like network can achieve a much better throughput while suffering from
higher cost.