CS181 - Fall 2006
Problem Set 1 -- Matrix Multiplication

This assignment has two parts. The first part involves programming and is due at 11:30PM on Tuesday, September 12. The second does not involve programming and is due at the beginning of class at 2:45PM on Wednesday, September 13.

### Part I: Optimizing Dense Matrix Multiplication

For the first part of this problem set you will write optimized code for doing matrix multiplication:

C = C + A*B,

where A, B, and C are all square matrices. (If you don't remember how matrix multiplication works, this is a good time to review your linear algebra notes.)

The obvious matrix multiplication code (consisting of three nested loops) has a peak performance of under 140 MFlops on lucy, a Pentium with a 3.06 GHz Intel Xeon processor (the same processor as on this supercomputer at the US Army Research Lab). Furthermore, the performance seems to decrease (slightly) as the problem size increases, and for some size matrices the performance falls to about 20 MFlops (see the figure). This is terrible.

Fortunately, with your knowledge of uniprocessors, you have reason to believe that you are personally capable of optimizing the code so it achieves significantly better performance (while still computing the right thing, of course).

You can find a makefile, the test program, a basic 3-loop implementation of matrix multiplication, and a gnuplot file for generating plots of the timings here. That should be enough to get you started on writing your own optimized matrix multiplication subroutine.

• #### To Submit

This part of the assignment is due by 11:30PM on Tuesday, September 12th. In class on September 13th you will be expected to give a short (5-10 minute) presentation on the optimizations you made. In addition you should submit the following:
• The file containing your matrix multiplication code
• A makefile that creates an executable of the given matmul.c with your code. (Alternatively, a very precise description of what I need to do to create the executable.)
• A writeup of what you tried, what you found worked, an explanation for why your improved code exhibits the behavior it does, and so on. This should include either a plot or a table showing matrix size vs. mflops of the final version of your code.
• You should submit these either by emailing me everything or by putting it on a web page and giving me a pointer.

Your grade will be based on the set of things you considered/tried, your explanations for why ``optimizations'' you tried either were or weren't, your explanation of why your final code is truly optimized, and the performance of your subroutine.

• #### Rules

• Everyone must submit their own code.

• Your subroutine must work with the test program given (note that the matrix sizes I end up testing may differ from those currently in the test code). However, any language (including assembly) is fine.
• You may not use code that you find on the web.
• You may not use an automatic code generator (eg. PhiPAC or ATLAS) to generate your code.
• However, you are encouraged to look for information on how those generators work, for more information on the actual architecture of the machine, and for any other information that you think might help you write the fastest code possible.
• You may use compiler optimizations, as long as you adequately explain why you thought to use them, what they do, and why that's a good thing.

• You'll find the test program, a makefile, a basic 3-loop implementation of matrix multiplication, and a gnuplot file for generating plots of your running times here.
• Lucy is a Dell PowerEdge 1750, with a Xeon processor (actually it has 2, but your code will only be running on one). The processor is rated at 3.06 GHz. It has an 8K L1 data cache, a 512K L2 cache, a 1M L3 cache, and 2GB memory.
• PhiPAC and ATLAS are automatic code generators for tuned matrix multiplication kernels.
• You can get information about the processor here

• ### Part II: Variations on Dense Matrix Multiplication

Many variations on this problem come up in practice. For this part of the assignment you'll consider two that come up relatively frequently.

Discuss what would change to your optimization strategy from part I in each of the following situations:

• You are multiplying matrix and a vector:

z = z + A*x,

where A is a square matrix and x and z are vectors.

• Often the matrix A is not dense, but rather sparse. In this case, most of the entries are zero, which means the data structure used is not a 2-D array. Rather, the data structure stores the values of the nonzero entries and their location in the matrix.

Now you're multiplying a sparse matrix and a vector.

z = z + A*x,

where A is a square, sparse matrix and x and z are (dense) vectors.

In practice sparse matrices often come up when modelling some structure that exists "in the real world"; the nonzeros represent relationships between objects that are "close" or "similar" in some way. Sparse matrix-vector multiplications are necessary for certain eigensolvers and iterative solvers for linear systems (among other computations).

• #### To Submit

This part of the assignment is due by 2:45PM on Wednesday, September 13th. At the beginning of class you should hand in your responses (typed, not handwritten).

Your grade will be based on how well you've thought through the issues and how reasonable your proposed changes are.