# Sparse Matrix Dot Product Parallel Algorithm Design

• This page is for documentation and discussion of parallel algorithms on GPU for dot product involving sparse matrices.
• dot(dense, csr.T) = dense
• Lemma: csr.T is equivalent to interpreting the same csr input as a csc matrix
• Algorithm: Treat the input as a csc matrix, use one GPU thread per column, scan for every row in dense matrix, output to corresponding position in output dense matrix
• dot(dense, csr) = dense
• Take One: Transpose the right-hand side csr matrix (in parallel fashion), then apply the above algorithm
• Take Two: Change csr to coo through a histogrammming and exclusive scan on the indices vector of input csr, then sort all elements by column, then apply the above algorithm
• Take Three(flaky): One GPU thread per row on original input, for each element in the row, multiply by the corresponding element in the dense matrix, then atomically add to corresponding destination in output (flaky due to float32 additions do not obey commutative law).
• dot(csr.T, dense) = dense
• Transpose the lefthand side to it's csc form (in parallel fashion), then do the multiply-and-add for each element in the output matrix using a warp or single thread depending on the shape and size of the inputs.
• No labels

1. A quick question: Is there CPU version of the parallel algorithm for sparse matrix dot?

1. I think this algorithm can be migrated onto CPUs with minor modifications to the details, so the big picture will still be the same.

So is this migration in your plan?

What's the estimated benefit (speedup/efficiency) this solution can gain from CPU and GPU?

1. The common issue with sparse matrices is that they usually represent large matrices that cannot be represented in dense form due to memory limitations, so it would also be necessary to perform related computations without converting them to dense forms.

For this algorithm I do have some rough benchmarks on GPU, the relative speedup compared to a fallback-to-dense solution depends on the sparsity of the input matrices (the more sparse, the more speedup), I'll put those results and the benchmarking script on github once I finalize it.

There's also a need to avoid atmoicAdd operations because float additions are not commutable, the above algorithm works hard to avoid it.

There's no plan on finishing up the CPU implementation right now as it's not needed by anyone yet, but we'll definitely support that when we have some time to do so.

If you're interested in the actual GPU implementation you can take a look at my pending PR on github: https://github.com/apache/incubator-mxnet/pull/10371

Please let me know if you have further questions, thanks for your attention on this!

1. Thanks for the rich info Seem dense solution on CPU is not such attractive even big memory helps a lot.

Looking forward to seeing your benchmark scripts (or the part of real cases).

We will evaluate it and decide if we can take CPU part of this task.

1. Update: The CPU implementation has been merged as well #11113