Sparse Matrix Dot Product Parallel Algorithm Design

Created by Hao Jin, last modified on Aug 21, 2018

This page is for documentation and discussion of parallel algorithms on GPU for dot product involving sparse matrices.
dot(dense, csr.T) = dense
- Lemma: csr.T is equivalent to interpreting the same csr input as a csc matrix
- Algorithm: Treat the input as a csc matrix, use one GPU thread per column, scan for every row in dense matrix, output to corresponding position in output dense matrix
dot(dense, csr) = dense
- Take One: Transpose the right-hand side csr matrix (in parallel fashion), then apply the above algorithm
- Take Two: Change csr to coo through a histogrammming and exclusive scan on the indices vector of input csr, then sort all elements by column, then apply the above algorithm
- Take Three(flaky): One GPU thread per row on original input, for each element in the row, multiply by the corresponding element in the dense matrix, then atomically add to corresponding destination in output (flaky due to float32 additions do not obey commutative law).
dot(csr.T, dense) = dense

- Transpose the lefthand side to it's csc form (in parallel fashion), then do the multiply-and-add for each element in the output matrix using a warp or single thread depending on the shape and size of the inputs.

No labels

6 Comments

Patric Zhao
A quick question: Is there CPU version of the parallel algorithm for sparse matrix dot?
- Permalink
- Apr 12, 2018
- Delete comments
1. Hao Jin
  I think this algorithm can be migrated onto CPUs with minor modifications to the details, so the big picture will still be the same.
  Permalink
  
  Apr 12, 2018
  
  Delete comments
  1. Patric Zhao
    Thanks for your answer.
    So is this migration in your plan?
    What's the estimated benefit (speedup/efficiency) this solution can gain from CPU and GPU?
    
    Permalink
    
    Apr 13, 2018
    
    Delete comments
    1. Hao Jin
      
      The common issue with sparse matrices is that they usually represent large matrices that cannot be represented in dense form due to memory limitations, so it would also be necessary to perform related computations without converting them to dense forms.
      For this algorithm I do have some rough benchmarks on GPU, the relative speedup compared to a fallback-to-dense solution depends on the sparsity of the input matrices (the more sparse, the more speedup), I'll put those results and the benchmarking script on github once I finalize it.
      There's also a need to avoid atmoicAdd operations because float additions are not commutable, the above algorithm works hard to avoid it.
      There's no plan on finishing up the CPU implementation right now as it's not needed by anyone yet, but we'll definitely support that when we have some time to do so.
      If you're interested in the actual GPU implementation you can take a look at my pending PR on github: https://github.com/apache/incubator-mxnet/pull/10371
      Please let me know if you have further questions, thanks for your attention on this!
      
      Permalink
      
      Apr 13, 2018
      
      Delete comments
      1. Patric Zhao
        
        Thanks for the rich info
        Seem dense solution on CPU is not such attractive even big memory helps a lot.
        Looking forward to seeing your benchmark scripts (or the part of real cases).
        We will evaluate it and decide if we can take CPU part of this task.
        
        Permalink
        
        Apr 13, 2018
        
        Delete comments
        
        Patric Zhao
        
        Update: The CPU implementation has been merged as well #11113
        
        Permalink
        
        Jun 12, 2018
        
        Delete comments

Page tree

Sparse Matrix Dot Product Parallel Algorithm Design

6 Comments

Patric Zhao

Hao Jin

Patric Zhao

Hao Jin

Patric Zhao

Patric Zhao