I currently want to multiply a large sparse matrix(~1M x 200k) with its transpose. The values of the resulting matrix would be in float.

- I tried loading the matrix in scipy's sparse matrix and by multiplying each row of first matrix with the second matrix. The multiplication took ~2hrs to complete.

What is the efficient way to achieve this multiplication? Because I see a pattern in the computation.

- The matrix being large and sparse .
- The multiplication of a matrix with its transpose. So, the resulting matrix would be symmetric.

I would like to know what libraries can achieve the computation faster. It can be in Python, R, C, C++ or any other one.

I suppose your main need is to save memory. First as you multiply a matrix with its transpose, you do not need any memeory for the transpose : all of its cells are directly accessible through first matrix (tA[i,j] = A[j,i]). Almost 1/3 of memory saved.

I can see that computation time cannot be neglected too. As the resulting matrix will be symetric, you can compute only one half and directly store the other. Near half of computation time saved.

And if you are sure that you initial matrix is really sparse, and so can hope the resulting one will be too, you can directly store the result in a scipy sparse matrix, COO format :
only three lists to store the non null values.

But ... I do not know any library to do that and you will have to code it yourself in your prefered language (probably python as you spoke of scipy).

Python code example (matrix = A[M][N])

```
I = []
J = []
V = []
for i in range(M):
for j in range(i:M) :
X = 0.0
for k in range(N):
X += A[i ][k] * A[k][j]
if X != 0.0 # or abs (X) > epsilon if floating point accuracy is a concern ...
I.append (i )
J.append(j)
V.append(X)
I.append (j )
J.append(i)
V.append(X)
```

And I, J, V are what is needed for a scipy COO sparse matrix via :

```
RESULT = sparse.coo_matrix((V,(I,J)),shape=(N, N))
```

```
def URMdist(URM):
NLin=URM.shape[0]
NCol=URM.shape[1]
URMt=URM.T
Result = lil_matrix((NLin,NLin))
for Lin in range(0,NLin):
X = 0.0
for Col in range(Lin,NLin):
X = URM[Col,:].dot(URMt[:,Lin])
if X != 0.0:
Result[Lin,Col] = Result[Col,Lin] = X
return Result
```

The problem in your case is the shear size of you original matrix. If `A`

has in the order of `1e6`

rows then `A*A.T`

has on the order of `1e12`

entries! There is absolutely no way that you can work with such a matrix in your desktop computer and expect it to be fast. Also, I believe that the problem is not in using Python because the critical loops in Scipy are implemented in C or Fortran (see here).

That being said, I was facing a smaller version of your problem myself (~200k rows, ~3k cols) and I found that Python works quite nicely if you are careful and you use sparse matrices everywhere. (I was breaking the sparsity because I was dividing the sparse matrix by a vector. Don't do this.)

You do the multiplication with the normal overloaded operator `*`

for sparse matrices.

```
import scipy.sparse as sp
import numpy as np
# ... in my case my sparse matrix is obtained from real data
# ... but we get something similar below
A = sp.rand(int(2e5), int(3e3), density=1e-3, format='csr')
# If used in the jupyter notebook
%time A * A.T # Wall time: 2.95 s
# Another option
At = A.T.tocsr()
%time A * At # Wall time: 2.89 s
```

I found quite interesting that the second option is not significantly faster, even though the documentations says that it's faster to multiply two `csr`

matrices.

Note that the result will also depend on the density of your sparse matrix.