4
$\begingroup$

I have a fairly large array, a billion or so by 500,000 array. I need to calculate the singular value decomposition of this array. The problem is that my computer RAM will not be able to handle the whole matrix at once. I need an incremental approach of calculating the SVD. This would mean that I could take one or a couple or a couple hundred/thousand (not too much though) rows of data at one time, do what I need to do with those numbers, and then throw them away so that I can address memory toward getting the rest of the data.

People have posted a couple of papers on similar issues such as http://www.bradblock.com/Incremental_singular_value_decomposition_of_uncertain_data_with_missing_values.pdf and http://www.jofcis.com/publishedpapers/2012_8_8_3207_3214.pdf.

I am wondering if anyone has done any previous research or has any suggestions on how should go on approaching this? I really do need the FASTEST approach, without losing too much accuracy in the data.

  • 0
    As long as I know there is paper about such. I dont remember but y colleagues should know. From my experience I used SVD and it is so slow in MATLAB. How big is your matrix?2012-08-06
  • 0
    @SeyhmusGüngören it will be nearly a billion by 500,000 big, which is why I need to take in a couple of entries at a time. Overall, there are going to be around 500 trillion numbers.2012-08-06
  • 0
    Ok. I see the problem. You are right. One needs too much memory for such a calculation. If you dont receive any answer until tomorrow, I will ask it to my colleagues and let you know about a possible solution.2012-08-06
  • 0
    @SeyhmusGüngören That would be very much appreciated! Thank you so much.2012-08-06
  • 0
    You might want to see [this](http://www.cs.yale.edu/publications/techreports/tr966.pdf).2012-08-07
  • 0
    @J.M. I will take a look at that. Thank you very much! I'll tell you how it is.2012-08-07
  • 0
    Ok. I asked to colleagues. Some havent replied yet. If they tell me sth. I will let you know but it is so strange. Where do you get those matrices? $10^9\times 5*10^5$ means $5\,10^14$ bits and if each number is represented by $32$ bits then you have roughly $1.5*10^16$ bits which is $2*10^15=2000$ Terra bytes. You can not even save such a huge data in your hard disc.2012-08-07
  • 0
    @SeyhmusGüngören This is what I'll be planning on doing if I can incrementally calculate the SVD: Many documents are stored online on my server (not yet, but will be a billion documents). I will grab the text from one or a couple files at a time, do what I need to do to incrementally calculate the SVD, throw them away from RAM, and move on to more documents. The second dimension will be 500,000 as in about 500,000 total words across all of the documents.2012-08-07
  • 0
    Although I will probably be using sparse matrices to store the data.2012-08-08
  • 0
    @J.M. Sorry I thought I commented on this: but the article you provided me talks about updating a previously existing SVD. I need to be able to start from the beginning. This might not be possible, but it would be good to at least be told that.2012-08-08
  • 0
    "I need to be able to start from the beginning." - and it's not terribly hard to compute the SVD of a $1\times n$ or an $m\times 1$ matrix... ;) There's your starting point.2012-08-09
  • 0
    Hmm..okay I am also just concerned about the accuracy of the final result, seeing how starting with an SVD based off of many vectors as opposed to a single vector might make a much more precise answer, but I will take a go at this method and start off with one vector to see if it'll still work appropriately.2012-08-09

2 Answers 2

2

You could compute the SVD of randomly chosen submatrices of your original matrix, as shown e.g. in the 2004 paper by Drineas, Frieze, Kannan, Vempala and Vinay, and scale the result to obtain an approximate SVD of the original matrix. There has been quite a bit of additional work on randomized matrix methods since then. The grandfather of all this is the Kaczmarz method of 1939 for solving the problem $Ax = b$, if only one row of $A$ at a time is accessible.

It might also be useful to check if maybe a few top singular values are sufficient for your purposes. If so, Lanczos methods (e.g.) will result in additional time savings.

  • 0
    I'm really liking Lanczos method so far. I think I might go with that! Thank you for your help!2012-08-16
0

So is there any other property that the matrix has? For example, is it sparse or is it symmetric, real/complex, etc... As there are optimized algorithms for various situations, SVD may not be the best option -- it would be helpful to know what you are trying to get from the SVD.

  • 0
    This is more of a comment than an answer...2012-08-15