Parallelization of computer codes can be used to obtain speedups approaching the number of processors employed, but parallel codes and computer systems can be difficult and expensive to develop and maintain. Current microprocessor architectures actually support significant pipelining and SIMD streaming, but most computer codes fail to take advantage of this. We show that by exploiting Matlabís support for vectorization we are able to make better use of the microprocessor architecture and achieve dramatic speedups of forward and adjoint sensitivity calculations required for a large 3-D inverse problem in biomedical fluorescence tomography on a single processor computer, and that these speedups increase as the problem size increases. Our vectorized implementations involve replication of large amounts of data and are thus memory intensive, however we effectively remove memory constraints by using domain decomposition. We show that global matrix assembly for a large (98,304 element) finite element mesh is speeded up by a factor of 6.5 and adjoint sensitivity calculations are speeded up by a factor of 502 on a single-processor 2.2 GHz Pentium IV.
Colloquia Series page.