James Demmel

Faculty Scientist, UC Berkeley

demmel@berkeley.edu

Phone: +1 510 495 2851

Web: http://www.cs.berkeley.edu/~demmel/

Journal Articles

Xuan Jiang, Raja Sengupta, James Demmel, Samuel Williams, "Large scale multi-GPU based parallel traffic simulation for accelerated traffic assignment and propagation", Transportation Research Part C: Emerging Technologies, December 2024, 169:104873, doi: 10.1016/j.trc.2024.104873

H. Luo, J.W. Demmel, Y. Cho, X. S. Li, Y. Liu, "Non-smooth Bayesian optimization in tuning problems", arxiv-preprint, September 21, 2021,

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

Download File: SISC-SpGEMM.pdf (pdf: 1.5 MB)

James Demmel, Hong-Diep Nguyen, "Parallel Reproducible Summation", IEEE Transactions on Computers, Special Section on Computer Arithmetic 2014, August 11, 2014, doi: 10.1109/TC.2014.2345391

Reproducibility, i.e. getting bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [10]. However, the combination of dynamic scheduling of parallel computing resources, and floating point non-associativity, makes attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rump's algorithm for error-free vector transformation [7], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm reproducibly computes highly accurate results with an absolute error bound of (formula) at a cost of 7n FLOPs and a small constant amount of extra memory usage. Higher accuracies are also possible by increasing the number of error-free transformations. As long as all operations are performed in to-nearest rounding mode, results computed by the proposed algorithms are reproducible for any run on any platform. In particular, our algorithm requires the minimum number of reductions, i.e. one reduction of an array of six double precision floating point numbers per sum, and hence is well suited for massively parallel environments.

O. Marques, J. Demmel, C. Voemel, B. Parlett, "A Testing Infrastructure for Symmetric Tridiagonal Eigensolvers", ACM TOMS, 2008, 35,

J. Demmel, O. Marques, C. Voemel, B. Parlett, "Performance and Accuracy of LAPACK's Symmetric Tridiagonal Eigensolvers", SIAM Journal on Scientific Computing, 2008, 30:1508–1526,

S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

Download File: parco08-spmv.pdf (pdf: 1.5 MB)

Conference Papers

Y. Cho, J. W. Demmel, X. S. Li, Y. Liu, H. Luo, "Enhancing autotuning capability with a history database", IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), December 20, 2021,

Download File: GPTuneHistoryDB.pdf (pdf: 390 KB)

Y. Liu, W. M. Sid-Lakhdar, O. Marques, X. Zhu, C. Meng, J. W. Demmel, X. S. Li, "GPTune: multitask learning for autotuning exascale applications", PPoPP, February 17, 2021, doi: 10.1145/3437801.3441621

Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Nicholas Knight, "A 3D Parallel Algorithm for QR Decomposition", SPAA '18, 2018,

Yang You, Aydin Buluc, James Demmel, "Scaling deep learning on GPU and Knights Landing clusters", Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), 2017,

Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

Download File: ipdps14cabicgstabfinal.pdf (pdf: 943 KB)
Download File: ipdps14CABiCGStabtalk.pdf (pdf: 944 KB)

G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, Hong Diep Nguyen, E. Solomonik, "Reconstructing Householder Vectors from Tall-Skinny QR", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, 2014, 1159-1170, doi: 10.1109/IPDPS.2014.120

Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Benjamin Lipshitz, Oded Schwartz, Sivan Toledo, "Communication optimal parallel multiplication of sparse random matrices", SPAA 2013: The 25th ACM Symposium on Parallelism in Algorithms and Architectures, Montreal, Canada, 2013, 222-231, doi: 10.1145/2486159.2486196

Download File: spaa134-ballard.pdf (pdf: 301 KB)

E. Solomonik, A. Buluç, J. Demmel, "Minimizing communication in all-pairs shortest paths", International Parallel and Distributed Processing Symposium (IPDPS), 2013,

Download File: 25dapspipdps13.pdf (pdf: 256 KB)

James Demmel, Hong-Diep Nguyen, "Fast Reproducible Floating-Point Summation", Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH'13), April 10, 2013, doi: 10.1109/ARITH.2013.9

Reproducibility, i.e. getting the bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [1]. However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, make attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rump's algorithm for error-free vector transformation [2], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm trades off efficiency and accuracy: we reproducibly attain reasonably accurate results (with an absolute error bound c · n ² · macheps · max |v _i | for a small constant c) with just 2n + O(1) floating-point operations, and quite accurate results (with an absolute error bound c · n ³ · macheps ² · max |v _i | with 5n + O(1) floating point operations, both with just two reduction operations. Higher accuracies are also possible by increasing the number of error-free transformations. As long as the same rounding mode is used, results computed by the proposed algorithms are reproducible for any run on any platform.

Aydın Buluç, Samuel Williams, Leonid Oliker, James Demmel, "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication", IPDPS, IEEE, 2011, doi: https://doi.org/10.1109/IPDPS.2011.73

Download File: ipdps2011.pdf (pdf: 770 KB)

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

Download File: sc07-spmv.pdf (pdf: 438 KB)

Book Chapters

James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

Presentation/Talks

James Demmel, Hong-Diep Nguyen, Numerical Accuracy and Reproducibility at Exascale, Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH'13), April 10, 2013,

Download File: pres_33.pdf (pdf: 300 KB)

J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

Reports

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

Web Articles

"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

Posters

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)