Accelerating GLM using Normal Equation and GPGPU on a Tall-Skinny Data Set

Accelerating Solution of Generalized Linear Models by Solving Normal Equation using GPGPU on a Large Real-World Tall-Skinny Data Set

T. Sang, R. Kobayashi, R. S. Yamaguchi, T. Nakata “Accelerating Solution of Generalized Linear Models by Solving Normal Equation using GPGPU on a Large Real-World Tall-Skinny Data Set”, IEEE  31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp.112-119, Oct. 2019 .

  • Machine learning algorithm becomes more popular, recently
  • Training time gets longer when the data size grows and needs to be reduced.
  • Especially, in the IoT data analysis field, it is common to see a very tall-skinny dataset (e.g. 4×107 by 27).
  • E.g location history, physical activity history, shopping history, …
  • We wanted to speed up the machine learning algorithm on a real-world application which uses the Generalized Linear Model (GLM)

Lifestyle authentication(*)

(*) 小林 良輔, 疋田 敏朗, 鈴木 宏哉, 山口 利恵.“行動 センシングログを元にしたライフスタイル認証の提 案”, コンピュータセキュリティシンポジウム 2016 論文集, 2016(2), pp.1284-1290 (2016)

  • A technology which makes use of lifestyle information to identify the user
  • By Yamaguchi Laboratory @ Social ICT Research Center, the University of Tokyo
  • Using information from smartphone device and website without explicit user activity
  • As a result of convenience and security,
  • increase number of users and aim to a safer society
  • Our research focuses on “user access pattern

Training execution profiling

  • Total time complexity: O(m×n2×k) (m=2,719,348: no. of samples, n=27: no. of features, k=7: no. of iterations)
  • Total execution time: 112.79 seconds
  • Training mostly depends on the “.Call” which is called in many iterations
  • C extension “.Call”  routine solves the Linear Least Squares equation (LS equation)
  • The standard library uses canonical Linpack library with minor modification
  • The first step is to rewrite the “.Call” routine with GPU APIs
  • Speedup upper bound limit of this step: 2.2 times
  • This is base step for further future optimization

Acceleration comparison between two approaches

  • In the conventional algorithm, GPGPU reduces running time by 4.9x
  • Folded with 6.4x time of Normal Equation
  • Leading to 31.3x time speedup in total