optimizing-supervised-and-implementing-unsupervised-machine-learning-algorithms-in-hpcc-systems

Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems Victor Herrera Maryam Najafabadi LexisNexis/Florida Atlantic University Cooperative Research Machine Learning Algorithms Big Data Management Optimized Code Parallel Processing Developing ML Algorithms On HPCC/ECL LexisNexis HPCC Platform HPCC Platform ECL ML Library High Level Data Centric Declarative Language Scalability Dictionary Approach Open Source Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems Agenda • Optimizing Supervised Methods Victor Herrera • Toward Deep Learning Maryam Najafabadi Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems Optimizing Supervised Methods Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems Overview ML-ECL Random Forest Optimization: • Decreased significantly the time for Learning and Classification phases • Improved Classification performance Working with Sparse Data: • Sparse ARFF reduced dataset representation • Speed Up Naïve Bayes algorithm learning and classification time on highly sparse datasets Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems Random Forest Random Forest (Breiman, Leo 2001) Ensemble supervised learning algorithm for classification and regression Operate by constructing a multitude of decision trees Main Idea: Most of the trees are good for most of the data and make mistakes in different places How: DT Bagging - Rnd samples with replace Splits over Rnd Selection of Features Majority Voting Why RF: Overcomes overfitting problem Handles wide, unbalanced class, and noisy data Generally outperforms single algorithms Good for parallelization Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems Recursive Partitioning as Iterative in ECL Training Data GrowTree( Training Data ) Training Data in ROOT Calculate Node Purity Is the Node Pure Enough? YES Return LEAF Node with Label Dependent Dataset Independent Dataset Decision Tree Learning Process NO Assign All Instances to Root Node Split Training Data in Subsets Di Find Best Attribute to Split Tranining Data into Root Node Children += GrowTree(Di) FOR EACH Subset Di DONE Return SPLIT Node + Children NEXT • Random Forest Learning is based on Recursive Partitioning as in Decision Trees • Forward References not allowed in ECL • DecTree Learning implemented in ECL as an Iterative Process via LOOP(dataset, …,loopbody) Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems Iterative Split/ Partition Process Purity, Purity, Max Max Tree Tree Level Level Instances into Dec Tree Layout Transform to DecTree Model Format Decision Tree Model Random Forest Learning Optimization Rnd Forest Initial Implementation Training Data Dependent Dataset Independent Dataset Bootstrap (Sampling) Associate Original to New ID Instances K Number of Trees Orig-New ID Instances Hash Table Dataset Size K x N x M records Assign Instances to Root Nodes Forest Growth Tranining Data into Root Nodes S Number of Feat to Select Iterative Split/ Partition Process Initial Implementation Flaws: • For every single iteration of Iterative Split/Partition LOOP at least K x N x M records are sent to the loopbody function: • For each LOOP iteration every Node-Instance record pass to loopbody function regardless of whether its processing was completed or not • Wasting resources by including Independent data as part of loopbody function’s INPUT: • Node Purity based only upon Dependent data • Finding Best Split per Node only needs subsets of Independent data (Feature Selection) • Implementation was not fully parallelized RF Model Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems Random Forest Learning Optimization Rnd Forest Optimized Implementation Training Data Independent Dataset Review of initial implementation helped to reorganize the process and data flows Dependent Dataset Bootstrap (Sampling) Sampling Independent Training Data Original – New Instances ID Hash Table Sampling Dependent Training Data Dataset Size K x M records Sampled Independent Dataset K Number of Trees Dataset Size K x N records Sampled Dependent Data in Root Nodes Forest Growth Purity & Max Tree Level Fetch Required Sampled Independent Data Iterative Split/Partition Process Number of Features to Select We improved our initial approach in order to: • Filter records not requiring further processing (LOOP - rowfilter) • Pass only one RECORD per instance (dependent value) into loopbody function • Fetch only Required Independent data from within the function at each iteration • Take full advantage of distributed data storage and parallel processing capabilities of the HPCC Systems Platform RF Model Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems Random Forest Learning Optimization RF Split/Partition loopbody FUNCTION Optimized INPUT Node’s Impurity Data Filter Nodes by Impurity PURE Calculate Node Gini Impurity Pure Enough Node-Inst Data IMPURE Impure Nodes-Inst Data Fetch Rnd Feat Selection Sampled Independent Dataset RndFeat Sel Nodes-Inst Data Choose Best Split per Node Split Nodes Data Re-Assign Instances in New Nodes New Node-Inst Assign Data OUTPUT Loopbody function fully parallelized: • Receives and returns one RECORD per Instance • Node Impurity and Best Split per Node calculations done LOCAL-ly: • Node-Instance Data DISTRIBUTED by Node_id • Fetching Rnd Feat Selection using JOIN-LOCAL: • Sampled Independent data generated and DISTRIBUTED by inst Id at BOOTSTRAP • Instances-Features Selected combinations dataset (RETRIEVER) DISTRIBUTED by inst Id • Inst Relocation to New Nodes done LOCAL-ly: • Impure Node-Instance Data still DISTRIBUTED by Node_id • JOIN-LOOKUP with Split Nodes data Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 10 Sparse Autoencoder • Autoencoder • Output is the same as the input • Sparsity • constraint the hidden neurons to be inactive most of the time • Stacking them up makes a Deep Network Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 32 Formulate to an optimization problem • Parameters • Weight and bias values • Objective function • Difference between output and expected output • Penalty term to impose sparsity • Define a function to calculate objective value and Gradient at a give point Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 33 Sparse Autoencoder results • 10’000 samples of randomly 8*8 selected patches Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 34 Sparse Autoencoder results • MNIST dataset Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 35 SoftMax Regression • Generalizes logistic regression • More than two classes • MNIST -> 10 different classes Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 36 Formulate to an optimization problem • Parameters • K by n variables • Objective function • Generalize logistic regression objective function • Define a function to calculate objective value and Gradient at a give point Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 37 SoftMax Results • Test on MNIST data • Using features extracted by Sparse Autoencoder • 96% accuracy Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 38 Toward Deep Learning • Provide learned features from one layer to another sparse autoencoder • … Stack up to build a deep network • Fine tuning • Using forward propagation to calculate cost value and back propagation to calculate gradients • Use L-BFGS to fine tune Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 39 Take Advantages of HPCC Systems • PBblas • Graphs Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 40 Example Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 41 Example Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 42 Example Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 44 Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 45 SUMMARY • Optimization Algorithms an important aspect for advanced machine learning problems • L-BFGS implemented on HPCC Systems • SoftMax • Sparse Autoencoder • Implement other algorithms by calculating objective value and gradient • Toward deep learning Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 46 Questions? Thank You Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems 47

Tiêu đề	Optimizing Supervised Machine Learning Algorithms and Implementing Deep Learning in HPCC Systems
Tác giả	Victor Herrera, Maryam Najafabadi
Trường học	Florida Atlantic University
Chuyên ngành	Machine Learning Algorithms
Thể loại	thesis

Định dạng
Số trang	46
Dung lượng	1,73 MB