Scalable Regression Tree Learning on Hadoop using OpenPlanet
W. Yin, Y. Simmhan, and V. Prasanna. International Workshop on MapReduce and its Applications (MAPREDUCE), page 57--64. (2012)
Abstract
As scientific and engineering domains attempt to effectively analyze
the deluge of data arriving from sensors and instruments, machine
learning is becoming a key data mining tool to build prediction models.
Regression tree is a popular learning model that combines decision
trees and linear regression to forecast numerical target variables
based on a set of input features. Map Reduce is well suited for addressing
such data intensive learning applications, and a proprietary regression
tree algorithm, PLANET, using MapReduce has been proposed earlier.
In this paper, we describe an open source implement of this algorithm,
OpenPlanet, on the Hadoop framework using a hybrid approach. Further,
we evaluate the performance of OpenPlanet using realworld datasets
from the Smart Power Grid domain to perform energy use forecasting,
and propose tuning strategies of Hadoop parameters to improve the
performance of the default configuration by 75% for a training dataset
of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.
%0 Conference Paper
%1 Yin:mapreduce:2012
%A Yin, Wei
%A Simmhan, Yogesh
%A Prasanna, Viktor
%B International Workshop on MapReduce and its Applications (MAPREDUCE)
%D 2012
%K cloud, grid, hadoop, learning, machine map peer reduce, reviewed, smart usc
%P 57--64
%T Scalable Regression Tree Learning on Hadoop using OpenPlanet
%U http://ceng.usc.edu/~simmhan/pubs/yin-mapreduce-2012.pdf
%X As scientific and engineering domains attempt to effectively analyze
the deluge of data arriving from sensors and instruments, machine
learning is becoming a key data mining tool to build prediction models.
Regression tree is a popular learning model that combines decision
trees and linear regression to forecast numerical target variables
based on a set of input features. Map Reduce is well suited for addressing
such data intensive learning applications, and a proprietary regression
tree algorithm, PLANET, using MapReduce has been proposed earlier.
In this paper, we describe an open source implement of this algorithm,
OpenPlanet, on the Hadoop framework using a hybrid approach. Further,
we evaluate the performance of OpenPlanet using realworld datasets
from the Smart Power Grid domain to perform energy use forecasting,
and propose tuning strategies of Hadoop parameters to improve the
performance of the default configuration by 75% for a training dataset
of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.
@inproceedings{Yin:mapreduce:2012,
abstract = {As scientific and engineering domains attempt to effectively analyze
the deluge of data arriving from sensors and instruments, machine
learning is becoming a key data mining tool to build prediction models.
Regression tree is a popular learning model that combines decision
trees and linear regression to forecast numerical target variables
based on a set of input features. Map Reduce is well suited for addressing
such data intensive learning applications, and a proprietary regression
tree algorithm, PLANET, using MapReduce has been proposed earlier.
In this paper, we describe an open source implement of this algorithm,
OpenPlanet, on the Hadoop framework using a hybrid approach. Further,
we evaluate the performance of OpenPlanet using realworld datasets
from the Smart Power Grid domain to perform energy use forecasting,
and propose tuning strategies of Hadoop parameters to improve the
performance of the default configuration by 75% for a training dataset
of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.},
added-at = {2014-08-13T04:08:36.000+0200},
author = {Yin, Wei and Simmhan, Yogesh and Prasanna, Viktor},
biburl = {https://www.bibsonomy.org/bibtex/225fcd4e5ca0dcf88c5cd037d11bf17cc/simmhan},
booktitle = {International Workshop on MapReduce and its Applications (MAPREDUCE)},
interhash = {5a17fc3883059b0f47db8127ac0c2e11},
intrahash = {25fcd4e5ca0dcf88c5cd037d11bf17cc},
keywords = {cloud, grid, hadoop, learning, machine map peer reduce, reviewed, smart usc},
owner = {Simmhan},
pages = {57--64},
timestamp = {2014-08-13T04:08:36.000+0200},
title = {Scalable Regression Tree Learning on Hadoop using OpenPlanet},
url = {http://ceng.usc.edu/~simmhan/pubs/yin-mapreduce-2012.pdf},
year = 2012
}