Decision Trees on Embedded Systems

Many approaches for classification and regression deliver high accuracy but are due to their black-box nature no longer interpretable for its user, for example models produced with support vector machines (SVMs). Hence, these models do not provide us any intuitive insights why they are performing well in certain situations while miserably failing in others.

The contrary are white-box models, which let us look inside the model to find out what the model learned from the training data and how it is using the input variables to predict the value of the output variable. A class of popular white-box models are decision trees. In the following, we will have a closer look at them, more specifically at Matlab’s TreeBagger random decision forest implementation, and show how we can run the classifier on embedded systems. The code discussed here can be found on GitHub.

Decision trees are tree-like graphs. Each interior node represents one of the input variables. Edges going to the node’s children represent possible values of that input variable. The leaf nodes depict the values of the output variable given that the input variables match the values of the edges traversed from the root to the given leaf node. The following tree is an example from Akanoo, a company providing automated on-site marketing solutions. It shows the classification of web shop visitors as buyers or non buyers:

example_decision_tree

The tree shows the basic principle of a decision tree analysis. In reality the analysis will depend on a variety of additional factors, i.e., input variables. Akanoo uses for example a combination of over 50 independent variables to calculate purchasing probabilities.

Matlab’s TreeBagger function combines multiple decision trees, each using a random subset of the input variables, to increase the classification accuracy. The following example uses Fisher’s iris flower data set to show how TreeBagger is used to create 20 decision trees to predict three different flower species based on four input variables: sepal length, sepal width, petal length, and petal width.

% fisheriris data set:
%  - meas: Matrix of input variables
%  - species: vector of species
load fisheriris
num_bags=20;

% create numerical class labels
species_class(find(strcmp(species, 'setosa'))) = 1;
species_class(find(strcmp(species, 'versicolor'))) = 2;
species_class(find(strcmp(species, 'virginica'))) = 3;

% create decision trees
B = TreeBagger(num_bags, meas, species_class);

B holds the ensemble of 20 decision trees, which we can use to predict the species based on the four input variables. In the following, we use the first set of input variables in our measurement data set to predict the species.

predict(B, meas(1,:))

Each of the trees in B have around 20 nodes, even for this rather simple classification example. For more complex problems the trees easily grow to many hundreds of nodes. As long as we use the classifier within Matlab this is not really a problem. However, as soon as we want to deploy the classifier on a real system in a device, we need to re-implement the decision trees in a low-level language, such as C or C++, and export the trees trained in Matlab. Luckily, Paul Kendrick provides on GitHub the project decisionTreeMat2Cpp doing exactly what we need for this:

This program takes a decision tree trained in Matlab using TreeBagger or the classification tree function ClassificationTree and outputs a text file containing all the branch information. This text file can be read by the attached C++ class, and then used to make decisions based on presented features in deployed applications.

However, many embedded systems do not have a file system, turning it impossible to read in the text files describing the decision trees. Hence, we extend the package to directly output a header file containing all the branch information. This header file we can easily include in the C++ class, which implements the logic of the decision trees. The following Matlab command

extractDecTreeStruct(B, unique(species_class), 1, num_bags);

creates the header file decTreeConstants.h, which looks like this (only showing the first few lines of the file):

#ifndef DECTREECONSTANTS_h
#define DECTREECONSTANTS_h

const int NO_CLASSES = 3;
const int NO_BAGS = 20;

const int NO_BRANCHES[20] = {5, 10, 7, 9, 11, 6, 7, 8, 8, 5, 8, 7, 5, 10, 9, 12, 8, 6, 5, 13};

const int BRANCH_LENGTHS[20][13] = {
{1, 2, 3, 4, 4},
{2, 3, 3, 3, 3, 4, 4, 4, 5, 5},
{1, 2, 3, 4, 5, 6, 6},
{1, 3, 3, 4, 4, 5, 5, 5, 5},
{1, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6},
{1, 2, 3, 4, 5, 5},
...
...

This simple C++ example illustrates how we can use then the decision tree class to classify a species based on the input variables 5.7, 2.8, 4.1, and 1.3 corresponding to sepal length, sepal width, petal length, and petal width:

#include "DecisionTreeClass.hpp"

using namespace std;

int main(int argc, char* argv[]) {

  DTree tree;

  float input[4] = {5.7f, 2.8f, 4.1f, 1.3f};
  int prediction = tree.decisionTreeFun(input);
  cout << "Predicted species: " << prediction << endl;

  return 0;
}

Check out the Github project TreeBagger-Matlab2Cpp for more details and the full source code.

Another challenging problem not covered by this post persists. Embedded systems are chronically short on memory and, hence, may have difficulties to fit the large three dimensional arrays needed by the decision trees’ C++ implementation.

One thought on “Decision Trees on Embedded Systems”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s