Large Air Quality Data Set Available for Download

During my PhD, we built ten sensor nodes to measure different air pollutants and deployed them in the city of Zurich (Switzerland) on top of public transport vehicles. With this installation we collected a unique pollution data set comprising by far the largest number of measurements at that time.

Balz Maag, a co-researcher, made part of this unique data set publicly available:

Zenodo is a strong supporter of open data in all its forms (meaning data that anyone is free to use, reuse, and redistribute) and takes an incentives approach to encourage depositing under an open license.

The data set contains 11 million samples, this is one year (from April 2012 to April 2013) worth of ultra-fine particle (UFP) concentration measurements. The data was collected by a mobile sensor network. The sensors were mounted on top of 10 streetcars in the city of Zurich, Switzerland. The data has been post-processed by performing a periodic null-offset calibration and  filtering samples during malfunction.

A small excerpt of the data set:

2012.04.19 14:12,47.373288,8.522049,1.1,5,6400,48.1,16.0
2012.04.19 14:12,47.373272,8.522053,1.1,5,6545,47.4,16.1
2012.04.19 14:12,47.373253,8.522068,1.1,5,6656,47.2,16.3
2012.04.19 14:12,47.373244,8.522065,1.1,5,6731,47.1,16.4
2012.04.19 14:12,47.373233,8.522051,1.1,5,6451,47.9,16.0
2012.04.19 14:12,47.373233,8.522044,1.1,5,6400,48.1,16.0
2012.04.19 14:12,47.373237,8.522035,1.1,5,6178,48.7,15.7
2012.04.19 14:12,47.373248,8.522030,1.1,5,6378,47.2,15.6

The data set has been used and is described in more detail in the following publications:

  • David Hasenfratz et al. Pushing the Spatio-Temporal Resolution Limit of Urban Air Pollution Maps. IEEE International Conference on Pervasive Computing and Communications (PerCom). Budapest, Hungary, March 2014. Best Paper Award.
  • David Hasenfratz et al. Deriving High-Resolution Urban Air Pollution Maps Using Mobile Sensor Nodes. Pervasive and Mobile Computing. Elsevier, 2015.
  • David Hasenfratz et al. Demo Abstract: Health-Optimal Routing in Urban Areas. ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). Seattle, USA, April 2015.
  • Michael Müller et al. Statistical modelling of particle number concentration in Zurich at high spatio-temporal resolution utilizing data from a mobile sensor network. Atmospheric Environment. Elsevier, 2016.

Visualizing MQTT Data

The interest in crowd-sourced applications is steadily increasing year to year. Drastic events like the Fukushima Daiichi nuclear disaster in March 2011 showed the power and value of public initiatives, in this case to build Geiger counters to accurately measure radiation. This was particularly valuable in a situation where the general public did not have a high confidence in the numbers reported by the government.

The availability of low-cost air quality sensors pushed another application in the last years, the one of do-it-yourself air quality stations. The OK (Open Knowledge) Lab Stuttgart provides on a very nice how-to guide to building and operating an air quality station with integrated temperature, humidity, and particulate matter sensors. The data can be shared with their platform using the API provided. Based on the data of thousands of distributed stations a particulate matter (PM10) pollution map is published:

Components of the air quality station (picture from

Since I am interested in air quality it was obvious to build a station for myself. The setup with an NodeMCU ESP8266 for data processing and communication (WiFi), a SDS011 fine dust sensor, and a DHT22 temperature and humidity sensor is easy and straight-forward. While they also provide a firmware to get quickly started, I was interested in sending the sensor data over MQTT, which is not supported by the original firmware. MQTT is a lightweight messaging protocol for sensor data streams from small sensors and mobile devices, optimized for high-latency or unreliable networks. Hence, I stripped down their full-fledged NodeMCU firmware and extended it with the ability so send all sensor data to an MQTT broker. The code can be found here:

A deployed air quality station.

Much to my surprise, it was not easy to find a simple online platform to visualize sensor data sent over MQTT. First I tried Amazon Web Services (AWS), one of the dominant players in providing cloud service platforms. While AWS provides many great services their setup for this specific scenario is difficult. It requires an additional MQTT broker sitting between the NodeMCU ESP8266 sensor node and AWS’ own broker and involves multiple different service platforms, such as shown on the picture below depicting a high-level architecture diagram of the setup required.

High-level architecture overview of the involved services.

Furthermore, Amazon QuickSight only provides rudimental data visualization capabilities in its current state. Showing real-time plots of the sensor data collected is not yet possible.

I found that this experience was not the exception but rather the rule. It was difficult to find a simple, easy-to-use cloud service for visualizing sensor data sent over MQTT. After having searched for a while, I came across ThingSpeak:

ThingSpeak is an open source “Internet of Things” application and API to store and retrieve data from things using HTTP over the Internet or via a Local Area Network. With ThingSpeak, you can create sensor logging applications, location tracking applications, and a social network of things with status updates.

ThingSpeak has integrated support from the numerical computing software MATLAB from MathWorks allowing ThingSpeak users to analyze and visualize uploaded data using Matlab without requiring the purchase of a Matlab license from Mathworks. provides several paid license options but is also available as a free service for small non-commercial home projects (~8200 messages per day) with limits on capacity and update rates. These limitations are not a problem for an air quality station.

The ThingSpeak’s MQTT broker is reachable under On ThingSpeak’s platform several channels can be created, each channel can process and visualize up to eight sensor data types. A channel can be fed with data by using by using as topic channels/CHANNEL-ID/publish/API-KEY where CHANNEL-ID is the channel’s unique ID and the API-KEY is the channel specific write API key. A detailed description with example code for an Arduino client can be found here: Publish to a Channel Using Arduino Client.

For every channel also a public dashboard can be defined. This is shown below for my own channel visualizing the air quality station’s sensor data:

Public dashboard of the air quality sensor’s public channel.

Surprising App Analytics

Hosting apps on the iOS App Store is tied to a developer account with an annual fee of $99. Every year by this time I need to decide whether it is still worth keeping it. Cancelling the account means that the apps hosted through the account get removed from the App Store.

If it wasn’t for nostalgia, I would have closed down the account a couple of years ago. At the moment I have a single app in the Store based on the outcome of my PhD thesis. The Health-Optimal Route Planner demonstrates one of many interesting use cases enabled by high-resolution urban air pollution maps. The app computes and displays the healthiest route between any two locations in the city of Zurich. Further, it compares it to the shortest route between the two given locations.

The app itself, while being a nice use case demonstrator, has limited use for the general public. The high-resolution pollution map needed to implement the use case is only available for the city of Zurich. It comes at no surprise that the number of app installations is very small. Since its release in September 2014 there were a total of 554 downloads until this summer, with a small peak in the early days when a local newspaper wrote a story about it.


Having the app not updated for the last 2.5 years, I decided to have a last glimpse on the statistics before closing down the developer account. And rather surprising the numbers changed. Since this August the application gets installed around 60 times a day.


This does not make any sense. As mentioned above, the app’s value is limited to a small geographic region, having this amount of downloads from the region of Zurich is difficult to imagine. Hence, let’s have a look at the geographic distribution of the installations in the last two months.


This even makes less sense. The vast majority of installations is originating from China. Being curious, the next question that popped up was whether the app is just installed by some random service or whether there are real users behind it. The app itself does not track any user interaction. But since it makes use of the Google Places and Maps APIs and Google sets limits on the number of requests before charging for them, the number of requests are logged and can be analyzed. The Places API gets a request if a user enters the location she wants to use to compute the optimal route. Only if the user specifies a valid (needs to be within the city of Zurich) start and end locations and hits the Compute route button the Maps API is used to illustrate the routes computed. The stats show the following number of request in past days.



It seems that some of the users are trying to use the app by entering a location. But only a tiny subset results in an actual route computation indicating that most users would like to query a location not supported by the app.

The question remains. What is the reason for the app’s popularity in China? I don’t know, but I would be very curious to find out! By the way, the download statistic of the Android version of the app did not increase in the same time period.

Time-Delay Neural Networks

In the last post, we described how to implement Decision Trees on Embedded Systems. We ended the post by stating that while decision trees are well suited for data classification, its implementation on embedded systems is very challenging due to, more often than not, limited memory.

Time-delay neural networks (TDNN), another approach for data classification, gained momentum in the last years. It performs very well on time series and is therefore interesting for a wide range of applications, such as stock market prediction, image sequence analysis, and speech recognition. Further, it provides one significant advantage compared to decision trees: it can be more efficiently implemented on embedded systems. In the following, we will show how Matlab’s Neural Network Time Series Tool or its counterpart the timedelaynet function can be used to train time-delay neural networks and run them on embedded systems. The code discussed here can be found on GitHub.

Artificial neural networks approximate the operation of the human brain. Wikipedia has the following short and crisp introduction:

Neural networks are a computational approach which is based on a large collection of neural units loosely modeling the way a biological brain solves problems with large clusters of biological neurons connected by axons. Each neural unit is connected with many others, and links can be enforcing or inhibitory in their effect on the activation state of connected neural units. Each individual neural unit may have a summation function which combines the values of all its inputs together. There may be a threshold function or limiting function on each connection and on the unit itself such that it must surpass it before it can propagate to other neurons. These systems are self-learning and trained rather than explicitly programmed and excel in areas where the solution or feature detection is difficult to express in a traditional computer program.

Time-delay neural networks work on sequential data, e.g., time series, by augmenting the input with time-delayed copies of previous inputs:


We use Matlab’s pollution mortality data set to show how to create and deploy a time-delay neural network. The data set’s input consists of eight measurements of the ambient environment (temperature, relative humidity, carbon monoxide, sulfur dioxide, nitrogen dioxide, hydrocarbons, ozone, particulate) and three output variables (total mortality, respiratory mortality, cardiovascular mortality). We can either use the Neural Network GUI

or command line functions to construct a time-delay neural network with ten hidden nodes and a two step delay:

% Pollution data set:
% - PollutionInputs: every input consists of 8 variables.
% - PollutionTargets: every target consists of 3 variables.
load pollution_dataset

% Training function
trainFcn = 'trainlm'; % Levenberg-Marquardt backpropagation.

% Create a time-delay network.
inputDelays = 1:2;
hiddenLayerSize = 10;
net = timedelaynet(inputDelays, hiddenLayerSize, trainFcn);

% Prepare data for training.
[x,xi,ai,t] = preparets(net, pollutionInputs, pollutionTargets);

% Setup Division of data for training, validation, testing.
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;

% Train the network.
[net,tr] = train(net,x,t,xi,ai);

Next, we want to deploy the classifier on a real system and, hence, we need a C/C++ implementation of the created and trained neural network. We implement the export function extractNeuralNetwork, which takes as input the trained network net and automatically creates the class TDNNClassifier with the header files Data.h and TDNNClassifier.h and the source file TDNNClassifier.cpp:

% Create C++ implementation of the trained network.

Now we can use the neural network in a C/C++ project to predict pollution-induced mortality rates:

#include <iostream>
#include "Data.h"
#include "TDNNClassifier.h"

using namespace std;

static const float POLLUTION_DATASET[5][8] = {
  {72.38, 29.20, 11.51, 3.37, 9.64, 45.79, 6.69, 72.72},
  {67.19, 67.51, 8.92, 2.59, 10.05, 43.90, 6.83, 49.60},
  {62.94, 61.42, 9.48, 3.29, 7.80, 32.18, 4.98, 55.68},
  {72.49, 58.99, 10.28, 3.04, 13.39, 40.43, 9.25, 55.16},
  {74.25, 34.80, 10.57, 3.39, 11.90, 48.53, 9.15, 66.02}

int main(int argc, char* argv[]) {
  // Test the classifier with the first five inputs of Matlab's dataset
  // pollution_dataset, which was used to train the neural network.
  Data data;
  TDNNClassifier classifier;
  float *prediction;

  for (int d = 0; d < 5; d++) {
    for (int i = 0; i < 8; i++) {
      data.value[i] = POLLUTION_DATASET[d][i];
    prediction = classifier.Predict(data);
    cout << "Prediction " << prediction[0] << ", " << prediction[1]
         << ", " << prediction[2] << endl;
  return 0;

Check out the Github project TDNN-Matlab2Cpp for more details and the full source code.

Decision Trees on Embedded Systems

Many approaches for classification and regression deliver high accuracy but are due to their black-box nature no longer interpretable for its user, for example models produced with support vector machines (SVMs). Hence, these models do not provide us any intuitive insights why they are performing well in certain situations while miserably failing in others.

The contrary are white-box models, which let us look inside the model to find out what the model learned from the training data and how it is using the input variables to predict the value of the output variable. A class of popular white-box models are decision trees. In the following, we will have a closer look at them, more specifically at Matlab’s TreeBagger random decision forest implementation, and show how we can run the classifier on embedded systems. The code discussed here can be found on GitHub.

Decision trees are tree-like graphs. Each interior node represents one of the input variables. Edges going to the node’s children represent possible values of that input variable. The leaf nodes depict the values of the output variable given that the input variables match the values of the edges traversed from the root to the given leaf node. The following tree is an example from Akanoo, a company providing automated on-site marketing solutions. It shows the classification of web shop visitors as buyers or non buyers:


The tree shows the basic principle of a decision tree analysis. In reality the analysis will depend on a variety of additional factors, i.e., input variables. Akanoo uses for example a combination of over 50 independent variables to calculate purchasing probabilities.

Matlab’s TreeBagger function combines multiple decision trees, each using a random subset of the input variables, to increase the classification accuracy. The following example uses Fisher’s iris flower data set to show how TreeBagger is used to create 20 decision trees to predict three different flower species based on four input variables: sepal length, sepal width, petal length, and petal width.

% fisheriris data set:
%  - meas: Matrix of input variables
%  - species: vector of species
load fisheriris

% create numerical class labels
species_class(find(strcmp(species, 'setosa'))) = 1;
species_class(find(strcmp(species, 'versicolor'))) = 2;
species_class(find(strcmp(species, 'virginica'))) = 3;

% create decision trees
B = TreeBagger(num_bags, meas, species_class);

B holds the ensemble of 20 decision trees, which we can use to predict the species based on the four input variables. In the following, we use the first set of input variables in our measurement data set to predict the species.

predict(B, meas(1,:))

Each of the trees in B have around 20 nodes, even for this rather simple classification example. For more complex problems the trees easily grow to many hundreds of nodes. As long as we use the classifier within Matlab this is not really a problem. However, as soon as we want to deploy the classifier on a real system in a device, we need to re-implement the decision trees in a low-level language, such as C or C++, and export the trees trained in Matlab. Luckily, Paul Kendrick provides on GitHub the project decisionTreeMat2Cpp doing exactly what we need for this:

This program takes a decision tree trained in Matlab using TreeBagger or the classification tree function ClassificationTree and outputs a text file containing all the branch information. This text file can be read by the attached C++ class, and then used to make decisions based on presented features in deployed applications.

However, many embedded systems do not have a file system, turning it impossible to read in the text files describing the decision trees. Hence, we extend the package to directly output a header file containing all the branch information. This header file we can easily include in the C++ class, which implements the logic of the decision trees. The following Matlab command

extractDecTreeStruct(B, unique(species_class), 1, num_bags);

creates the header file decTreeConstants.h, which looks like this (only showing the first few lines of the file):


const int NO_CLASSES = 3;
const int NO_BAGS = 20;

const int NO_BRANCHES[20] = {5, 10, 7, 9, 11, 6, 7, 8, 8, 5, 8, 7, 5, 10, 9, 12, 8, 6, 5, 13};

const int BRANCH_LENGTHS[20][13] = {
{1, 2, 3, 4, 4},
{2, 3, 3, 3, 3, 4, 4, 4, 5, 5},
{1, 2, 3, 4, 5, 6, 6},
{1, 3, 3, 4, 4, 5, 5, 5, 5},
{1, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6},
{1, 2, 3, 4, 5, 5},

This simple C++ example illustrates how we can use then the decision tree class to classify a species based on the input variables 5.7, 2.8, 4.1, and 1.3 corresponding to sepal length, sepal width, petal length, and petal width:

#include "DecisionTreeClass.hpp"

using namespace std;

int main(int argc, char* argv[]) {

  DTree tree;

  float input[4] = {5.7f, 2.8f, 4.1f, 1.3f};
  int prediction = tree.decisionTreeFun(input);
  cout &lt;&lt; "Predicted species: " &lt;&lt; prediction &lt;&lt; endl;

  return 0;

Check out the Github project TreeBagger-Matlab2Cpp for more details and the full source code.

Another challenging problem not covered by this post persists. Embedded systems are chronically short on memory and, hence, may have difficulties to fit the large three dimensional arrays needed by the decision trees’ C++ implementation.