Latest on Neural Networks research

We are now several weeks we into the implementation of our Neural Networks predictive algorithm and it worth to share some of the lessons we learnt. This is the first post and we will update regularly with our findings.

Neural Networks has been an hot topic in the recent months, and with the release of new librarie (notably Google developed Tensorflow) a very active community has been developed.

As often when new technologies move to mainstream we have been seeing the usual promises “Learn Neural Networks in 1h” or “how to write your own Neural Networks model in 4 python lines”… Unlikely!

Most of these claims are a blind copy paste of the Tensorflow examples or post from researchers. While the example itself is correct for the dataset which has been trained for, it does not means that is converging and predicting any other dataset (read is at “It cannot predict a random dataset”).

Our approach has been to understand first the basis of the Neural Networks Theory, then dig deep into the low level Tensorflow API to build a network which can converge to an analytical recurrent function, a \([-1:1]\) bounded \(f(x)=\sin\left(\frac{x}{\lambda}\right)\), and use different activation functions to check the error and speed of convergence.

Only after we understood fully these steps, we could move mindfully to the analysis of real time series with data coming from the Crypto exchange we use for trading.

The basic ideas has been taken from a Github repo and we modified the code to integrate in our internal python framework.

As our final objective is to predict data, we focus on LSTM Network only (LSTM stands for Long Short Term Memory). We skip the theory and advise to read this excellent blog post.

Test 1 - Single Training dataset - RU activation function

The network is trained on \(f(x)=\sin\left(\frac{x}{\lambda}\right), \lambda \in [0.5\mathrel{\ldotp\ldotp}4]\) using random shifts for 25000 epochs and we set a fixed seed for the random number generator. The activation function used is a rectified linear \(f(x)=\max(x, 0)\).

Unsurprisingly, the network capture the positive contributions of the various frequencies and return zero for the negative contributions.

Test 1

This test is showing how important is to carefully choose the activation function (or conversely to scale properly your input data) to guarantee a successful convergence. The same comment applies to other network hyper-parameters.

Test 2 - Single Training dataset - Linear activation function

The network is trained on a \(f(x)=\sin\left(\frac{x}{\lambda}\right), \lambda = 2\) only using random shifts for 25000 epochs and we set a fixed seed for the random number generator. The activation function used is linear \(f(x)=x\). The network structure is the same as the network in the previous test.

Therefore, the expectation is that, when we try to apply to different frequencies, the network is not capable to properly grasp the specific frequency and revert somehow close to the one it knows (for \(\lambda=3\) is getting visible). This demostrate that using a training set for a totally different sequence does not lead to good results.

Test 2

Test 3 - Multiple Training dataset - Linear activation function

The network is trained on \(f(x)=\sin\left(\frac{x}{\lambda}\right), \lambda \in [0.5\mathrel{\ldotp\ldotp}4]\) using random shifts for 25000 epochs and we set a fixed seed for the random number generator. The activation function used is linear \(f(x)=x\). The network structure is the same as the network in the previous tests.

Test 3

When we test this network with different dataset, it is able to properly capture the various frequencies.

Written on May 20, 2018