# An Autoregressive Neural Network Approach to Forecasting Bitcoin Price

### by MAX Contributor Andrew Jim

DISCLAIMER; This post and its contents should in no way be considered investment advice.

“Essentially, all models are wrong but some might be useful…”
George Box

Neural networks (NN) have been one of the more interesting machine learning models because of their structure inspired by the brain, and also the ability to add much more complexity (think of deep learning with many hidden layers). Neural networks have not always been popular, being viewed as a computationally expensive black box, and in some cases not yielding better results compared with simpler methods. In this article, we are going to explore a simpler NN model to capture the nonlinear movements in Bitcoin price: a multilayer feed forward neural net with a single hidden layer using autoregressive (lagged) inputs.

1. Introduction

A NN can be thought of as a “network of neurons” (nodes) which are organized in layers — analogous to the way information passes through neurons in humans. The advantage of a neural network is its adaptive nature which learns from the inputs provided and trains itself from the data ( i.e. optimizing its weights for a better prediction).

## A Feed Forward Neural Network — The Multilayer Perceptron:

multilayer perceptron (MLP) is a class of feedforward artificial neural network. A MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. The outputs in one layer are inputs into the next layer. The first layer of the neural network receives the inputs. Coefficients (weights) are attached to these inputs and a linear combination of these weighted inputs are passed (“fed forward”) to the hidden layer. The result from the hidden layer nodes are modified by a nonlinear activation function (in our case a sigmoid function) before being passed to the last output layer which has only one node representing the predicted value.

The example below shows a NN with 4 inputs and 1 hidden layer with 3 neurons which feed to one single output:

Activation functions are an important feature of a neural network since they decide whether a node should be “fired up” (ie. activated and whether the information that the node is receiving is relevant or should it be ignored). This is summarized in the general formula below, along with a bias parameter which helps the model find a best fit for the given input data (think of it conceptually like an ‘intercept’ whose role allows us to shift the activation function which maybe critical for successful learning).

## Training a Neural Net:

The connections in a NN are weighted and the weights are optimized using a backpropagation algorithm/learning rule (in our case a gradient descent algorithm). This algorithm iteratively and recursively changes the weights and biases to minimize a loss function (in our case, RMSE — Root Mean Square Error) which in essence, measure the difference between the predicted and actual observed values.

Thus, better prediction = lower loss, and training a network = trying to minimize its loss.

Note: the initial weights at the input layer take random values to begin with and are updated using the observed data. As such, there is an element of randomness in the predictions produced by a neural network. Therefore, the network is normally “trained” several times using different random starting weights, and the results are averaged.

## Time Series Data:

For time series data (in our case daily Bitcoin prices) the lagged (autoregressive) values of the time series are used as inputs to a neural network. The objective is then to determine how many lags to include in the input layer and how many neurons to include in the hidden layer to produce a forecast that minimizes RMSE. For forecasting, the network is applied iteratively (eg. for forecasting 10 steps ahead, we use the preceding forecasts for 9 steps along with the original historical data)

## Preparing to fit the neural network:

Before fitting a neural network, some data pre-processing needs to be done. Neural networks are not that easy to train and tune. It is good practice to scale and/or transform our data before training a neural network to avoid spurious results or long training times due to lack of convergence. NN’s do not require data normality (or stationarity for that matter) but transforming the data (eg. a log transform) can help to remove noise, reduce variance and improve the signal in time series forecasting. In our case we will use a Box-Cox transform (which is a power transformation with the key exponent variable lambda chosen via maximum log likelihood estimation).

For our case, we will be scaling our data using ‘Z-normalization’ after centering our data around the mean (ie. subtracting each data point from the data mean and then dividing by their respective standard deviation). Given the high historical volatility of BTC, robust scaling (by removing the median and dividing by the IQR InterQuartile Range) was also experimented with but the results were similar.

The chart below show BTC after the Box-Cox transform and standard normalization along with a histogram that has much less skew and variance and resembles slightly more of a normal distribution.

## 3. Key Neural Network Parameters

Number of layers and nodes in each layer:
There are no fixed rules on the number of layers to use or the number of nodes to use in each layer. Usually one hidden layer is enough to model a large number of different non-linear applications (our model below uses one hidden layer). As for as the number of nodes inside the hidden layer, empirically it has been shown it should be between the input layer size and the output layer size, usually 1/2 to 2/3 of the number of nodes in the input layer. In our case, the number of nodes in the input layer will be determined by the optimal number of non-seasonal lags (p) from a linear Autoregressive model AR(p) where the lagged values are used as predictor variables. The number of nodes in our hidden layer will be at least 1/2 of the number of nodes in the input layer.

Other Model Hyperparameters:
There are some model parameters which are specific to the problem and they are found by “tuning” the model via trial and error or using best practices. This is done to improve the skill of the model predictions. In our case, two key hyperparameters are addressed:

## (a) Early Stopping

The normal practice is to split our dataset (e.g 90/10) into a training set and a test set. The NN model is initially fit only on the data in training set. The resulting parameters (weights for the neutral network etc) are then used to produce a forecast which can be compared to the test set data to provide an unbiased evaluation of the model fit.

A major challenge in training neural networks is how long to train them. Too little training will mean that the model will underfit and too much training will mean that the model will overfit the training dataset and have poor prediction performance on the test set or other new data. A compromise is to stop training at the point when performance on a test set starts to degrade. A widely used approach to training neural networks is called ‘early stopping’.

In our case, we trained the NN over a large number of epochs, where one epoch is when the entire dataset is passed through the NN once. Once the performance of the model on the test set started to degrade, the training process was stopped. The line plot below shows the impact on test accuracy as measured by RMSE vs the number of epoch iterations. We can see via the ups and downs on the chart that test accuracy increases to a point and then begins to decrease again after epoch 300. Therefore we can expect that a good time to stop training would be around epoch 300.

## (b) Regularization: Weight Decay

The weight decay (or learning rate) controls the rate at which the NN model learns. In general, a faster learning rate (larger weight decay) allows the model to learn faster, at the cost of arriving on a sub-optimal final set of weights. Whereas a slower learning rate (smaller weight decay) may allow the model to learn a more optimal set of weights which may overfit the data and also take significantly longer to train. Unfortunately, we cannot analytically calculate the optimal learning rate for a given model on a given dataset. Instead, a ‘good enough’ learning rate must be estimated via trial and error.

When training neural networks, the weights are multiplied by a decay factor (<1) which prevents the weights from growing too large. This approach tends to reduce the overfitting of a NN on the training data and helps improve the performance of the model on new data. In our model, we do sensitivity analysis using fixed weight decay factors of between 0 to 0.01 on the training set, and choose the one that gives the best accuracy on the test set as measured using MAPE (Mean Absolute Percentage Error — lower is better).

From the line plot below, we can see that using a weight decay penalty of about 0.002 gives better accuracy (ie. lower loss) on the test data vs the default of using no (ie. zero) weight decay. Any larger (faster) decay penalty shows noticeably poorer test accuracy.

## 4. Fitting the Neural Net to Bitcoin

We split our BTC time series data (of about 3280 daily price observations starting from 2010–07–17) into a training set (90%) on which to train the NN and a testset data set (10%) to evaluate the results of the NN predictions.

Note that the test set data is isolated and not used in any way to influence training of the NN or tuning of the parameters so that the forecast produced from the training data will be unbiased. The training data is transformed using a Box-Cox transformation then scaled/normalized and fed into our neural net as shown in the plot below. There are a total of 33 weighted connections formed by 6 nodes (X1 to X6) in the input layer representing the 6 most recent daily prices lags, 4 nodes (H1 to H4) in the hidden layer, and 2 bias nodes (B1 and B2) which all feed forward to the 1 output node (O1) (the predicted price):

A (6–4–1) feed forward autoregressive neural net:

In order to produce confidence intervals, future sample paths were iteratively simulated to build up knowledge of the forecast distribution based on the fitted neural net. As an example, the simulation below shows just 10 such future paths covering about 180 days forward from timestamp 3100 (approx end of Dec 2018):

Repeating this simulation 10,000 times gives us a good picture of the forecast distribution. The plot below shows those paths with 68% and 95% confidence intervals with the mean of those paths highlighted in dark blue. The actual test series in red is also overlaid for comparison.

We can obtain a matrix of accuracy measures. In our case we are minimizing RMSE as our loss function but MAPE (mean absolute percentage error) is also highlighted since it is sometimes easier to understand expressing accuracy as a percentage:

The results show our NN was trained and fitted to within 3.6% MAPE of the original training data set — a relatively close fit given the volatile nature of Bitcoin. As we generalize our problem and use our trained NN to forecast 180 days ahead (from 2018–12–31 to end of June, 2019–06–30), we can see that both accuracy measures RMSE and MAPE increase as expected. The chart shows a wide forecast range, however the mean of the forecasted paths does rise inline with the test set data. Using the withheld test set for comparison, MAPE tells us that we can expect to see, on aggregate, at least a 15.5% forecasting error. At 180 days ahead, actual BTC price was ~10,100 vs the forecasted mean of ~8000 (or a ~20% discrepancy). As the chart shows, Bitcoin price clearly had a recent rally upwards to above ~13,000 taking it to the upper bounds of the 95% confidence interval and perhaps suggesting limited upside from current levels in the very near term.

# 5. Final Model Evaluation

We now take our final model with optimized parameters and hyper-parameters obtained from the training dataset, and apply them to the entire full dataset (train + test) in order to make a forward forecast from the latest trading observation (on 2019–07–09).

The following forecast distribution along with confidence intervals and accuracy measurements were obtained:

There are a few notable takeaways: the model fits the full dataset relatively well and trained to within 3.15% MAPE (RMSE tells a similar story). The 180 day forward forecast shows that the mean price level should stabilize around ~10,000 (ie. the model does not expect a heavy pullback on sell offs after the recent strong rally) but the confidence intervals remain wide (68% confidence band between 7700 ~ 11,700 and 95% confidence between 5500 ~14,200). In particular regarding upside, the maximum upper band of the 95% confidence interval seems capped at ~15,200, implying that the model does not expect to see a rally above that level in the short term. Clearly, the NN forecasts will evolve with time and price fluctuations, but the question is how many time steps (days) ahead should we realistically expect the forecast to hold within a certain error level? We can answer that using Cross Validation below.

## 6. Cross-Validation

Cross-validation is a statistical resampling technique used to estimate the skill of machine learning models when we have limited data. That is, to estimate how accurately a predictive model will perform in practice.

## Times Series Cross-Validation:

Time series data (or other intrinsically ‘ordered data’) can be problematic for cross-validation given the temporal dependencies of each data point, so standard cross validation where test set data is randomized will not work. Instead we use a ‘time series cross validation’ method where the training set only consists of data that occurred prior to the observation that forms the test set. Doing so means that future observations are not used in constructing the forecast. As shown in the diagram below, a certain point in time is selected where everything before that point (blue) is the training data and everything after that point (red) is the test data. The forecast accuracy is computed by averaging over all the test sets.

This procedure is sometimes known as a “walk forward validation with a rolling forecasting origin”, because the origin at which the forecast is based rolls forward in time. In doing so, it allows us to find structural changes in the data and also, in practice we will likely retrain our model as new data becomes available anyway. This would give the model the best opportunity to make better forecasts at each time step and this cross validation technique provides a much more robust estimation of how the chosen model and parameters will perform in practice.

In our case, the following accuracy measures were obtained using a 7 day step ahead rolling origin forecast (instead of just 1 day steps) to see if the model is still robust in forecasting slightly further out. The plots below show the accuracy measures for MAPE and RMSE plotted against forecast days ahead:

Intuitively from the plots, we can see that our error measures increase (forecast accuracy decreases) as time steps (h) increases. In our case, reading from the right side plot, to keep MAPE at 10% and below (RMSE below ~1000) we should realistically not expect to read the forecast beyond 20~25 days. Similarly for MAPE below 5% (RMSE below ~600) the forecast horizon should be about 1 week. So in summary, using the 20 days forecast to end of July as a guide, the model does not expect a sharp pullback and forecasts a mean of 11,300 with 68% confidence interval of 9900 ~ 12,750 and a 95% confidence of 8100 ~14,600.