STOCK-MARKET-RETURN-PREDICTION-USING-R

πŸ“ˆ Stock Market Return Prediction Using R

Forecasting S&P 500 & FTSE All-Share indices across 30 years of market data using five time series models β€” evaluated across three economic eras including the 2008 financial crisis and COVID-19 onset

View Live Analysis Language Models


πŸ” Project Overview

This project investigates the long-term predictability of major stock market indices by benchmarking five classical forecasting methodologies against 30 years of historical price data (January 1990 – January 2020). The study focuses on the S&P 500 Composite and FTSE All-Share Price Index, evaluating each model’s accuracy across three distinct time horizons designed to capture different market regimes: the dot-com boom, the 2008 financial crisis, and the post-recovery bull market.

The core research question: Can traditional statistical models reliably predict equity index returns, and under what conditions do they break down?


πŸ“Š Key Results at a Glance

Model Best RMSE (S&P 500) Best MAPE Period
ARIMA 82.26 6.37% 2000
SES 184.24 4.99% 2020
Moving Average (2OMA) 253.21 8.04% 2020
Simple Forecast (Drift) 254.93 17.41% 2000
Simple Forecast (Mean) β€” β€” Baseline

Key finding: ARIMA delivered the strongest performance, achieving an RMSE of 82.26 and MAPE of just 6.37% on S&P 500 data for the 1990–2000 period. All models, however, failed to anticipate the 2008 financial crisis β€” confirming that conventional time series methods cannot model black swan events.


🧠 Methodology

Data Source

Three Time Horizons

Data was split into three training windows, each forecasting the next 24 months:

Horizon Training Data Forecast Period
1 Jan 1990 – Dec 1997 1998–1999
2 Jan 1990 – Dec 2009 2010–2011
3 Jan 1990 – Dec 2018 2019–2020

This design captures three distinct economic regimes: the 1990s bull market, the post-GFC recovery, and the pre-pandemic expansion.


πŸ”¬ Models Implemented

1. Simple Forecasts (Mean, NaΓ―ve, Drift)

Baseline benchmarks. The Drift method (random walk with trend) consistently outperformed Mean and NaΓ―ve across both indices by incorporating directional momentum. None of the three methods could capture the 2008 downturn or COVID-19 onset.

2. Moving Average (2nd-Order)

A 2OMA was applied using the ma() function from the forecast package. While it effectively smoothed noise, it introduced lag β€” a structural limitation that results in delayed responses to sharp market reversals. RMSE ranged from 253 (S&P, 2020) to 591 (FTSE, 2010).

3. Simple Exponential Smoothing (SES)

SES assigned exponentially decreasing weights to historical observations, making it more adaptive than simple moving averages. It achieved the lowest MAPE of 4.99% for S&P 500 in the 2020 horizon, outperforming simpler methods in stable trend environments.

4. ARIMA (Auto-selected via AIC)

auto.arima() with AIC-based model selection was used for all six training sets. ARIMA achieved the best overall accuracy, with the lowest AIC of 775.26 for S&P 500 (1990–2000). Residual diagnostics (ACF, PACF, histogram, density plots) confirmed near-random residuals for this model, indicating a well-specified fit.

ARIMA accuracy results:

5. Seasonal-Trend Decomposition (STL)

STL decomposition separated each series into trend, seasonal, and remainder components. Findings confirmed a persistent upward long-term trend in both indices, with a consistent annual seasonal pattern (Β±10–20 index points in S&P) that proved resilient even through the 2008 crisis. Remainder forecasts were modelled using ARIMA on the residual component.


πŸ“‰ Statistical Analysis

Log Returns Summary

Index N Mean Daily Return Std Dev Min Max
S&P 500 7,844 0.00028 0.011 -0.095 +0.110
FTSE All-Share 7,844 0.00016 0.0099 -0.087 +0.088

Both distributions exhibit leptokurtosis (fat tails), confirming that returns are not normally distributed β€” a critical consideration for any risk model assuming Gaussian behaviour.

Accuracy Metrics Used


πŸ› οΈ Tech Stack

Tool Purpose
R 4.2 Core programming language
forecast ARIMA, SES, Moving Average, STL forecasting
tsfe Source dataset (global equity indices)
ggplot2 All visualisations and forecast plots
tidyverse Data manipulation pipeline
lubridate Date handling and time series indexing
ggfortify Time series autoplot extensions
Metrics RMSE, MAE, MAPE calculation
kableExtra Formatted accuracy tables
PerformanceAnalytics Financial analytics support
quantmod Financial data utilities
fpp2 Forecasting Principles and Practice framework
gt Publication-quality tables
RMarkdown / Posit Cloud Reproducible analysis and reporting

πŸ“ Repository Structure

STOCK-MARKET-RETURN-PREDICTION-USING-R/
β”‚
β”œβ”€β”€ Advanced_Financial_Data_Analytics_Assignment.html   # Full rendered analysis with all plots
β”œβ”€β”€ Advanced_Financial_Data_Analytics_Assignment.Rmd    # Source R Markdown file
└── README.md

πŸš€ How to Reproduce

# Install required packages
install.packages(c("forecast", "ggplot2", "tidyverse", "lubridate",
                   "ggfortify", "Metrics", "kableExtra", "fpp2", 
                   "PerformanceAnalytics", "quantmod", "gt", "remotes"))

# Install the TSFE data package from GitHub
remotes::install_github("barryquinn1/tsfe")

# Open and knit the .Rmd file in RStudio or Posit Cloud
# All data loads automatically from tsfe::indices β€” no external data files needed

πŸ’‘ Key Insights & Conclusions

  1. ARIMA is the best classical model for stable trend periods β€” it achieved up to 93.6% accuracy (MAPE 6.37%) on S&P 500 data from the 1990s bull market.

  2. All models fail during structural breaks. The 2008 financial crisis and the 2020 COVID-19 shock exposed a fundamental limitation: backward-looking statistical models cannot anticipate black swan events driven by exogenous shocks.

  3. Leptokurtosis is pervasive. Both indices show fat-tailed return distributions, meaning models that assume normality will systematically underestimate the probability of extreme price movements.

  4. Seasonal patterns are robust. STL decomposition revealed consistent annual seasonality in both indices that persisted across all three economic regimes β€” a finding with potential applications in tactical asset allocation.

  5. Longer training data β‰  better MAPE. The 2020-horizon ARIMA models achieved the lowest MAPE values (3.98% for S&P 500) despite higher absolute RMSE, partly because higher index price levels arithmetically compress percentage errors.

Future Enhancements


πŸ“š References


πŸ‘¨β€πŸ’» Author

Siddharth Shekhar Singh
Data & AI Automation Analyst | NI Water, Belfast
LinkedIn


This project was completed as part of an Advanced Financial Data Analytics assignment. The full rendered HTML report with all visualisations is available in the repository.