Supervised Learning: Regression

Shanshan Wang
shanshan.wang@uni-due.de
Feb.16, 2021

Contents

Main objective of the analysis that specifies whether our regression models will be focused on prediction or interpretation.

  1. Data Exploration
    • Reading datasets
    • Brief description of the dataset and a summary of its attributes
    • Data cleaning and feature engineering

  2. Models
    • Model 1: a simple linear regression
    • Model 2: a polynomial regression
    • Model 3: a LASSO regression
    • Model 4: a Ridge regression
    • Model 5: an Elastic Net regression

  3. Results and Discussion
  4. Summary
  5. Suggestions for next steps in analyzing this data

1 Data Exploration

Reading datasets

Load NASDAQ-100 company symbols from Wikipedia

Load price data of stocks by YahooFinancials

Brief description of the data set and a summary of its attributes

The dataset contains adjusted prices of 102 stocks during 2768 days from Jan. 4, 2010 to Dec. 30, 2020. The 102 stocks are from NASDAQ-100 index. Their company names are listed in Wikipedia and updated on December 21, 2020. The companies added to NASDAQ-100 index latest may not have the data in early time. Therefore, we find some of the stocks in the dataset contain the missing data.

A row in the dataset represents the data from all stocks in one trading day and a column indicates a time series of a stock. The dataset lists the stock prices in the data type of float64, where 19 of 102 stocks contain missing data. We list a description of the dataset, such as the mean values, standard deviation, minimum value and maximum value, for different stocks in the following table.

Data cleaning and feature engineering

check missing data

Data cleaning by removing the tickers with missing data and grouping the tickers by sectors

Data normalization

Group the tickers by GICS Sector

Obtain the data of sectors and the whole market

Line plots of features for the sectors and the market

Pair plots of features

With filtered dataset, we can use pair plots to visualize the target and feature-target relationships

Correlation features

Separate our features from our target

We refer the scaled prices avearged over all tickers in each sector or the market as an index of that sector or the market. Thus, we further consider the indices of 7 sectors at day $t$ as the features, and the index of the market at day $t+1$ as the target. The target index reflects the trend of the whole market. If the target index is larger (smaller) than the previous one, there is a upward (downward) trend for the market. If the traget index is equal to the previous one, the market keeps stable.

Basic feature engineering: adding polynomial and interaction terms by Scikit-Learn

we will add quadratic polynomial terms or transformations for those features, allowing us to express that non-linear relationship while still using linear regression as our model.

2 Models

The training uses three linear regression models which covers using a simple linear regression as a baseline, adding polynomial effects, and using a regularization regression. Preferably, all use the same training and test splits, and the same cross-validation method.

Model 1: a simple linear regression

Model 2: a polynomial regression

Model 3: a LASSO regression

A regularization regression with LASSO’s feature selection

Model 4: a Ridge regression

A regularization regression with Ridge’s feature selection

Model 5: an Elastic Net regression

A regularization regression with Elastic Net’s feature selection

3 Results and Discussion

By comparison, the linear regression has the highest $R^2$ score and the lowest mean squared error. Therefore, it is better to fit our data and predict the market index with a higher accuracy. The highest $R^2$ score from the linear regression suggests there is a linear relationship between the market index and each sector price.

The results from polynormial regression and the Ridge regression are close to the results from the linear regression, where the best polynormial degree is 1 so that the polynormial regression is symplified to a simple linear regression. The best $\alpha$ in the Ridge regression is 0.42, which results in a high $R^2$ score 0.9991.

4 Summary

  1. Each sector presents strongly positive correlations with the market and with the other sectors.
  2. There are strong correlations in time shown in the diagonal blocks in the temporial correlation in terms of sector indices.
  3. By comparing with different regression models, the simple linear regression can predict the future market indices with a high accuracy. It also suggests that the market indices presents a linear relationship with the sector indices.

5 Suggestions for next steps in analyzing this data

  1. Add more features to the data
  2. Find out the market trends by predicting the returns of the market
  3. Analyze the spectrum information of the correlation matrix
  4. Analyze the time-laggged correlation among stocks