Supervised Learning: Regression

Shanshan Wang
shanshan.wang@uni-due.de
Feb.16, 2021

Contents

Main objective of the analysis that specifies whether our regression models will be focused on prediction or interpretation.

  1. Data Exploration
    • Reading datasets
    • Brief description of the dataset and a summary of its attributes
    • Data cleaning and feature engineering

  2. Models
    • Model 1: a simple linear regression
    • Model 2: a polynomial regression
    • Model 3: a LASSO regression
    • Model 4: a Ridge regression
    • Model 5: an Elastic Net regression

  3. Results and Discussion
  4. Summary
  5. Suggestions for next steps in analyzing this data

1 Data Exploration

Reading datasets

Load NASDAQ-100 company symbols from Wikipedia

Load price data of stocks by YahooFinancials

Brief description of the data set and a summary of its attributes

The dataset contains adjusted prices of 102 stocks during 2768 days from Jan. 4, 2010 to Dec. 30, 2020. The 102 stocks are from NASDAQ-100 index. Their company names are listed in Wikipedia and updated on December 21, 2020. The companies added to NASDAQ-100 index latest may not have the data in early time. Therefore, we find some of the stocks in the dataset contain the missing data.

A row in the dataset represents the data from all stocks in one trading day and a column indicates a time series of a stock. The dataset lists the stock prices in the data type of float64, where 19 of 102 stocks contain missing data. We list a description of the dataset, such as the mean values, standard deviation, minimum value and maximum value, for different stocks in the following table.

Data cleaning and feature engineering

check missing data

Data cleaning by removing the tickers with missing data and grouping the tickers by sectors

Data normalization

Group the tickers by GICS Sector

Obtain the data of sectors and the whole market

Line plots of features for the sectors and the market

Pair plots of features

With filtered dataset, we can use pair plots to visualize the target and feature-target relationships