Exploratory Data Analysis for Machine Learning

Shanshan Wang
shanshan.wang@uni-due.de
Feb.15, 2021

Contents

  1. Data exploration
    • Initial plan for data exploration
    • Reading a dataset
    • Brief description of the dataset and a summary of its attributes
    • Data cleaning and feature engineering
    • Key Findings and Insights

  2. Hypothesis tests
    • Three hypotheses about this data
    • Significance tests and the results
    • Suggestions for next steps

  3. Summary
    • The quality of this data set
    • A request for additional data if needed

1 Data Exploration

Initial plan for data exploration

  1. Load a stock dataset by YahooFinancials
  2. Check the information of the dataset
  3. Filter out the tickers with missing data
  4. Calculate returns of stock prices for each ticker
  5. Normalize the prices and the returns for each ticker
  6. Group the tickers by their industrial sectors
  7. Calculate the mean return of tickers for each sector and for the whole market
  8. Visualize the trends of prices and the trends of returns for the sectors and the market with lineplots
  9. Visualize the relationships of returns among sectors and the market with pairplots
  10. Seperate the feature data from the target data
  11. Add polynomial and interaction terms for the basic feature engineering

Reading datasets

Load NASDAQ-100 company symbols from Wikipedia

Load price data of stocks by YahooFinancials

Brief description of the data set and a summary of its attributes

The dataset contains adjusted prices of 102 stocks during 2768 days from Jan. 4, 2010 to Dec. 30, 2020. The 102 stocks are from NASDAQ-100 index. Their company names are listed in Wikipedia and updated on December 21, 2020. The companies added to NASDAQ-100 index latest may not have the data in early time. Therefore, we find some of the stocks in the dataset contain the missing data.

A row in the dataset represents the data from all stocks in one trading day and a column indicates a time series of a stock. The dataset lists the stock prices in the data type of float64, where 19 of 102 stocks contain missing data. We list a description of the dataset, such as the mean values, standard deviation, minimum value and maximum value, for different stocks in the following table.

Data cleaning and feature engineering

Data cleaning by removing the tickers with missing data and grouping the tickers by sectors

Data normalization

Group the tickers by GICS Sector

Obtain the data of sectors and the whole market

Line plots of features for the sectors and the market

Pair plots of features

With filtered dataset, we can use pair plots to visualize the target and feature-target relationships

Separate our features from our target

We refer the scaled prices avearged over all tickers in each sector or the market as an index of that sector or the market. Thus, we further consider the return of indices of 7 sectors at day $t$ as the features, and the return of the index of the market at day $t+1$ as the target. The target return reflects the trend of the whole market. If the target return is larger (smaller) than 0, there is a upward (downward) trend for the market. If the traget return is equal to zero, the market keeps stable.

Basic feature engineering: adding polynomial and interaction terms

we will add quadratic polynomial terms or transformations for those features, allowing us to express that non-linear relationship while still using linear regression as our model.

Polynomial Features

Feature interactions

Polynomial Features in Scikit-Learn

Key Findings and Insights

  1. We use an index to indicate the scaled price averaged over the tickers in each sector.
  2. For the sectors and the market, similar trends are present in both indices and returns, where a dramatical fluctuation is visiable for each case after Dec. 9, 2019.
  3. The scatter plots show obvious positive correlations of return between the sectors and the market and between sectors themselves.
  4. The histograms either for the sectors or for the market suggest that their returns vary likely do not follow the normal distribution.

2 Hypothesis tests

Hypothesis 1

H0: The return of the market follows a Gaussian distribution

H1: The return of the market does not follow a Gaussian distribution

Result: The p-value is smaller than 0.05 so that the null hypothesis is rejected and the return of the market does not follow a Gaussian distribution.

Hypothesis 2

H0: The return of Utilities and the return of Health Care are independent.

H1: There is a dependency between the return of Utilities and the return of Health Care.

Result: The p-value is smaller than 0.05 so that the null hypothesis is rejected and there is a dependency between the return of Utilities and the return of Health Care.

Hypothesis 3

H0: The means of the returns of different sectors are equal.

H1: One or more of the means of the returns for different sectors are unequal.

Result: The p-value is larger than 0.05 so that the null hypothesis is accepted. The means of the returns of different sectors are equal.

Suggestions for next steps in analyzing this data

  1. Find out the distributions of returns of the market
  2. Calculate the correlation matrix among stocks
  3. Analyze the spectrum information of the correlation matrix
  4. Analyze the time-laggged correlation among stocks

3 Summary

This dataset has a high quality with less missing data, which has been removed during the data cleaning. The dataset contains the daily price data of 83 stocks during the past ten years and more than 2000 data points for each time series. After processing the data, the data set can be used for analyzing the price or return evolution for each stock, each sector and the whole market. Except for the stock list in NASDAQ-100 index, no additional dataset are needed for further analysis.