Exploratory Data Analysis for Machine Learning

Shanshan Wang
shanshan.wang@uni-due.de
Feb.15, 2021

Contents

  1. Data exploration
    • Initial plan for data exploration
    • Reading a dataset
    • Brief description of the dataset and a summary of its attributes
    • Data cleaning and feature engineering
    • Key Findings and Insights

  2. Hypothesis tests
    • Three hypotheses about this data
    • Significance tests and the results
    • Suggestions for next steps

  3. Summary
    • The quality of this data set
    • A request for additional data if needed

1 Data Exploration

Initial plan for data exploration

  1. Load a stock dataset by YahooFinancials
  2. Check the information of the dataset
  3. Filter out the tickers with missing data
  4. Calculate returns of stock prices for each ticker
  5. Normalize the prices and the returns for each ticker
  6. Group the tickers by their industrial sectors
  7. Calculate the mean return of tickers for each sector and for the whole market
  8. Visualize the trends of prices and the trends of returns for the sectors and the market with lineplots
  9. Visualize the relationships of returns among sectors and the market with pairplots
  10. Seperate the feature data from the target data
  11. Add polynomial and interaction terms for the basic feature engineering

Reading datasets

Load NASDAQ-100 company symbols from Wikipedia

Load price data of stocks by YahooFinancials

Brief description of the data set and a summary of its attributes

The dataset contains adjusted prices of 102 stocks during 2768 days from Jan. 4, 2010 to Dec. 30, 2020. The 102 stocks are from NASDAQ-100 index. Their company names are listed in Wikipedia and updated on December 21, 2020. The companies added to NASDAQ-100 index latest may not have the data in early time. Therefore, we find some of the stocks in the dataset contain the missing data.

A row in the dataset represents the data from all stocks in one trading day and a column indicates a time series of a stock. The dataset lists the stock prices in the data type of float64, where 19 of 102 stocks contain missing data. We list a description of the dataset, such as the mean values, standard deviation, minimum value and maximum value, for different stocks in the following table.

Data cleaning and feature engineering

Data cleaning by removing the tickers with missing data and grouping the tickers by sectors

Data normalization

Group the tickers by GICS Sector

Obtain the data of sectors and the whole market

Line plots of features for the sectors and the market

Pair plots of features

With filtered dataset, we can use pair plots to visualize the target and feature-target relationships