Supervised learning: classifying weather conditions of Cologne Bonn airport

Shanshan Wang
shanshan.wang@uni-due.de
Jan. 7, 2021

Table of Contents

1 Introduction

1.1 Main objective

Flight situations of an airport are more likely affected by weather conditions. Correctly classification and prediction of weather conditions are of importance for traveling by airplane. This study aims to classify weather conditions of Cologne Bonn airport using supervised machine learning algorithms. To this end, eleven models in supervised learning are used for classifications of weather conditions based on the features including air temperature, atmospheric pressure, humidity, wind and so on. The models are named standard logistic regression, L1 regularized logistic regression, L2 regularized logistic regression, K nearest neighbors (KNN), support vector machines (SVM), decision trees, random forest, extra random forest, gradient boosted tree, adaptive boosting (AdaBoost) and voting classifier. They are compared and ranked in terms of precision, recall, f-score, accuracy, ROC-AUC scores, and confusion matrix for the prediction results. For the case in this study, the random forest model shows the best prediction performance for classifications.

1.2 Outlines

The report is organized as follows:

2 Data

2.1 Data description

The data set used in this study comes from the website https://rp5.ru/Weather_archive_in_Cologne,_Bonn_(airport). The data in the resolution of one hour spans the period from October 1, 2021 to December 31, 2021. Total 28 attributes present in the raw data set. By data clearning, only 8 attributes are available in this study, where 7 attributes are as the features of classifications and 1 attributes is the weather conditions as the target of classifications. Table 1 lists the data type and description of the 8 attributes. The classifications of weather conditions from October to December of 2021 contain clouds, fog, rain and snow. For the sake of prediction, each classification are represent by an integer. The finally used and cleaned data matrix has 2201 rows and 8 columns.

Table 1: Description of attributes in the used datasets
attributesdata type Description
T float64Air temperature (degrees Celsius) at 2 metre height above the earth's surface
Po float64 Atmospheric pressure at weather station level (millimeters of mercury)
P float64 Atmospheric pressure reduced to mean sea level (millimeters of mercury)
U int64 Relative humidity (%) at a height of 2 metres above the earth's surface
DD int64 Mean wind direction (compass points) at a height of 10-12 metres above the earth’s surface over the 10-minute period immediately preceding the observation
Ff int64 Mean wind speed at a height of 10-12 metres above the earth’s surface over the 10-minute period immediately preceding the observation (meters per second)
WW int64 Present weather reported from a weather station
Td float64 Dewpoint temperature at a height of 2 metres above the earth's surface (degrees Celsius)

2.2 Data cleaning

LabelEncoder will be used to fit_transform the "DD" and "WW" column to integers.

2.3 Feature engineering

2.4 A summary about data

The raw data obtained has 2201 rows and 28 columns which includes many missing values. The data clearning removed the missing values and reduced the number of columns of attributes. Hence the availble data has 2201 rows and 8 columns, where 7 columns are as the features for classifying and 1 column as the targets of classifications. We implemented the normalization for the 7 features with regard to different data scales leading to different importances for classifying. The statistics of correlations among 7 features reveals a strong relationship between features P and Po and between features T and Td. We further split the data set into a train and a test data set regarding 30% test data. The classifications for weather conditions contains cloud represented by 0, fog represented by 1, rain represented by 2 and snow represented by 3. Among the four kind of weather conditions, the rain occupies 54.45%, fog 26.47%, cloud 18.60% and snow 0.45%.

3 Classifier models

This sections briefs and carries out 11 models in supervised learning for classifications. Each model is first fitted by the training data set. For some models, a grid search is carried out to find the best estimators. Then each fitted model is used for predicting the weather conditions with the test data set. The prediction performance is evaluated by the precision, recall, f-score, accuracy, ROC-AUC scores, and confusion matrix for the prediction. The 11 classification models and their comparison are organized as follows.

3.1 Logistic regression

Logistic regression models the probabilities for classification problems with a logistic function.

General logistic function in terms of variable $x$ $$P(x)=\frac{1}{1+e^{-(\beta_0+\beta_1 x+\varepsilon)}}\approx\frac{e^{(\beta_0+\beta_1 x)}}{1+e^{(\beta_0+\beta_1 x)}}\ . $$ Odds ratio of the dependent variable $x$ $$\frac{P(x)}{1-P(x)}=e^{(\beta_0+\beta_1 x)}\ , \qquad \log \frac{P(x)}{1-P(x)}=\beta_0+\beta_1 x \ .$$

Confusion matrix is a table that summarize the performance of a classification algorithm, e.g.,

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Active Negative False Positive (FP) True Negative (TN)

For predicting correctly, $$\mathsf{Accuracy}=\frac{\mathsf{TP}+\mathsf{TN}}{\mathsf{TP}+\mathsf{TN}+\mathsf{FP}+\mathsf{FN}}$$ For identifying all positive instances, $$\mathsf{Recall~or~Sensitivity}=\frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FN}}$$
For identifying only positive instances, $$\mathsf{Precision}=\frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FP}}$$ For avoiding false alarms, $$\mathsf{Specificity}=\frac{\mathsf{TN}}{\mathsf{TN}+\mathsf{FP}}$$ F1 score captures the trade-off between recall and precision, $$\mathsf{F1}=2\frac{\mathsf{Precision}\times\mathsf{Recall}}{\mathsf{Precision}+\mathsf{Recall}}$$

Multiple class error metrics :

Predicted Class 1 Predicted Class 2 Predicted Class 3
Actual Class 1 TP1
Actual Class 2 TP2
Actual Class 3 TP3
$$\mathsf{Accuracy}=\frac{\mathsf{TP1}+\mathsf{TP2}+\mathsf{TP3}}{\mathsf{Total}}$$

Receiver operating characteristic (ROC) curve indicates the sensitivity or the recall and can be used evaluate the classification metrics. The ROC area under the curve, i.e., ROC AUC, gives a measure of how well we are separating the two classes. In the ROC matrix of true positive rate (i.e., sensitivity) versus false positive rate (i.e., 1-specificity), the diagonal of the matrix represents the value obtained by randomly guessing, ROC-AUC=0.5, the lower right triangle of the matrix indicates the value obtained worse than guessing ROC-AUC<0.5, and the top left triangle indicates the value obtained better than guessing, ROC-AUC>0.5. The closer the value to the top left corner, the better the classification model is. ROC curve is generally better for data with balanced classes.

Precision-recall curve measures the trade-off between precision and recall. It is generally better for data with imbalanced classes.

3.2 K nearest neighbors

For an appropriate K Nearest Neighbors (KNN) model, a right value of K and a method for the measurement of the distance between neighbors are required. Elbow method is a common approach to determine the K vlaue. In a curve of the error rate as a function of K, the K is chosen at an Elbow point where the model approaches the minimum error on the hold out set. Beyond the Elbow point, the rate of improvement slows or stops. The widely used methods for distance measurements include Euclidean distance $$d_E=\sqrt{\sum\limits_i (p_i-q_i)^2}$$ and Manhattan distance $$d_M=\sum\limits_i |p_i-q_i| \ .$$ Features with large distances have a heavier effect on the outcome in contrast to those with small distances. To aviod this, feature scaling is important to make the distances of features in a similar scale.

Here we select K=3 where the error rate is the lowest and the F1 Score which captures the trade-off between recall and precision is high.

3.3 Support vector machines

A support-vector machine is a supervised learning model for the classification, regression, or other tasks by constructing a hyperplane or set of hyperplanes in a high- or infinite-dimensional space.

There is no clear linear decision boundaries between four categories in a 2-dimensional space, but may exit decision boundaries of hypeplane in a 3-dimensional space

3.4 Decision trees

Algorithm:

The best splitting can be defined as the one that maximizes information gained from the splitting. The information gain of the entropy-based splitting from node $i$ at step $t$ to nodes $j$ at step $t+1$ is defined by $$IG(i,t+1)=H(i,t)-\sum\limits_j p(j,t+1)*H(j,t) \ ,$$ where the entropy for node $i$ at step $t$ is written by $$H(i,t)=-\sum\limits_{i=1}^{n}p(i,t)\log_2\big(p(i,t)\big)$$ and the $p(i,t)$ is the probability for the positive case. In contrast to the splitting based on the classification error $$E(i,t)=1-\mathrm{max}_i\big(p(i,t)\big)$$ which leads the end nodes not homogeneous, the splitting based on entropy can reach the goal of the homogeneous nodes in the end. The often used splitting is based on Gini Index, defined by $$G(i,t)=1-\sum\limits_{i=1}^{n}p^2(i,t) \ . $$

To find the best splitting at each step, Greedy search can be used.

Decision tree tends to overfit and high-variance. One solution to reduce the variance is pruning trees based on a classification error threshold. Another solution is combining predictions from many different trees.