Supervised learning: classifying weather conditions of Cologne Bonn airport

Shanshan Wang
shanshan.wang@uni-due.de
Jan. 7, 2021

Table of Contents

1 Introduction

1.1 Main objective

Flight situations of an airport are more likely affected by weather conditions. Correctly classification and prediction of weather conditions are of importance for traveling by airplane. This study aims to classify weather conditions of Cologne Bonn airport using supervised machine learning algorithms. To this end, eleven models in supervised learning are used for classifications of weather conditions based on the features including air temperature, atmospheric pressure, humidity, wind and so on. The models are named standard logistic regression, L1 regularized logistic regression, L2 regularized logistic regression, K nearest neighbors (KNN), support vector machines (SVM), decision trees, random forest, extra random forest, gradient boosted tree, adaptive boosting (AdaBoost) and voting classifier. They are compared and ranked in terms of precision, recall, f-score, accuracy, ROC-AUC scores, and confusion matrix for the prediction results. For the case in this study, the random forest model shows the best prediction performance for classifications.

1.2 Outlines

The report is organized as follows:

2 Data

2.1 Data description

The data set used in this study comes from the website https://rp5.ru/Weather_archive_in_Cologne,_Bonn_(airport). The data in the resolution of one hour spans the period from October 1, 2021 to December 31, 2021. Total 28 attributes present in the raw data set. By data clearning, only 8 attributes are available in this study, where 7 attributes are as the features of classifications and 1 attributes is the weather conditions as the target of classifications. Table 1 lists the data type and description of the 8 attributes. The classifications of weather conditions from October to December of 2021 contain clouds, fog, rain and snow. For the sake of prediction, each classification are represent by an integer. The finally used and cleaned data matrix has 2201 rows and 8 columns.

Table 1: Description of attributes in the used datasets
attributesdata type Description
T float64Air temperature (degrees Celsius) at 2 metre height above the earth's surface
Po float64 Atmospheric pressure at weather station level (millimeters of mercury)
P float64 Atmospheric pressure reduced to mean sea level (millimeters of mercury)
U int64 Relative humidity (%) at a height of 2 metres above the earth's surface
DD int64 Mean wind direction (compass points) at a height of 10-12 metres above the earth’s surface over the 10-minute period immediately preceding the observation
Ff int64 Mean wind speed at a height of 10-12 metres above the earth’s surface over the 10-minute period immediately preceding the observation (meters per second)
WW int64 Present weather reported from a weather station
Td float64 Dewpoint temperature at a height of 2 metres above the earth's surface (degrees Celsius)

2.2 Data cleaning

LabelEncoder will be used to fit_transform the "DD" and "WW" column to integers.

2.3 Feature engineering

2.4 A summary about data

The raw data obtained has 2201 rows and 28 columns which includes many missing values. The data clearning removed the missing values and reduced the number of columns of attributes. Hence the availble data has 2201 rows and 8 columns, where 7 columns are as the features for classifying and 1 column as the targets of classifications. We implemented the normalization for the 7 features with regard to different data scales leading to different importances for classifying. The statistics of correlations among 7 features reveals a strong relationship between features P and Po and between features T and Td. We further split the data set into a train and a test data set regarding 30% test data. The classifications for weather conditions contains cloud represented by 0, fog represented by 1, rain represented by 2 and snow represented by 3. Among the four kind of weather conditions, the rain occupies 54.45%, fog 26.47%, cloud 18.60% and snow 0.45%.

3 Classifier models

This sections briefs and carries out 11 models in supervised learning for classifications. Each model is first fitted by the training data set. For some models, a grid search is carried out to find the best estimators. Then each fitted model is used for predicting the weather conditions with the test data set. The prediction performance is evaluated by the precision, recall, f-score, accuracy, ROC-AUC scores, and confusion matrix for the prediction. The 11 classification models and their comparison are organized as follows.

3.1 Logistic regression

Logistic regression models the probabilities for classification problems with a logistic function.

General logistic function in terms of variable $x$ $$P(x)=\frac{1}{1+e^{-(\beta_0+\beta_1 x+\varepsilon)}}\approx\frac{e^{(\beta_0+\beta_1 x)}}{1+e^{(\beta_0+\beta_1 x)}}\ . $$ Odds ratio of the dependent variable $x$ $$\frac{P(x)}{1-P(x)}=e^{(\beta_0+\beta_1 x)}\ , \qquad \log \frac{P(x)}{1-P(x)}=\beta_0+\beta_1 x \ .$$

Confusion matrix is a table that summarize the performance of a classification algorithm, e.g.,

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Active Negative False Positive (FP) True Negative (TN)

For predicting correctly, $$\mathsf{Accuracy}=\frac{\mathsf{TP}+\mathsf{TN}}{\mathsf{TP}+\mathsf{TN}+\mathsf{FP}+\mathsf{FN}}$$ For identifying all positive instances, $$\mathsf{Recall~or~Sensitivity}=\frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FN}}$$
For identifying only positive instances, $$\mathsf{Precision}=\frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FP}}$$ For avoiding false alarms, $$\mathsf{Specificity}=\frac{\mathsf{TN}}{\mathsf{TN}+\mathsf{FP}}$$ F1 score captures the trade-off between recall and precision, $$\mathsf{F1}=2\frac{\mathsf{Precision}\times\mathsf{Recall}}{\mathsf{Precision}+\mathsf{Recall}}$$

Multiple class error metrics :

Predicted Class 1 Predicted Class 2 Predicted Class 3
Actual Class 1 TP1
Actual Class 2 TP2
Actual Class 3 TP3
$$\mathsf{Accuracy}=\frac{\mathsf{TP1}+\mathsf{TP2}+\mathsf{TP3}}{\mathsf{Total}}$$

Receiver operating characteristic (ROC) curve indicates the sensitivity or the recall and can be used evaluate the classification metrics. The ROC area under the curve, i.e., ROC AUC, gives a measure of how well we are separating the two classes. In the ROC matrix of true positive rate (i.e., sensitivity) versus false positive rate (i.e., 1-specificity), the diagonal of the matrix represents the value obtained by randomly guessing, ROC-AUC=0.5, the lower right triangle of the matrix indicates the value obtained worse than guessing ROC-AUC<0.5, and the top left triangle indicates the value obtained better than guessing, ROC-AUC>0.5. The closer the value to the top left corner, the better the classification model is. ROC curve is generally better for data with balanced classes.

Precision-recall curve measures the trade-off between precision and recall. It is generally better for data with imbalanced classes.

3.2 K nearest neighbors

For an appropriate K Nearest Neighbors (KNN) model, a right value of K and a method for the measurement of the distance between neighbors are required. Elbow method is a common approach to determine the K vlaue. In a curve of the error rate as a function of K, the K is chosen at an Elbow point where the model approaches the minimum error on the hold out set. Beyond the Elbow point, the rate of improvement slows or stops. The widely used methods for distance measurements include Euclidean distance $$d_E=\sqrt{\sum\limits_i (p_i-q_i)^2}$$ and Manhattan distance $$d_M=\sum\limits_i |p_i-q_i| \ .$$ Features with large distances have a heavier effect on the outcome in contrast to those with small distances. To aviod this, feature scaling is important to make the distances of features in a similar scale.

Here we select K=3 where the error rate is the lowest and the F1 Score which captures the trade-off between recall and precision is high.

3.3 Support vector machines

A support-vector machine is a supervised learning model for the classification, regression, or other tasks by constructing a hyperplane or set of hyperplanes in a high- or infinite-dimensional space.

There is no clear linear decision boundaries between four categories in a 2-dimensional space, but may exit decision boundaries of hypeplane in a 3-dimensional space

3.4 Decision trees

Algorithm:

The best splitting can be defined as the one that maximizes information gained from the splitting. The information gain of the entropy-based splitting from node $i$ at step $t$ to nodes $j$ at step $t+1$ is defined by $$IG(i,t+1)=H(i,t)-\sum\limits_j p(j,t+1)*H(j,t) \ ,$$ where the entropy for node $i$ at step $t$ is written by $$H(i,t)=-\sum\limits_{i=1}^{n}p(i,t)\log_2\big(p(i,t)\big)$$ and the $p(i,t)$ is the probability for the positive case. In contrast to the splitting based on the classification error $$E(i,t)=1-\mathrm{max}_i\big(p(i,t)\big)$$ which leads the end nodes not homogeneous, the splitting based on entropy can reach the goal of the homogeneous nodes in the end. The often used splitting is based on Gini Index, defined by $$G(i,t)=1-\sum\limits_{i=1}^{n}p^2(i,t) \ . $$

To find the best splitting at each step, Greedy search can be used.

Decision tree tends to overfit and high-variance. One solution to reduce the variance is pruning trees based on a classification error threshold. Another solution is combining predictions from many different trees.

3.5 Bagging and random forest

Bagging, short for bootstrap aggregating, is an ensamble-based method that helps to reduce variance and avoid overfitting. It is usually applied to decision tree methods. For $n$ independent trees, each with variance $\sigma^2$, the bagging variance is $\sigma^2/n$. However, the bootstrap samples may be correlated. If the correlation coefficient is $\rho$, the bagging variance equals to $\rho\sigma^2 +(1-\rho)\sigma^2/n$. With the increase of $n$, the variance tends to $\rho\sigma^2$. To avoid this, more randomness can be introduced by further de-correlating tree or using random setset of features for each tree, e.g. the random subset with $\sqrt{m}$ features for classifications and $m/3$ features for regressions, where $m$ is the total number of features. This method is called random forest. Relative to bagging, variance is further reduced for random forest. By selecting features randomly and creating splitting randomly, more randomness is introduced to random forest. This method is called extra random trees.

3.6 Boosting and stacking

Boosting Algorithm:

Learning rate<1 for regularization means less overfitting. It shrinks the impact of each successive learner to a value less than one.

Booting uses different loss function. AdaBoot (adaptive booting) loss function is exponential, $e^{(-\mathrm{margin})}$ and more sensitive to outliers than other type of boosting. Gradient boosting loss function is more robust than AdaBoot loss function. A common implementation of gradient boosting loss function uses a log likelihood loss function $\log(1+e^{(-\mathrm{margin})})$.

A staked model combines models of any kind. The output of a staked model combines the outputs of base models via weighting or majority votes.

3.7 Comparison of models

The performance of the above classification models are compared in this section. The confusion matrix for the prediction results is visualized for each model. The diagonal elements in each matrix reveal the true positive values for classifications. The larger value of a diagonal element, the more cases are classified correctly in the corresponding classification. Due to the higher propotion of rain, we can find the high value for the third diagonal element in each confusion matrix. The models are also compared in terms of the precision, recall, F-score, accuracy, ROC-AUC scores for the prediction results. Among all models, K nearest neighbors, random forest, extra random forest and voting classifier perform better. Based on the value averaged over the precision, recall, F-score, accuracy and ROC-AUC scores, we further rank all models. As a result, the random forest performs best and L1 regularized logistic regression performs worest in this study.

4 Summary

This study used 11 models in supervised learning to classify weather conditions of Cologne Bonn airport based on 7 features of classifications including air temperature, atmospheric pressure, humidity, wind and so on. The 11 models are standard logistic regression, L1 regularized logistic regression, L2 regularized logistic regression, K nearest neighbors (KNN), support vector machines (SVM), decision trees, random forest, extra random forest, gradient boosted tree, adaptive boosting (AdaBoost) and voting classifier. We used each model with the best estimator fitted by the training data set to predict the weather conditions with the test data set. We further evaluated the performance of models by several indices, including the precision, recall, f-score, accuracy, ROC-AUC scores, and the confusion matrix. Among all models, K nearest neighbors, random forest, extra random forest and voting classifier perform better. Based on the performance of models estimated by averaging over the precision, recall, f-score, accuracy and ROC-AUC scores, we ranked the models and found that the random forest performs best and L1 regularized logistic regression performs worest in this study.

5 Suggestions for next steps

The number of random_state has more or less effects on initializing each model function. In particular, the effect is obvious on the number of trees in the decision tree method. This problems should be noticed and fixed for next steps.

In the current study, the classification was carried out based on the features observed at the same time for the target, i.e., the weather condition. A further step can be done by using the classification models to predict the weathe conditions based on the features at a time preceeding to the time for the corresponding target. The performance of models can be evaluated and comparied.