Clustering road sections with respective to the correlations of traffic flows

Shanshan Wang
shanshan.wang@uni-due.de
Dec.31, 2021

Table of Contents

1 Introduction

Road sections in a traffic network manifest strong correlation of traffic flows, especially during the period of rush hours. Due to the correlations, classifying road sections into groups benifit the identification of traffic behavior of road sections based on the behavior of one or several road sections in the same group. The collective behavior of road sections in a group is of importance in implimenting traffic planning and managenment. Therefore, this report aims at clustering road sections with respect to the correlations of traffic flows. To this end, we work out the correlation matrix with the open dataset from Ministry of Transport of the State of North Rhine-Westphalia (NRW), Germany. After dimensional reduction of the correlation matrix, four clustering methods, including $k$-means clustering, hierarchical clustering, DBSCAN clustering and mean shift clustering, are applied to our data set and their clustering results are compared in terms of the silhouette values, which measures the cluster cohesion and separation. We summize our results and give the suggestions for next steps around this study.

2 Data

2.1 Description of datasets

This report uses the open dataset from Ministry of Transport of the State of North Rhine-Westphalia (NRW), Germany, with the Data license Germany attribution 2.0. It lists the annual results of traffic census in NRW during 2017. The attributes of the raw dataset are listed in Table 2, which contains the attributes' abbrivations, full names, units and data types. After data cleaning, the used data attributes are listed Table 3. The data is aggregated by counting station name (ZST_NAME) and street class and number (STRASSE), respectively, resulting a data matrix df3 and df4, which will be used for analysis.

Table 1: the annual results of traffic census in NRW during 2017

Table 2: a summary of the dataset's attributes

2.2 Data cleaning

The data cleaning for the used dataset is carried out by converting the data types of some variables, i.e., attributes, grouping with counting station name (ZST_NAME) and street class and number (STRASSE), respectively, and checking and removing the missing values.

Table 3: a summary of used data attributes

Table 4: data of road sections

Table 5: data of motorways

Now data frames df3 for the data of road sections and df4 for the data of motorways are clearned for using. In the following, we will focus on the data matrix df3 with 18 attributes as columns and 323 counting stations, representing 323 road sections, as rows.

2.3 Feature engineering

Feature engineering is performed by visualizing the data matrix, displaying the relationship of six vechicle types, examing the skew values and logarithmically tranforming the highly skewed variables.

Examine the skew values and log transform

2.4 Summary of data exploration, data cleaning and feature engineering

The raw dataset includes 342 rows of items and 28 columns of attributes. Wrong data types and missing values are present in some attributes, which can not be used directly. Therefore, data cleaning is required so as to make data availbe for analysis.

To this end, we converted the attributes with wrong data types to the corrected types, selected the useful attributes, aggregated the data by counting station name (ZST_NAME) and street class and number (STRASSE), respectively, and removed the rows of the data matrix with missing values. The resulting data matrix includes 323 rows of items and 18 columns of attributes. The attributes are also the variables for each item of each row.

The feature enginering reveals a large difference is present among attributes in each column. We thus examined the skew values and picked out attributes with the skew values larger than 0.75. For those attributes, a logarithmical transformation is applied to them. In this way, all attributes have the similar scale and are comparable.

3 Correlation matrices of traffic flows

This section compares the correlation matrices of traffic flows before and after the logarithmically transforming the highly skewed variables. To this end, we carried out following steps:

The results can be concluded as follows:

The vechicle flows show strong correlation among different road sections, represented by counting stations. The logarithmical transformation makes all variables in the similar scale and comparable. It also makes the correlations stronger than the original ones. The distribution of correlations approaches to the Cauchy-Lorentz distribution for small values before transformations, but to the exponential distributions for small values after transformations. The Shapiro-Wilk normality test and Anderson-Darling test for the distribution of correlations demonstrate that the elements in the correlation matrices are non-Gaussian distributed. Reordering the counting stations and displaying them in a clustered heatmap, the strongly correlated groups are remarkable in the diagonal blocks in each correlation matrix. After logarithmical transformation, the strongly correlated groups are more significant and distinguishable with the weakly correlated groups. By locating the maximal absolute correlation, the section that most strongly correlated with each road section can be found out.

4 Dimensionality reduction with PCA

In this section, we will reduce the dimentions of the $323\times 323$ correlation matrix, i.e., stcorr_df6, resulting from the logarithmically transformed data of traffic flows in the following way:

The above figure reveals that for the numers of dimensions n_components=4, the relative importances of all elements are close to be similar values in contrast to other numers of dimensions, which show more importances for several particular elements but less importances for the rest.

For n=3,4 and 5, the densities of relative importance are positive. Among the three cases, the densities of relative importance for n=4 are larger than the cases for n=3 and 5.

By fitting the KernelPCA model with grid search parameters, the result shows the best estimator with n=4. Therefore, in the following, we will use the first four principle components of the correlation matrix, i.e., the $323\times 4$ matrix comp_pca, for clustering.

5 Clustering models

The $323\times 4$ matrix comp_pca contains 323 rows of road sections and 4 columns of the first four principle components of the correlation matrix as attributes. This section uses four methods for clustering and compares their clustering results:

5.1 $k$-means clustering

There’s a spot where the above curve starts to bend known as the elbow point. The x-value of this point is thought to be a reasonable trade-off between error and number of clusters. Therefore, we choose 7 as the optimal number of clusters.

The silhouette coefficient is a measure of cluster cohesion and separation. In the following, we use the silhouette coefficient to quantify how well a data point fits into its assigned cluster based on how close the data point is to other points in the cluster and how far away the data point is from points in other clusters. Silhouette coefficient values range between -1 and 1. Larger numbers indicate that samples are closer to their clusters than they are to other clusters.

5.2 Hierarchical clustering

5.3 DBSCAN clustering

Three possible labels for any point with Density-Based Spatial Clustering of Applications with Noise (DBSCAN) :

DBSCAN Algorithm:

5.4 Mean shift clustering

The mean shift algorithm, working similarly to k-means, partition points according to their nearest cluster centroid.

  1. Start a random point as a centroid and set a window size $W$ of a square area covering this point.
  2. Calculate weighted mean of density within that window $W$ by
  3. Shift the centroid of the window to the new mean by gradient towards denser direction.
  4. Repeat steps 2 and 3 until convergence (no shift), i.e., until local density maximum ("mode") is reached.
  5. Repeat steps 1-4 for all data points.
  6. Assign data points to centroids that they fall to. In other words, the data points that lead to the same mode are grouped into the same cluster.

The weighted mean is calcualted by

$m(x)=\frac{\sum\limits_{x_i\in W}x_iK(x_i-x)}{\sum\limits_{x_i\in W}K(x_i-x)}$

where $x_i$ is the density of point $i$, $x$ is the previous mean, $m(x)$ is the new mean, $\sum\limits_{x_i\in W}$ means summing over points within window size (bandwidth) $W$, and $K(\cdot)$ is the weighting (kernel) function.

5.5 Comparison of clustering methods

6 Summary

This study used the data of traffic flows from 323 counting stations with regard to 18 attributes. Data cleaning and feature engineering make the raw data available. In particular, a logarithmical transformation was performed for the attributes with large skew values. We worked out a $323\times 323$ correlation matrix with processed data. The correlation matrix shows significant correlation structures both before and after logarithmical transformation, but the latter are more obvious.

We performed the principle component analysis (PCA) for the dimensonal reduction of the correlation matrix. The first four principle components was considered so that the data matrix for clustering has the dimensions of $323\times 4$.

We applied four methods to clustering road sections, indicated by counting stations. The four methods are $k$-means clustering, Hierarchical agglomerative clustering, DBSCAN clustering and Mean shift clustering. For each method, we calculated the silhouette values which measures the cluster cohesion and separation. Using silouette values as the evalution of the clustering quality, we further compared the four methods. The results shows that the mean shift clustering among the four methods fits our data best. The difference of the four methods with respect to the cluster cohesion and separation is also visible in a 3-dimensional space for the first three principle components, where the $k$-means clustering and mean shift clustering show the better separations for clusters than the other two methods.

7 Suggestions for next steps

The above study compared the four clustering methods by silhouette scores. A futher step can be done by spliting the data set into a training and a testing data set. The training data set can be clustered by the four methods. Instead of the silhouette scores, the testing data set can be used to evaluate the clustering quality. The best methods with the parameters resulting from training the model then can be used for predicting the cluster that the unclassified data points belong to.