Shanshan Wang
shanshan.wang@uni-due.de
Dec.31, 2021
Road sections in a traffic network manifest strong correlation of traffic flows, especially during the period of rush hours. Due to the correlations, classifying road sections into groups benifit the identification of traffic behavior of road sections based on the behavior of one or several road sections in the same group. The collective behavior of road sections in a group is of importance in implimenting traffic planning and managenment. Therefore, this report aims at clustering road sections with respect to the correlations of traffic flows. To this end, we work out the correlation matrix with the open dataset from Ministry of Transport of the State of North Rhine-Westphalia (NRW), Germany. After dimensional reduction of the correlation matrix, four clustering methods, including $k$-means clustering, hierarchical clustering, DBSCAN clustering and mean shift clustering, are applied to our data set and their clustering results are compared in terms of the silhouette values, which measures the cluster cohesion and separation. We summize our results and give the suggestions for next steps around this study.
This report uses the open dataset from Ministry of Transport of the State of North Rhine-Westphalia (NRW), Germany, with the Data license Germany attribution 2.0. It lists the annual results of traffic census in NRW during 2017. The attributes of the raw dataset are listed in Table 2, which contains the attributes' abbrivations, full names, units and data types. After data cleaning, the used data attributes are listed Table 3. The data is aggregated by counting station name (ZST_NAME) and street class and number (STRASSE), respectively, resulting a data matrix df3 and df4, which will be used for analysis.
import pandas as pd
import numpy as np
import math
df=pd.read_csv('https://open.nrw/sites/default/files/opendatafiles/180501_NW_Jahrestabelle_2017.csv',
sep=';',encoding='latin_1',header=1)
df.drop([0],inplace=True)
df.reset_index(drop=True,inplace=True)
df
ID | STRASSE | STR_KL | STR_NR | STR_NR_ZUSATZ | ZST_NAME | TK_BLATT | ZST_NR | RICHTUNG_01 | RICHTUNG_02 | ... | DTV_PKW_LFW_MOT | DTV_PKW | DTV_LFW | DTV_MOT | DTV_PKW_mit_Anh | DTV_Bus | DTV_LKW_ohne_Anh | DTV_LKW_mit_Anh_Sattel | DTV_Sattel | DTV_Sonstige | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.0 | A 1 | A | 1.0 | NaN | Lotte | 3.713 | 5.143 | Bremen | Münster | ... | 59.364 | 53.019 | 6.151 | 194 | 1.371 | 191 | 1.943 | 12.914 | 10.424 | 0 |
1 | 10.0 | A 1 | A | 1.0 | NaN | Ladbergen | 3.812 | 5.118 | Bremen | Münster | ... | 49.662 | 43.686 | 5.779 | 197 | 1.183 | 141 | 2.007 | 9.437 | 7.211 | 477 |
2 | 20.0 | A 1 | A | 1.0 | NaN | Ascheberg | 4.111 | 5.121 | Bremen | Köln | ... | 55.362 | 50.779 | 4.461 | 122 | 914 | 124 | 2.011 | 8.007 | 5.675 | 1.014 |
3 | 25.0 | A 1 | A | 1.0 | NaN | Unna | 4.411 | 5.144 | Münster | Köln | ... | 97.513 | 88.203 | 8.891 | 419 | 1.554 | 227 | 2.657 | 14.265 | 11.196 | 0 |
4 | 30.0 | A 1 | A | 1.0 | NaN | Hengsen | 4.511 | 5.101 | Kamen, Münster | Wuppertal | ... | 89.824 | 81.237 | 8.107 | 480 | 1.378 | 197 | 4.005 | 14.489 | 11.015 | 1.829 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
337 | 2770.0 | L 937 | L | 937.0 | NaN | Detmold (S) | 4.019 | 5.518 | Detmold | Schlangen | ... | 13.973 | NaN | NaN | NaN | 83 | 55 | 495 | 11 | NaN | 213 |
338 | 2780.0 | L 938 | L | 938.0 | NaN | Detmold (W) | 4.019 | 5.519 | Detmold | Hiddesen | ... | 9.517 | NaN | NaN | NaN | 100 | 608 | 1077 | 47 | NaN | 146 |
339 | 2790.0 | L 954 | L | 954.0 | NaN | Bad Driburg | 4.220 | 5.529 | Bad Driburg | Warburg | ... | 7.024 | 6506 | 439 | 79 | 85 | 61 | 280 | 137 | 75 | 12 |
340 | 2800.0 | K 10 | K | 10.0 | NaN | Sennestadt (O) | 4017.000 | 5521.000 | Bielefeld | Oerlinghausen | ... | 4367 | NaN | NaN | NaN | 227 | 0 | 76 | 44 | NaN | 0 |
341 | 2810.0 | K 10 | K | 10.0 | NaN | Breckerfeld (O) | 4710.000 | 5508.000 | Gloertalsperre | Ennepetalsperre | ... | 2459 | NaN | NaN | NaN | 123 | 1 | 120 | 58 | NaN | 0 |
342 rows × 28 columns
df_legend=pd.read_csv('https://open.nrw/sites/default/files/opendatafiles/180501_NW_Jahrestabellen_2017_Legende.csv',
sep=';',encoding='latin_1')
df_attributes=df_legend[['Bezeichnung','Gegenstand','Dimension']].iloc[0:28,:]
df_attributes['type']=df.dtypes.to_list()
df_attributes
Bezeichnung | Gegenstand | Dimension | type | |
---|---|---|---|---|
0 | ID | Sortierungsvariable (aufsteigende Straßenkateg... | NaN | float64 |
1 | STRASSE | Straßenklasse und Nummer | NaN | object |
2 | STR_KL | Straßenklasse | NaN | object |
3 | STR_NR | Straßennummer | NaN | float64 |
4 | STR_NR_ZUSATZ | Straßennummerzusatz | NaN | object |
5 | ZST_NAME | Zählstellenname | NaN | object |
6 | TK_BLATT | TK-Blatt-Nummer | NaN | float64 |
7 | ZST_NR | Zählstellennummer | NaN | float64 |
8 | RICHTUNG_01 | Richtungsbezeichnung_1 | NaN | object |
9 | RICHTUNG_02 | Richtungsbezeichnung_2 | NaN | object |
10 | DTV_KFZ | DTV KFZ Mo-So (Gesamtquerschnitt) | [KFZ/24h] | object |
11 | DTVmofr_KFZ | DTV KFZ Mo-Fr (Gesamtquerschnitt) | [KFZ/24h] | object |
12 | DTVsa_KFZ | DTV KFZ Sa (Gesamtquerschnitt) | [KFZ/24h] | object |
13 | DTVso_KFZ | DTV KFZ So (Gesamtquerschnitt) | [KFZ/24h] | object |
14 | DTV_SGV | DTV SGV Mo-So (Gesamtquerschnitt) | [KFZ/24h] | object |
15 | DTV_SV | DTV SV Mo-So (Gesamtquerschnitt) | [KFZ/24h] | object |
16 | DTVmofr_SV | DTV SV Mo-Fr (Gesamtquerschnitt) | [KFZ/24h] | object |
17 | DTVsa_SV | DTV SV Sa (Gesamtquerschnitt) | [KFZ/24h] | object |
18 | DTV_PKW_LFW_MOT | DTV PKW/Lieferwagen/Motorräder Mo-So (Gesamtqu... | [KFZ/24h] | object |
19 | DTV_PKW | DTV PKW Mo-So (Gesamtquerschnitt) | [KFZ/24h] | object |
20 | DTV_LFW | DTV Lieferwagen Mo-So (Gesamtquerschnitt) | [KFZ/24h] | object |
21 | DTV_MOT | DTV Motorräder Mo-So (Gesamtquerschnitt) | [KFZ/24h] | object |
22 | DTV_PKW_mit_Anh | DTV Pkw mit Anhänger Mo-So (Gesamtquerschnitt) | [KFZ/24h] | object |
23 | DTV_Bus | DTV Bus Mo-So (Gesamtquerschnitt) | [KFZ/24h] | object |
24 | DTV_LKW_ohne_Anh | DTV LKW ohne Anhänger_Mo-So Gesamtquerschnitt | [KFZ/24h] | object |
25 | DTV_LKW_mit_Anh_Sattel | DTV LKW mit Anhänger und Sattelzüge_Mo-So Gesa... | [KFZ/24h] | object |
26 | DTV_Sattel | DTV Sattelzüge_Mo-So Gesamtquerschnitt | [KFZ/24h] | object |
27 | DTV_Sonstige | DTV Sonstige_Mo-So Gesamtquerschnitt | [KFZ/24h] | object |
The data cleaning for the used dataset is carried out by converting the data types of some variables, i.e., attributes, grouping with counting station name (ZST_NAME) and street class and number (STRASSE), respectively, and checking and removing the missing values.
# change the data types
df2=df.copy()
df2.iloc[:,[1,2,4,5,8,9]]=df.iloc[:,[1,2,4,5,8,9]].astype('string',copy=False)
df2.iloc[:,10:28]=df.iloc[:,10:28].astype('float64',copy=False)
df2.drop(['ID','STR_NR_ZUSATZ','STR_KL','STR_NR','TK_BLATT','ZST_NR'],inplace=True,axis=1)
df2.describe()
DTV_KFZ | DTVmofr_KFZ | DTVsa_KFZ | DTVso_KFZ | DTV_SGV | DTV_SV | DTVmofr_SV | DTVsa_SV | DTV_PKW_LFW_MOT | DTV_PKW | DTV_LFW | DTV_MOT | DTV_PKW_mit_Anh | DTV_Bus | DTV_LKW_ohne_Anh | DTV_LKW_mit_Anh_Sattel | DTV_Sattel | DTV_Sonstige | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 341.000000 | 341.000000 | 341.000000 | 341.000000 | 339.000000 | 339.000000 | 341.000000 | 341.000000 | 339.000000 | 323.000000 | 323.000000 | 323.000000 | 339.000000 | 339.000000 | 339.000000 | 339.000000 | 323.000000 | 339.000000 |
mean | 143.951516 | 162.058554 | 117.628396 | 94.911223 | 285.593041 | 265.519622 | 334.917185 | 159.866798 | 295.728891 | 312.230529 | 300.872659 | 213.245867 | 274.359814 | 92.235988 | 213.789903 | 193.156590 | 171.324372 | 173.691655 |
std | 1321.034589 | 1512.193497 | 1044.156382 | 807.249933 | 805.148107 | 454.349521 | 721.708356 | 247.228792 | 3253.154933 | 2100.420417 | 443.682714 | 187.464179 | 280.342760 | 83.343361 | 362.188630 | 523.662929 | 305.198408 | 272.616367 |
min | 1.632000 | 1.647000 | 1.434000 | 1.374000 | 1.009000 | 1.093000 | 1.089000 | 1.017000 | 1.514000 | 1.258000 | 1.090000 | 1.704000 | 1.006000 | 0.000000 | 1.016000 | 0.000000 | 1.068000 | 0.000000 |
25% | 9.644000 | 10.646000 | 8.008000 | 6.342000 | 6.806500 | 6.905000 | 8.431000 | 4.080000 | 8.676000 | 8.396500 | 5.247000 | 89.000000 | 62.000000 | 29.500000 | 2.472000 | 5.834000 | 5.413500 | 0.000000 |
50% | 29.158000 | 31.420000 | 23.240000 | 18.925000 | 17.586000 | 16.478000 | 18.880000 | 72.000000 | 22.993000 | 25.188000 | 10.028000 | 148.000000 | 145.000000 | 62.000000 | 109.000000 | 16.000000 | 27.000000 | 23.000000 |
75% | 75.697000 | 81.721000 | 64.526000 | 54.414000 | 372.000000 | 401.000000 | 476.000000 | 206.000000 | 65.821500 | 62.091500 | 512.000000 | 289.500000 | 474.500000 | 130.000000 | 283.000000 | 227.000000 | 204.500000 | 258.000000 |
max | 23590.000000 | 26998.000000 | 18623.000000 | 14371.000000 | 12747.000000 | 3479.000000 | 8891.000000 | 2251.000000 | 55406.000000 | 27063.000000 | 2988.000000 | 1552.000000 | 1274.000000 | 608.000000 | 4716.000000 | 8031.000000 | 2477.000000 | 1105.000000 |
idx=[]
for x in ['ID','STR_NR_ZUSATZ','STR_KL','STR_NR','TK_BLATT','ZST_NR']:
idx.append(df_attributes[df_attributes['Bezeichnung']==x].index[0].tolist())
idx
[0, 4, 2, 3, 6, 7]
df2_attributes=df_attributes.copy()
df2_attributes.drop(index=idx,inplace=True)
df2_attributes.drop(columns=['type'],inplace=True)
df2_attributes['type']=df2.dtypes.to_list()
df2_attributes.reset_index(drop=True,inplace=True)
df2_attributes
Bezeichnung | Gegenstand | Dimension | type | |
---|---|---|---|---|
0 | STRASSE | Straßenklasse und Nummer | NaN | string |
1 | ZST_NAME | Zählstellenname | NaN | string |
2 | RICHTUNG_01 | Richtungsbezeichnung_1 | NaN | string |
3 | RICHTUNG_02 | Richtungsbezeichnung_2 | NaN | string |
4 | DTV_KFZ | DTV KFZ Mo-So (Gesamtquerschnitt) | [KFZ/24h] | float64 |
5 | DTVmofr_KFZ | DTV KFZ Mo-Fr (Gesamtquerschnitt) | [KFZ/24h] | float64 |
6 | DTVsa_KFZ | DTV KFZ Sa (Gesamtquerschnitt) | [KFZ/24h] | float64 |
7 | DTVso_KFZ | DTV KFZ So (Gesamtquerschnitt) | [KFZ/24h] | float64 |
8 | DTV_SGV | DTV SGV Mo-So (Gesamtquerschnitt) | [KFZ/24h] | float64 |
9 | DTV_SV | DTV SV Mo-So (Gesamtquerschnitt) | [KFZ/24h] | float64 |
10 | DTVmofr_SV | DTV SV Mo-Fr (Gesamtquerschnitt) | [KFZ/24h] | float64 |
11 | DTVsa_SV | DTV SV Sa (Gesamtquerschnitt) | [KFZ/24h] | float64 |
12 | DTV_PKW_LFW_MOT | DTV PKW/Lieferwagen/Motorräder Mo-So (Gesamtqu... | [KFZ/24h] | float64 |
13 | DTV_PKW | DTV PKW Mo-So (Gesamtquerschnitt) | [KFZ/24h] | float64 |
14 | DTV_LFW | DTV Lieferwagen Mo-So (Gesamtquerschnitt) | [KFZ/24h] | float64 |
15 | DTV_MOT | DTV Motorräder Mo-So (Gesamtquerschnitt) | [KFZ/24h] | float64 |
16 | DTV_PKW_mit_Anh | DTV Pkw mit Anhänger Mo-So (Gesamtquerschnitt) | [KFZ/24h] | float64 |
17 | DTV_Bus | DTV Bus Mo-So (Gesamtquerschnitt) | [KFZ/24h] | float64 |
18 | DTV_LKW_ohne_Anh | DTV LKW ohne Anhänger_Mo-So Gesamtquerschnitt | [KFZ/24h] | float64 |
19 | DTV_LKW_mit_Anh_Sattel | DTV LKW mit Anhänger und Sattelzüge_Mo-So Gesa... | [KFZ/24h] | float64 |
20 | DTV_Sattel | DTV Sattelzüge_Mo-So Gesamtquerschnitt | [KFZ/24h] | float64 |
21 | DTV_Sonstige | DTV Sonstige_Mo-So Gesamtquerschnitt | [KFZ/24h] | float64 |
df3=df2.groupby(['ZST_NAME']).mean()
df3.head()
DTV_KFZ | DTVmofr_KFZ | DTVsa_KFZ | DTVso_KFZ | DTV_SGV | DTV_SV | DTVmofr_SV | DTVsa_SV | DTV_PKW_LFW_MOT | DTV_PKW | DTV_LFW | DTV_MOT | DTV_PKW_mit_Anh | DTV_Bus | DTV_LKW_ohne_Anh | DTV_LKW_mit_Anh_Sattel | DTV_Sattel | DTV_Sonstige | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ZST_NAME | ||||||||||||||||||
AD Bonn-Nordost (N) | 120.127 | 134.011 | 97.080 | 84.835 | 7.454 | 7.753 | 10.298 | 3.331 | 110.475 | 101.790 | 8.252 | 433.0 | 823.0 | 299.0 | 3.133 | 4.321 | 3.412 | 1077.0 |
AD Bonn-Nordost (S) | 88.024 | 100.522 | 66.545 | 56.852 | 3.443 | 3.588 | 4.806 | 1.470 | 83.724 | 77.314 | 6.050 | 360.0 | 412.0 | 145.0 | 2.133 | 1.310 | 918.000 | 299.0 |
AD Bonn-Nordost (W) | 98.699 | 109.482 | 82.359 | 70.025 | 6.698 | 6.918 | 9.243 | 2.861 | 89.982 | 82.141 | 7.318 | 523.0 | 785.0 | 220.0 | 2.749 | 3.949 | 3.184 | 1014.0 |
AD Essen-Ost (O) | 122.509 | 132.714 | 114.031 | 89.698 | 8.051 | 8.275 | 11.137 | 3.022 | 113.139 | 105.280 | 7.616 | 243.0 | 535.0 | 224.0 | 3.109 | 4.942 | 4.061 | 560.0 |
AD Essen-Ost (S) | 60.391 | 67.577 | 45.811 | 44.276 | 3.692 | 3.771 | 5.165 | 1.118 | 56.090 | 52.105 | 3.853 | 132.0 | 298.0 | 79.0 | 1.534 | 2.158 | 1.701 | 232.0 |
df3.shape
(341, 18)
# check missing value
df3.isna().sum(axis=0)
DTV_KFZ 0 DTVmofr_KFZ 0 DTVsa_KFZ 0 DTVso_KFZ 0 DTV_SGV 2 DTV_SV 2 DTVmofr_SV 0 DTVsa_SV 0 DTV_PKW_LFW_MOT 2 DTV_PKW 18 DTV_LFW 18 DTV_MOT 18 DTV_PKW_mit_Anh 2 DTV_Bus 2 DTV_LKW_ohne_Anh 2 DTV_LKW_mit_Anh_Sattel 2 DTV_Sattel 18 DTV_Sonstige 2 dtype: int64
# remove missing values
df3.dropna(inplace=True)
df3.shape
(323, 18)
# check missing values again
df3.isna().sum(axis=0).sum()
0
df4=df2.groupby(['STRASSE']).mean()
df4.head()
DTV_KFZ | DTVmofr_KFZ | DTVsa_KFZ | DTVso_KFZ | DTV_SGV | DTV_SV | DTVmofr_SV | DTVsa_SV | DTV_PKW_LFW_MOT | DTV_PKW | DTV_LFW | DTV_MOT | DTV_PKW_mit_Anh | DTV_Bus | DTV_LKW_ohne_Anh | DTV_LKW_mit_Anh_Sattel | DTV_Sattel | DTV_Sonstige | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
STRASSE | ||||||||||||||||||
A 1 | 77.569071 | 81.277714 | 70.564143 | 68.831571 | 9.567929 | 9.682143 | 12.910071 | 3.891857 | 66.59950 | 60.104857 | 6.153571 | 341.071429 | 371.351143 | 114.214286 | 117.434857 | 7.660286 | 5.962929 | 148.345929 |
A 2 | 93.449500 | 101.295000 | 79.981750 | 73.868000 | 16.646875 | 16.828125 | 22.437750 | 7.340500 | 75.15825 | 68.299200 | 7.852000 | 336.200000 | 207.287250 | 181.250000 | 3.483750 | 13.163125 | 10.585000 | 344.625000 |
A 3 | 110.086667 | 115.281333 | 102.781952 | 95.811238 | 13.143429 | 13.365667 | 17.633143 | 5.959524 | 95.02300 | 85.738667 | 8.861762 | 341.509714 | 221.048286 | 222.238095 | 36.174952 | 10.030619 | 8.126524 | 400.553048 |
A 30 | 44.595500 | 48.739000 | 38.420000 | 33.493500 | 9.248500 | 9.368500 | 12.230000 | 4.955000 | 34.15850 | 30.640000 | 3.254000 | 264.500000 | 667.000000 | 120.000000 | 1.369000 | 7.879500 | 6.465000 | 401.000000 |
A 31 | 33.046000 | 34.343000 | 30.403000 | 30.149000 | 4.220000 | 4.277000 | 5.851000 | 1.074000 | 28.16500 | 25.188000 | 2.956000 | 21.000000 | 580.000000 | 57.000000 | 650.000000 | 3.570000 | 2.789000 | 24.000000 |
# check missing value
df4.isna().sum(axis=0)
DTV_KFZ 0 DTVmofr_KFZ 0 DTVsa_KFZ 0 DTVso_KFZ 0 DTV_SGV 0 DTV_SV 0 DTVmofr_SV 0 DTVsa_SV 0 DTV_PKW_LFW_MOT 0 DTV_PKW 8 DTV_LFW 8 DTV_MOT 8 DTV_PKW_mit_Anh 0 DTV_Bus 0 DTV_LKW_ohne_Anh 0 DTV_LKW_mit_Anh_Sattel 0 DTV_Sattel 8 DTV_Sonstige 0 dtype: int64
# remove missing values
df4.dropna(inplace=True)
df4.shape
(127, 18)
# check missing values again
df4.isna().sum(axis=0).sum()
0
Now data frames df3 for the data of road sections and df4 for the data of motorways are clearned for using. In the following, we will focus on the data matrix df3 with 18 attributes as columns and 323 counting stations, representing 323 road sections, as rows.
Feature engineering is performed by visualizing the data matrix, displaying the relationship of six vechicle types, examing the skew values and logarithmically tranforming the highly skewed variables.
import seaborn as sns
from scipy import stats
from scipy.stats import norm, expon, cauchy
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
# data matrix of transpose of df3
ax = plt.subplots(figsize=(15, 5))
ax=sns.heatmap(df3.T,cmap="magma_r", vmin=0, vmax=800).set(ylabel='',xlabel='')
# draw the relationship of six vechicle types both by scatter plots and kernel density estimate (KDE)
vechicle=df3[['DTV_PKW','DTV_LFW','DTV_MOT','DTV_Bus','DTV_Sattel','DTV_Sonstige']]
g = sns.pairplot(vechicle, kind="kde", height=2.5)
g.map_upper(sns.scatterplot,color="k")
g.map_lower(sns.scatterplot,color="k")
g.add_legend()
<seaborn.axisgrid.PairGrid at 0x7fdb5add08b0>