Ranking and Clustering Cities in North Rhine-Westphalia, Germany

-- A Project for Applied Data Science Capstone by IBM/Coursera

Shanshan Wang
shanshan.wang@uni-due.de
Feb. 9, 2021

Table of Contents

1 Introduction

A strategic city planning is of benefit to a state government to improve citizens' economic and living levels. To this end, a better understanding to the cities in a state is of importance. In this project we will cluster and evaluate the cities in North Rhine-Westphalia, Germany based on the fields in working, education, living facilities, transportation, health care and leisure places. By this way, we can find out the top cities in each field as well as the bottom cities that to be improved in the corresponding field. Moreover, we can reveal the correlations among cities based on the above-mentioned fields and figure out how a city's change impacts on the correlated cities. The correlations among cities will facilitate the development of multiple cities synchronously and therefore is useful to be applied to a city planning.

To simplify the issue, we focus on the frequency of appearance of categorized venues as an index to estimate the level of the development in each city field. By $k$-means clustering and hierarchical clustering, we classify the cities in five clusters. For each cluster, a correlation pattern among different city fields is disclosed. To give a recommendation for traveling or a suggestion for city planning, we rank the five top and bottom cities in each field.

This project report is organized as follows. In section 2, we describe the dataset we used and the processing for dealing with the raw data. In section 3, we work out the frequency values of venues in each category and classify the cities by the $k$-means clustering and the hierarchical clustering. In section 4, we analyze and discuss the characteristics of city clusters and select out the best and the worst cities in each field. We finally conclude our results in section 5.

2 Datasets

The project uses two datasets. One is from Wikipedia, where we downloaded a table which lists the ranks of population ranks, names, populations in 2017, areas in square kilometer and populations in per square kilometer of the ten largest cities in North Rhine-Westphalia (NRW). The information of the city names are then used to find their locations.

The other dataset is from Foursquare company. With a given search query, i.e., a key word, we search the relevant venues around the central locations with a radius of 100000 kilometer. These central locations is set as the locations of the largest ten cities in NRW. In this way, the searched venues almost come from the whole state. The location data from Foursquare company includes the information of location names, categories, addresses, latitudes, longitudes, distances, postal codes, city's names, state's names, countries and so on. We considered multiple search queries, i.e., Company, GmbH, Factory, Fabrik, Office, Restaurant, Supermarket, Shop, University, Universität, College, School, Hospital, Residence, Haus, Park, Transport, and added their information as a column of that table.

Totally we downloaded 8321 data points for categorized venues from Foursquare company, where the 6144 data points are located in NRW. They are visualized on the maps by categories. We split all search queries into six main categories. They are named as working from the search queries Company, GmbH, Factory, Fabrik and Office, education from University, Universität, College and School, living facilities from Restaurant, Supermarket, Shop, Residence and Haus, health care from Hospital, transportation from Transport, and leisure places from Park. In the following, we will use these 6144 data points for our calculation.

Import necessary Libraries

Download the top ten largest cities in North Rhine-Westphalia as central locations

Define Foursquare Credentials and Version

Define a function for loading data from Foursquare with the central cities with a radius

Search and load the building data based on the given key words

Define information of interest and filter dataframe

Select the rows with the locations in state Nordrhein-Westfalen

Create map of Nordrhein-Westfalen (NRW) using latitude and longitude values and add markers to map

Company(GmbH), Factory (Fabrik), Office

Restaurant, Supermarket, Shop, Residence, Haus

University (Universität), College, School

Hospital

Transportation

Park

3 Methodologies

3.1 Frequency of categorized venues

Calculate the number of venues of each field in each city

Convert the number of venues to the appearance frequency

Combine the sub-categories into one main category

3.2 $k$-means clustering

We will use df_city2 as our data matrix to cluster cities into 5 groups in NRW.

Update the cluster labels in the data matrix df_city_km

Visualize the clusters in a map

3.3 Hierarchical clustering

Calculate the correlation matrix of cities with the data matrix df_city

Classify cities with hierarchical clustering. The strongly correlated city groups are shown in the diagonal blocks in the clustermap.

Get the first k cluster labels

Reorder the city index in data matrix df_city2 based on the cluster labels and reconstruct the heatmap with clusters. The indices of cities in each cluster are not reordered. Thus the following heatmap is a little different from the above clustermap.

Visualize the city clusters in a map

4 Results and Disscusion

4.1 City clusters by k-means clustering

average the frequencies over all cities in each cluster

For each cluster, get the city names and the score, i.e., the average frequency over different fields

Box plots of each cluster in each city field

Cluster 1 includes the state's capital D\"usseldorf and the state's largest city K\"oln. This cluster exhibits higher levels in the fields of working, education, living facilities, health care, transportation and leisure places. In contrast, cluster 6 which contains many small cities has the lowest level in each city field. Besides the ones in cluster 1, the cities in cluster 2, i.e., Dortmund and Essen, are suitable for working, living and relaxing, while the cities in cluster 3, i.e., Bochum and Duisburg, have more facilities in education, health care and transportation.

Correlation matrices of city fields for each cluster

The correlation matrices among six categories show different patterns. For cluster 1, D\"usseldorf and K\"oln exihibit the strong correlations among working, education and leisure places, and among living facilities, health care and transportation, respectively. Clusters 2 and 3 show the strong correlations among all fields except for the health care in cluster 2 and the transportation in cluster 3. Although the obviously strong correlation between working and education and between living facilities and health care can be found in cluster 4, the strength of correlation matrices in the last three clusters are weaker than those in the first three clusters, as more cities included in the last clusters reduce the correlation values. These strong positive correlations shown in the matrices imply that an improvement (or a deterioration) in one city field will also advance (or drop) the level of another field. The strong negative correlations, however, reveal an opposite relation between two city fields.

4.2 City clusters by hierarchical clustering

For each cluster, get the city names and the score, i.e., the average frequency over different fields

Different from $k$-means clustering based on the data matrix, the hierarchical clustering groups the cities with respect to their correlation matrix. Therefore, the cities with strong correlations are grouped into the same cluster, which is different from those by $k$-means clustering.

Box plots of each cluster in each city field

The figure shows close values for different clusters at different categories. We can found more outer points comparing with the figure by $k$-means clustering, as more cities are included in each cluster.

Correlation matrices of city fields for each cluster

The correlation matrix are worked out by averaging the frequencies of venues over different categories. Thus, the resulting clusters by hierarchical clustering also present strongly positive correlations among the six categories, except for the individual categories that distinguishes different clusters.

4.3 Ranking of cities

Find out the top and bottom five cities in each fields.

The top five cities, e.g., Düsseldorf, Köln, Bochum, Essen, Duisburg, are recommended for travelling or living based on peasonal requirements.

The bottom five cities, e.g., Ahlen, Alpen, Aachen, Alfter, Arnsberg, are suggested to add in the state's city planning and to be improved in the corresponding fields.

5 Conclusions

With the location data from Foursquare company, we visualized the different cities on maps. To simplify the issue, we focused on the fields of working, education, living facilities, health care, transportation, leisure places in North Rhine-Westphalia (NRW), Germany.

Using the frequency of categorized venues as a data matrix, we classified the cities in NRW by $k$-means clustering. The $k$-means clustering is better to classify cities and identify the dominating cities, e.g.,D\"usseldorf and K\"oln in most fields, Dortmund and Essen in the fields of working, living and relaxing, and Bochum and Duisburg in the fields of health care and transportation. Besides, we use the hierarchical clustering to classify cities based on their correlation matrix, where the cities with strongly positive correlations are grouped together. The correlation among cities with respect to each field provides a new perspective for city planning, which will be efficient due to the collective development for those positively correlated cities.

By ranking the cities based on the frequency values of venues in each field, we can find out the cities that are suitable for travelling or living. Meanwhile, we identify the cities that is to be improved in the facilities of the corresponding fields. However, due to a lack of high quality in the dataset, the results of this project may contain some biases.

6 References

  1. North Rhine-Westphalia https://en.wikipedia.org/wiki/North_Rhine-Westphalia
  2. Location data from https://foursquare.com
  3. Map data from https://www.openstreetmap.org
  4. $k$-means clustering https://scikit-learn.org/stable/modules/clustering.html#k-means
  5. Hierarchical clustering https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering