-- A Project for Applied Data Science Capstone by IBM/Coursera
Shanshan Wang
shanshan.wang@uni-due.de
Feb. 9, 2021
A strategic city planning is of benefit to a state government to improve citizens' economic and living levels. To this end, a better understanding to the cities in a state is of importance. In this project we will cluster and evaluate the cities in North Rhine-Westphalia, Germany based on the fields in working, education, living facilities, transportation, health care and leisure places. By this way, we can find out the top cities in each field as well as the bottom cities that to be improved in the corresponding field. Moreover, we can reveal the correlations among cities based on the above-mentioned fields and figure out how a city's change impacts on the correlated cities. The correlations among cities will facilitate the development of multiple cities synchronously and therefore is useful to be applied to a city planning.
To simplify the issue, we focus on the frequency of appearance of categorized venues as an index to estimate the level of the development in each city field. By $k$-means clustering and hierarchical clustering, we classify the cities in five clusters. For each cluster, a correlation pattern among different city fields is disclosed. To give a recommendation for traveling or a suggestion for city planning, we rank the five top and bottom cities in each field.
This project report is organized as follows. In section 2, we describe the dataset we used and the processing for dealing with the raw data. In section 3, we work out the frequency values of venues in each category and classify the cities by the $k$-means clustering and the hierarchical clustering. In section 4, we analyze and discuss the characteristics of city clusters and select out the best and the worst cities in each field. We finally conclude our results in section 5.
The project uses two datasets. One is from Wikipedia, where we downloaded a table which lists the ranks of population ranks, names, populations in 2017, areas in square kilometer and populations in per square kilometer of the ten largest cities in North Rhine-Westphalia (NRW). The information of the city names are then used to find their locations.
The other dataset is from Foursquare company. With a given search query, i.e., a key word, we search the relevant venues around the central locations with a radius of 100000 kilometer. These central locations is set as the locations of the largest ten cities in NRW. In this way, the searched venues almost come from the whole state. The location data from Foursquare company includes the information of location names, categories, addresses, latitudes, longitudes, distances, postal codes, city's names, state's names, countries and so on. We considered multiple search queries, i.e., Company, GmbH, Factory, Fabrik, Office, Restaurant, Supermarket, Shop, University, Universität, College, School, Hospital, Residence, Haus, Park, Transport, and added their information as a column of that table.
Totally we downloaded 8321 data points for categorized venues from Foursquare company, where the 6144 data points are located in NRW. They are visualized on the maps by categories. We split all search queries into six main categories. They are named as working from the search queries Company, GmbH, Factory, Fabrik and Office, education from University, Universität, College and School, living facilities from Restaurant, Supermarket, Shop, Residence and Haus, health care from Hospital, transportation from Transport, and leisure places from Park. In the following, we will use these 6144 data points for our calculation.
Import necessary Libraries
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from scipy.cluster import hierarchy
import seaborn as sns; sns.set_theme()
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt # plotting library
import requests # library to handle requests
!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
# libraries for displaying images
from IPython.display import Image
from IPython.core.display import HTML
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
! pip install folium==0.12.0
import folium # plotting library
print('Folium installed')
print('Libraries imported.')
Requirement already satisfied: geopy in /Users/working/opt/anaconda3/lib/python3.8/site-packages (2.1.0) Requirement already satisfied: geographiclib<2,>=1.49 in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from geopy) (1.50) Requirement already satisfied: folium==0.12.0 in /Users/working/opt/anaconda3/lib/python3.8/site-packages (0.12.0) Requirement already satisfied: numpy in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from folium==0.12.0) (1.19.2) Requirement already satisfied: branca>=0.3.0 in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from folium==0.12.0) (0.4.2) Requirement already satisfied: jinja2>=2.9 in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from folium==0.12.0) (2.11.2) Requirement already satisfied: requests in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from folium==0.12.0) (2.24.0) Requirement already satisfied: MarkupSafe>=0.23 in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from jinja2>=2.9->folium==0.12.0) (1.1.1) Requirement already satisfied: idna<3,>=2.5 in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from requests->folium==0.12.0) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from requests->folium==0.12.0) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from requests->folium==0.12.0) (2020.6.20) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/working/opt/anaconda3/lib/python3.8/site-packages (from requests->folium==0.12.0) (1.25.11) Folium installed Libraries imported.
Download the top ten largest cities in North Rhine-Westphalia as central locations
URL = "https://en.wikipedia.org/wiki/North_Rhine-Westphalia"
tables = pd.read_html(URL)
df_topcities=tables[4]
df_topcities
Pos. | Name | Pop. 2017 | Area (km²) | Pop. per km2 | Map | |
---|---|---|---|---|---|---|
0 | 1 | Cologne | 1080394 | 405.15 | 2668 | NaN |
1 | 2 | Düsseldorf | 617280 | 217.01 | 2839 | NaN |
2 | 3 | Dortmund | 586600 | 280.37 | 2090 | NaN |
3 | 4 | Essen | 583393 | 210.38 | 2774 | NaN |
4 | 5 | Duisburg | 498110 | 232.81 | 2140 | NaN |
5 | 6 | Bochum | 365529 | 145.43 | 2509 | NaN |
6 | 7 | Wuppertal | 353590 | 168.37 | 2100 | NaN |
7 | 8 | Bielefeld | 332552 | 257.83 | 1285 | NaN |
8 | 9 | Bonn | 325490 | 141.22 | 2307 | NaN |
9 | 10 | Münster | 313559 | 302.91 | 1034 | NaN |
Define Foursquare Credentials and Version
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
ACCESS_TOKEN = '' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 200
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
Your credentails: CLIENT_ID: CLIENT_SECRET:
Define a function for loading data from Foursquare with the central cities with a radius
radius = 100000
def GetData(CLIENT_ID,CLIENT_SECRET,ACCESS_TOKEN,VERSION,LIMIT,df_topcities,search_query,radius):
for index, city in enumerate(df_topcities['Name']):
# Define a user_agent and an instance of the geocoder
address = city+', Germany'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
# Define the corresponding URL
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude,ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
# Send the GET Request and examine the results
results = requests.get(url).json()
# assign relevant part of JSON to venues
venues = results['response']['venues']
# tranform venues into a dataframe
dataframe=pd.json_normalize(venues)
dataframe['search_query']=search_query
if index==0:
dataframe_total=dataframe
else:
dataframe_total=dataframe_total.append(dataframe, ignore_index=True)
del address,location,latitude,longitude,url,results,venues
return dataframe_total
Search and load the building data based on the given key words
search_query=['Company','GmbH','Factory','Fabrik','Office','Restaurant','Supermarket','Shop','University','Universität','College','School','Hospital','Residence','Haus','Park','Transport']
for i in range(len(search_query)):
dataframe=GetData(CLIENT_ID,CLIENT_SECRET,ACCESS_TOKEN,VERSION,LIMIT,df_topcities,search_query[i],radius)
if i==0:
dataframe_total=dataframe
else:
dataframe_total=dataframe_total.append(dataframe, ignore_index=True)
dataframe_total.shape
(8321, 20)
dataframe_total.head()
id | name | categories | referralId | hasPerk | location.address | location.lat | location.lng | location.labeledLatLngs | location.distance | location.postalCode | location.cc | location.city | location.state | location.country | location.formattedAddress | venuePage.id | location.crossStreet | location.neighborhood | search_query | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5a311f5d029a5526b06f81db | DAHLER & COMPANY Immobilien Köln | [{'id': '5032885091d4c4b30a586d66', 'name': 'R... | v-1612611839 | False | Schildergasse 49 | 50.936320 | 6.953768 | [{'label': 'display', 'lat': 50.9363195, 'lng'... | 491 | 50667 | DE | Köln | Nordrhein-Westfalen | Deutschland | [Schildergasse 49, 50667 Köln, Deutschland] | NaN | NaN | NaN | Company |
1 | 4d12234340e6548166d98d1a | Hair Company | [{'id': '4bf58dd8d48988d110951735', 'name': 'S... | v-1612611839 | False | Herzogstr. 10 - 12 | 50.937113 | 6.953708 | [{'label': 'display', 'lat': 50.937113, 'lng':... | 460 | 50667 | DE | Köln | Nordrhein-Westfalen | Deutschland | [Herzogstr. 10 - 12, 50667 Köln, Deutschland] | NaN | NaN | NaN | Company |
2 | 5be061b6491be7002c55bfc1 | aesthetic company | [{'id': '54541900498ea6ccd0202697', 'name': 'H... | v-1612611839 | False | Neusser Str. 265 | 50.963793 | 6.953513 | [{'label': 'display', 'lat': 50.963793, 'lng':... | 2867 | 50733 | DE | Köln | Nordrhein-Westfalen | Deutschland | [Neusser Str. 265, 50733 Köln, Deutschland] | NaN | NaN | NaN | Company |
3 | 556c82b9498ed18f2480b170 | Immobilien Company Petra Emmer e.K. | [{'id': '5032885091d4c4b30a586d66', 'name': 'R... | v-1612611839 | False | Dürener Str. 149 | 50.931332 | 6.917490 | [{'label': 'display', 'lat': 50.9313324, 'lng'... | 3081 | 50931 | DE | Köln | Nordrhein-Westfalen | Deutschland | [Dürener Str. 149, 50931 Köln, Deutschland] | 150522531 | NaN | NaN | Company |
4 | 50ae192ce4b0f706ce71e90f | Practice Company | [{'id': '4bf58dd8d48988d197941735', 'name': 'C... | v-1612611839 | False | NaN | 50.938027 | 6.945670 | [{'label': 'display', 'lat': 50.93802689569254... | 1004 | NaN | DE | Köln | Nordrhein-Westfalen | Deutschland | [Köln, Deutschland] | NaN | NaN | NaN | Company |
Define information of interest and filter dataframe
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories','search_query'] + [col for col in dataframe_total.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe_total.loc[:, filtered_columns]
# function that extracts the category of the venue
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)
# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]
dataframe_filtered.head()
name | categories | search_query | address | lat | lng | labeledLatLngs | distance | postalCode | cc | city | state | country | formattedAddress | crossStreet | neighborhood | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | DAHLER & COMPANY Immobilien Köln | Real Estate Office | Company | Schildergasse 49 | 50.936320 | 6.953768 | [{'label': 'display', 'lat': 50.9363195, 'lng'... | 491 | 50667 | DE | Köln | Nordrhein-Westfalen | Deutschland | [Schildergasse 49, 50667 Köln, Deutschland] | NaN | NaN | 5a311f5d029a5526b06f81db |
1 | Hair Company | Salon / Barbershop | Company | Herzogstr. 10 - 12 | 50.937113 | 6.953708 | [{'label': 'display', 'lat': 50.937113, 'lng':... | 460 | 50667 | DE | Köln | Nordrhein-Westfalen | Deutschland | [Herzogstr. 10 - 12, 50667 Köln, Deutschland] | NaN | NaN | 4d12234340e6548166d98d1a |
2 | aesthetic company | Health & Beauty Service | Company | Neusser Str. 265 | 50.963793 | 6.953513 | [{'label': 'display', 'lat': 50.963793, 'lng':... | 2867 | 50733 | DE | Köln | Nordrhein-Westfalen | Deutschland | [Neusser Str. 265, 50733 Köln, Deutschland] | NaN | NaN | 5be061b6491be7002c55bfc1 |
3 | Immobilien Company Petra Emmer e.K. | Real Estate Office | Company | Dürener Str. 149 | 50.931332 | 6.917490 | [{'label': 'display', 'lat': 50.9313324, 'lng'... | 3081 | 50931 | DE | Köln | Nordrhein-Westfalen | Deutschland | [Dürener Str. 149, 50931 Köln, Deutschland] | NaN | NaN | 556c82b9498ed18f2480b170 |
4 | Practice Company | College Administrative Building | Company | NaN | 50.938027 | 6.945670 | [{'label': 'display', 'lat': 50.93802689569254... | 1004 | NaN | DE | Köln | Nordrhein-Westfalen | Deutschland | [Köln, Deutschland] | NaN | NaN | 50ae192ce4b0f706ce71e90f |
dataframe_filtered.shape
(8321, 17)
Select the rows with the locations in state Nordrhein-Westfalen
df=dataframe_filtered[dataframe_filtered['state']=='Nordrhein-Westfalen'];
#removes duplicate rows based on lantitude and longitude.
df.drop_duplicates(subset=['lat', 'lng', 'city', 'search_query'], keep='last')
df.reset_index(drop=True,inplace=True)
df.shape
(6144, 17)
Create map of Nordrhein-Westfalen (NRW) using latitude and longitude values and add markers to map
address = 'Nordrhein-Westfalen, Germany'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
map_NRW = folium.Map(location=[latitude, longitude], zoom_start=8)
map_NRW1 = folium.Map(location=[latitude, longitude], zoom_start=8)
map_NRW2 = folium.Map(location=[latitude, longitude], zoom_start=8)
map_NRW3 = folium.Map(location=[latitude, longitude], zoom_start=8)
map_NRW4 = folium.Map(location=[latitude, longitude], zoom_start=8)
map_NRW5 = folium.Map(location=[latitude, longitude], zoom_start=8)
map_NRW6 = folium.Map(location=[latitude, longitude], zoom_start=8)
for lat, lng, name, categories, city, sq in zip(df['lat'], df['lng'], df['name'], df['categories'], df['city'],df['search_query']):
label = '{}, {}, {}'.format(name, categories, city)
label = folium.Popup(label, parse_html=True)
if sq in ('Company','GmbH','Factory','Fabrik','Office'):
color_query='orange'
folium.CircleMarker([lat, lng],radius=5,popup=label,color='black',fill=True,fill_color=color_query,fill_opacity=0.7,
parse_html=False).add_to(map_NRW1)
elif sq in ('Restaurant','Supermarket','Shop','Residence','Haus'):
color_query='red'
folium.CircleMarker([lat, lng],radius=5,popup=label,color='black',fill=True,fill_color=color_query,fill_opacity=0.7,
parse_html=False).add_to(map_NRW2)
elif sq in ('University','College','School','Universität'):
color_query='blue'
folium.CircleMarker([lat, lng],radius=5,popup=label,color='black',fill=True,fill_color=color_query,fill_opacity=0.7,
parse_html=False).add_to(map_NRW3)
elif sq in ('Hospital'):
color_query='white'
folium.CircleMarker([lat, lng],radius=5,popup=label,color='black',fill=True,fill_color=color_query,fill_opacity=0.7,
parse_html=False).add_to(map_NRW4)
elif sq in ('Transport'):
color_query='purple'
folium.CircleMarker([lat, lng],radius=5,popup=label,color='black',fill=True,fill_color=color_query,fill_opacity=0.7,
parse_html=False).add_to(map_NRW5)
elif sq in ('Park'):
color_query='green'
folium.CircleMarker([lat, lng],radius=5,popup=label,color='black',fill=True,fill_color=color_query,fill_opacity=0.7,
parse_html=False).add_to(map_NRW6)
folium.CircleMarker(
[lat, lng],
radius=4,
popup=label,
color='black',
fill=True,
fill_color=color_query, #'#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_NRW)
map_NRW1