Battle of the Suburbs: Finding the best Suburb in Brisbane using k-Means Clustering Algorithm

IBM Data Science Professional Certificate Capstone Project

16 min readJul 29, 2020

Photo by Brisbane Local Marketing on Unsplash

Introduction

Brisbane is the 3rd largest city in Australia and the capital of Queensland, otherwise known as the Sunshine State. It has a diverse population of approximately 2.5 million people and is ranked top 50 in the world’s most liveable cities. With over 280 days of sunshine and pleasant weather all year round, Brisbane attracts people both locally and internationally seeing roughly 9 million visitors and 20,000 new residents every year. Moreover, with Queensland’s quick recovery from the COVID-19 crisis and the economy opening up, many are looking towards Queensland’s sunshine capital to start a new life. The biggest challenge for families moving to Brisbane is finding a place to live that is affordable, safe, and has some form of recreation. But how does one go about this without any experience living in the city? Well, you could ask someone who’s lived there to give you their opinion… or better yet you could ask a computer what it thinks is the best suburb in Brisbane!

In this project, I will utilise an unsupervised machine learning algorithm known as k-Means clustering, which takes unlabelled data and groups them based on similar features, to uncover underlying patterns about Brisbane’s suburbs. I use easily accessible datasets from ABS and QLD Police for demographic and crime statistics, scrape property website for housing prices, and use Foursquare’s Places API to get information about nearby venues to feed into the algorithm to try and answer the question: what are the most liveable suburbs in Brisbane? The results from this project will aid prospective visitors and residents about potential suburbs to live in based on their needs.

This project is part of my IBM Data Science Professional Certificate Capstone Project. For more details about the methodology, please check out my Jupyter notebook and detailed report available in my GitHub Repository.

Photo by Agnieszka Kowalczyk on Unsplash

Data Sources

Demographic Data: The ABS has an online Table Builder tool that allows users to build their tables based on a large number of variables from the Census data. For this project, we will use the Table Builder tool to extrapolate population and age composition records for suburbs in the Brisbane region.
Crime Statistics: QLD Police provides an online crime statistics portal which records criminal offences with information on location, date, offence type, and time of day. The numbers are derived from official crime reports filed in the Queensland region. For this project, we will be using this database to retrieve crime statistics for Brisbane suburbs over the past five years.
Property Prices: In Australia, property sales data is available on many real estate websites; however, this information is either scattered across several places or monetised by real estate agencies. Fortunately for us, speakingsame.com is a real estate website that automatically collects data through scouring the net for house sales and rental prices advertised online. Information is updated daily with the median house and rental rate for each suburb. We will scrape the website’s data for housing data related to Brisbane suburbs.
Location Data: Foursquare Labs Inc. is one of many location data providers that can give information on the types of venues that exist within a radius based on geographical coordinates provided by the user. The Foursquare location database inbuilt on crowdsourced information entered through users checking into venues and filling missing information about the place. Today many company’s like Uber, Apple Maps, and Snapchat utilise their database for location information. In this project, we will use Foursquare’s Places API to request nearby venue information using the location of Brisbane suburbs and group locations based on the number of food, shopping, and activity spots are present.

Methodology

You can skip this part and jump straight to Exploratory Data Analysis if you just want to see the insights. Otherwise, in this section, I’ll go over how I obtained and merged the datasets.

Getting a list of suburbs in Brisbane

To begin I need a list of suburbs in Brisbane which I’ll use to obtain and merge other data sets. I create a simple web scraper using BeautifulSoup to obtain the list of suburbs from the Brisbane Local Councils website and add it to a table.

Geo-Coordinates for each suburb

Now that I have the name of all Brisbane suburbs I can use the Google Maps Geocode API to get the geo-coordinates for each of the suburbs which I’ll need to plot on a map and to find nearby locations later on.

Housing Data

To get the housing and rental prices for each suburb, I use the Pandas function to just read the table from the site http://house.speakingsame.com/city.php?q=Brisbane&sta=qld

Crime and Demographics Data

Next, I download the population, age, and crime data from their respective databases. While the population and age data were listed per suburb, the crime data from the QLD police had a list of criminal offences reports since 2015 detailing the type of offence, the date it happened, and location in this case suburb. This yielded 435,529 rows of data which needed to be grouped by counts of offences by suburb.

I use the group by function in python to group all counts of criminal offences by Suburb which yields 182 rows of data.

The age data were grouped in 5 year age brackets from 0–100+ years old with 22 columns of data.

To make this easier to work with I group them into 3 buckets, Gen Z (0–19 years), Gen Y (20–34 years), Gen X (35–54 years), and Baby Boomers (55 years+)

Nearby Venues

To get a list of nearby venues, I use the FourSquare API and function provided by IBM which uses the geo-coordinates for each suburb to find nearby venues in a 1km radius.

Using the function I get a list of nearby venues and which suburb they’re located at. There were 263 total categories and over 3000 different venues across all suburbs.

As I’m only interested in the number of nearby food, shopping, and recreational venues, I aggregate them based on venue categories and group them by suburb.

Finding the number of nearby food, shopping, and recreational venues by Suburb

Calculating the Rate of Offences and Venues

After cleaning the data up, I merge all these various datasets into one data frame.

At this point, I realise that the population of a suburb will probably have a strong influence on the number of offences and nearby venue. To make this relative, I use the population of each suburb to calculate the rate of offences and venues per 1000 people.

Rate of offences and venues per 1000 people

Now we can finally explore the data to see what we’re working with!

Exploratory Data Analysis

According to the Brisbane City Council, there are 192 suburbs located in the Brisbane City LGA. After vigorously cleaning and wrangling the various dataframes and merging them into one, we were left with 172. Suburbs with a population under 500 were dropped from our analysis since they were either industrial suburbs (Brisbane Airport, Port of Brisbane, etc.) or uninhabited (Mount Coottha, Banks Creek).

Population

We can use Python’s Folium library to visualise the suburbs location and their relative population size, as seen below:

The average population for each suburb is approximately 6,500 with the most populous suburbs being Forest Lake (22,898), Sunnybank Hills (18,087)& Calamvale (17,123). Pallara (514), Archerfield (540) & Sumner (590) were the least populated.

Age Groups

Top & bottom 5 Suburbs by Age Group Composition

From this diagram, we can see that Gen X (35–54 years) is the most common age group averaging at 27.9% of the population slightly higher than the other groups. Otherwise, there seems to be a relatively equal distribution of the age groups. Gen Z (0–19 years) has a lot of outliers, suburbs such as Newstead & Fortitude Valley have a relatively small proportion of children indicating that these suburbs probably have a low number of families with children. On the contrary, Gen Y(20–34 years) has the highest composition of people in Bowen Hills, Fortitude Valley, and Brisbane City. This doesn’t come as a surprise since these suburbs are close to the CBD where most of the businesses and nightlife areas are. Lastly, Pinjara Hills has the highest number of Baby Boomers (55+ years), making up 58.4% of the population.

Crime

Distribution of offence rates in Brisbane suburbs

Brisbane is regarded as a safe city with a crime index of only 34.83. However, like any big city, some areas see a significantly higher number of criminal offences. The figure above shows the distribution of criminal offence rates in Brisbane suburbs. We recognise that generally, crime is low averaging at 81.9 offences per 1000 every year, however on the map we see that the city centre especially Brisbane City and Fortitude Valley have significantly higher crime rates making it more dangerous than the rest of the city.

Brisbane crime map by yearly offence rate per 1000 residents

Property Prices

The figure shows the distribution of Median house prices and rents along with the most expensive and cheapest suburbs in Brisbane. We can see that both graphs follow a similar distribution with the average cost for a house being $774,883 and the average rent at $548.9.

Venues

The second box & whisker plot seen below shows the distribution of nearby venues. We can see that certain suburbs like Petrie Terrace have a lot of nearby food, shopping and activity venues whereas suburbs such as Chermside West, Bridgeman Downs, and Carindale have very few.

Top & bottom 5 suburbs for nearby venues

Correlation Matrix

The correlation matrix in the figure above provides some valuable insights. As expected, we see a strong positive correlation between venues indicating that some suburbs have a lot of nearby venues whereas others may have very few. It was also expected that the median house price and rent to have a positive correlation. It was interesting to see that that was a slight positive correlation between the proportion of Gen Y (20–34 years) with crime and the number of nearby venues. Lastly, we do see a minimal relationship between Gen X(35–54 years) and Gen Z(0–199 years) suggesting that some suburbs are more popular with families, especially those with younger children. We also see some negative correlations between Gen Y and both Gen Z and Baby Boomers.

Model Development

K-Means clustering is an unsupervised machine learning algorithm which can be used to find clusters of information that share similar attributes and to classify these groups into individual categories. The algorithm works by creating a set number of clusters from the data points as defined by the user. The algorithm then iteratively tries to find the optimum centroid to classify the data points. From our exploratory data analysis, we have observed some correlations between our variables and have spotted some clusters forming. We can run the k-Means algorithm on our data set to find groups of suburbs that share similar qualities which can guide decision making when it comes to deciding where you want to live.

First, we’ll need to drop any non-numerical columns from our data set such as the name of the suburbs and geo-coordinates and then normalise the dataset.

Normalising the dataset for k-Means clustering

The next step is to find the optimal number of clusters. We will use the elbow method, which runs the k-Means algorithm for a range of K values and the inflexion point on the graph is generally regarded as the optimum K value. From the figure below, no visible elbow exists; however, we do see a slight inflexion at k=4 and will use this as the number of clusters.

Results

Cluster 0: Suburbs to avoid

Bowen Hills, Brisbane City, Fortitude Valley, Herston, Milton, Newstead, Petrie Terrace, South Brisbane, Spring Hill & Woolloongabba (Total: 10)

Avg Yearly offence rate for Clusters & Brisbane

These suburbs are conveniently located at or near the Brisbane CBD, but they see some of the highest crime rates significantly higher than the average crime rates for the entire Brisbane City. They are also the 2nd most expensive group of suburbs after Cluster 3, but the higher crime rates don’t make these suburbs a viable option. Nevertheless, these suburbs do have the most number of nearby venues and the highest proportion of Gen Y (20–34) residents. It could be a viable option for young individuals who want to be close to the action, but Cluster 2 suburbs could offer a bit more in terms of savings and safety. These suburbs should be avoided for families, and the Age Groups distribution confirms this showing close to 50% for Gen Y while Gen Z (0–19) only makes up 11%.

Cluster 1: Budget suburbs

Acacia Ridge, Algester, Anstead, Ashgrove, Aspley, Bald Hills, Banyo, Bellbowrie, Belmont, Boondall, Bracken Ridge, Bridgeman Downs, Brighton, Calamvale, Camp Hill, Carina, Carindale, Carseldine, Chapel Hill, Chermside West, Chuwar, Corinda, Deagon, Doolandella, Drewvale, Durack, Eight Mile Plains, Ellen Grove, Everton Park, Ferny Grove, Fitzgibbon, Forest Lake, Geebung, Gordon Park, Heathwood, Hemmant, Holland Park, Holland Park West, Inala, Jamboree Heights, Jindalee, Karana Downs, Kedron, Kenmore, Keperra, Kuraby, Lota, Mackenzie, Manly West, Mansfield, McDowall, Middle Park, Mitchelton, Moggill, Moorooka, Mount Crosby, Mount Gravatt East, Murarrie, Nudgee, Oxley, Pallara, Parkinson, Pinjarra Hills, Riverhills, Runcorn, Salisbury, Sandgate, Seven Hills, Seventeen Mile Rocks, Sherwood, Sinnamon Park, Stafford Heights, Stretton, Sunnybank Hills, Taigum, Tarragindi, The Gap, Tingalpa, Upper Kedron, Wakerley, Wavell Heights, Westlake, Wishart, Wynnum, Wynnum West, Yeronga & Zillmere (Total: 87)

Cluster 1 consists of the most suburbs and has the highest population with the majority of Brisbane residents living in these suburbs. It is quite similar to Cluster 3 in that crime is low, and the split of the age groups are identical. However, the prices of houses and rent are significantly lower with homes going for around $600,000 and rent at roughly $488/week, explaining why most choose to live in these areas. The main downside of these suburbs is the lack of nearby venues scoring the lowest amongst all clusters.

Avg Median house price for Clusters & Brisbane

Looking at the map, we can see that all of these suburbs are outside the city centre and residents will be required to commute a fair bit if they are to come to the CBD. Some of the suburbs are located close to University of Queensland St Lucia campus and Griffith University Nathan Campus. The cheaper rent, low crime and proximity to these campuses would make the suburbs of Yeronga, Sherwood and Chapel Hill (U of Q), Salisbury, Moorooka, and Tarragindi (Griffith) ideal options for students. The cheaper rent and low crime rates would also benefit families on a budget and young professionals who wouldn’t mind the lack of nearby venues and the commute to the city.

Cluster 2: Mid-tier suburbs

Albion, Alderley, Annerley, Archerfield, Auchenflower, Cannon Hill, Carina Heights, Chermside, Coopers Plains, Coorparoo, Darra, Dutton Park, East Brisbane, Enoggera, Fairfield, Gaythorne, Greenslopes, Highgate Hill, Indooroopilly, Kangaroo Point, Kelvin Grove, Lutwyche, Macgregor, Morningside, Mount Gravatt, Nathan, Newmarket, Northgate, Nundah, Richlands, Robertson, Rocklea, St Lucia, Stafford, Sumner, Sunnybank, Taringa, Toowong, Upper Mount Gravatt, Virginia, Wacol, West End, Windsor, Wooloowin & Yeerongpilly (Total: 45)

The suburbs of Cluster 2 are the mid-tier suburbs of Brisbane. They are more expensive than the suburbs in Cluster 1 but still cheaper than those in Cluster 3 with residents expected to pay approximately $800,000 for a house or around $519/week to rent. Looking at the map, we can see that one of the main advantages these suburbs have is that they are a lot closer to the CBD and that there are a lot more food and shopping spots nearby compared with Cluster 0 and 1.

Avg Composition of Age Groups for Clusters & Brisbane

However, the main drawbacks are higher crime rates. This group of suburbs would suit individuals and young professionals who want to be closer to the city and enjoy a fair share of nearby food and shopping venues. Families would be ill-advised to reside in these suburbs due to the higher crime occurrences while cluster 0 and 1 suburbs offer more for them. The Age Group distribution highlights this, showing a higher proportion of Gen Y(20–34 year) living in this area at 34% while seeing lower portions of Gen Z and Baby Boomers at approximately 20%.

Cluster 3: Premium suburbs

Ascot, Balmoral, Bardon, Brookfield, Bulimba, Burbank, Chandler, Chelmer, Clayfield, Fig Tree Pocket, Graceville, Grange, Gumdale, Hamilton, Hawthorne, Hendra, Kenmore Hills, Manly, Mount Ommaney, New Farm, Norman Park, Paddington, Pullenvale, Red Hill, Rochedale, Shorncliffe, Teneriffe, Tennyson, Upper Brookfield & Wilston (Total: 30)

This cluster of suburbs has the highest property prices with house prices close to a million dollars, or you would expect to pay around $771/week to rent. Nevertheless, the benefits are that these areas are the safest in Brisbane with the lowest crime rates. There appears to be a high proportion of Gen X (35–54 years) and Gen Z (0–19 years) in the area which could indicate that a lot of families are living in these suburbs. Additionally, there is a moderate number of nearby food, shopping, and activity venue that should give you plenty to do and explore.

From the map, we can see that most of the suburbs are scattered around the city centre. The ideal suburb would depend on location with New Farm and Paddington being closest to the CBD while Manly and Shorncliffe would give you the best access to the coast. With the high price, low crime, and the moderate number of nearby amenities, these suburbs would be perfect for upper-middle-class families looking for a safe place for themselves and their children.

Discussion

To test the accuracy of our clustering algorithm, we compare the results with Domain’s list of Brisbane Suburbs ranked for livability. From this article, we do some similarities in results with Ascot, Grange, Paddington, Red Hill, and Wilston meeting the top suburb list for both. However, we do see some significant differences, especially with Brisbane City and Fortitude Valley ranked 26th and 35th, respectively, when our clustering algorithm ranked them as suburbs to avoid.

Overall, it was interesting to see how the algorithm had clustered the suburbs based on the relationships between variables. For example, in Cluster 0, the algorithm seems to have grouped suburbs with high numbers of crime, the proportion of Gen Y, and the number of venues. Cluster 1 and 3 had similar attributes except for property prices and the number of nearby venues, and the clustering algorithm did an excellent job in dividing these groups up to differentiate between the premium and budget suburbs.

However, we do see some issues arising from the algorithm. It seems to have clustered suburbs such as Herston & Milton into cluster 0 because it had a high number of nearby venues even though it had moderate crime. These suburbs would have been better suited for Cluster 2. Similarly, Archerfield was grouped into Cluster 2 even though it had higher crime rates than Herston & Milton, but just lacked the number of nearby venues to meet the criteria for cluster 0. Interestingly, when rerunning the algorithm with five clusters, it does break Cluster 0 into two groups separating Milton, Petrie Terrace, and Newstead and classifying it as suburbs with slightly less crime than Cluster 0 but with a lot more nearby venues. Nevertheless, Herston, Archerfield and a few more remain clustered incorrectly. The drawbacks of k-Means are visible here due to the effects of outliers and ambiguity over the ideal number of clusters. It would be interesting to see how the algorithm would perform with more data about suburbs that can measure its livability.

Conclusion

Looking back at the question asked at the start of this post, the algorithm provides a few options on the ideal suburb depending on your background and what you’re looking for. If you are a family moving to Brisbane looking for a safe place, plenty of recreational activities nearby, and have money to spend, Cluster 3 is your ideal choice. If you are looking for a cheaper place and don’t mind the lack of nearby amenities or the commute to the city, then Cluster 1 is your best bet. Finally, if you are a young professional and want to be close to the action, but don’t want to pay premium prices for accommodation, then Cluster 2 is the right place for you. Cluster 0 suburbs should be avoided at all costs since there are better options available listed above. Though in a rare circumstance if you want to be right in the CBD, then please be vary about the high levels of crime in the area.

However, it is essential to note that no algorithm is perfect. The purpose of this project was to see how well a machine learning algorithm can determine what the best suburbs in Brisbane are, based purely on data. While the power of machine learning can uncover hidden trends; getting some first-hand experience of living in the area or asking a local can provide additional expertise. In other words, don’t blame the algorithm if you end up in a less than ideal situation! I hope this post serves you well in your quest to find a new home in Brisbane, Queensland’s sunshine capital!