Houses in sunset

In a previous post, I looked at data from the HPI Index, and discussed the impact of the Great Recession and subsequent government policies on the movement of house prices.

The HPI dataset doesn’t contain granular details on differences between new homes and old homes, so it make it difficult to see what (if any) impact the introduction of Help to Buy had on flat prices in Outer London. I this post, I look at how to import the price paid dataset which contains data for all transactions recoreded by the from the Land Registry.

As well as working with raw transaction level data, this dataset allows us to differentiate between new and old homes. Although the data doesn’t tell us which of the transactions were completed with the Help to Buy, it is useful for us to gauge what impact (if any) the introduction of the scheme had on prices and volumes. The dataset from 2013 to 2019 contains 6.8 million transactions, because of its size I imported the data into a database and used Python for the rest, a process which I discuss at the end of the post.

There is about a £150k difference between new and old flats in Outer London at the moment, peaking at about £200k around 2018. Help to Buy appears to have contributed to this difference after it was introduced in 2013 where we can see this difference begin to creep up. Looking at the average prices of old flats we can see that average prices are closely correlated with the LOESS line, whilst the prices of new flats have far less correlation. This is possibly related to the Mortgage Credit Directive which has capped what people are able to bid on old flats to their incomes.

There were three changes to interest rates in the period shown above, the first after the Brexit vote when rates were cut in half to 0.25% followed by two 25 basis point increases in 2017 and 2018. The Brexit vote prompted the Bank of England to engage in Quantitaive Easing of £435 billion through the Term Funding Scheme. During the Conservative Annual Conference in October 2017, Help to Buy which was originally intended to finish in 2017 was extended with another £10 billion to 2027. We can clearly see the impact of this extension on new flat prices at the end of 2017.

This dataset is not suitable for examining the impact of Help to Buy, since purchases made using the scheme are not indicated in data. However as we able to see prices increases for new flats occurring simultaneously with the introduction of the scheme, we could look at sales volumes to see if the prices were corresponding to demand.

2013-2016Average Monthly VolumePercentage
New Flats97517.1%
Old Flats471782.9%
Total5693
2016-2019Average Monthly VolumePercentage
New Flats135528.4%
Old Flats340971.6%
Total4765

The tables above compare two three year periods, omitting March 2016, where there was a rush of people buying second homes to avoid changes in Stamp Duty. There has been an 11.3% monthly increase in new flat sales between these two period, but in a market for flats that has contracted by 16%. The question that arises in my mind is whether its mortgage affordability rules that are forcing more people to use Help to Buy or QE money that is allowing house builders to charge more and wait for the buyer who is willing to pay a premium—either way, this is working out a treat for the large builders.

A couple of caveats on the data: I am just starting to get my head around this dataset and the numbers I have calculated do not exactly match up with the ones from the HPI data, volumes tend to be off about a 100 or so every month and average prices are slightly different. This could be down to a number of reasons including:

  1. Dates for transactions are handled differently in the HPI data
  2. The geographic region for ‘Outer London’ from the HPI Index is different from the ‘Greater London’ defined in the Prices Paid data
  3. The handling of amendments to data, I chose to ignore deletions for simplicity

Despite these differences, I have found it a useful dataset to get my head around the market. It really is amazing the variance that the monthly HPI dataset which makes the headlines every month actually hides when you break down the data.

Method for working with the Prices Paid Dataset

The prices paid dataset is available in separate files for each year from 2013 onwards (named pp-2013.csv onwards). I used awk command in the terminal to combine the files for the different years into a single file that could be imported into a database in one go.

awk '{print}' pp-*.csv > 2013-2019_sales_data.csv

You can see the line count for the output file using the wc command:

wc -l 2013-2019_sales_data.csv

I then used the sed command to insert a header row based on the column descriptions for the data set.

sed -i '1i "id","price","date","postcode","property_type","new_building","duration","paon","saon","street","locality","town","district","county","ppd_type","status"' 2013-2019_sales_data.csv

I used a Postgres database to load the csv file because I already had its docker image, but an sqlite database would have been fine. I created a new database called houses and created tables then imported the file using these SQL commands in psql:

CREATE SCHEMA import;

CREATE TABLE import.sales(
    id TEXT PRIMARY KEY,
    price NUMERIC,
    date TIMESTAMP,
    postcode CHAR(8),
    property_type CHAR(1),
    new_building CHAR(1),
    duration CHAR(1),
    paon TEXT,
    saon TEXT,
    street TEXT,
    locality TEXT,
    town TEXT,
    district TEXT,
    county TEXT,
    ppd_type CHAR(1),
    status CHAR(1)
);

COPY import.sales FROM '2013-2019_sales_data.csv' WITH DELIMITER ',' HEADER CSV;

SELECT date, postcode, price FROM import.sales LIMIT 10;

I used Jupyter Lab to do the analysis in Python, using SQL Alchemy to connect to the Postgres database.

import pandas as pd
import numpy as np
from sqlalchemy import create_engine

engine = create_engine("postgresql://[user]:[password]@[ip address of db]/houses")

Get the averages for flat prices from the database.

# Get the data for new flats
average_new_ol = pd.read_sql_query(
    """
    SELECT ROUND(AVG(price)::numeric,2) AS average_price,
	   COUNT(id) AS volume,
	   TO_CHAR(date,'MM') AS month,
	   TO_CHAR(date,'YYYY') AS year
    FROM import.sales
    WHERE county='GREATER LONDON'
    AND date < '2019-12-01'
    AND status = 'A'
    AND new_building = 'Y'
    AND property_type = 'F'
    GROUP BY month, year;
    """,
    engine,
)

# Convert the month and date columns into a single date
average_new_ol["date"] = pd.to_datetime(
    pd.to_datetime(average_new_ol[["month", "year"]].assign(day=1))
)
average_new_ol.drop("month", axis=1, inplace=True)
average_new_ol.drop("year", axis=1, inplace=True)


# Get the data for old flats
average_old_ol = pd.read_sql_query(
    """
    SELECT ROUND(AVG(price)::numeric,2) AS average_price,
	   COUNT(id) AS volume,
	   TO_CHAR(date,'MM') AS month,
	   TO_CHAR(date,'YYYY') AS year
    FROM import.sales
    WHERE county='GREATER LONDON'
    AND date < '2019-12-01'
    AND status = 'A'
    AND new_building = 'N'
    AND property_type = 'F'
    GROUP BY month, year;
    """,
    engine,
)

# Convert the month and date columns into a single date
average_old_ol["date"] = pd.to_datetime(
    pd.to_datetime(average_old_ol[["month", "year"]].assign(day=1))
)
average_old_ol.drop("month", axis=1, inplace=True)
average_old_ol.drop("year", axis=1, inplace=True)

# Lets merge the two datasets
ol_flats = pd.merge(
    average_new_ol, average_old_ol, on="date", suffixes=("_new", "_old")
)

Calculate the LOESS

# Sort the dataset by date
ol_flats = ol_flats.sort_values(by="date")

import statsmodels.api as sm

# Calcualte loess values for new flats
lowess = sm.nonparametric.lowess
z = lowess(ol_flats["average_price_new"], ol_flats["date"], frac=1.0 / 3, it=0)
ol_flats["price_loess_new"] = pd.DataFrame(z)[1].values

# Do it again for the old flats
z = lowess(ol_flats["average_price_old"], ol_flats["date"], frac=1.0 / 3, it=0)
ol_flats["price_loess_old"] = pd.DataFrame(z)[1].values

ol_flats.head()

And a quick plot in Matplotlib

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

fig, ax = plt.subplots(figsize=(16, 9))

# Old Flats
ax = sns.regplot(
    data=ol_flats,
    x="date",
    y="average_price_new",
    color="tab:orange",
    fit_reg=False,
    scatter_kws={"alpha": 0.5},
    label="New Flats",
)
ax = sns.lineplot(x="date", y="price_loess_new", color="tab:orange", data=ol_flats)

# New Flats
ax = sns.regplot(
    data=ol_flats,
    x="date",
    y="average_price_old",
    color="tab:blue",
    fit_reg=False,
    scatter_kws={"alpha": 0.5},
    label="Old Flats",
)
ax = sns.lineplot(x="date", y="price_loess_old", color="tab:blue", data=ol_flats)

plt.suptitle("Average Selling Prices for Flats in Outer London", fontsize=20)
plt.legend(framealpha=1, borderpad=1)

And a stacked bar chart for the sales volumes

fig, ax = plt.subplots(figsize=(16, 9))

# Old flats
ax = plt.bar(
    ol_flats.date,
    ol_flats.volume_new,
    color="tab:orange",
    width=22,
    label="New Flats"
)

# New flats
ax = plt.bar(
    ol_flats.date,
    ol_flats.volume_old,
    color="tab:blue",
    bottom=ol_flats.volume_new,
    width=22,
    label="Old Flats",
)

plt.suptitle("Flats Sales Volumes in Outer London", fontsize=20)
plt.legend(framealpha=1, borderpad=1)

Cover photo by Tom Thain on Unsplash