TimeSeries Decomposition in Python with statsmodels and Pandas

by Paul Balzer on 3. Januar 2016

5 Comments

A lot of data is recorded in time domain, which means you will have a datapoint in the form of

timestamp: value

A useful approach to get insights into the data is, to decompose the timeseries. That usually means, you seperate your data into

  • seasonal
  • trend
  • residual

This famous library from R (`decompose`) is available in Python via statsmodel since version 0.6. Yeah! Let’s take a look into it with the parking lot data of city of Dresden.

The Data

The Open Data guys of Dresden (@offenesdresden) collected parking lot occupancy of a shopping mall called ‘Centrum-Galerie’ in the city of Dresden for over a year. After my talk at PyData 2015, a guy from NewYork came to me (thank you!) and said, I should decompose the data first and try to predict the occupancy of the parking lots with the decomposed timeseries. I tried, but the results were not that good, like with my approach (see talk video). Give it a try:

  Centrum-Galerie-Belegung.csv

Never the less, at least this blogpost came out of this.

Pandas Time Series Decomposition with Python

After loading the .csv with Pandas with

import pandas as pd
centrumGalerie = pd.read_csv('Centrum-Galerie-Belegung.csv',
 names=['Datum', 'Belegung'],
 index_col=['Datum'],
 parse_dates=True)
centrumGalerie.Belegung.plot()
Over a year of OpenData of the parking spot 'Centrum Galerie' in Dresden: 100% means, you will never ever find a free place for your car there

Over a year of OpenData of the parking spot ‘Centrum Galerie’ in Dresden: 100% means, you will never ever find a free place for your car there

we can simply decompose the data with statsmodels:

import statsmodels.api as sm

The `seasonal_decompose()` function needs a parameter called `freq`, which could be computed from the Pandas Timeseries, but is not fully functional right now. So we have to specify it for ourselves. The frequency of decomposition must be an interval, which ‘may’ repeat. Like a hour, a week, a day or something one is interested in.  Our data is stored with 15min resolution and I want to see a weekly seasonality, so our `freq` is

\(decompfreq = \frac{24h \cdot 60min}{15min} \cdot 7days\)

The Python implementation is this:

decompfreq = 24*60/15*7

Now we can decompose the Pandas TimeSeries with statsmodels:

res = sm.tsa.seasonal_decompose(centrumGalerie.Belegung.interpolate(),
 freq=decompfreq,
 model='additive')
resplot = res.plot()

The resulting decomposed timeseries is looking like this:

Seasonal Decomposition of the data: Observed is the original data, seasonal is the repetition within freq, trend is the trend and residual is everything, which is not described by seasonal+trend

Seasonal Decomposition of the data: Observed is the original data, seasonal is the repetition within freq, trend is the trend and residual is everything, which is not described by seasonal+trend

We chose `additive`, so you can add Trend+Seasonal+Residual, which should result in the `Observed`.

Evaluation of the TimeSeries Decomposition

The most interesting is the ‘Trend’, which is clearly showing some impacts of school holidays and christmas in Germany. Obviously, a lot of people drove back to the city, to gave back or change their christmas presents after 24.12.. One may ask, what the huge increase in the trend in the end of April 2015 was? Well, let’s take a look, what happened next to the ‘Centrum-Galerie’, where also a lot of parking spots were located: Beginning of a huge construction site (sorry, german).

Here is the full IPython Notebook

5 Comments

  1. Danke schoen for the nice article. May I ask some questions?
    1. Why is there a large gap in the graphs at around Dec 2014? Maybe Xmas holidays?
    2. Does the gap not cause any problem in the process of decomposition?
    I am considering a problem of missing data in data file. That is, no data during holidays but it is not regular; some weeks have 5 days but others have 4 or 3. If the freq is simply set to 5, I expect the result will not correct, and therefore a kind of data manipulation is required. But I don’t know how. If you know anything about this, any comment is very welcome. Thanks.

Leave a Reply

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert