February 2, 2011 14 Comments
Sadly the Dublin bikes do not look like this.
Dublin bikes revisited
Same idea, more data and more cities
Quite a while ago I wrote a post about the Dublin bikes scheme. J.C.Decaux, who run the scheme, make real-time data about the location of the bikes throughout the city available and I thought it would be interesting to collect this data and carry out some simple analysis of it.
At the time when I wrote my previous post, October 2009, I only had 7 weeks of data (about the length of time the scheme had then been running). I was pleased to discover that it was possible to learn a bit about Dubliners’ bike habits even with this limited quantity of data. Nevertheless, I wanted to return to the idea when I had a bigger data-set.
It has taken a while but I have finally collected over a year of data for the Dublin bikes scheme as well as almost a year of data for similar bike schemes (also run by J.C.Decaux) in Brussels, Lyon, Paris and Seville.
Questioning the data
To my mind, there are all sorts of questions one might investigate using this bike data. Below I describe those that I looked into. I would be interested to hear about others’ work. (Incidentally, I have made the data I have available for download – see below).
Bike usage by time of day
The fundamental datum that J.C.Decaux make available is the number of bikes docked at each station at a given point in time. Let us call this a snapshot of a given station at a moment in time. By adding snapshot data up across all stations at the same time we can deduce how many bikes are in use. Taking the average over all days in the data-set and treating week days and weekends separately (for obvious reasons) we can generate charts of bike usage as a function of time of day.
In the below charts we plot the average number of bikes in use as a function of time of day minus the number that were in use at midnight.
Dublin bike usage by time of day (weekdays)
Dublin bike usage by time of day (weekends)
The peaks in the morning and evening presumably represent people going to and coming from work. As such they give us a way to estimate the length of people’s working days (including the time getting to/from the bike stations). I made this observation in my previous post and it occurred to me that it might be interesting to use this as a method for comparing the length of the working day across different European cities. At the time, I did not have any data for other cities. Now however, that has all changed! Below are the corresponding (weekday) charts for Brussels, Lyon, Paris and Seville.
Brussels bike usage by time of day
Lyon bike usage by time of day
Paris bike usage by time of day
Seville bike usage by time of day
The first thing that strikes me about this data is that of the five cities, four of them have reasonably unsurprising profiles but one, Brussels, does not. Based solely on this quick plotting exercise, it looks like something may be wrong (or at least very different) with the Brussels bike scheme. In fact I was not completely surprised when I saw this because I had had some correspondence with an entrepreneurial Brussels bike user, Jonathan Van Parys, who mentioned he had concerns about the Brussels bike scheme. In his words:
Brussels’s physical geography leads to a rather inefficient allocation of the bikes as the day goes on (ie. stations at the top of hills are empty while those at the bottom are full), which can be a little frustrating
Furthermore he has set up an excellent website with useful data about the Brussels bike scheme with the intention of drawing attention to this important issue.
Returning to the original idea of comparing different European cities’ apparent/implied working hours, let us gather our data in a table:
|City||Morning peak (local time)||Evening peak (local time)||Time between peaks|
|Dublin||08:50||17:45||8 hrs 55 mins|
|Lyon||08:50||18:10||9 hrs 20 mins|
|Paris||08:55||18:50||9 hrs 55 mins|
|Seville||07:55||20:25||12 hrs 30 mins|
So what do we make of that!? Certainly the data makes no suggestion that our European counterparts are slacking off with a shorter working week. Indeed Dublin comes in bottom of the four cities. The peak times are much less well defined for the other cities than for Dublin (this is because there are far more bikes and stations) but even allowing for this, there is no way this data suggests we work any harder here in Dublin (indeed it is hard not to conclude the opposite!). Of course this is an extremely crude method and shouldn’t really be taken very seriously for a whole host of reasons but I still thought it would be a fun to see what I might find.
Two other brief points here are
- The reason the graphs for the other cities, especially Paris and Brussels, are so noisy is that there are a great many more stations in these cities than there are in Dublin. As a result, and because it takes time to scrape the data from each station, I have a similar number of records per day for these cities as I do for Dublin but a much smaller number of snapshots of the entire city. (E.g., if we have 100 records between 10 stations we have 10 snapshots of the city but if we have 100 records between 50 stations, we only have 2 snapshots of the city). Fewer data-points means more noise.
- There appear to be two peaks around lunchtime for Lyon. I do not know why this is the case and would be interested if anyone has an idea. (Perhaps it is a seasonal effect and people try to avoid the hottest sun in the summer? I could look for this in the data but I haven’t bothered.)
Another obvious question I thought I’d ask of the data was how the Summer compared to the Winter and how Mondays compared to Fridays in terms of bike usage.
Regarding Summer vs. Winter, the situation is mildly interesting. The peaks in our usage by time-of-day graph occur at exactly the time and the usage peaks to the same value in the morning. However usage peaks at a significantly smaller value later in the day for Winter than for Summer. My guess is that people prefer using the bikes in Summer but not so much that they are willing to be late for work! (So morning usage is unaffected by the weather but not so for afternoon/evening.)
Dublin bike usage by time of day (weekdays): Winter vs. Summer
For Monday vs. Friday, there is no significant difference. I had wondered if it might be possible to observe a late arrival/early departure time effect but the data does not even hint at its presence.
Dublin bike usage by time of day (weekdays): Monday vs. Friday
Bike usage by station
Yet another question one can ask is how the different stations compare to each other in terms of busyness. This is impossible to measure perfectly using the snapshot-type data that I have because many dockings/undockings can take place between two snapshots and it is impossible to know how many. Nevertheless we can count how many times each station’s snapshot changes throughout the day to get an approximate idea of station busyness. This is a biased measure (e.g., it would be biased against stations that have intense but sparse periods of busyness and biased in favour of stations with a more even spread of busyness) but there should still be some value in its results. Here is what I found
Busyness by station (count of snapshot changes versus busyness rank)
Evidently there is quite a significant spread. I would expect the true chart to have approximately the same ordering but a much greater range of values. Even based on this data, it is worth noting that Hardwicke Street and Parnell Square North stand out as extremely underused stations relative to the others. These two are quite close to each other, it might be interesting to plot station usage on a map to see this data in terms of the city geography. Below is the underlying data for the chart including the dictionary which reveals stations by busyness rank.
|Busyness rank||Station name||Busyness|
|39||Parnell Square North||13651|
|35||Fitzwilliam Square West||19516|
|32||Ormond Quay Upper||20794|
|29||St. Stephen’s Green East||22104|
|27||Leinster Street South||23997|
|26||Merrion Square West||24129|
|24||Cathal Brugha Street||24824|
|22||Mountjoy Square West||26692|
|20||James Street East||26879|
|15||Merrion Square East||29301|
|14||St. Stephen’s Green South||29408|
|13||Fownes Street Upper||29743|
|7||Princes Street / O’Connell Street||34765|
|1||Custom House Quay||40035|
Finally I thought it might be worthwhile looking at weather effects directly (not just through the seasons). I found that Met Eireann make historical weather station data available and scraped a little over a year of it (available for download below). I tested bike usage against weather station data from Dublin Airport (obviously I would prefer more central weather station data, like Merrion Square, but I could not get it). The data I tested against was:
- Rainfall (in mm)
- Sunshine (in hours)
- Minimum day’s temperature (in degrees Celsius)
- Maximum day’s temperature (in degrees Celsius)
I had one data-point for each of the above measurements per day. I calculated a day’s busyness in terms of bike usage by adding up the number of changes to a station’s snapshots over the whole day, across all stations. I then I scatter-plotted this busyness against the corresponding day’s weather figure for each day.
At first I expected the rainfall data to have the strongest predictive power, however this turned out not to be the case. Thinking about it, I should not have been surprised. Firstly, I only have one point in the scatter plot per day and only just over a year of data, so only about 400 data-points. While it might seem surprising to those of us who live in Dublin, it did not actually rain at all on the vast majority of days so most of the data-set is wasted on dry days. Secondly, while I believe rain is a very strong disincentive for people to cycle it usually only rains for relatively brief periods so we would probably need rainfall data on an much finer timescale like every 5 or 10 minutes to see the effect clearly. Finally, we have rainfall data for Dublin Airport, not Dublin City Centre. In any case, below are the charts.
Dublin rainfall (mm) as a function of time
note how similar Summer and Winter are!
Busyness vs. rainfall (mm) (no apparent relationship)
After giving up on rainfall, I did however find that there appeared to be a relationship between maximum day’s temperature and bike usage. This makes sense and it is easier to believe that such a weather effect would not be as susceptible to the same problems as the rainfall effect since temperature is much less localised than rain and persists for longer. While the scatter plot is noisy, it does look like there is a plausible positive relationship. (Btw if you’re an experimental physicist, the relationship probably looks weak; to a quant. in finance it looks strong!) Here are the charts
Dublin maximum temperature as a function of time
note the maximum is below 0 in January 2010!
Busyness vs. max temp. (plausible positive relationship)
There is a bike-sharing blog that is worth a visit if you are at all interested in this sort of thing if only to see its world map of bike schemes around the world. I had not realised they were so widespread till I saw this.
Although I like the Dublin bike scheme, I can’t help wondering if it is good value for money. It may well be, but I can’t help wondering. It seems to me that there is a potential conflict of interest for the city councilors who make the decisions (the scheme is popular so they could be biased in favour of it, even at a bad price) or at least that the best decisions may not be made because of a possible asymmetric perception of opportunity cost relative to upfront cost by either the councilors of the public. A significant part of the payment for the scheme is in advertising revenue forgone by providing J.C.Decaux with free hoardings throughout the city. I would be interested to see the pricing detail. I have been told by those in the know that the keyword when buying such advertising space is the rate card and it seems that this website has some figures for that. I have not quite found the time to look into this in detail myself but would be interested to hear others’ thoughts. I’m sure that a few FOI requests could turn up some interesting figures.
For my own part, I feel like I have spent more than enough time analysing bike schemes for now (indeed I had to force myself to write this post). However Dublin Bus are still promising that they will be providing GPS data on the location of their buses any day (they missed their own 2010 deadline). They claim to have a pilot scheme running on the 123 route and I did manage to find one stop with a so called real time passenger information (RTPI) display on this route, but it was blank. However in the last few weeks I have noticed the posts for several such displays appearing beside stops on Nassau Street so I expect this data may be on the way. I look forward to gathering and analysing it to see if I can detect any correlation at all between timetables and the behaviour of the buses.
A post-script on technical details
Based on some of the queries I have received in relation to my other post on this topic, I thought it might be useful if I included a little bit of technical information about how/where to scrape the data. In that post, I give the URLs to visit in order to get the data but I did not supply the simple python script to actually scrape the data. I have received quite a few requests for this so here it is (in all its hacky glory!):
import urllib, time, csv, sys, datetime, gzip from xml.dom import minidom URL_station_list = 'https://abo-%s.cyclocity.fr/service/carto' URL_data = 'https://abo-%s.cyclocity.fr/service/stationdetails/%d' def prev_day_s(d_s): d = datetime.date(int(d_s[:4]), int(d_s[4:6]), int(d_s[6:8])) d -= datetime.timedelta(days=1) return '%04d%02d%02d' % (d.year, d.month, d.day) def yyyymmdd_from_epoch(t): tm = time.localtime(t) return '%04d%02d%02d' % (tm.tm_year, tm.tm_mon, tm.tm_mday) def get_stations(city, date, get_file = True): fname = '%s/%s.stations.xml' % (city, date) if get_file: urllib.urlretrieve(URL_station_list % city, fname) stations = map(lambda x: int(x.getAttribute('number')), minidom.parse(fname).getElementsByTagName('marker')) assert len(stations) > 0 # Horrible hack to quickfix case when web serves up incorrect stations data. return sorted(stations) if len(sys.argv) != 5: sys.stderr.write('Usage: python %s <city> <run time> <main loop delay> <request delay>\n' % sys.argv) sys.exit(1) city = sys.argv (run_time, main_delay, req_delay) = map(float, sys.argv[2:]) start_epoch = time.time() date = yyyymmdd_from_epoch(start_epoch) try: stations = get_stations(city, date) except: stations = get_stations(city, prev_day_s(date), False) out = csv.writer(gzip.open('%s/%s.out.csv.gz' % (city, date), 'a')) err = csv.writer(gzip.open('%s/%s.err.csv.gz' % (city, date), 'a')) t = time.time() while t < start_epoch + run_time: for i in stations: t = time.time() try: out.writerow([i, t] + map(str.strip, urllib.urlopen(URL_data % (city, i)))) except: err.writerow([i, t, sys.exc_info(), sys.exc_info()]) time.sleep(req_delay) time.sleep(main_delay)
I also thought it would be worth recording the summary statistics for my various data-sets. So here are a few that seem relevant:
|City||Start date||End date||Number of records||Link to data|
|Brussels||7-Dec-2009||8-Nov-2010||33,741,610||bruxelles.csv.bz2 (121 MB)|
|Dublin||19-Sep-2009||8-Nov-2010||30,909,771||dublin.csv.bz2 (104 MB)|
|Lyon||7-Dec-2009||8-Nov-2010||34,236,139||lyon.csv.bz2 (147 MB)|
|Seville||7-Dec-2009||8-Nov-2010||33,365,277||seville.csv.bz2 (124 MB))|
|Paris||7-Dec-2009||8-Nov-2010||40,232,297||paris.csv.bz2 (189 MB)|
A record consists of a snapshot of a station and as such consists of a station ID, a date/time-stamp and a number which is the number of bikes docked at that station at that date/time. (It would be extremely interesting to get hold of J.C.Decaux’s data in which they can actually track individual bikes.)
One last point which might be worth making (if only to document it for my own sake) is that I ended up working with these moderately large data-sets in a raw csv format but that I experimented with using sqlite to manage them. A simple python script like this:
mport sqlite3 import sys import datetime import csv import gzip from xml.dom import minidom from xml.parsers.expat import ExpatError from utils import yyyymmdd_from_date, date_from_yyyymmdd, ssm_from_datetime path = '../raw_data/v1/%s/%s.out.csv.gz' city, startdate, enddate = sys.argv[1:4] startdate = date_from_yyyymmdd(startdate) enddate = date_from_yyyymmdd(enddate) db = sqlite3.connect('%s.db' % city) curs = db.cursor() curs.execute('create table station_snaps (station_id integer, date text, ssm integer, available integer)') d = startdate while d <= enddate: db_rows =  try: for row in csv.reader(gzip.open(path % (city, yyyymmdd_from_date(d)))): try: station_id = int(row) timestamp = float(row) dt = datetime.datetime.fromtimestamp(timestamp) # NB: This handles DST properly. xml_data = '\n'.join(row[2:]) fld = minidom.parseString(xml_data).getElementsByTagName('available').firstChild if fld is not None: db_rows.append((station_id, yyyymmdd_from_date(dt), round(ssm_from_datetime(dt)), int(fld.data))) except (ValueError, IndexError, ExpatError): pass except (IOError, csv.Error): pass if len(db_rows) > 1: curs.executemany('insert into station_snaps values (?, ?, ?, ?)', db_rows) db.commit() print d d += datetime.timedelta(days=1) db.close()
turns the scraped xml files into a nice sqlite3 file and then we can calculate many of our statistics very easily, in principle, using SQL. For example
select ssm_300, sum(avg_available) from (select station_id, ssm/300 ssm_300, avg(available) avg_available from station_snaps group by station_id, ssm_300) group by ssm_300 order by ssm_300
Unfortunately however this was just too slow (at least on my laptop) so I had to do things by hand. This is to be expected given that the database cannot use the natural time-ordering of the records which I could in my scripts.