2017-05-02

Contents

  • Introductory comments
  • An example: the propensity to cycle tool
  • dplyr and tibbles (if time allows)
  • Discussion: can GDS save the world?

Introduction

What is saving the world?

Many ways of saying the same thing:

  • 'Policy-led research'
  • 'Impact'
  • 'Socially beneficial research'
  • Don't be evil (Google)

My definition: building an evidence-base for sustainable systems.

  • In the context of climate change that means:
  • Building an evidence-base to transition away from fossil fuels
  • But could also be interpretted in terms of other (quantifiable) social/economic/environmental indicators

Why bother?

Aren't the machines 'on it' anyway?

  • Agency is no longer exclusively human domain
  • Machine learning is "is finding commercial applications that range from self-driving cars to websites that recommend products on the basis of a user's browsing history" (Castelvecchi 2016).

  • Where does the (data) scientist come in in this context?

  • More important than ever to set high level aims/goals
  • Autonomous vehicle analogy: robots are good at getting from A to B but not deciding where and when to travel

What is Geographic Data Science?

  • You tell me!
  • How does it differ from good old 'GIS'?
  • What does the science in the title mean?
  • Why the focus on data rather than information

Code example:

d = frame_data(
  ~Attribute, ~GIS, ~GDS,
  "Home disciplines", "Geography", "Geography, Computing, Statistics",
  "Software focus", "Graphic User Interface", "Code",
  "Reproduciblility", "Minimal", "Maximal"
)

Comparing GDS with GIS

knitr::kable(d)
Attribute GIS GDS
Home disciplines Geography Geography, Computing, Statistics
Software focus Graphic User Interface Code
Reproduciblility Minimal Maximal

Geographic data science CAN 'save the world'

But only if it's open and scientific

Reasoning:

  • Evidence inevitably gets skewed by political aims
  • If the people doing the research are influenced by dominant political forces, findings will be biases for political gain (solved by independent well-funded public research).
  • People doing policy relevant research watch out (regarding politicians):

“Their very spirit undergoes a pervasive transformation,” and they finally end up as “experts at exchanging smiles, handshakes, and favors.” (Reclus 2013, original: 1898)

Importance of open data and methods

  • If the data underlying policy is hidden, it can be represented to push certain aims (solved by open data)
  • If the data is 'open' but the tools are closed, results open to political influence
  • Which brings us onto our next topic…

Where will cycling uptake happen?

Prior work (source: Lovelace et al. 2017)

Tool Scale Coverage Public access Format of output Levels of analysis Software licence
Propensity to Cycle Tool National England Yes Online map A, OD, R, RN Open source
Prioritization Index City Montreal No GIS-based P, A, R Proprietary
PAT Local Parts of Dublin No GIS-based A, OD, R Proprietary
Usage intensity index City Belo Horizonte No GIS-based A, OD, R, I Proprietary
Bicycle share model National England, Wales No Static A, R Unknown
Cycling Potential Tool City London No Static A, I Unknown
Santa Monica model City Santa Monica No Static P, OD, A Unknown

The PCT team

"If you want to go far, go as a team"

Robin Lovelace (Lead Developer, University of Leeds)

  • James Woodcock (Principal Investigator, Cambridge University)
  • Anna Goodman (Lead Data Analyst, LSHTM)
  • Rachel Aldred (Lead Policy and Practice, Westminster University)
  • Ali Abbas (User Interface, University of Cambridge)
  • Alvaro Ullrich (Data Management, University of Cambridge)
  • Nikolai Berkoff (System Architecture, Independent Developer)
  • Malcolm Morgan (GIS and infrastructure expert, UoL)

The PCT in CWIS and LCWIP

Included in Cycling and Walking Infrastructure Strategy (CWIS) and the Local Cycling and Walking Infrastructure Plan (LCWIP)

How the PCT works

Shows on the map where there is high cycling potential, for 4 scenarios of change

  • Government Target
  • Gender Equality
  • Go Dutch
  • Ebikes

Scenario shift in desire lines

Source: Lovelace et al. (2017)

  • Origin-destination data shows 'desire lines'
  • How will these shift with cycling uptake

Scenario shift in network load

A live demo for Liverpool

"Actions speak louder than words"

tibbles and dplyr: A detour for programmers

Why data carpentry?

  • If you 'hack' or 'munge' data, it won't scale
  • So ultimately it's about being able to handle Big Data
  • We'll cover the basics of data frames and tibbles
  • And the basics of dplyr, an excellent package for data carpentry
    • dplyr is also compatible with the sf package

The data frame

The humble data frame is at the heart of most analysis projects:

d = data.frame(x = 1:3, y = c("A", "B", "C"))
d
##   x y
## 1 1 A
## 2 2 B
## 3 3 C

In reality this is a list, making function work on each column:

summary(d)
##        x       y    
##  Min.   :1.0   A:1  
##  1st Qu.:1.5   B:1  
##  Median :2.0   C:1  
##  Mean   :2.0        
##  3rd Qu.:2.5        
##  Max.   :3.0
plot(d)

Subsetting

In base R, there are many ways to subset:

d[1,] # the first line
##   x y
## 1 1 A
d[,1] # the first column
## [1] 1 2 3
d$x # the first column
## [1] 1 2 3
d [1] # the first column, as a data frame
##   x
## 1 1
## 2 2
## 3 3

The tibble

Recently the data frame has been extended:

library("tibble")
dt = tibble(x = 1:3, y = c("A", "B", "C"))
dt
## # A tibble: 3 × 2
##       x     y
##   <int> <chr>
## 1     1     A
## 2     2     B
## 3     3     C

Advantages of the tibble

It comes down to efficiency and usability

  • When printed, the tibble diff reports class
  • Character vectors are not coerced into factors
  • When printing a tibble diff to screen, only the first ten rows are displayed

dplyr

Like tibbles, has advantages over historic ways of doing things

  • Type stability (data frame in, data frame out)
  • Consistent functions - functions not [ do everything
  • Piping make complex operations easy
ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  group_by(Country) %>% 
  summarise(Mean = mean(Transportation),
            Growth = diff(range(Transportation))) %>%
  top_n(3, Growth) %>%
  arrange(desc(Growth))
# dplyr must be loaded with
library(dplyr)

Why pipes?

wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)

vs

top_n(
  arrange(
    summarise(
      group_by(
        filter(wb_ineq, grepl("g", Country)),
        Year),
      gini = mean(gini, na.rm  = TRUE)),
    desc(gini)),
  n = 5)

Subsetting with dplyr

Only 1 way to do it, making life simpler:

select(dt, x) # select columns
## # A tibble: 3 × 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3
slice(dt, 2) # 'slice' rows
## # A tibble: 1 × 2
##       x     y
##   <int> <chr>
## 1     2     B

How we've used this in the PCT

Worked example: pct data in West Yorksire

  • We'll download and visualise some transport data
u_pct = "https://github.com/npct/pct-data/raw/master/west-yorkshire/l.Rds"
if(!file.exists("l.Rds"))
  download.file(u_pct, "l.Rds")
library(stplanr)
## Loading required package: sp
l = readRDS("l.Rds")
plot(l)

Analysing where people walk

sel_walk = l$foot > 9
l_walk = l[sel_walk,]
plot(l)
plot(l_walk, add = T, col = "red")

library(dplyr) # for next slide...

Doing it with sf (!)

l_walk1 = l %>% filter(All > 10) # fails
library(sf)
## Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.2 r15302
l_sf = st_as_sf(l)
plot(l_sf[6])

Subsetting with sf

much easier

l_walk2 = l_sf %>% 
  filter(foot > 9)
plot(l_sf[6])
plot(l_walk2, add = T)

Subsetting with sf

results

A more advanced example

l_sf$distsf = as.numeric(st_length(l_sf))
l_drive_short2 = l_sf %>% 
  filter(distsf < 1000) %>% 
  filter(car_driver > foot)

Result: where people drive short distances rather than drive

library(tmap)
tmap_mode("view")
## tmap mode set to interactive viewing
qtm(l_drive_short2)

Discussion: ensuring research is used for the greater good

Points of discussion

It is clear that geographical research can have large policy impacts.

  • That researchers can act to maximise the social benefit of the research
  • That involves getting the evidence out to as many people as possible
  • And using open source, accessible tools - the 'science' in GDS?

But many questions remain:

  • Where to draw the line between impartial research and campaigning?
  • To what extent should researchers open-sourcing their work defend against commercial exploitation?

Final question

  • What can you do to maximise the social benefits arising from your work?
  • Thanks for listening - get in touch via r.lovelace@leeds.ac.uk or @robinlovelace

References

Castelvecchi, Davide. 2016. “Can We Open the Black Box of AI?” Nature News 538 (7623): 20. doi:10.1038/538020a.

Lovelace, Robin, Anna Goodman, Rachel Aldred, Nikolai Berkoff, Ali Abbas, and James Woodcock. 2017. “The Propensity to Cycle Tool: An Open Source Online System for Sustainable Transport Planning.” Journal of Transport and Land Use 10 (1). doi:10.5198/jtlu.2016.862.

Reclus, Elisée. 2013. Anarchy, Geography, Modernity: Selected Writings of Elisée Reclus. Edited by John Clark and Camille Martin. Oakland, CA: PM Press.