class: center, middle, inverse, title-slide # Transport Data Science: from regional to street levels ## 🚀
Emerging discipline and community of practice? ### Robin Lovelace ### University of Leeds,
Institute for Transport Studies
### University of Pennsylvania, 2020-04-06 (updated: 2020-04-06) Reproducible source code:
github.com/Robinlovelace/presentations
--- background-image: url(https://d3l4am9dimtbet.cloudfront.net/wp-content/uploads/2018/07/In-Pics_-These-Mind-Blowing-Aerial-Shots-of-Mumbai-Were-Taken-by-a-Drone.jpg) <!-- 14h30-15h30 lecture part (this could already drift towards the practical part but doesn't have to ...) --> <!-- 15:30-16h00 coffee break --> <!-- 16h00-17h00 hands-on workshop (could be somewhat open-ended, we'll have the room all afternoon - but if we would go into extra time, we should make a short 'cut' around 17h such that those who have/want to go home can do so). --> # Introduction Source: [thebetterindia.com](https://www.thebetterindia.com/148450/mumbai-aerial-shots/) <!-- Mumbai is by [some](https://www.independent.co.uk/travel/worlds-biggest-cities-mexico-city-new-york-karachi-tokyo-lagos-kolkata-kinshasa-dhaka-delhi-a8158426.html) measures the world's largest city --> -- Abstract: Data Science has emerged as an area of high and consistent growth in many sectors. High tech industries such as search engine optimisation, marketing and retail analytics have been quick to adopt new workflows. Transport has arguably been slow to adapt to the transformations towards open source software (as opposed to proprietary products that still dominate), command line and scriptable interfaces (unlike the graphical user interfaces of tools such as Excel) and code sharing and version development via platforms such as GitHub. This talk demonstrates these new workflows, building on the [Transport chapter in the open source book Geocomputation with R](https://geocompr.robinlovelace.net/transport.html) and experience developing and deploying data science tools that are having a real world impact on transport policy and practice. It will provide insight into how multiple geographic scales of analysis were used to develop the [Propensity to Cycle Tool](http://www.pct.bike/), which Robin developed in collaboration with colleagues from 4 universities and which is being used by dozens of Local Authorities to develop strategic cycle networks. Furthermore, a live demo of the recently release [stats19](https://docs.ropensci.org/stats19/) package, which provides fast access to crash data for road safety research, will highlight the power of reproducibility and that transport data science is a practical field best discovered by doing it and collaboration using reproducible code. ```r # to reproduce these slides do: pkgs = c("rgdal", "sf", "geojsonsf") install.packages(pkgs) ``` -- --- ## whoami ```r system("whoami") ``` -- .pull-left[ - Environmental geographer - Learned R for PhD on energy and transport - Now work at the University of Leeds (ITS and LIDA) - Focus: Applied geocomputation - Strong interest in technology + reproducibility, e.g.: ```r devtools::install_github("r-rust/gifski") system("youtube-dl https://youtu.be/CzxeJlgePV4 -o v.mp4") system("ffmpeg -i v.mp4 -t 00:00:03 -c copy out.mp4") system("ffmpeg -i out.mp4 frame%04d.png ") f = list.files(pattern = "frame") gifski::gifski(f, gif_file = "g.gif", width = 200, height = 200) ``` ] -- .pull-right[ Image credit: Jeroen Ooms + others ```r knitr::include_graphics("https://user-images.githubusercontent.com/1825120/39661313-534efd66-5047-11e8-8d99-a5597fe160ff.gif") ``` <img src="https://user-images.githubusercontent.com/1825120/39661313-534efd66-5047-11e8-8d99-a5597fe160ff.gif" width="100%" /> ] --- ## Importance of Geographic data .pull-left[ ### Geographic data is everywhere ### Underlies some of society's biggest issues ### Can give general analyses local (actionable) meaning ] -- .pull-right[ ![](https://raw.githubusercontent.com/npct/pct-team/master/figures/Leeds-network.png)<!-- --> Example: Propensity to Cycle Tool (PCT) in Manchester: http://pct.bike/m/?r=greater-manchester ] --- # Classifying transport data Source (check it): https://www.openstreetmap.org/edit#map=11/53.8137/-1.5240 ![](https://user-images.githubusercontent.com/1825120/78499529-195c0380-7749-11ea-8f7a-43dfd479039e.png)<!-- --> --- ## Information and data pyramids Data science is climbing the DIKW pyramid <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/DIKW_Pyramid.svg/220px-DIKW_Pyramid.svg.png" width="70%" /> -- Data science *can* convert data to information and graphics but *cannot*, on its own, create knowledge, let alone wisdom --- ## A geographic availability pyramid .pull-left[ - Recommendations - Build this here! - City-specific datasets - Bristol cycle count data - Hard-to-access national data - Open international/national datasets - Open origin-destination data from UK Census - Globally available, low-grade data (bottom) - OpenStreetMap, Elevation data ] .pull-right[ ![](https://user-images.githubusercontent.com/1825120/78549175-dc037e80-77f9-11ea-999e-e526632a1d53.png) ] --- ## An ease-of access pyramid - Data provision packages - Use the pct package - stats19 package - Pre-processed data - E.g. downloading data from website www.pct.bike - Messy official data - Raw STATS19 data --- ## A geographic level of detail pyramid - Agents - Route networks - Nodes - Routes - Desire lines - Transport zones --- ## Observations - Official sources are often smaller in sizes but higher in Quality - Unofficial sources provide higher volumes but tend to be noisy, e.g. study on Twitter vs Mobile vs in-person survey (Lovelace et al. [2016](https://onlinelibrary.wiley.com/doi/full/10.1111/gean.12081)) - Another way to classify data is by quality: signal/noise ratios - Globally available datasets would be at the bottom of this pyramid; local surveys at the top. Source: https://geocompr.robinlovelace.net/read-write.html - Which would be best to inform policy? --- ## Portals - UK geoportal, providing geographic data at many levels: https://geoportal.statistics.gov.uk - Other national geoportals exist, such as this: http://www.geoportal.org/ - A good source of cleaned origin destination data is the Region downloads tab in the Propensity to Cycle Tool - see the Region data tab for West Yorkshire here, for example: http://www.pct.bike/m/?r=west-yorkshire - OpenStreetMap is an excellent source of geographic data with global coverage. You can download data on specific queries (e.g. highway=cycleway) from the overpass-turbo service: https://overpass-turbo.eu/ or with the **osmdata** package --- ## Online lists For other datasets, search online! Good starting points in your research may be: - The open data section in Geocomputation with R - https://geocompr.robinlovelace.net/read-write.html#retrieving-data - Transport datasets mentioned here: https://data.world/datasets/transportation - UK government transport data: https://ckan.publishing.service.gov.uk/publisher/department-for-transport --- ## Data packages - The **openrouteservice** github package provides routing data - The stats19 package can get road crash data for anywhere in Great Britain (Lovelace et al. 2019) see here for info: https://itsleeds.github.io/stats19/ - The pct package provides access to data in the PCT: https://github.com/ITSLeeds/pct - There are many other R packages to help access data --- ## Example: osmdata .pull-left[ ``` r library(osmdata) library(sf) system.time({ # around 2 seconds n = "louvain-la-neuve" v = "primary|secondary|cycleway" louvain = opq(n) %>% add_osm_feature("highway", v, value_exact = F) %>% osmdata_sf() }) #> user system elapsed #> 0.125 0.020 1.814 louvain_highway = louvain$osm_lines plot(louvain_highway) ``` ] -- .pull-right[ ![](https://i.imgur.com/09yWU7V.png)<!-- --> ] --- ## Example: geofabrik ``` r # remotes::install_github("itsleeds/geofabrik") library(geofabrik) lvn_centroid = tmaptools::geocode_OSM("louvain-la-neuve", as.sf = T) system.time({ # belgium = get_geofabrik(lvn_centroid) # warning: downloads giant file belgium = get_geofabrik("Beligium") # equivalent code }) #> although coordinates are longitude/latitude, st_contains assumes that they are planar #> The place is within these geofabrik zones: Europe, Belgium #> Selecting the smallest: Belgium #> Downloading http://download.geofabrik.de/europe/belgium-latest.osm.pbf to #> ~/h/data/osm/Belgium.osm.pbf #> Old attributes: attributes=name,highway,waterway,aerialway,barrier,man_made #> New attributes: attributes=name,highway,waterway,aerialway,barrier,man_made,maxspeed,oneway,building,surface,landuse,natural,start_date,wall,service,lanes,layer,tracktype,bridge,foot,bicycle,lit,railway,footway #> Using ini file that can can be edited with file.edit(/tmp/RtmprktWUQ/ini_new.ini) #> user system elapsed #> 72.514 18.150 269.507 plot(louvain_highway) ``` <sup>Created on 2020-02-06 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup> -- #### We made our code 100+ times slower! -- #### ~30 times slower excluding download time --- ## Geofabrik II: pre-downloaded data and filtering ```r belgium_filename = geofabrik::gf_filename("Belgium") belgium_filename ``` ``` ## [1] "~/hd/data/osm/Belgium.osm.pbf" ``` ```r utils:::format.object_size(file.size(belgium_filename), units = "MB") ``` ``` ## [1] "NA Mb" ``` ```r # Around 1 GB in RAM... ``` ```r library(geofabrik) system.time({ belgium_cycleway = get_geofabrik("Belgium", key = "highway", value = "cycleway") }) #> user system elapsed #> 23.658 1.933 23.130 format(object.size(belgium_cycleway), units = "MB") #> [1] "36.9 Mb" ``` -- #### 'Only' ~10 times slower now --- ## Preprocessing bulk OSM extracts ```r # piggyback::pb_download_url("belgium_cycleway.Rds") u = "https://github.com/Robinlovelace/presentations/releases/download/2020-02/belgium_cycleway.Rds" f = "belgium_cycleway.Rds" if(!file.exists(f)) { download.file(url = u, destfile = f) } system.time({ belgium_cycleway = readRDS("belgium_cycleway.Rds") }) ``` ``` ## user system elapsed ## 0.242 0.011 0.257 ``` ```r format(object.size(belgium_cycleway), units = "MB") ``` ``` ## [1] "36.9 Mb" ``` #### Now ~10 times faster! -- #### And 10 times more useful? --- ## Spatial subsetting ```r lvn_box = stplanr::geo_bb(louvain_highway) mapview::mapview(lvn_box) system.time({ lvn_lines = belgium[lvn_box, ] }) #> user system elapsed #> 8.767 0.142 8.911 ``` ```r piggyback::pb_download_url("lvn_lines.Rds") ``` ``` ## [1] "https://github.com/Robinlovelace/presentations/releases/download/2020-02/lvn_lines.Rds" ``` ```r u = "https://github.com/Robinlovelace/presentations/releases/download/2020-02/lvn_lines.Rds" f = "lvn_lines.Rds" if(!file.exists(f)) download.file(url = u, destfile = f) system.time({ lvn_lines = readRDS("lvn_lines.Rds") }) ``` ``` ## user system elapsed ## 0.038 0.000 0.039 ``` ```r format(object.size(lvn_lines), units = "MB") ``` ``` ## [1] "5.6 Mb" ``` --- # Getting data for the transport network in Leeds ```r library(geofabrik) westyorkshire = geofabrik::get_geofabrik("west yorkshire") # subset them centering in chapeltown place_name = "chapeltown leeds" place_point = tmaptools::geocode_OSM(place_name) place_df = data.frame( name = place_name, lon = place_point$coords[1], lat = place_point$coords[2] ) place_sf = sf::st_as_sf(place_df, coords = c("lon", "lat"), crs = 4326) place_buffer = stplanr::geo_projected(place_sf, sf::st_buffer, dist = 2000) ctown = westyorkshire[place_buffer, ] key = "primary|secondary|tertiary|cycleway|trunk|motorway" sel_key = grepl(key, x = ctown$highway) ctown_roads = ctown[sel_key, ] saveRDS(ctown, "ctown.Rds") saveRDS(ctown_roads, "ctown_roads.Rds") plot(sf::st_geometry(ctown_roads)) ``` --- ```r ## Visualising spatial networks u = "https://github.com/Robinlovelace/presentations/releases/download/2020-02/ctown_roads.Rds" ctown_roads = readRDS(url(u)) library(sf) plot(ctown_roads["highway"]) ``` ![](2020-04-06-pennsylvania-tds-regional-street_files/figure-html/unnamed-chunk-21-1.png)<!-- --> --- ## Highway types + data cleaning ```r summary(as.factor(ctown_roads$highway)) ``` ``` ## cycleway motorway motorway_link secondary secondary_link tertiary ## 65 29 27 66 15 176 ## tertiary_link trunk trunk_link ## 24 158 44 ``` ```r # remove messy variables manually (if needed) library(dplyr) ctown_roads = ctown_roads %>% mutate(highway2 = case_when( stringr::str_detect(highway, "const|corri|eleva|_link|plat|unc") ~ as.character(NA), TRUE ~ ctown_roads$highway )) ``` --- ## Cleaning II ```r # (a) base R way highway_table = table(ctown_roads$highway) highway_table ``` ``` ## ## cycleway motorway motorway_link secondary secondary_link tertiary ## 65 29 27 66 15 176 ## tertiary_link trunk trunk_link ## 24 158 44 ``` ```r n10 = round(nrow(ctown_roads)) / 10 ctown_roads_other = names(highway_table)[highway_table < n10] ctown_roads$highway3 = ctown_roads$highway ctown_roads$highway3[ctown_roads$highway3 %in% ctown_roads_other] = "other" ``` --- ## Results... ```r plot(ctown_roads %>% select(contains("high"))) ``` ![](2020-04-06-pennsylvania-tds-regional-street_files/figure-html/unnamed-chunk-25-1.png)<!-- --> --- ## Mapping packages ``` ## tmap mode set to plotting ``` ```r library(tmap) tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3, ) ``` ![](2020-04-06-pennsylvania-tds-regional-street_files/figure-html/unnamed-chunk-27-1.png)<!-- --> --- ## Interactive mapping ```r tmap_mode("view") tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3) ```
--- ## OD data ```r library(stplanr) library(od) od_data_leeds = flow zones_leeds = zones_sf class(od_data_leeds) ``` ``` ## [1] "data.frame" ``` ```r desire_lines_leeds = od::od_to_sf(od_data_leeds, z = zones_sf) class(desire_lines_leeds) ``` ``` ## [1] "sf" "data.frame" ``` --- ## OD data on the network ```r tmap_mode("view") tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3) + tm_shape(desire_lines_leeds) + tm_lines() ``` ``` ## Warning: The shape desire_lines_leeds is invalid. See sf::st_is_valid ``` ``` ## Warning: Currect projection of shape desire_lines_leeds unknown. Long-lat (WGS84) is assumed. ```
--- ## Routing .pull-left[ ```r desire_lines_leeds_interzonal = desire_lines_leeds %>% filter(Area.of.residence != Area.of.workplace) routes_cycling_leeds = route( l = desire_lines_leeds_interzonal[2:9], route_fun = cyclestreets::journey ) ``` ``` ## Most common output is sf ``` ] .pull-right[ ```r tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3) + tm_shape(routes_cycling_leeds) + tm_lines() ```
] --- # Other routing services .pull-left[ - See Section ([12.5](https://geocompr.robinlovelace.net/transport.html#routes)) of Geocomputation with R ```r library(osrm) routes_driving_leeds = route( l = desire_lines_leeds_interzonal[2:9], route_fun = osrm::osrmRoute, returnclass = "sf" ) ``` ] .pull-right[ ```r tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3) + tm_shape(routes_driving_leeds) + tm_lines() ``` ] --- ## Route aggregation ```r # names(routes_cycling_leeds) rnet_leeds = overline(routes_cycling_leeds, attrib = "All") tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3) + tm_shape(rnet_leeds) + tm_lines(lwd = "All", scale = 9) ```
--- ## Spatial networks ```r remotes::install_github("luukvdmeer/sfnetworks") ``` ```r library(sfnetworks) ``` --- ## National route network dataset ```r plot(sf::st_geometry(belgium_cycleway)) ``` ![](https://user-images.githubusercontent.com/1825120/73952495-e3400600-48f6-11ea-8e52-33df6f15613f.png) --- # TDS for policy Source: the Propensity to Cycle Tool (PCT) project, demo at www.pct.bike <img src="https://raw.githubusercontent.com/npct/pct-team/master/figures/front-page-leeds-pct-demo.png" width="90%" /> -- Source - https://github.com/npct which hosts national web tool PCT [www.pct.bike](http://www.pct.bike/) --- ## Tools for reproducible research Transport data science is simply the application of 'data science' to transport - In terms of technique this means command-line tools for scalable analysis - Principles: reproducibility, generalisability, (public) accessibility ### Example II: **stats19** - Package providing access to open access road casualty dataset: https://docs.ropensci.org/stats19/ Used by the Parliamentary library to provide evidence for Members of Parliament (MPs): https://commonslibrary.parliament.uk/economy-business/transport/roads/constituency-data-traffic-accidents/ --- # The practical session It depends on you level of R skills, in ascending order of difficulty 1. Have a play with the OSM data for Leeds and reproduce the examples in these slides, hosted in the releases page of https://github.com/Robinlovelace/presentations (easy-to-hard) - Visualise the dataset with tmap and other packages (see Chapter 8) - Find the OD pairs with the highest levels of walking, cycling and driving - Find the driving routes with osrmRoute() - Undertake spatial network analysis of the data (advanced, see sfnetworks package) - Devise your own research questions using available data (PhD!) 2. Work through the Transport Chapter (12) in Geocompuation with R (intermediate) 3. Work through an introduction to road safety analysis with R (easier) : https://docs.ropensci.org/stats19/articles/stats19-training.html 4. Apply the methods in that chapter to a different UK city (advanced) 5. Technical issues: how to convert road network into polygons (see https://github.com/ropensci/stplanr/issues/362 ), how to do 'kriging' on a spatial network (see https://github.com/luukvdmeer/sfnetworks/issues/11 ) <!-- specific stats - kriging on network --> <!-- geo question --> --- ### Interested in spatial networks? Check-out the sfnetworks package (and issue tracker): https://github.com/luukvdmeer/sfnetworks Join us for the Spatial Networks Hackathon, date TBC ![](https://user-images.githubusercontent.com/1825120/78507945-1b3eba80-777b-11ea-87a0-fa84965982e9.png) --- # Thanks! Contact me at r. lovelace at leeds ac dot uk (email), [`@robinlovelace`](https://twitter.com/robinlovelace) -- Check-out links to my work at [robinlovelace.net](https://www.robinlovelace.net/) -- Thanks to everyone for building a open and collaborative communities Thanks to the University of Leeds and the Institute for Transport Studies [![](https://avatars3.githubusercontent.com/u/22447619?s=400&u=2d566bedf62374d5066a50d2dd7c87c84470f69b&v=4)](https://environment.leeds.ac.uk/transport)