class: center, middle, inverse, title-slide # Transport Data Science: from regional to street levels ## 🚀
Emerging discipline and community of practice? ### Robin Lovelace ### University of Leeds,
Institute for Transport Studies
### University of Pennsylvania, 2020-04-06 (updated: 2020-04-06) Reproducible source code:
github.com/Robinlovelace/presentations
--- background-image: url(https://d3l4am9dimtbet.cloudfront.net/wp-content/uploads/2018/07/In-Pics_-These-Mind-Blowing-Aerial-Shots-of-Mumbai-Were-Taken-by-a-Drone.jpg) <!-- 14h30-15h30 lecture part (this could already drift towards the practical part but doesn't have to ...) --> <!-- 15:30-16h00 coffee break --> <!-- 16h00-17h00 hands-on workshop (could be somewhat open-ended, we'll have the room all afternoon - but if we would go into extra time, we should make a short 'cut' around 17h such that those who have/want to go home can do so). --> # Introduction Source: [thebetterindia.com](https://www.thebetterindia.com/148450/mumbai-aerial-shots/) <!-- Mumbai is by [some](https://www.independent.co.uk/travel/worlds-biggest-cities-mexico-city-new-york-karachi-tokyo-lagos-kolkata-kinshasa-dhaka-delhi-a8158426.html) measures the world's largest city --> -- Abstract: Data Science has emerged as an area of high and consistent growth in many sectors. High tech industries such as search engine optimisation, marketing and retail analytics have been quick to adopt new workflows. Transport has arguably been slow to adapt to the transformations towards open source software (as opposed to proprietary products that still dominate), command line and scriptable interfaces (unlike the graphical user interfaces of tools such as Excel) and code sharing and version development via platforms such as GitHub. This talk demonstrates these new workflows, building on the [Transport chapter in the open source book Geocomputation with R](https://geocompr.robinlovelace.net/transport.html) and experience developing and deploying data science tools that are having a real world impact on transport policy and practice. It will provide insight into how multiple geographic scales of analysis were used to develop the [Propensity to Cycle Tool](http://www.pct.bike/), which Robin developed in collaboration with colleagues from 4 universities and which is being used by dozens of Local Authorities to develop strategic cycle networks. Furthermore, a live demo of the recently release [stats19](https://docs.ropensci.org/stats19/) package, which provides fast access to crash data for road safety research, will highlight the power of reproducibility and that transport data science is a practical field best discovered by doing it and collaboration using reproducible code. ```r # to reproduce these slides do: pkgs = c("rgdal", "sf", "geojsonsf") install.packages(pkgs) ``` -- --- ## whoami ```r system("whoami") ``` -- .pull-left[ - Environmental geographer - Learned R for PhD on energy and transport - Now work at the University of Leeds (ITS and LIDA) - Focus: Applied geocomputation - Strong interest in technology + reproducibility, e.g.: ```r devtools::install_github("r-rust/gifski") system("youtube-dl https://youtu.be/CzxeJlgePV4 -o v.mp4") system("ffmpeg -i v.mp4 -t 00:00:03 -c copy out.mp4") system("ffmpeg -i out.mp4 frame%04d.png ") f = list.files(pattern = "frame") gifski::gifski(f, gif_file = "g.gif", width = 200, height = 200) ``` ] -- .pull-right[ Image credit: Jeroen Ooms + others ```r knitr::include_graphics("https://user-images.githubusercontent.com/1825120/39661313-534efd66-5047-11e8-8d99-a5597fe160ff.gif") ``` <img src="https://user-images.githubusercontent.com/1825120/39661313-534efd66-5047-11e8-8d99-a5597fe160ff.gif" width="100%" /> ] --- ## Importance of Geographic data .pull-left[ ### Geographic data is everywhere ### Underlies some of society's biggest issues ### Can give general analyses local (actionable) meaning ] -- .pull-right[ ![](https://raw.githubusercontent.com/npct/pct-team/master/figures/Leeds-network.png)<!-- --> Example: Propensity to Cycle Tool (PCT) in Manchester: http://pct.bike/m/?r=greater-manchester ] --- # Classifying transport data Source (check it): https://www.openstreetmap.org/edit#map=11/53.8137/-1.5240 ![](https://user-images.githubusercontent.com/1825120/78499529-195c0380-7749-11ea-8f7a-43dfd479039e.png)<!-- --> --- ## Information and data pyramids Data science is climbing the DIKW pyramid <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/DIKW_Pyramid.svg/220px-DIKW_Pyramid.svg.png" width="70%" /> -- Data science *can* convert data to information and graphics but *cannot*, on its own, create knowledge, let alone wisdom --- ## A geographic availability pyramid .pull-left[ - Recommendations - Build this here! - City-specific datasets - Bristol cycle count data - Hard-to-access national data - Open international/national datasets - Open origin-destination data from UK Census - Globally available, low-grade data (bottom) - OpenStreetMap, Elevation data ] .pull-right[ ![](https://user-images.githubusercontent.com/1825120/78549175-dc037e80-77f9-11ea-999e-e526632a1d53.png) ] --- ## An ease-of access pyramid - Data provision packages - Use the pct package - stats19 package - Pre-processed data - E.g. downloading data from website www.pct.bike - Messy official data - Raw STATS19 data --- ## A geographic level of detail pyramid - Agents - Route networks - Nodes - Routes - Desire lines - Transport zones --- ## Observations - Official sources are often smaller in sizes but higher in Quality - Unofficial sources provide higher volumes but tend to be noisy, e.g. study on Twitter vs Mobile vs in-person survey (Lovelace et al. [2016](https://onlinelibrary.wiley.com/doi/full/10.1111/gean.12081)) - Another way to classify data is by quality: signal/noise ratios - Globally available datasets would be at the bottom of this pyramid; local surveys at the top. Source: https://geocompr.robinlovelace.net/read-write.html - Which would be best to inform policy? --- ## Portals - UK geoportal, providing geographic data at many levels: https://geoportal.statistics.gov.uk - Other national geoportals exist, such as this: http://www.geoportal.org/ - A good source of cleaned origin destination data is the Region downloads tab in the Propensity to Cycle Tool - see the Region data tab for West Yorkshire here, for example: http://www.pct.bike/m/?r=west-yorkshire - OpenStreetMap is an excellent source of geographic data with global coverage. You can download data on specific queries (e.g. highway=cycleway) from the overpass-turbo service: https://overpass-turbo.eu/ or with the **osmdata** package --- ## Online lists For other datasets, search online! Good starting points in your research may be: - The open data section in Geocomputation with R - https://geocompr.robinlovelace.net/read-write.html#retrieving-data - Transport datasets mentioned here: https://data.world/datasets/transportation - UK government transport data: https://ckan.publishing.service.gov.uk/publisher/department-for-transport --- ## Data packages - The **openrouteservice** github package provides routing data - The stats19 package can get road crash data for anywhere in Great Britain (Lovelace et al. 2019) see here for info: https://itsleeds.github.io/stats19/ - The pct package provides access to data in the PCT: https://github.com/ITSLeeds/pct - There are many other R packages to help access data --- ## Example: osmdata .pull-left[ ``` r library(osmdata) library(sf) system.time({ # around 2 seconds n = "louvain-la-neuve" v = "primary|secondary|cycleway" louvain = opq(n) %>% add_osm_feature("highway", v, value_exact = F) %>% osmdata_sf() }) #> user system elapsed #> 0.125 0.020 1.814 louvain_highway = louvain$osm_lines plot(louvain_highway) ``` ] -- .pull-right[ ![](https://i.imgur.com/09yWU7V.png)<!-- --> ] --- ## Example: geofabrik ``` r # remotes::install_github("itsleeds/geofabrik") library(geofabrik) lvn_centroid = tmaptools::geocode_OSM("louvain-la-neuve", as.sf = T) system.time({ # belgium = get_geofabrik(lvn_centroid) # warning: downloads giant file belgium = get_geofabrik("Beligium") # equivalent code }) #> although coordinates are longitude/latitude, st_contains assumes that they are planar #> The place is within these geofabrik zones: Europe, Belgium #> Selecting the smallest: Belgium #> Downloading http://download.geofabrik.de/europe/belgium-latest.osm.pbf to #> ~/h/data/osm/Belgium.osm.pbf #> Old attributes: attributes=name,highway,waterway,aerialway,barrier,man_made #> New attributes: attributes=name,highway,waterway,aerialway,barrier,man_made,maxspeed,oneway,building,surface,landuse,natural,start_date,wall,service,lanes,layer,tracktype,bridge,foot,bicycle,lit,railway,footway #> Using ini file that can can be edited with file.edit(/tmp/RtmprktWUQ/ini_new.ini) #> user system elapsed #> 72.514 18.150 269.507 plot(louvain_highway) ``` <sup>Created on 2020-02-06 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup> -- #### We made our code 100+ times slower! -- #### ~30 times slower excluding download time --- ## Geofabrik II: pre-downloaded data and filtering ```r belgium_filename = geofabrik::gf_filename("Belgium") belgium_filename ``` ``` ## [1] "~/hd/data/osm/Belgium.osm.pbf" ``` ```r utils:::format.object_size(file.size(belgium_filename), units = "MB") ``` ``` ## [1] "NA Mb" ``` ```r # Around 1 GB in RAM... ``` ```r library(geofabrik) system.time({ belgium_cycleway = get_geofabrik("Belgium", key = "highway", value = "cycleway") }) #> user system elapsed #> 23.658 1.933 23.130 format(object.size(belgium_cycleway), units = "MB") #> [1] "36.9 Mb" ``` -- #### 'Only' ~10 times slower now --- ## Preprocessing bulk OSM extracts ```r # piggyback::pb_download_url("belgium_cycleway.Rds") u = "https://github.com/Robinlovelace/presentations/releases/download/2020-02/belgium_cycleway.Rds" f = "belgium_cycleway.Rds" if(!file.exists(f)) { download.file(url = u, destfile = f) } system.time({ belgium_cycleway = readRDS("belgium_cycleway.Rds") }) ``` ``` ## user system elapsed ## 0.242 0.011 0.257 ``` ```r format(object.size(belgium_cycleway), units = "MB") ``` ``` ## [1] "36.9 Mb" ``` #### Now ~10 times faster! -- #### And 10 times more useful? --- ## Spatial subsetting ```r lvn_box = stplanr::geo_bb(louvain_highway) mapview::mapview(lvn_box) system.time({ lvn_lines = belgium[lvn_box, ] }) #> user system elapsed #> 8.767 0.142 8.911 ``` ```r piggyback::pb_download_url("lvn_lines.Rds") ``` ``` ## [1] "https://github.com/Robinlovelace/presentations/releases/download/2020-02/lvn_lines.Rds" ``` ```r u = "https://github.com/Robinlovelace/presentations/releases/download/2020-02/lvn_lines.Rds" f = "lvn_lines.Rds" if(!file.exists(f)) download.file(url = u, destfile = f) system.time({ lvn_lines = readRDS("lvn_lines.Rds") }) ``` ``` ## user system elapsed ## 0.038 0.000 0.039 ``` ```r format(object.size(lvn_lines), units = "MB") ``` ``` ## [1] "5.6 Mb" ``` --- # Getting data for the transport network in Leeds ```r library(geofabrik) westyorkshire = geofabrik::get_geofabrik("west yorkshire") # subset them centering in chapeltown place_name = "chapeltown leeds" place_point = tmaptools::geocode_OSM(place_name) place_df = data.frame( name = place_name, lon = place_point$coords[1], lat = place_point$coords[2] ) place_sf = sf::st_as_sf(place_df, coords = c("lon", "lat"), crs = 4326) place_buffer = stplanr::geo_projected(place_sf, sf::st_buffer, dist = 2000) ctown = westyorkshire[place_buffer, ] key = "primary|secondary|tertiary|cycleway|trunk|motorway" sel_key = grepl(key, x = ctown$highway) ctown_roads = ctown[sel_key, ] saveRDS(ctown, "ctown.Rds") saveRDS(ctown_roads, "ctown_roads.Rds") plot(sf::st_geometry(ctown_roads)) ``` --- ```r ## Visualising spatial networks u = "https://github.com/Robinlovelace/presentations/releases/download/2020-02/ctown_roads.Rds" ctown_roads = readRDS(url(u)) library(sf) plot(ctown_roads["highway"]) ``` ![](2020-04-06-pennsylvania-tds-regional-street_files/figure-html/unnamed-chunk-21-1.png)<!-- --> --- ## Highway types + data cleaning ```r summary(as.factor(ctown_roads$highway)) ``` ``` ## cycleway motorway motorway_link secondary secondary_link tertiary ## 65 29 27 66 15 176 ## tertiary_link trunk trunk_link ## 24 158 44 ``` ```r # remove messy variables manually (if needed) library(dplyr) ctown_roads = ctown_roads %>% mutate(highway2 = case_when( stringr::str_detect(highway, "const|corri|eleva|_link|plat|unc") ~ as.character(NA), TRUE ~ ctown_roads$highway )) ``` --- ## Cleaning II ```r # (a) base R way highway_table = table(ctown_roads$highway) highway_table ``` ``` ## ## cycleway motorway motorway_link secondary secondary_link tertiary ## 65 29 27 66 15 176 ## tertiary_link trunk trunk_link ## 24 158 44 ``` ```r n10 = round(nrow(ctown_roads)) / 10 ctown_roads_other = names(highway_table)[highway_table < n10] ctown_roads$highway3 = ctown_roads$highway ctown_roads$highway3[ctown_roads$highway3 %in% ctown_roads_other] = "other" ``` --- ## Results... ```r plot(ctown_roads %>% select(contains("high"))) ``` ![](2020-04-06-pennsylvania-tds-regional-street_files/figure-html/unnamed-chunk-25-1.png)<!-- --> --- ## Mapping packages ``` ## tmap mode set to plotting ``` ```r library(tmap) tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3, ) ``` ![](2020-04-06-pennsylvania-tds-regional-street_files/figure-html/unnamed-chunk-27-1.png)<!-- --> --- ## Interactive mapping ```r tmap_mode("view") tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3) ```
--- ## OD data ```r library(stplanr) library(od) od_data_leeds = flow zones_leeds = zones_sf class(od_data_leeds) ``` ``` ## [1] "data.frame" ``` ```r desire_lines_leeds = od::od_to_sf(od_data_leeds, z = zones_sf) class(desire_lines_leeds) ``` ``` ## [1] "sf" "data.frame" ``` --- ## OD data on the network ```r tmap_mode("view") tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3) + tm_shape(desire_lines_leeds) + tm_lines() ``` ``` ## Warning: The shape desire_lines_leeds is invalid. See sf::st_is_valid ``` ``` ## Warning: Currect projection of shape desire_lines_leeds unknown. Long-lat (WGS84) is assumed. ```
--- ## Routing .pull-left[ ```r desire_lines_leeds_interzonal = desire_lines_leeds %>% filter(Area.of.residence != Area.of.workplace) routes_cycling_leeds = route( l = desire_lines_leeds_interzonal[2:9], route_fun = cyclestreets::journey ) ``` ``` ## Most common output is sf ``` ] .pull-right[ ```r tm_shape(ctown_roads) + tm_lines("highway3", lwd = 3) + tm_shape(routes_cycling_leeds) + tm_lines() ```