A workshop for Housing Data Coalition on using R to analyze NYC housing data
All the code for this article is available in this RStudio Cloud project and in this GitHub repo
R is a free and open-source software environment for statistical computing and graphics. It has a more narrow set of uses and a smaller user base compared python, but because it is specifically designed for data analysis it is a great language for data analysts to learn. In recent years it has become increasingly popular and has also become much easier to learn, especially for those new to coding.
There are three main reasons that I enjoying using R myself and teaching it to others.
Here is just a small sample of some of the great resources available for learning R:
The RStudio Desktop application is free to download. They also now provide Rstudio Cloud as a free service that works just like RStudio Desktop but it is all accessed through the browser and requires no installation. We’ll be using RStudio Cloud this for this workshop, and you can get started by opening this RStudio Cloud project and making a permanent copy in your own work space.
This is the default RStudio interface. * The top left pane contains all your code files. This is where you can write and save all the code for your analysis. * The bottom left pane has the R console where you can run individual pieces of R code, such as quick calculations, printing objects, or anything else that you don’t need to save in your final files. * The top right pane contains a list of all the objects currently in your environment. If you click on a dataframe object it will open in a viewer in the top left pane where you can browse, sort, and filter your view of the data (without altering the object itself) * The bottom right pane contains a few important tabs: the plot viewer where any graphs you create will appear, the files explorer, and the help page
R is an open-source language so in addition to the basic functions that come standard with R (referred to as Base R) there are more than 10,000 user written packages that can accomplish virtually any task in R. There is an official repository for these packages called CRAN that does some vetting of the quality of packages, and packages from here can be installed directly from R using:
install.packages("PackageName")
These packages only need to be installed like this once, and after that initial installation we only need to load the packages that we want use for each analysis with library().
All of the packages we are using here are part of a collection of R packages referred to as the tidyverse.
The Tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
All of these packages are extremely well-maintained and have helpful websites that include, examples and guides, function documentation, cheatsheets, and links to the GitHub repos where the packages are developed.
The following are the core set of Tidyverse packages, but there are many more.
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challengesreadr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf)tidyr helps you create tidy data. Tidy data is data where: Each variable is in a column, each observation is a row, and each value is a cellstringr provides a cohesive set of functions designed to make working with strings as easy as possibleforcats provides a suite of useful tools that solve common problems with factorspurrr is a complete and consistent set of tools for working with functions and vectorsggplot2 is a system for declaratively creating graphics, based on The Grammar of GraphicsIn addition to the package websites, there is an amazing free book that covers how to use all these packages to do data analysis, called R for Data Science.
library(dplyr) # manipulate dataframes
library(readr) # read/write dataframes
library(tidyr) # reshapeing dataframes
library(stringr) # string manipulation
library(forcats) # factor manipulation
library(purrr) # iteration (instead of loops)
library(ggplot2) # making plots
We’ll be using a dataset of buildings from the Public Advocate’s NYC Landlord Watchlist.
read_csv() guesses about the data type of each column, and gives you the column specification is used. Often times this will be what you want, but if you want to override the guesses you can supply your own specification.
watchlist_bldgs <- read_csv("data/landlord-watchlist-buildings_2018-12-08.csv")
watchlist_bldgs <- read_csv(
file = "data/landlord-watchlist-buildings_2018-12-08.csv",
col_types = cols(
.default = col_character(),
units = col_integer(),
violations = col_integer()
)
)
Now let’s take a look at this new dataframe that we’ve imported. You can print the dataframe to get a simple preview.
watchlist_bldgs
# A tibble: 452 x 8
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonathan C… 5416 4 A… 16 62 5416 4 AVE… BROOKL… 11220
2 Jonathan C… 5422 4 A… 21 78 5422 4 AVE… BROOKL… 11220
3 Jonathan C… 314 57 S… 17 70 314 57 ST… BROOKL… 11220
4 Jonathan C… 372 BALT… 8 76 372 BALTI… BROOKL… 11201
5 Jonathan C… 166 BLEE… 6 41 166 BLEEC… BROOKL… 11221
6 Jonathan C… 590 BUSH… 4 55 590 BUSHW… BROOKL… 11206
7 Jonathan C… 165 HOWA… 8 39 165 HOWAR… BROOKL… 11233
8 Jonathan C… 208 HOYT… 3 9 208 HOYT … BROOKL… 11217
9 Jonathan C… 203 HULL… 6 143 203 HULL … BROOKL… 11233
10 Jonathan C… 318 HUMB… 7 102 318 HUMBO… BROOKL… 11211
# ... with 442 more rows
When simply printing the dataframe you’ll only see a few rows and as many columns as fit nicely on your screen. When you have many columns it’s often helpful to use the function glimpse() to see a list of all your columns.
glimpse(watchlist_bldgs)
Observations: 452
Variables: 8
$ landlord <chr> "Jonathan Cohen-Silvershore Properties", "Jona...
$ address <chr> "5416 4 AVENUE, BROOKLYN 11220", "5422 4 AVENU...
$ units <int> 16, 21, 17, 8, 6, 4, 8, 3, 6, 7, 6, 8, 6, 6, 7...
$ violations <int> 62, 78, 70, 76, 41, 55, 39, 9, 143, 102, 19, 5...
$ number <chr> "5416", "5422", "314", "372", "166", "590", "1...
$ street <chr> "4 AVENUE", "4 AVENUE", "57 STREET", "BALTIC S...
$ borough <chr> "BROOKLYN", "BROOKLYN", "BROOKLYN", "BROOKLYN"...
$ zip <chr> "11220", "11220", "11220", "11201", "11221", "...
dplyrThe package dplyr contains functions for basic data manipulation. It is organized around 5 main functions that take a dataframe and manipulate it in some way. The functions are named as verbs which help explain what they do.
filter() - filter to the rows you want to keep based on conditionsselect() - select columns you want to keeparrange() - sort dataframe by a columnmutate() - adds new columnssummarise() - collapse multiple rows down to a single oneEvery one of these functions takes a dataframe as the first argument and returns an altered version of that dataframe.
Inside of these functions columns are referred to with just their names without quotes.
filter()Use filter() find rows/cases where conditions are true. Rows where the condition evaluates to NA are dropped.
bk_bldgs <- filter(watchlist_bldgs, borough == "BROOKLYN")
bk_bldgs
# A tibble: 256 x 8
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonathan C… 5416 4 A… 16 62 5416 4 AVE… BROOKL… 11220
2 Jonathan C… 5422 4 A… 21 78 5422 4 AVE… BROOKL… 11220
3 Jonathan C… 314 57 S… 17 70 314 57 ST… BROOKL… 11220
4 Jonathan C… 372 BALT… 8 76 372 BALTI… BROOKL… 11201
5 Jonathan C… 166 BLEE… 6 41 166 BLEEC… BROOKL… 11221
6 Jonathan C… 590 BUSH… 4 55 590 BUSHW… BROOKL… 11206
7 Jonathan C… 165 HOWA… 8 39 165 HOWAR… BROOKL… 11233
8 Jonathan C… 208 HOYT… 3 9 208 HOYT … BROOKL… 11217
9 Jonathan C… 203 HULL… 6 143 203 HULL … BROOKL… 11233
10 Jonathan C… 318 HUMB… 7 102 318 HUMBO… BROOKL… 11211
# ... with 246 more rows
Multiple conditions are combined with &.
bk_big_bldgs <- filter(watchlist_bldgs, units > 10, borough == "QUEENS")
bk_big_bldgs
# A tibble: 4 x 8
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonathan C… 1708 SUM… 39 116 1708 SUMMER… QUEENS 11385
2 RONALD SWA… 34-01 28… 41 157 34-01 28 AVE… QUEENS 11103
3 RONALD SWA… 47-02 88… 29 98 47-02 88 STR… QUEENS 11373
4 HILLDISDE … 87-40 16… 119 383 87-40 165 ST… QUEENS 11432
select()Use select() to keep or drop columns. You can either specify a set of variables to keep by listing them, or specify columns to be dropped with -.
select(watchlist_bldgs, landlord, borough, units)
# A tibble: 452 x 3
landlord borough units
<chr> <chr> <int>
1 Jonathan Cohen-Silvershore Properties BROOKLYN 16
2 Jonathan Cohen-Silvershore Properties BROOKLYN 21
3 Jonathan Cohen-Silvershore Properties BROOKLYN 17
4 Jonathan Cohen-Silvershore Properties BROOKLYN 8
5 Jonathan Cohen-Silvershore Properties BROOKLYN 6
6 Jonathan Cohen-Silvershore Properties BROOKLYN 4
7 Jonathan Cohen-Silvershore Properties BROOKLYN 8
8 Jonathan Cohen-Silvershore Properties BROOKLYN 3
9 Jonathan Cohen-Silvershore Properties BROOKLYN 6
10 Jonathan Cohen-Silvershore Properties BROOKLYN 7
# ... with 442 more rows
select(watchlist_bldgs, -landlord)
# A tibble: 452 x 7
address units violations number street borough zip
<chr> <int> <int> <chr> <chr> <chr> <chr>
1 5416 4 AVENUE, BR… 16 62 5416 4 AVENUE BROOKL… 11220
2 5422 4 AVENUE, BR… 21 78 5422 4 AVENUE BROOKL… 11220
3 314 57 STREET, BR… 17 70 314 57 STREET BROOKL… 11220
4 372 BALTIC STREET… 8 76 372 BALTIC S… BROOKL… 11201
5 166 BLEECKER STRE… 6 41 166 BLEECKER… BROOKL… 11221
6 590 BUSHWICK AVEN… 4 55 590 BUSHWICK… BROOKL… 11206
7 165 HOWARD AVENUE… 8 39 165 HOWARD A… BROOKL… 11233
8 208 HOYT STREET, … 3 9 208 HOYT STR… BROOKL… 11217
9 203 HULL STREET, … 6 143 203 HULL STR… BROOKL… 11233
10 318 HUMBOLDT STRE… 7 102 318 HUMBOLDT… BROOKL… 11211
# ... with 442 more rows
You can rename the columns that you are selecting within select(), or use rename() which keeps all columns.
select(watchlist_bldgs, borough_name = borough)
# A tibble: 452 x 1
borough_name
<chr>
1 BROOKLYN
2 BROOKLYN
3 BROOKLYN
4 BROOKLYN
5 BROOKLYN
6 BROOKLYN
7 BROOKLYN
8 BROOKLYN
9 BROOKLYN
10 BROOKLYN
# ... with 442 more rows
rename(watchlist_bldgs, landlord_name = landlord)
# A tibble: 452 x 8
landlord_name address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonathan Coh… 5416 4… 16 62 5416 4 AVE… BROOKL… 11220
2 Jonathan Coh… 5422 4… 21 78 5422 4 AVE… BROOKL… 11220
3 Jonathan Coh… 314 57… 17 70 314 57 ST… BROOKL… 11220
4 Jonathan Coh… 372 BA… 8 76 372 BALTI… BROOKL… 11201
5 Jonathan Coh… 166 BL… 6 41 166 BLEEC… BROOKL… 11221
6 Jonathan Coh… 590 BU… 4 55 590 BUSHW… BROOKL… 11206
7 Jonathan Coh… 165 HO… 8 39 165 HOWAR… BROOKL… 11233
8 Jonathan Coh… 208 HO… 3 9 208 HOYT … BROOKL… 11217
9 Jonathan Coh… 203 HU… 6 143 203 HULL … BROOKL… 11233
10 Jonathan Coh… 318 HU… 7 102 318 HUMBO… BROOKL… 11211
# ... with 442 more rows
mutate()Use mutate() to add new columns to a dataset. mutate() keeps all the existing columns and adds new one to the end of the dataset, and the variant transmute() creates new columns but keeps only the new ones.
mutate(watchlist_bldgs, landlord_lower = str_to_lower(landlord))
# A tibble: 452 x 9
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 Jonatha… 5416 4… 16 62 5416 4 AVE… BROOKL… 11220
2 Jonatha… 5422 4… 21 78 5422 4 AVE… BROOKL… 11220
3 Jonatha… 314 57… 17 70 314 57 ST… BROOKL… 11220
4 Jonatha… 372 BA… 8 76 372 BALTI… BROOKL… 11201
5 Jonatha… 166 BL… 6 41 166 BLEEC… BROOKL… 11221
6 Jonatha… 590 BU… 4 55 590 BUSHW… BROOKL… 11206
7 Jonatha… 165 HO… 8 39 165 HOWAR… BROOKL… 11233
8 Jonatha… 208 HO… 3 9 208 HOYT … BROOKL… 11217
9 Jonatha… 203 HU… 6 143 203 HULL … BROOKL… 11233
10 Jonatha… 318 HU… 7 102 318 HUMBO… BROOKL… 11211
# ... with 442 more rows, and 1 more variable: landlord_lower <chr>
transmute(watchlist_bldgs, violations_per_unit = violations / units)
# A tibble: 452 x 1
violations_per_unit
<dbl>
1 3.88
2 3.71
3 4.12
4 9.5
5 6.83
6 13.8
7 4.88
8 3
9 23.8
10 14.6
# ... with 442 more rows
arrange()Use arrange() to add order the rows in your dataset by the values of one or more columns. Be default they will be in ascending order, and you can use desc() for descending order.
arrange(watchlist_bldgs, landlord, desc(units))
# A tibble: 452 x 8
landlord address units violations number street borough zip
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 ADAM STR… 150 WEST … 31 92 150 WEST 1… BRONX 10468
2 ADAM STR… 19 EAST 1… 25 105 19 EAST 1… MANHAT… 10035
3 ADAM STR… 611 WEST … 22 78 611 WEST 1… MANHAT… 10031
4 ADAM STR… 535 EDGEC… 21 96 535 EDGECO… MANHAT… 10032
5 ADAM STR… 480 CONVE… 15 76 480 CONVEN… MANHAT… 10031
6 ADAM STR… 616 EAST … 13 62 616 EAST 1… BRONX 10458
7 ADAM STR… 2372 ARTH… 12 47 2372 ARTHUR… BRONX 10458
8 ADAM STR… 2550 CRES… 11 67 2550 CRESTO… BRONX 10468
9 ADAM STR… 293 PLEAS… 10 49 293 PLEASA… MANHAT… 10029
10 ADAM STR… 2405 BEAU… 9 33 2405 BEAUMO… BRONX 10458
# ... with 442 more rows
summarize()You can use summarize() on a dataset to collapse down all the rows to a single row to calculate an aggregate statistic of one or more columns. It works in a similar way as mutate(), except whereas in mutate you can create new columns that are the same length as your existing dataset, with summarise() you will sum some sort of aggregate function (like sum()) that takes a column of multiple values and returns only one value.
summarise(watchlist_bldgs, total_units = sum(units))
# A tibble: 1 x 1
total_units
<int>
1 7192
group_by()The 6th function is group_by() and this doesn’t change the contents of your dataframe, but instead affects how all of the above functions work if they are subsequently called on the dataframe. After a dataframe has been grouped by one or more columns, all functions apply to each group of rows in the dataset as if it was it’s own dataset. group_by() is most commonly used with summarize. Alone summarize() will collapse a dataframe to a single row, but with a grouped dataframe it is collapsed down to one row per group. After you have finished with your grouped operations use ungroup() to make sure that it doesn’t unintentionally alter later operations.
boro_bldgs <- group_by(watchlist_bldgs, borough)
boro_bldgs <- summarise(boro_bldgs, total_units = sum(units))
boro_bldgs <- ungroup(boro_bldgs)
boro_bldgs
# A tibble: 5 x 2
borough total_units
<chr> <int>
1 BRONX 1662
2 BROOKLYN 2723
3 MANHATTAN 2458
4 QUEENS 237
5 STATEN ISLAND 112
%>% (“pipe”)As you can see above when you want to make a series of changes to a dataframe you can end up repeating yourself a lot and overwriting a dataframe with each step. Thankfully there’s a way to avoid this!
The beauty of dplyr is that all of the functions above take a dataframe as the first argument, and return an altered version of that dataframe as the result. This allows us to start with a dataframe and link together multiple functions so that they each make a change to the dataframe then pass it along to the next function. dplyr includes a special operator, %>% (pronounced “pipe”), that allows us to chain together these function calls. When reading the code for these pipelines you can read %>% as “then”.
This %>% takes the object on the left and passes it to the function on the right as the first argument.
For a simple example, let’s look at the function str_c(), which concatenates strings together. Instead of passing "a" and "b" as the first and second argument, we can use the %>% to “pipe” the "a" into the function as the first argument and the "b" becomes the second argument.
str_c("a", "b")
[1] "ab"
"a" %>% str_c("b")
[1] "ab"
Now let’s practice putting together some of these dplyr functions into a little data manipulation pipeline by getting some information about the landlords on the watchlist and the buildings they own in Brooklyn.
The long pipeline of these dplyr functions can seem overwhelming at first, but once you get familiar with the functions you’ll be able to read these code chunks like a little paragraph explaining the changes being made to a dataframe. To help illustrate this the following paragraph is a written explanation of every step of the accompanying block of code.
We’ll start with the full
watchlist_bldgsdataset, then “pipe” (%>%) it into the next function tofilterthe dataset to just buildings where theboroughis"Brooklyn". Then wemutatethe dataset to add a new column calledlandlord_namethat is simply a more nicely-formatted version of the existinglandlordcolumn. Then weselectonly the columns that we need:landlord_name,units, and HPDviolations. Then wegroup_bythe newlandlord_namecolumn, and then, with the dataset grouped, we’llsummarizethe data across all buildings for each landlord to get some summary information about each landlord and their buildings in Brooklyn. Specifically, we’llsummarizeto get the total number ofbuildingsusing the specialn()function that counts the number of rows, we’ll also get thetotal_unitsbysumming the units across all buildings for each landlord, and we’ll get theavg_bldg_sizeof each landlord’s Brooklyn buildings by taking themeanof units across their buildings. Similarly, we get thesumandmeanof HPDviolationsfor each landlord. We’ve now gone from a dataset in which each row represents a building to one in which each row is a landlord. Since we are done with our grouped operations we canungroupthe data, then finally we canarrangethe dataset indescending order of the number ofbuildingsthe landlord owns in Brooklyn. After all of this our final resulting dataset is assigned to a new dataframe we’ll callbk_landlords.
bk_landlords <- watchlist_bldgs %>%
filter(borough == "BROOKLYN") %>%
mutate(landlord_name = str_to_title(landlord)) %>%
select(landlord_name, units, violations) %>%
group_by(landlord_name) %>%
summarize(
buildings = n(),
total_units = sum(units),
avg_bldg_size = mean(units),
total_viol = sum(violations),
avg_bldg_viol = mean(violations)
) %>%
ungroup() %>%
arrange(desc(buildings))
bk_landlords
# A tibble: 53 x 6
landlord_name buildings total_units avg_bldg_size total_viol
<chr> <int> <int> <dbl> <int>
1 Jonathan Coh… 18 149 8.28 974
2 Meir Fried 17 125 7.35 690
3 Deodat Lowtan 14 80 5.71 565
4 Cheryl Ighod… 11 68 6.18 413
5 Joel Kohn 10 60 6 312
6 Iskyo Aronov 9 37 4.11 495
7 Yosef Emergi 9 46 5.11 528
8 Chaim Schwar… 8 39 4.88 361
9 Isaac Schwar… 8 110 13.8 510
10 Jacob Rubin 8 52 6.5 349
# ... with 43 more rows, and 1 more variable: avg_bldg_viol <dbl>
ggplot2Now let’s visualize this new dataset we’ve created using the package ggplot2.
ggplot2 is designed to work with dataframe inputs, so the first step is always to use ggplot(data = your_dataframe). You can build plots step by step by adding layers with +. The second step is always aes(), which establishes the aesthetics of the plot by mapping columns from the dataframe to aesthetic elements of the plot. For example, here we are setting the x axis values to landlord names and y to the total number of HPD violations. After the aes is set, you can use one of the many geom_*() functions to transform the aesthetics into a type of plot. In this case we want a column plot, so we use geom_column(). Finally, we can label any part of our graph with labs().
ggplot(data = bk_landlords) +
aes(x = landlord_name, y = total_viol) +
geom_col() +
labs(
title = '"Worst" Landlords in Brooklyn',
subtitle = "Total HPD Violations in All Buildings for 2017",
x = NULL,
y = "Number of Violations",
caption = "Source: NYC Public Advocate's Landlord Watchlist"
)
With only the defaults ggplot2 graphs tend to look pretty good, and are not too difficult to create. However, there are definitely some things we’ll want to improve with this graph. Luckily, there is a near-infinite amount of customization possible with ggplot2 to get the plot looking exactly the way you want.
To start, there are clearly too many landlords to display clearly in a graph like this, so we can use dplyr toarrange the data by violations and filter to keep only the top 10 landlords. The first landlord name doesn’t match the same format as the other, so let’s remove the " Properties" part using str_remove from the stringr package. It would also be nice if the landlords were sorted in order of the number of violations. To achieve this we can change the landlord_name column from a string to instead use R’s factor datatype, which allows us to specify an ordering to the values. Specifically, we’ll use the function fct_reorder() from the package forcats to make the column a factor and put the values in order based on the values of the total_viol column.
Now we can use this new dataframe with ggplot2 and make a few more changes to improve the graph further. One obvious problem with our initial graph is that the landlord names are completely illegible due to overlap. To solve this we can use the ggplot2 function coord_flip() to flip our bars sideways so we can read the labels more cleanly. Another smaller adjustment we can make it to format the violation count labels on our y-axis. To make changes to anything related to one of the aesthetic elements of a plot we can use one of the many scale_*_* functions. The first * is always one of the aes element types, and the second * indicates the type of data that is mapped to it. In our case we want to make a change to the y axis and we’ve mapped our count of violations to y so it a continuous scale, so the function we’ll want to use is scale_y_continuous(). Now within that function we’ll want to use the formatting function comma from the scales package on our axis labels. Lastly, we can use one of the theme_* functions to apply some alternative styling to the plot. These functions provide you some helpful preset styling, but you can make your own fine-tuned adjustments using theme(). This can get a bit overwhelming, but just to illustrate what’s possible, here we’ll remove the unnecessary lines on the plot, move the landlord name labels over a bit, and change the font of the caption.
landlord_bk_10_worst <- bk_landlords %>%
arrange(desc(total_viol)) %>%
filter(row_number() <= 10) %>%
mutate(
landlord_name = str_remove(landlord_name, " Properties"),
landlord_name = fct_reorder(landlord_name, total_viol)
)
ggplot(data = landlord_bk_10_worst) +
aes(x = landlord_name, y = total_viol) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(
panel.grid.major.y = element_blank(),
axis.text.y = element_text(margin = margin(r = -15)),
plot.caption = element_text(face = "italic", color = "darkgrey", margin = margin(t = 10))
) +
labs(
title = '10 "Worst" Landlords in Brooklyn',
subtitle = "Total HPD Violations in All Buildings for 2017",
x = NULL,
y = "Number of Violations",
caption = "Source: NYC Public Advocate's Landlord Watchlist"
)
To learn more about out our dataset of buildings we can join on additional information from other datasets. However we are missing an important column to allow us to join to other datasets. In NYC properties are identified in many administrative datasets by their tax lot identifiers know as Borough-Block-Lot (BBL), and if you have a BBL column it opens up lots of datasets bu out list of buildings is missing BBL. Luckily we do have the building addresses, and this allows us to use geocoding to get BBL (and other geographic information) using NYC’s Geoclient API.
To access the Geoclient API in R we can use the geoclient package. The first step is to set your API keys that allow us to use the API, for this we’ll use geoclient_api_keys to set environment variables with our keys to that we only have to input the once. The readRenviron() just makes sure that after adding the new environment variables we have access to them now without having to restart R. To make sure that we aren’t including private API keys in any public files we’ll use the getPass package to create pop-up windows for us to input this information without writing it down in the code.
You can acquire your own Geoclient API credentials by first registering with the NYC’s Developer Portal, then creating a new project and selecting “Geoclient v1” from available APIs.
library(geoclient) # remotes::install_github("austensen/geoclient")
geoclient_api_keys(
id = getPass::getPass("geoclient ID:"),
key = getPass::getPass("geoclient KEY:"),
install = TRUE,
overwrite = TRUE
)
readRenviron("~/.Renviron")
The Geoclient API allows for gecoding from a variety of input location types, and for each type there is an accompanying R function in the geoclient package. Here we want to geocode the building addesses so we’ll start with the dataset of buildings and pipe it into the function geo_address_data() and speficy all the address compoenents that are required for the function. This will return a dataset, of the same length and in the same order s the input dataset, containing all the many geographic identifiers and other pieces of intormation returned by the API. For our purposes we’re primarily interested in the BBL column, but wehile we’re at it let’s also keep some other geographic IDs and the latitude/longitude for each property, renaming some variables while we’re selecting them.
geo_info <- watchlist_bldgs %>%
geo_address_data(number, street, borough, zip) %>%
select(
bbl,
bin = buildingIdentificationNumber,
tract_2010 = censusTract2010,
cd = communityDistrict,
nta,
lat = latitudeInternalLabel,
lng = longitudeInternalLabel
)
geo_info
# A tibble: 452 x 7
bbl bin tract_2010 cd nta lat lng
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 3008220039 3014544 " 76 " 307 BK32 40.6 -74.0
2 3008220041 3014545 " 76 " 307 BK32 40.6 -74.0
3 3008460011 3015608 " 74 " 307 BK32 40.6 -74.0
4 3004030030 3006430 " 69 " 306 BK38 40.7 -74.0
5 3033060024 3075649 " 419 " 304 BK78 40.7 -73.9
6 3031500031 3071890 " 389 " 304 BK78 40.7 -73.9
7 3015050001 3040558 " 377 " 303 BK35 40.7 -73.9
8 3004030037 3006432 " 69 " 306 BK38 40.7 -74.0
9 3015340038 3041728 " 369 " 316 BK79 40.7 -73.9
10 3027780003 3069286 " 495 " 301 BK90 40.7 -73.9
# ... with 442 more rows
Because the resulting dataframe is the exact same size and order as our input dataset we can simply bind the two datasets together side-by-side with dplyr’s bind_cols().
watchlist_bldgs_geos <- bind_cols(watchlist_bldgs, geo_info)
watchlist_bldgs_geos
# A tibble: 452 x 15
landlord address units violations number street borough zip bbl
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <chr>
1 Jonatha… 5416 4… 16 62 5416 4 AVE… BROOKL… 11220 3008…
2 Jonatha… 5422 4… 21 78 5422 4 AVE… BROOKL… 11220 3008…
3 Jonatha… 314 57… 17 70 314 57 ST… BROOKL… 11220 3008…
4 Jonatha… 372 BA… 8 76 372 BALTI… BROOKL… 11201 3004…
5 Jonatha… 166 BL… 6 41 166 BLEEC… BROOKL… 11221 3033…
6 Jonatha… 590 BU… 4 55 590 BUSHW… BROOKL… 11206 3031…
7 Jonatha… 165 HO… 8 39 165 HOWAR… BROOKL… 11233 3015…
8 Jonatha… 208 HO… 3 9 208 HOYT … BROOKL… 11217 3004…
9 Jonatha… 203 HU… 6 143 203 HULL … BROOKL… 11233 3015…
10 Jonatha… 318 HU… 7 102 318 HUMBO… BROOKL… 11211 3027…
# ... with 442 more rows, and 6 more variables: bin <chr>,
# tract_2010 <chr>, cd <chr>, nta <chr>, lat <dbl>, …
Now that we have BBL for all our buildings we can link to other bbl-level datasets. NYC-DB is an incredible resource of NYC housing datasets that all contain BBL.
We’ll start by connecting to the NYC-DB PostgreSQL database using the DBI package.
library(DBI)
con <- dbConnect(
RPostgres::Postgres(),
user = getPass::getPass("NYC-DB USER:"),
password = getPass::getPass("NYC-DB PASSWORD:"),
host = getPass::getPass("NYC-DB HOST:"),
port = 5432,
dbname = "nycdb"
)
To see all the tables available to use in NYC-DB we can use dbListTables().
dbListTables(con)
[1] "acris_country_codes"
[2] "personal_property_legals"
[3] "acris_document_control_codes"
[4] "acris_property_type_codes"
[5] "acris_ucc_collateral_codes"
[6] "dob_complaints"
[7] "dobjobs"
[8] "dof_sales"
[9] "hpd_business_addresses"
[10] "hpd_complaint_problems"
[11] "hpd_complaints"
[12] "hpd_contacts"
[13] "hpd_corporate_owners"
[14] "hpd_registrations"
[15] "hpd_registrations_grouped_by_bbl"
[16] "hpd_registrations_grouped_by_bbl_with_contacts"
[17] "hpd_violations"
[18] "marshal_evictions_17"
[19] "personal_property_master"
[20] "personal_property_parties"
[21] "personal_property_references"
[22] "personal_property_remarks"
[23] "pluto_18v1"
[24] "real_property_legals"
[25] "real_property_master"
[26] "real_property_parties"
[27] "real_property_references"
[28] "real_property_remarks"
[29] "rentstab"
A great thing about dplyr is that it not only works on regular dataframes, but you can also use it with tables in a SQL database. To start working with a table in dplyr you use tbl() to connect to that table.
hpd_viol <- tbl(con, "hpd_violations")
This will now act like a regular dataframe, and we can print it to see the first few rows. What it’s actually doing is querying the table in the database and only retrieving these few preview rows and converting the data into an R dataframe.
hpd_viol
# Source: table<hpd_violations> [?? x 40]
# Database: postgres
violationid buildingid registrationid boroid borough housenumber
<int> <int> <int> <chr> <chr> <chr>
1 11884817 51282 201932 2 BRONX 2707
2 11884815 694627 402586 4 QUEENS 801
3 11884819 333183 347984 3 BROOKL… 423
4 12312611 735531 14351 5 STATEN… 114
5 11884821 51282 201932 2 BRONX 2707
6 11530875 697656 406497 4 QUEENS 17-37
7 11884823 63112 200727 2 BRONX 352
8 11884824 333183 347984 3 BROOKL… 423
9 11884825 333183 347984 3 BROOKL… 423
10 11884828 333183 347984 3 BROOKL… 423
# ... with more rows, and 34 more variables: lowhousenumber <chr>,
# highhousenumber <chr>, streetname <chr>, streetcode <chr>,
# postcode <chr>, …
Just like before we can also use glimpse() to get a better sense of the contents of the dataset.
glimpse(hpd_viol)
Observations: ??
Variables: 40
$ violationid <int> 11884817, 11884815, 11884819, 12312...
$ buildingid <int> 51282, 694627, 333183, 735531, 5128...
$ registrationid <int> 201932, 402586, 347984, 14351, 2019...
$ boroid <chr> "2", "4", "3", "5", "2", "4", "2", ...
$ borough <chr> "BRONX", "QUEENS", "BROOKLYN", "STA...
$ housenumber <chr> "2707", "801", "423", "114", "2707"...
$ lowhousenumber <chr> "2707", "801", "423", "114", "2707"...
$ highhousenumber <chr> "2707", "801", "423", "114", "2707"...
$ streetname <chr> "BRIGGS AVENUE", "SENECA AVENUE", "...
$ streetcode <chr> "14620", "61890", "58930", "27990",...
$ postcode <chr> "10458", "11385", "11233", "10314",...
$ apartment <chr> "IB", NA, "2A", "1", "IB", NA, "3G"...
$ story <chr> "2", NA, "2", "1", "2", NA, "3", "2...
$ block <int> 3300, 3463, 1516, 1632, 3300, 3564,...
$ lot <int> 56, 12, 48, 108, 56, 33, 16, 48, 48...
$ class <chr> "B", "B", "A", "B", "A", "I", "C", ...
$ inspectiondate <date> 2017-07-18, 2017-07-18, 2017-07-18...
$ approveddate <date> 2017-07-19, 2017-07-18, 2017-07-22...
$ originalcertifybydate <date> 2017-09-07, 2017-09-06, 2017-11-10...
$ originalcorrectbydate <date> 2017-08-24, 2017-08-23, 2017-10-27...
$ newcertifybydate <date> NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ newcorrectbydate <date> NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ certifieddate <date> NA, 2017-10-06, NA, NA, NA, NA, 20...
$ ordernumber <chr> "508", "513", "594", "949", "556", ...
$ novid <int> 5805072, 5804390, 5807912, 6054008,...
$ novdescription <chr> "SECTION 27-2005 ADM CODE REPAIR T...
$ novissueddate <date> 2017-07-20, 2017-07-19, 2017-07-24...
$ currentstatusid <int> 2, 4, 2, 2, 2, 9, 19, 2, 2, 2, 2, 2...
$ currentstatus <chr> "NOV SENT OUT", "NOV CERTIFIED LATE...
$ currentstatusdate <date> 2017-07-20, 2017-10-06, 2017-07-24...
$ novtype <chr> "Original", "Original", "Original",...
$ violationstatus <chr> "Open", "Open", "Open", "Open", "Op...
$ latitude <dbl> 40.86633, 40.70143, 40.68163, 40.61...
$ longitude <dbl> -73.89170, -73.90535, -73.91234, -7...
$ communityboard <chr> "7", "5", "16", "2", "7", "5", "1",...
$ councildistrict <int> 15, 34, 41, 50, 15, 34, 8, 41, 41, ...
$ censustract <chr> "40501", "551", "371", "29104", "40...
$ bin <int> 2017013, 4082786, 3041170, 5032029,...
$ bbl <chr> "2033000056", "4034630012", "301516...
$ nta <chr> "Bedford Park-Fordham North", "Ridg...
We already have the total count of HPD violations in 2017 from the original dataset, but with NYC-DB we get more detailed information. Let’s get the count of heat violations for the 2017-2018 heat season. Just like before we can build up our data manipulation using dplyr functions in a pipeline, and behind the scenes dplyr translates all these functions into SQL queries. We can use all the main dplyr functions, but for some specific tasks like writing conditions for filter() there may be things that aren’t supported by dplyr’s SQL translation. Luckily as a backup we can use the sql() function to insert raw SQL code into parts of our dplyr functions. Sometimes this is easier even when translations are possible. For example there is some SQL code borrowed from another project for identifying heat violations based on the notice of violation description text and filtering to the months of the heat season. We can paste that SQL code right into our filter statement along with our regular R code to filter for just our list of BBLs.
We can use all the same dplyr functions to work with the table on the database. when you print it will just be a preview. You can do this to figure out how you want to extract the data. then use collect() to download the extract into a local dataset in R. behind the scenes dplyr is translating all these functions into SQL queries. You can see what it’s doing with show_query().
watchlist_bbls <- pull(watchlist_bldgs_geos, bbl)
violation_bbls <- hpd_viol %>%
filter(
bbl %in% watchlist_bbls, # only get the BBLs we need
sql("novdescription ~ '27-20(2[8-9]|3[0-3])'"), # heat violations
sql("novissueddate between '2017-10-01' and '2018-05-31'") # 2017-2018 heat season
) %>%
group_by(bbl) %>%
summarise(heat_viols = n())
violation_bbls
# Source: lazy query [?? x 2]
# Database: postgres
bbl heat_viols
<chr> <S3: integer64>
1 1007650017 1
2 1010330045 2
3 1010370049 1
4 1010410004 1
5 1010420009 2
6 1010550040 4
7 1011420141 1
8 1012430036 1
9 1013260037 1
10 1014400023 4
# ... with more rows
Just like before when we printed the database table, when we print the result of this dplyr pipeline we are still only getting a preview of the full results. Once we are happy with how our query is looking an we’re ready to download the full results into R we can use collect(). Now we can join the heat violations onto our existing dataframe of watchlist buildings.
heat_violations <- collect(violation_bbls)
watchlist_bldgs_geos_heat <- inner_join(watchlist_bldgs_geos, heat_violations, by = "bbl")
We won’t be able to cover spatial data in R in this workshop, but as a final example and preview of some of R’s spatial functionality we can map a simple interactive map to visualize and explore the watchlist buildings.
There is a growing collection of packages for working with spatial data in R, largely organized around the r-spatial group and the core package sf. sf has similar conventions as PostGIS and enables you to create spatial dataframes that are compatible with all the same dplyr functions we’ve been using above.
Here we’ll just use st_as_sf() to convert our data into a spatial sf-dataframe by creating point geometries from the latitude and longitude coordinates we got from Geoclient, and because we are working with lat/lng we’ll set the coordinate reference system (crs) to 4326. Now when we print the dataframe you see some additional metadata.
library(sf)
watchlist_bldgs_spatial <- st_as_sf(watchlist_bldgs_geos_heat, coords = c("lng", "lat"), crs = 4326)
watchlist_bldgs_spatial
Simple feature collection with 117 features and 14 fields
geometry type: POINT
dimension: XY
bbox: xmin: -74.08035 ymin: 40.59765 xmax: -73.79844 ymax: 40.89073
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
# A tibble: 117 x 15
landlord address units violations number street borough zip bbl
<chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <chr>
1 Jonatha… 5416 4… 16 62 5416 4 AVE… BROOKL… 11220 3008…
2 Jonatha… 5422 4… 21 78 5422 4 AVE… BROOKL… 11220 3008…
3 Jonatha… 166 BL… 6 41 166 BLEEC… BROOKL… 11221 3033…
4 Jonatha… 590 BU… 4 55 590 BUSHW… BROOKL… 11206 3031…
5 Jonatha… 203 HU… 6 143 203 HULL … BROOKL… 11233 3015…
6 Jonatha… 196 KI… 6 19 196 KINGS… BROOKL… 11222 3026…
7 Jonatha… 20 MEN… 6 51 20 MENAH… BROOKL… 11221 3033…
8 Jonatha… 321 ME… 6 23 321 MENAH… BROOKL… 11237 3033…
9 Jonatha… 234 SC… 6 44 234 SCHEN… BROOKL… 11213 3013…
10 Jonatha… 128 SU… 6 22 128 SUMPT… BROOKL… 11233 3015…
# ... with 107 more rows, and 6 more variables: bin <chr>,
# tract_2010 <chr>, cd <chr>, nta <chr>, heat_viols <S3:
# integer64>, …
There is a leaflet package that provides an R interface for creating interactive web maps with leaflet.js. We won’t get into all the details of that package now, but instead will use the mapview package which quickly makes you a simple leaflet map and takes care of all the details. This is very useful for simply exploring the data when you aren’t yet concerned about the visual presentation.
library(mapview)
mapview(watchlist_bldgs_spatial)