All the code for this article is available in this RStudio Cloud project and in this GitHub repo

R

R is a free and open-source software environment for statistical computing and graphics. It has a more narrow set of uses and a smaller user base compared python, but because it is specifically designed for data analysis it is a great language for data analysts to learn. In recent years it has become increasingly popular and has also become much easier to learn, especially for those new to coding.

There are three main reasons that I enjoying using R myself and teaching it to others.

R has an amazing community of users that have produced wealth of user friendly guides, documentation, and other tools and resources to support people learning the language.
RStudio is an incredibly helpful application (Integrated Development Environment) in which you can work with R.
A wide ecosystem of user-written packages that provide tools for almost every possible use case. Especially important is the collection of packages known as the Tidyverse that prioritize good design and documentation that make it easy to learn R.

R Community Resources

Here is just a small sample of some of the great resources available for learning R:

R for Data Science, a wonderful free book that provides a good introduction to R for data analysis
Tidyverse packages website provides comprehensive documentation for all packages, including helpful guides and examples
Stat545, a university course on data science with R that shares most of the materials
RStudio webinars, online learning materials, and cheatsheets
RStudio Community, a friendlier version of StackOverflow dedicated to R and RStudio packages
StackOverflow for R
DataCamp courses on R

RStudio

The RStudio Desktop application is free to download. They also now provide Rstudio Cloud as a free service that works just like RStudio Desktop but it is all accessed through the browser and requires no installation. We’ll be using RStudio Cloud this for this workshop, and you can get started by opening this RStudio Cloud project and making a permanent copy in your own work space.

This is the default RStudio interface. * The top left pane contains all your code files. This is where you can write and save all the code for your analysis. * The bottom left pane has the R console where you can run individual pieces of R code, such as quick calculations, printing objects, or anything else that you don’t need to save in your final files. * The top right pane contains a list of all the objects currently in your environment. If you click on a dataframe object it will open in a viewer in the top left pane where you can browse, sort, and filter your view of the data (without altering the object itself) * The bottom right pane contains a few important tabs: the plot viewer where any graphs you create will appear, the files explorer, and the help page

Packages in R

Installing and Loading Packages

R is an open-source language so in addition to the basic functions that come standard with R (referred to as Base R) there are more than 10,000 user written packages that can accomplish virtually any task in R. There is an official repository for these packages called CRAN that does some vetting of the quality of packages, and packages from here can be installed directly from R using:


install.packages("PackageName")

These packages only need to be installed like this once, and after that initial installation we only need to load the packages that we want use for each analysis with library().

Tidyverse Packages

All of the packages we are using here are part of a collection of R packages referred to as the tidyverse.

The Tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

All of these packages are extremely well-maintained and have helpful websites that include, examples and guides, function documentation, cheatsheets, and links to the GitHub repos where the packages are developed.

The following are the core set of Tidyverse packages, but there are many more.

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges
readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf)
tidyr helps you create tidy data. Tidy data is data where: Each variable is in a column, each observation is a row, and each value is a cell
stringr provides a cohesive set of functions designed to make working with strings as easy as possible
forcats provides a suite of useful tools that solve common problems with factors
purrr is a complete and consistent set of tools for working with functions and vectors
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics

In addition to the package websites, there is an amazing free book that covers how to use all these packages to do data analysis, called R for Data Science.

There is a tidyverse package that loads all seven of these core packages.


library(dplyr) # manipulate dataframes
library(readr) # read/write dataframes
library(tidyr) # reshapeing dataframes
library(stringr) # string manipulation
library(forcats) # factor manipulation
library(purrr) # iteration (instead of loops)
library(ggplot2) # making plots

Landlord Watchlist Buildings

We’ll be using a dataset of buildings from the Public Advocate’s NYC Landlord Watchlist.

Import and Preview Dataset

read_csv() guesses about the data type of each column, and gives you the column specification is used. Often times this will be what you want, but if you want to override the guesses you can supply your own specification.

In R the (admittedly quirky) convention is to use <- as the assignment operator instead of =.


watchlist_bldgs <- read_csv("data/landlord-watchlist-buildings_2018-12-08.csv")


watchlist_bldgs <- read_csv(
  file = "data/landlord-watchlist-buildings_2018-12-08.csv",
  col_types = cols(
    .default = col_character(),
    units = col_integer(),
    violations = col_integer()
  )
)

Now let’s take a look at this new dataframe that we’ve imported. You can print the dataframe to get a simple preview.

In R when you run code with just an object it is printed to the console the same as if the print() function was used.


watchlist_bldgs


# A tibble: 452 x 8
   landlord    address   units violations number street borough zip  
   <chr>       <chr>     <int>      <int> <chr>  <chr>  <chr>   <chr>
 1 Jonathan C… 5416 4 A…    16         62 5416   4 AVE… BROOKL… 11220
 2 Jonathan C… 5422 4 A…    21         78 5422   4 AVE… BROOKL… 11220
 3 Jonathan C… 314 57 S…    17         70 314    57 ST… BROOKL… 11220
 4 Jonathan C… 372 BALT…     8         76 372    BALTI… BROOKL… 11201
 5 Jonathan C… 166 BLEE…     6         41 166    BLEEC… BROOKL… 11221
 6 Jonathan C… 590 BUSH…     4         55 590    BUSHW… BROOKL… 11206
 7 Jonathan C… 165 HOWA…     8         39 165    HOWAR… BROOKL… 11233
 8 Jonathan C… 208 HOYT…     3          9 208    HOYT … BROOKL… 11217
 9 Jonathan C… 203 HULL…     6        143 203    HULL … BROOKL… 11233
10 Jonathan C… 318 HUMB…     7        102 318    HUMBO… BROOKL… 11211
# ... with 442 more rows

When simply printing the dataframe you’ll only see a few rows and as many columns as fit nicely on your screen. When you have many columns it’s often helpful to use the function glimpse() to see a list of all your columns.


glimpse(watchlist_bldgs)


Observations: 452
Variables: 8
$ landlord   <chr> "Jonathan Cohen-Silvershore Properties", "Jona...
$ address    <chr> "5416 4 AVENUE, BROOKLYN 11220", "5422 4 AVENU...
$ units      <int> 16, 21, 17, 8, 6, 4, 8, 3, 6, 7, 6, 8, 6, 6, 7...
$ violations <int> 62, 78, 70, 76, 41, 55, 39, 9, 143, 102, 19, 5...
$ number     <chr> "5416", "5422", "314", "372", "166", "590", "1...
$ street     <chr> "4 AVENUE", "4 AVENUE", "57 STREET", "BALTIC S...
$ borough    <chr> "BROOKLYN", "BROOKLYN", "BROOKLYN", "BROOKLYN"...
$ zip        <chr> "11220", "11220", "11220", "11201", "11221", "...

In RStudio you can also use View() to open an window where you can interactively view your dataframe, and even sort and filter that view (without changing the dataframe).

Data Manipulation with `dplyr`

The package dplyr contains functions for basic data manipulation. It is organized around 5 main functions that take a dataframe and manipulate it in some way. The functions are named as verbs which help explain what they do.

filter() - filter to the rows you want to keep based on conditions
select() - select columns you want to keep
arrange() - sort dataframe by a column
mutate() - adds new columns
summarise() - collapse multiple rows down to a single one

Every one of these functions takes a dataframe as the first argument and returns an altered version of that dataframe.

Inside of these functions columns are referred to with just their names without quotes.

`filter()`

Use filter() find rows/cases where conditions are true. Rows where the condition evaluates to NA are dropped.


bk_bldgs <- filter(watchlist_bldgs, borough == "BROOKLYN")

bk_bldgs


# A tibble: 256 x 8
   landlord    address   units violations number street borough zip  
   <chr>       <chr>     <int>      <int> <chr>  <chr>  <chr>   <chr>
 1 Jonathan C… 5416 4 A…    16         62 5416   4 AVE… BROOKL… 11220
 2 Jonathan C… 5422 4 A…    21         78 5422   4 AVE… BROOKL… 11220
 3 Jonathan C… 314 57 S…    17         70 314    57 ST… BROOKL… 11220
 4 Jonathan C… 372 BALT…     8         76 372    BALTI… BROOKL… 11201
 5 Jonathan C… 166 BLEE…     6         41 166    BLEEC… BROOKL… 11221
 6 Jonathan C… 590 BUSH…     4         55 590    BUSHW… BROOKL… 11206
 7 Jonathan C… 165 HOWA…     8         39 165    HOWAR… BROOKL… 11233
 8 Jonathan C… 208 HOYT…     3          9 208    HOYT … BROOKL… 11217
 9 Jonathan C… 203 HULL…     6        143 203    HULL … BROOKL… 11233
10 Jonathan C… 318 HUMB…     7        102 318    HUMBO… BROOKL… 11211
# ... with 246 more rows

Multiple conditions are combined with &.


bk_big_bldgs <- filter(watchlist_bldgs, units > 10, borough == "QUEENS")

bk_big_bldgs


# A tibble: 4 x 8
  landlord    address   units violations number street  borough zip  
  <chr>       <chr>     <int>      <int> <chr>  <chr>   <chr>   <chr>
1 Jonathan C… 1708 SUM…    39        116 1708   SUMMER… QUEENS  11385
2 RONALD SWA… 34-01 28…    41        157 34-01  28 AVE… QUEENS  11103
3 RONALD SWA… 47-02 88…    29         98 47-02  88 STR… QUEENS  11373
4 HILLDISDE … 87-40 16…   119        383 87-40  165 ST… QUEENS  11432

`select()`

Use select() to keep or drop columns. You can either specify a set of variables to keep by listing them, or specify columns to be dropped with -.


select(watchlist_bldgs, landlord, borough, units)


# A tibble: 452 x 3
   landlord                              borough  units
   <chr>                                 <chr>    <int>
 1 Jonathan Cohen-Silvershore Properties BROOKLYN    16
 2 Jonathan Cohen-Silvershore Properties BROOKLYN    21
 3 Jonathan Cohen-Silvershore Properties BROOKLYN    17
 4 Jonathan Cohen-Silvershore Properties BROOKLYN     8
 5 Jonathan Cohen-Silvershore Properties BROOKLYN     6
 6 Jonathan Cohen-Silvershore Properties BROOKLYN     4
 7 Jonathan Cohen-Silvershore Properties BROOKLYN     8
 8 Jonathan Cohen-Silvershore Properties BROOKLYN     3
 9 Jonathan Cohen-Silvershore Properties BROOKLYN     6
10 Jonathan Cohen-Silvershore Properties BROOKLYN     7
# ... with 442 more rows


select(watchlist_bldgs, -landlord)


# A tibble: 452 x 7
   address            units violations number street    borough zip  
   <chr>              <int>      <int> <chr>  <chr>     <chr>   <chr>
 1 5416 4 AVENUE, BR…    16         62 5416   4 AVENUE  BROOKL… 11220
 2 5422 4 AVENUE, BR…    21         78 5422   4 AVENUE  BROOKL… 11220
 3 314 57 STREET, BR…    17         70 314    57 STREET BROOKL… 11220
 4 372 BALTIC STREET…     8         76 372    BALTIC S… BROOKL… 11201
 5 166 BLEECKER STRE…     6         41 166    BLEECKER… BROOKL… 11221
 6 590 BUSHWICK AVEN…     4         55 590    BUSHWICK… BROOKL… 11206
 7 165 HOWARD AVENUE…     8         39 165    HOWARD A… BROOKL… 11233
 8 208 HOYT STREET, …     3          9 208    HOYT STR… BROOKL… 11217
 9 203 HULL STREET, …     6        143 203    HULL STR… BROOKL… 11233
10 318 HUMBOLDT STRE…     7        102 318    HUMBOLDT… BROOKL… 11211
# ... with 442 more rows

You can rename the columns that you are selecting within select(), or use rename() which keeps all columns.


select(watchlist_bldgs, borough_name = borough)


# A tibble: 452 x 1
   borough_name
   <chr>       
 1 BROOKLYN    
 2 BROOKLYN    
 3 BROOKLYN    
 4 BROOKLYN    
 5 BROOKLYN    
 6 BROOKLYN    
 7 BROOKLYN    
 8 BROOKLYN    
 9 BROOKLYN    
10 BROOKLYN    
# ... with 442 more rows


rename(watchlist_bldgs, landlord_name = landlord)


# A tibble: 452 x 8
   landlord_name address units violations number street borough zip  
   <chr>         <chr>   <int>      <int> <chr>  <chr>  <chr>   <chr>
 1 Jonathan Coh… 5416 4…    16         62 5416   4 AVE… BROOKL… 11220
 2 Jonathan Coh… 5422 4…    21         78 5422   4 AVE… BROOKL… 11220
 3 Jonathan Coh… 314 57…    17         70 314    57 ST… BROOKL… 11220
 4 Jonathan Coh… 372 BA…     8         76 372    BALTI… BROOKL… 11201
 5 Jonathan Coh… 166 BL…     6         41 166    BLEEC… BROOKL… 11221
 6 Jonathan Coh… 590 BU…     4         55 590    BUSHW… BROOKL… 11206
 7 Jonathan Coh… 165 HO…     8         39 165    HOWAR… BROOKL… 11233
 8 Jonathan Coh… 208 HO…     3          9 208    HOYT … BROOKL… 11217
 9 Jonathan Coh… 203 HU…     6        143 203    HULL … BROOKL… 11233
10 Jonathan Coh… 318 HU…     7        102 318    HUMBO… BROOKL… 11211
# ... with 442 more rows

`mutate()`

Use mutate() to add new columns to a dataset. mutate() keeps all the existing columns and adds new one to the end of the dataset, and the variant transmute() creates new columns but keeps only the new ones.


mutate(watchlist_bldgs, landlord_lower = str_to_lower(landlord))


# A tibble: 452 x 9
   landlord address units violations number street borough zip  
   <chr>    <chr>   <int>      <int> <chr>  <chr>  <chr>   <chr>
 1 Jonatha… 5416 4…    16         62 5416   4 AVE… BROOKL… 11220
 2 Jonatha… 5422 4…    21         78 5422   4 AVE… BROOKL… 11220
 3 Jonatha… 314 57…    17         70 314    57 ST… BROOKL… 11220
 4 Jonatha… 372 BA…     8         76 372    BALTI… BROOKL… 11201
 5 Jonatha… 166 BL…     6         41 166    BLEEC… BROOKL… 11221
 6 Jonatha… 590 BU…     4         55 590    BUSHW… BROOKL… 11206
 7 Jonatha… 165 HO…     8         39 165    HOWAR… BROOKL… 11233
 8 Jonatha… 208 HO…     3          9 208    HOYT … BROOKL… 11217
 9 Jonatha… 203 HU…     6        143 203    HULL … BROOKL… 11233
10 Jonatha… 318 HU…     7        102 318    HUMBO… BROOKL… 11211
# ... with 442 more rows, and 1 more variable: landlord_lower <chr>


transmute(watchlist_bldgs, violations_per_unit = violations / units)


# A tibble: 452 x 1
   violations_per_unit
                 <dbl>
 1                3.88
 2                3.71
 3                4.12
 4                9.5 
 5                6.83
 6               13.8 
 7                4.88
 8                3   
 9               23.8 
10               14.6 
# ... with 442 more rows

`arrange()`

Use arrange() to add order the rows in your dataset by the values of one or more columns. Be default they will be in ascending order, and you can use desc() for descending order.


arrange(watchlist_bldgs, landlord, desc(units))


# A tibble: 452 x 8
   landlord  address    units violations number street  borough zip  
   <chr>     <chr>      <int>      <int> <chr>  <chr>   <chr>   <chr>
 1 ADAM STR… 150 WEST …    31         92 150    WEST 1… BRONX   10468
 2 ADAM STR… 19 EAST 1…    25        105 19     EAST 1… MANHAT… 10035
 3 ADAM STR… 611 WEST …    22         78 611    WEST 1… MANHAT… 10031
 4 ADAM STR… 535 EDGEC…    21         96 535    EDGECO… MANHAT… 10032
 5 ADAM STR… 480 CONVE…    15         76 480    CONVEN… MANHAT… 10031
 6 ADAM STR… 616 EAST …    13         62 616    EAST 1… BRONX   10458
 7 ADAM STR… 2372 ARTH…    12         47 2372   ARTHUR… BRONX   10458
 8 ADAM STR… 2550 CRES…    11         67 2550   CRESTO… BRONX   10468
 9 ADAM STR… 293 PLEAS…    10         49 293    PLEASA… MANHAT… 10029
10 ADAM STR… 2405 BEAU…     9         33 2405   BEAUMO… BRONX   10458
# ... with 442 more rows

`summarize()`

You can use summarize() on a dataset to collapse down all the rows to a single row to calculate an aggregate statistic of one or more columns. It works in a similar way as mutate(), except whereas in mutate you can create new columns that are the same length as your existing dataset, with summarise() you will sum some sort of aggregate function (like sum()) that takes a column of multiple values and returns only one value.


summarise(watchlist_bldgs, total_units = sum(units))


# A tibble: 1 x 1
  total_units
        <int>
1        7192

`group_by()`

The 6th function is group_by() and this doesn’t change the contents of your dataframe, but instead affects how all of the above functions work if they are subsequently called on the dataframe. After a dataframe has been grouped by one or more columns, all functions apply to each group of rows in the dataset as if it was it’s own dataset. group_by() is most commonly used with summarize. Alone summarize() will collapse a dataframe to a single row, but with a grouped dataframe it is collapsed down to one row per group. After you have finished with your grouped operations use ungroup() to make sure that it doesn’t unintentionally alter later operations.


boro_bldgs <- group_by(watchlist_bldgs, borough)
boro_bldgs <- summarise(boro_bldgs, total_units = sum(units))
boro_bldgs <- ungroup(boro_bldgs)

boro_bldgs


# A tibble: 5 x 2
  borough       total_units
  <chr>               <int>
1 BRONX                1662
2 BROOKLYN             2723
3 MANHATTAN            2458
4 QUEENS                237
5 STATEN ISLAND         112

Data manipulation pipelines with `%>%` (“pipe”)

As you can see above when you want to make a series of changes to a dataframe you can end up repeating yourself a lot and overwriting a dataframe with each step. Thankfully there’s a way to avoid this!

The beauty of dplyr is that all of the functions above take a dataframe as the first argument, and return an altered version of that dataframe as the result. This allows us to start with a dataframe and link together multiple functions so that they each make a change to the dataframe then pass it along to the next function. dplyr includes a special operator, %>% (pronounced “pipe”), that allows us to chain together these function calls. When reading the code for these pipelines you can read %>% as “then”.

This %>% takes the object on the left and passes it to the function on the right as the first argument.

For a simple example, let’s look at the function str_c(), which concatenates strings together. Instead of passing "a" and "b" as the first and second argument, we can use the %>% to “pipe” the "a" into the function as the first argument and the "b" becomes the second argument.


str_c("a", "b")


[1] "ab"


"a" %>% str_c("b")


[1] "ab"

Now let’s practice putting together some of these dplyr functions into a little data manipulation pipeline by getting some information about the landlords on the watchlist and the buildings they own in Brooklyn.

The long pipeline of these dplyr functions can seem overwhelming at first, but once you get familiar with the functions you’ll be able to read these code chunks like a little paragraph explaining the changes being made to a dataframe. To help illustrate this the following paragraph is a written explanation of every step of the accompanying block of code.

We’ll start with the full watchlist_bldgs dataset, then “pipe” (%>%) it into the next function to filter the dataset to just buildings where the borough is "Brooklyn". Then we mutate the dataset to add a new column called landlord_name that is simply a more nicely-formatted version of the existing landlord column. Then we select only the columns that we need: landlord_name, units, and HPD violations. Then we group_by the new landlord_name column, and then, with the dataset grouped, we’ll summarize the data across all buildings for each landlord to get some summary information about each landlord and their buildings in Brooklyn. Specifically, we’ll summarize to get the total number of buildings using the special n() function that counts the number of rows, we’ll also get the total_units by summing the units across all buildings for each landlord, and we’ll get the avg_bldg_size of each landlord’s Brooklyn buildings by taking the mean of units across their buildings. Similarly, we get the sum and mean of HPD violations for each landlord. We’ve now gone from a dataset in which each row represents a building to one in which each row is a landlord. Since we are done with our grouped operations we can ungroup the data, then finally we can arrange the dataset in descending order of the number of buildings the landlord owns in Brooklyn. After all of this our final resulting dataset is assigned to a new dataframe we’ll call bk_landlords.


bk_landlords <- watchlist_bldgs %>% 
  filter(borough == "BROOKLYN") %>% 
  mutate(landlord_name = str_to_title(landlord)) %>% 
  select(landlord_name, units, violations) %>% 
  group_by(landlord_name) %>% 
  summarize(
    buildings = n(),
    total_units = sum(units),
    avg_bldg_size = mean(units),
    total_viol = sum(violations),
    avg_bldg_viol = mean(violations)
  ) %>% 
  ungroup() %>% 
  arrange(desc(buildings))

bk_landlords


# A tibble: 53 x 6
   landlord_name buildings total_units avg_bldg_size total_viol
   <chr>             <int>       <int>         <dbl>      <int>
 1 Jonathan Coh…        18         149          8.28        974
 2 Meir Fried           17         125          7.35        690
 3 Deodat Lowtan        14          80          5.71        565
 4 Cheryl Ighod…        11          68          6.18        413
 5 Joel Kohn            10          60          6           312
 6 Iskyo Aronov          9          37          4.11        495
 7 Yosef Emergi          9          46          5.11        528
 8 Chaim Schwar…         8          39          4.88        361
 9 Isaac Schwar…         8         110         13.8         510
10 Jacob Rubin           8          52          6.5         349
# ... with 43 more rows, and 1 more variable: avg_bldg_viol <dbl>

Making graphs with `ggplot2`

Now let’s visualize this new dataset we’ve created using the package ggplot2.

ggplot2 is designed to work with dataframe inputs, so the first step is always to use ggplot(data = your_dataframe). You can build plots step by step by adding layers with +. The second step is always aes(), which establishes the aesthetics of the plot by mapping columns from the dataframe to aesthetic elements of the plot. For example, here we are setting the x axis values to landlord names and y to the total number of HPD violations. After the aes is set, you can use one of the many geom_*() functions to transform the aesthetics into a type of plot. In this case we want a column plot, so we use geom_column(). Finally, we can label any part of our graph with labs().


ggplot(data = bk_landlords) +
  aes(x = landlord_name, y = total_viol) +
  geom_col() +
  labs(
    title = '"Worst" Landlords in Brooklyn',
    subtitle = "Total HPD Violations in All Buildings for 2017",
    x = NULL,
    y = "Number of Violations",
    caption = "Source: NYC Public Advocate's Landlord Watchlist"
  )

With only the defaults ggplot2 graphs tend to look pretty good, and are not too difficult to create. However, there are definitely some things we’ll want to improve with this graph. Luckily, there is a near-infinite amount of customization possible with ggplot2 to get the plot looking exactly the way you want.

To start, there are clearly too many landlords to display clearly in a graph like this, so we can use dplyr toarrange the data by violations and filter to keep only the top 10 landlords. The first landlord name doesn’t match the same format as the other, so let’s remove the " Properties" part using str_remove from the stringr package. It would also be nice if the landlords were sorted in order of the number of violations. To achieve this we can change the landlord_name column from a string to instead use R’s factor datatype, which allows us to specify an ordering to the values. Specifically, we’ll use the function fct_reorder() from the package forcats to make the column a factor and put the values in order based on the values of the total_viol column.

Now we can use this new dataframe with ggplot2 and make a few more changes to improve the graph further. One obvious problem with our initial graph is that the landlord names are completely illegible due to overlap. To solve this we can use the ggplot2 function coord_flip() to flip our bars sideways so we can read the labels more cleanly. Another smaller adjustment we can make it to format the violation count labels on our y-axis. To make changes to anything related to one of the aesthetic elements of a plot we can use one of the many scale_*_* functions. The first * is always one of the aes element types, and the second * indicates the type of data that is mapped to it. In our case we want to make a change to the y axis and we’ve mapped our count of violations to y so it a continuous scale, so the function we’ll want to use is scale_y_continuous(). Now within that function we’ll want to use the formatting function comma from the scales package on our axis labels. Lastly, we can use one of the theme_* functions to apply some alternative styling to the plot. These functions provide you some helpful preset styling, but you can make your own fine-tuned adjustments using theme(). This can get a bit overwhelming, but just to illustrate what’s possible, here we’ll remove the unnecessary lines on the plot, move the landlord name labels over a bit, and change the font of the caption.

If you have a package already installed, you can use a function from it without loading it with library() by using package::function(). Here we are doing this for scales::comma because we only use this single function from the package once.


landlord_bk_10_worst <- bk_landlords %>% 
  arrange(desc(total_viol)) %>% 
  filter(row_number() <= 10) %>% 
  mutate(
    landlord_name = str_remove(landlord_name, " Properties"),
    landlord_name = fct_reorder(landlord_name, total_viol)
  )

ggplot(data = landlord_bk_10_worst) +
  aes(x = landlord_name, y = total_viol) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(margin = margin(r = -15)),
    plot.caption = element_text(face = "italic", color = "darkgrey", margin = margin(t = 10))
  ) +
  labs(
    title = '10 "Worst" Landlords in Brooklyn',
    subtitle = "Total HPD Violations in All Buildings for 2017",
    x = NULL,
    y = "Number of Violations",
    caption = "Source: NYC Public Advocate's Landlord Watchlist"
  )

Geocoding addresses with Geoclient API

To learn more about out our dataset of buildings we can join on additional information from other datasets. However we are missing an important column to allow us to join to other datasets. In NYC properties are identified in many administrative datasets by their tax lot identifiers know as Borough-Block-Lot (BBL), and if you have a BBL column it opens up lots of datasets bu out list of buildings is missing BBL. Luckily we do have the building addresses, and this allows us to use geocoding to get BBL (and other geographic information) using NYC’s Geoclient API.

To access the Geoclient API in R we can use the geoclient package. The first step is to set your API keys that allow us to use the API, for this we’ll use geoclient_api_keys to set environment variables with our keys to that we only have to input the once. The readRenviron() just makes sure that after adding the new environment variables we have access to them now without having to restart R. To make sure that we aren’t including private API keys in any public files we’ll use the getPass package to create pop-up windows for us to input this information without writing it down in the code.

You can acquire your own Geoclient API credentials by first registering with the NYC’s Developer Portal, then creating a new project and selecting “Geoclient v1” from available APIs.

The geoclient package is one that I wrote myself and it isn’t published on the official R package repository, but can instead be installed directly from GitHub with the remotes package.


library(geoclient) # remotes::install_github("austensen/geoclient")

geoclient_api_keys(
  id = getPass::getPass("geoclient ID:"),
  key = getPass::getPass("geoclient KEY:"),
  install = TRUE,
  overwrite = TRUE
)

readRenviron("~/.Renviron")

The Geoclient API allows for gecoding from a variety of input location types, and for each type there is an accompanying R function in the geoclient package. Here we want to geocode the building addesses so we’ll start with the dataset of buildings and pipe it into the function geo_address_data() and speficy all the address compoenents that are required for the function. This will return a dataset, of the same length and in the same order s the input dataset, containing all the many geographic identifiers and other pieces of intormation returned by the API. For our purposes we’re primarily interested in the BBL column, but wehile we’re at it let’s also keep some other geographic IDs and the latitude/longitude for each property, renaming some variables while we’re selecting them.


geo_info <- watchlist_bldgs %>% 
  geo_address_data(number, street, borough, zip) %>% 
  select(
    bbl, 
    bin = buildingIdentificationNumber, 
    tract_2010 = censusTract2010,
    cd = communityDistrict,
    nta,
    lat = latitudeInternalLabel,
    lng = longitudeInternalLabel
  )


geo_info


# A tibble: 452 x 7
   bbl        bin     tract_2010 cd    nta     lat   lng
   <chr>      <chr>   <chr>      <chr> <chr> <dbl> <dbl>
 1 3008220039 3014544 "  76  "   307   BK32   40.6 -74.0
 2 3008220041 3014545 "  76  "   307   BK32   40.6 -74.0
 3 3008460011 3015608 "  74  "   307   BK32   40.6 -74.0
 4 3004030030 3006430 "  69  "   306   BK38   40.7 -74.0
 5 3033060024 3075649 " 419  "   304   BK78   40.7 -73.9
 6 3031500031 3071890 " 389  "   304   BK78   40.7 -73.9
 7 3015050001 3040558 " 377  "   303   BK35   40.7 -73.9
 8 3004030037 3006432 "  69  "   306   BK38   40.7 -74.0
 9 3015340038 3041728 " 369  "   316   BK79   40.7 -73.9
10 3027780003 3069286 " 495  "   301   BK90   40.7 -73.9
# ... with 442 more rows

Because the resulting dataframe is the exact same size and order as our input dataset we can simply bind the two datasets together side-by-side with dplyr’s bind_cols().


watchlist_bldgs_geos <- bind_cols(watchlist_bldgs, geo_info)

watchlist_bldgs_geos


# A tibble: 452 x 15
   landlord address units violations number street borough zip   bbl  
   <chr>    <chr>   <int>      <int> <chr>  <chr>  <chr>   <chr> <chr>
 1 Jonatha… 5416 4…    16         62 5416   4 AVE… BROOKL… 11220 3008…
 2 Jonatha… 5422 4…    21         78 5422   4 AVE… BROOKL… 11220 3008…
 3 Jonatha… 314 57…    17         70 314    57 ST… BROOKL… 11220 3008…
 4 Jonatha… 372 BA…     8         76 372    BALTI… BROOKL… 11201 3004…
 5 Jonatha… 166 BL…     6         41 166    BLEEC… BROOKL… 11221 3033…
 6 Jonatha… 590 BU…     4         55 590    BUSHW… BROOKL… 11206 3031…
 7 Jonatha… 165 HO…     8         39 165    HOWAR… BROOKL… 11233 3015…
 8 Jonatha… 208 HO…     3          9 208    HOYT … BROOKL… 11217 3004…
 9 Jonatha… 203 HU…     6        143 203    HULL … BROOKL… 11233 3015…
10 Jonatha… 318 HU…     7        102 318    HUMBO… BROOKL… 11211 3027…
# ... with 442 more rows, and 6 more variables: bin <chr>,
#   tract_2010 <chr>, cd <chr>, nta <chr>, lat <dbl>, …

Querying the NYC-DB SQL Database

Now that we have BBL for all our buildings we can link to other bbl-level datasets. NYC-DB is an incredible resource of NYC housing datasets that all contain BBL.

We’ll start by connecting to the NYC-DB PostgreSQL database using the DBI package.


library(DBI)

con <- dbConnect(
  RPostgres::Postgres(),
  user = getPass::getPass("NYC-DB USER:"),
  password = getPass::getPass("NYC-DB PASSWORD:"),
  host = getPass::getPass("NYC-DB HOST:"),
  port = 5432,
  dbname = "nycdb"
)

To see all the tables available to use in NYC-DB we can use dbListTables().


dbListTables(con)


 [1] "acris_country_codes"                           
 [2] "personal_property_legals"                      
 [3] "acris_document_control_codes"                  
 [4] "acris_property_type_codes"                     
 [5] "acris_ucc_collateral_codes"                    
 [6] "dob_complaints"                                
 [7] "dobjobs"                                       
 [8] "dof_sales"                                     
 [9] "hpd_business_addresses"                        
[10] "hpd_complaint_problems"                        
[11] "hpd_complaints"                                
[12] "hpd_contacts"                                  
[13] "hpd_corporate_owners"                          
[14] "hpd_registrations"                             
[15] "hpd_registrations_grouped_by_bbl"              
[16] "hpd_registrations_grouped_by_bbl_with_contacts"
[17] "hpd_violations"                                
[18] "marshal_evictions_17"                          
[19] "personal_property_master"                      
[20] "personal_property_parties"                     
[21] "personal_property_references"                  
[22] "personal_property_remarks"                     
[23] "pluto_18v1"                                    
[24] "real_property_legals"                          
[25] "real_property_master"                          
[26] "real_property_parties"                         
[27] "real_property_references"                      
[28] "real_property_remarks"                         
[29] "rentstab"

A great thing about dplyr is that it not only works on regular dataframes, but you can also use it with tables in a SQL database. To start working with a table in dplyr you use tbl() to connect to that table.


hpd_viol <- tbl(con, "hpd_violations")

This will now act like a regular dataframe, and we can print it to see the first few rows. What it’s actually doing is querying the table in the database and only retrieving these few preview rows and converting the data into an R dataframe.


hpd_viol


# Source:   table<hpd_violations> [?? x 40]
# Database: postgres
   violationid buildingid registrationid boroid borough housenumber
         <int>      <int>          <int> <chr>  <chr>   <chr>      
 1    11884817      51282         201932 2      BRONX   2707       
 2    11884815     694627         402586 4      QUEENS  801        
 3    11884819     333183         347984 3      BROOKL… 423        
 4    12312611     735531          14351 5      STATEN… 114        
 5    11884821      51282         201932 2      BRONX   2707       
 6    11530875     697656         406497 4      QUEENS  17-37      
 7    11884823      63112         200727 2      BRONX   352        
 8    11884824     333183         347984 3      BROOKL… 423        
 9    11884825     333183         347984 3      BROOKL… 423        
10    11884828     333183         347984 3      BROOKL… 423        
# ... with more rows, and 34 more variables: lowhousenumber <chr>,
#   highhousenumber <chr>, streetname <chr>, streetcode <chr>,
#   postcode <chr>, …

Just like before we can also use glimpse() to get a better sense of the contents of the dataset.


glimpse(hpd_viol)


Observations: ??
Variables: 40
$ violationid           <int> 11884817, 11884815, 11884819, 12312...
$ buildingid            <int> 51282, 694627, 333183, 735531, 5128...
$ registrationid        <int> 201932, 402586, 347984, 14351, 2019...
$ boroid                <chr> "2", "4", "3", "5", "2", "4", "2", ...
$ borough               <chr> "BRONX", "QUEENS", "BROOKLYN", "STA...
$ housenumber           <chr> "2707", "801", "423", "114", "2707"...
$ lowhousenumber        <chr> "2707", "801", "423", "114", "2707"...
$ highhousenumber       <chr> "2707", "801", "423", "114", "2707"...
$ streetname            <chr> "BRIGGS AVENUE", "SENECA AVENUE", "...
$ streetcode            <chr> "14620", "61890", "58930", "27990",...
$ postcode              <chr> "10458", "11385", "11233", "10314",...
$ apartment             <chr> "IB", NA, "2A", "1", "IB", NA, "3G"...
$ story                 <chr> "2", NA, "2", "1", "2", NA, "3", "2...
$ block                 <int> 3300, 3463, 1516, 1632, 3300, 3564,...
$ lot                   <int> 56, 12, 48, 108, 56, 33, 16, 48, 48...
$ class                 <chr> "B", "B", "A", "B", "A", "I", "C", ...
$ inspectiondate        <date> 2017-07-18, 2017-07-18, 2017-07-18...
$ approveddate          <date> 2017-07-19, 2017-07-18, 2017-07-22...
$ originalcertifybydate <date> 2017-09-07, 2017-09-06, 2017-11-10...
$ originalcorrectbydate <date> 2017-08-24, 2017-08-23, 2017-10-27...
$ newcertifybydate      <date> NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ newcorrectbydate      <date> NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ certifieddate         <date> NA, 2017-10-06, NA, NA, NA, NA, 20...
$ ordernumber           <chr> "508", "513", "594", "949", "556", ...
$ novid                 <int> 5805072, 5804390, 5807912, 6054008,...
$ novdescription        <chr> "SECTION 27-2005 ADM CODE  REPAIR T...
$ novissueddate         <date> 2017-07-20, 2017-07-19, 2017-07-24...
$ currentstatusid       <int> 2, 4, 2, 2, 2, 9, 19, 2, 2, 2, 2, 2...
$ currentstatus         <chr> "NOV SENT OUT", "NOV CERTIFIED LATE...
$ currentstatusdate     <date> 2017-07-20, 2017-10-06, 2017-07-24...
$ novtype               <chr> "Original", "Original", "Original",...
$ violationstatus       <chr> "Open", "Open", "Open", "Open", "Op...
$ latitude              <dbl> 40.86633, 40.70143, 40.68163, 40.61...
$ longitude             <dbl> -73.89170, -73.90535, -73.91234, -7...
$ communityboard        <chr> "7", "5", "16", "2", "7", "5", "1",...
$ councildistrict       <int> 15, 34, 41, 50, 15, 34, 8, 41, 41, ...
$ censustract           <chr> "40501", "551", "371", "29104", "40...
$ bin                   <int> 2017013, 4082786, 3041170, 5032029,...
$ bbl                   <chr> "2033000056", "4034630012", "301516...
$ nta                   <chr> "Bedford Park-Fordham North", "Ridg...

We already have the total count of HPD violations in 2017 from the original dataset, but with NYC-DB we get more detailed information. Let’s get the count of heat violations for the 2017-2018 heat season. Just like before we can build up our data manipulation using dplyr functions in a pipeline, and behind the scenes dplyr translates all these functions into SQL queries. We can use all the main dplyr functions, but for some specific tasks like writing conditions for filter() there may be things that aren’t supported by dplyr’s SQL translation. Luckily as a backup we can use the sql() function to insert raw SQL code into parts of our dplyr functions. Sometimes this is easier even when translations are possible. For example there is some SQL code borrowed from another project for identifying heat violations based on the notice of violation description text and filtering to the months of the heat season. We can paste that SQL code right into our filter statement along with our regular R code to filter for just our list of BBLs.

We can use all the same dplyr functions to work with the table on the database. when you print it will just be a preview. You can do this to figure out how you want to extract the data. then use collect() to download the extract into a local dataset in R. behind the scenes dplyr is translating all these functions into SQL queries. You can see what it’s doing with show_query().


watchlist_bbls <- pull(watchlist_bldgs_geos, bbl)

violation_bbls <- hpd_viol %>% 
  filter(
    bbl %in% watchlist_bbls, # only get the BBLs we need
    sql("novdescription ~ '27-20(2[8-9]|3[0-3])'"), # heat violations
    sql("novissueddate between '2017-10-01' and '2018-05-31'") # 2017-2018 heat season
  ) %>% 
  group_by(bbl) %>% 
  summarise(heat_viols = n())

violation_bbls


# Source:   lazy query [?? x 2]
# Database: postgres
   bbl        heat_viols     
   <chr>      <S3: integer64>
 1 1007650017 1              
 2 1010330045 2              
 3 1010370049 1              
 4 1010410004 1              
 5 1010420009 2              
 6 1010550040 4              
 7 1011420141 1              
 8 1012430036 1              
 9 1013260037 1              
10 1014400023 4              
# ... with more rows

You can see behind the scenes at the SQL translation using show_query(). Alternatively, if you already know SQL you can just write the query yourself and use dbGetQuery().

Just like before when we printed the database table, when we print the result of this dplyr pipeline we are still only getting a preview of the full results. Once we are happy with how our query is looking an we’re ready to download the full results into R we can use collect(). Now we can join the heat violations onto our existing dataframe of watchlist buildings.


heat_violations <- collect(violation_bbls)

watchlist_bldgs_geos_heat <- inner_join(watchlist_bldgs_geos, heat_violations, by = "bbl")

Spatial Data in R

We won’t be able to cover spatial data in R in this workshop, but as a final example and preview of some of R’s spatial functionality we can map a simple interactive map to visualize and explore the watchlist buildings.

There is a growing collection of packages for working with spatial data in R, largely organized around the r-spatial group and the core package sf. sf has similar conventions as PostGIS and enables you to create spatial dataframes that are compatible with all the same dplyr functions we’ve been using above.

Here we’ll just use st_as_sf() to convert our data into a spatial sf-dataframe by creating point geometries from the latitude and longitude coordinates we got from Geoclient, and because we are working with lat/lng we’ll set the coordinate reference system (crs) to 4326. Now when we print the dataframe you see some additional metadata.


library(sf)

watchlist_bldgs_spatial <- st_as_sf(watchlist_bldgs_geos_heat, coords = c("lng", "lat"), crs = 4326)

watchlist_bldgs_spatial


Simple feature collection with 117 features and 14 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: -74.08035 ymin: 40.59765 xmax: -73.79844 ymax: 40.89073
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs
# A tibble: 117 x 15
   landlord address units violations number street borough zip   bbl  
   <chr>    <chr>   <int>      <int> <chr>  <chr>  <chr>   <chr> <chr>
 1 Jonatha… 5416 4…    16         62 5416   4 AVE… BROOKL… 11220 3008…
 2 Jonatha… 5422 4…    21         78 5422   4 AVE… BROOKL… 11220 3008…
 3 Jonatha… 166 BL…     6         41 166    BLEEC… BROOKL… 11221 3033…
 4 Jonatha… 590 BU…     4         55 590    BUSHW… BROOKL… 11206 3031…
 5 Jonatha… 203 HU…     6        143 203    HULL … BROOKL… 11233 3015…
 6 Jonatha… 196 KI…     6         19 196    KINGS… BROOKL… 11222 3026…
 7 Jonatha… 20 MEN…     6         51 20     MENAH… BROOKL… 11221 3033…
 8 Jonatha… 321 ME…     6         23 321    MENAH… BROOKL… 11237 3033…
 9 Jonatha… 234 SC…     6         44 234    SCHEN… BROOKL… 11213 3013…
10 Jonatha… 128 SU…     6         22 128    SUMPT… BROOKL… 11233 3015…
# ... with 107 more rows, and 6 more variables: bin <chr>,
#   tract_2010 <chr>, cd <chr>, nta <chr>, heat_viols <S3:
#   integer64>, …

There is a leaflet package that provides an R interface for creating interactive web maps with leaflet.js. We won’t get into all the details of that package now, but instead will use the mapview package which quickly makes you a simple leaflet map and takes care of all the details. This is very useful for simply exploring the data when you aren’t yet concerned about the visual presentation.


library(mapview)

mapview(watchlist_bldgs_spatial)

R for NYC Housing Data Analysis

Table of Contents

R

R Community Resources

RStudio

Packages in R

Installing and Loading Packages

Tidyverse Packages

Landlord Watchlist Buildings

Import and Preview Dataset

Data Manipulation with `dplyr`

`filter()`

`select()`

`mutate()`

`arrange()`

`summarize()`

`group_by()`

Data manipulation pipelines with `%>%` (“pipe”)

Making graphs with `ggplot2`

Geocoding addresses with Geoclient API

Querying the NYC-DB SQL Database

Spatial Data in R

R for NYC Housing Data Analysis

Table of Contents

R

R Community Resources

RStudio

Packages in R

Installing and Loading Packages

Tidyverse Packages

Landlord Watchlist Buildings

Import and Preview Dataset

Data Manipulation with dplyr

filter()

select()

mutate()

arrange()

summarize()

group_by()

Data manipulation pipelines with %>% (“pipe”)

Making graphs with ggplot2

Geocoding addresses with Geoclient API

Querying the NYC-DB SQL Database

Spatial Data in R

Data Manipulation with `dplyr`

`filter()`

`select()`

`mutate()`

`arrange()`

`summarize()`

`group_by()`

Data manipulation pipelines with `%>%` (“pipe”)

Making graphs with `ggplot2`