Data Wrangling with dplyr

What is dplyr?

The dplyr package is designed to make data wrangling faster, clearer, and more intuitive by providing a consistent way to manipulate data.

At its core, dplyr is built around the idea of “verbs” where each function represents a common action you perform on data:

These verbs can be combined to create a step-by-step workflow that mirrors how analysts actually think. Another way to think of this is like a “data recipe” where we stipulate a set of steps to follow to give us a particular data set.

Why This Matters for Crime Analysts

Crime data is often messy, complex, and large. Analysts regularly need to:

  • Isolate specific types of incidents (e.g., use of force)
  • Focus on certain time periods or locations
  • Create new variables (e.g., month, crime categories)
  • Summarize patterns across groups (e.g., by neighborhood or offense type)

dplyr makes these tasks:

  • Readable: Code looks like a sequence of logical steps
  • Efficient: Handles large datasets quickly
  • Reproducible: Every step is documented and can be rerun

Instead of manually filtering or copying data in spreadsheets, analysts can build a clear, repeatable workflow that turns raw data into meaningful insight. As you work through this section, the advantages of these features should become more clear.

To facilitate that, let’s look at some data on use of force incidents.

Use of Force in Phoenix

Imagine you are a crime analyst working for a non-profit in Phoenix, Arizona. Your supervisor asks: “We’ve been getting questions from local constituents about use of force. Can you tell me what’s going on?” At first glance, this sounds straightforward: just open the data and look (since these are publicly available, it is easy to do so).

The DWVpack package contains a data set for use of force incidents from 2018-2024 in Phoenix. (As a reminder, if you have not not installed the DWVpack package, please review the Getting Started with R chapter where this is covered.) To look at the data, we need to first load the DWVpack package and then examine the data, called phx_uof_2018_2024 in the package, using the View() function. This function takes an object and creates a separate viewing pane where we can “view” the data.

Here are the commands:

# load the package
library( DWVpack )

# look at the data 
View( phx_uof_2018_2024 )

Take a look through the data file and think about what we have. In the rows, we have each incident from 2018-2024. If you scroll to the bottom of the file, you will see that there are 12324 total incidents. In the columns we have the variables.

Let’s take a look at the first column, which is the variable INC_IA_NO. According to the data dictionary for these data, the first variable INC_IA_NO is a record number to use as a unique identifier for each officer Use of Force report (aka officer use of force report). A Use of Force report is completed for each involved individual within an incident. This field is available beginning in 2021 when Phoenix Police Department implemented a new application for tracking UOF interactions to allow for additional information to be collected about each UOF interaction.

What about the tenth column, HUNDRED_BLOCK? This is the address of the incident, anonymized to the hundred block. As you can see if you scroll through the incidents, there are some listed as “Not Available”, meaning that (for some reason) the location information was not entered.

If you keep looking through the data (go ahead, I can wait), you will see that we can’t just dive into analyzing the data. We need to clean and structure the data. This is where data wrangling comes in.

Analytical Questions and Workflow

To guide our work, consider the types of questions we might want to answer:

Are incidents increasing or decreasing over time? What proportion of incidents involve Male (vs. Female) individuals? What proportion of incidents involve the suspect having a gun? Are non-white suspects more prevalent in the data than White suspects?

Each of these questions requires transforming raw data into a usable format as raw datasets rarely arrive ready for analysis (as we saw above). Instead, we typically move through a series of steps (a workflow):

  • Select relevant variables
  • Filter observations of interest
  • Create new variables (e.g., time, categories)
  • Group and summarize data
  • Check for missing or unusual values

The dplyr package is designed to make each of these steps clear, consistent, and reproducible.

dplyr Basics

You’re about to learn the primary dplyr verbs (functions), which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, it’s worth stating what they have in common:

  • The first argument is always a data frame.
  • The subsequent arguments typically describe which columns to operate on using the variable names (without quotes).
  • The output is always a new data frame.

Because each verb does one thing well, solving complex problems will usually require combining multiple verbs, and we’ll do so with the pipe, |>. We’ll discuss the pipe more, but in brief, the pipe allows us to string together multiple verbs. The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you haven’t yet learned the details:


phx_uof_2018_2024 |>
  filter( CIT_GENDER == "Male" ) |> 
  group_by( INC_YEAR ) |> 
  summarize(
    male_uof = n()
  )
  

dplyr’s verbs are organized into four groups based on what they operate on: rows, columns, groups, or tables. In the following sections, you’ll learn the most important verbs for rows, columns, and groups. Let’s dive in!

To use these functions, we need to install dyplr:


install.packages( "dplyr" )

After installing (or if you already have the package installed), we need to use the library() function to load it:

Rows

The most important verbs that operate on rows of a dataset are filter(), which changes which rows are present without changing their order, and arrange(), which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. We’ll also discuss distinct() which finds rows with unique values. Unlike arrange() and filter() it can also optionally modify the columns.

filter()

filter() allows you to keep rows based on the values of the columns. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all use of force incidents that involve “Male” suspects:


phx_uof_2018_2024 |>
  filter( CIT_GENDER == "Male" )
  

What does this say? It first says to get the phx_uof_2018_2024 object, “then” (this is the |> operator) filter the rows that meet the condition. In this case we are filtering the only to only keep rows where the CIT_GENDER column is equal to “Male”.

We can use various operators for the condition. Above we used == equal to, but we could say != which would give is all the values except “Male”. Let’s think about this. Here are all the unique values for the CIT_GENDER variable:

table( phx_uof_2018_2024$CIT_GENDER )

       Female          Male Not Available       Unknown 
         1137         11180             6             1 

Look at the values and think about what the following code will return:


phx_uof_2018_2024 |>
  filter( CIT_GENDER != "Male" )
  

For numeric variables, we can also use > (greater than), >= (greater than or equal to), < (less than), and <= (less than or equal to).

For example, if we wanted all years after 2018:


phx_uof_2018_2024 |>
  filter( INC_YEAR > 2018 ) 
  

We can also combine conditions with & to indicate “and” (check for both conditions) or with | to indicate “or” (check for either condition):


# take years 2018 and Males using the & "and" operationr:

phx_uof_2018_2024 |>
  filter( CIT_GENDER == "Male" & INC_YEAR == 2018 )


# take years 2018 and 2025 using the | "or" operationr:

phx_uof_2018_2024 |>
  filter( INC_YEAR == 2018 | INC_YEAR == 2025 ) 
  

When you run filter() dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <-:


uof_2018_and_2025 <- phx_uof_2018_2024 |>
  filter( INC_YEAR == 2018 | INC_YEAR == 2025 ) 
  

arrange()

arrange() changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names. If you provide more than one column name, each additional column will be used to break ties in the values of the preceding columns. For example, the following code sorts by INC_YEAR then by CIT_GENDER.


phx_uof_2018_2024 |>
  arrange( INC_YEAR, CIT_GENDER )
  

We get the earliest year first, then female, then male, then then next year, and so on. Note that we have not altered the values of the data, we have just arranged the data differently.

distinct()

distinct() finds all the unique values of a column in a dataset.


phx_uof_2018_2024 |>
  distinct( INC_YEAR )
  

We can also use distinct() to get combinations of values for several columns:


phx_uof_2018_2024 |>
  distinct( INC_YEAR, CIT_GENDER )
  

Above, I used the table() function to essentially do the same thing, except that function returns counts. We can use the count() function in dplyr to achieve this:


phx_uof_2018_2024 |>
  count( INC_YEAR, CIT_GENDER )
  

So why not just use the table() function? A strength of dyplr is that we can supply multiple commands to a recipe using our pipe, |>.

For example, let’s use filter() to take incidents involving Males and then look at the frequency over the years:


phx_uof_2018_2024 |> 
  filter( CIT_GENDER == "Male" ) |> 
  count( INC_YEAR )
  

As we will see, we can string together many of these instructions for how we want to augment into a single command using the |> operator.

Columns

There are also verbs that operate on the columns. The two we will focus on here are select() and mutate().

select()

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables:

  • Select columns by name
phx_uof_2018_2024 |> 
  select( INC_YEAR, CIT_GENDER )
  • Select all columns between INC_YEAR and CIT_GENDER
phx_uof_2018_2024 |> 
  select( INC_YEAR : CIT_GENDER )
  • Select all columns except INC_YEAR and CIT_GENDER
phx_uof_2018_2024 |> 
  select( !INC_YEAR : CIT_GENDER )

Historically this operation was done with - instead of !, so you’re likely to see that in the wild. These two operators serve the same purpose but with subtle differences in behavior. I recommend using ! because it reads as “not” and combines well with & and |.

There are also a number of helper functions you can use within select():

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • num_range("x", 1:3): matches x1, x2 and x3.

You can rename variables as you select() them by using =. The new name appears on the left-hand side of the =, and the old variable appears on the right-hand side:

phx_uof_2018_2024 |> 
  select( year = INC_YEAR )

This is a nice feature when you are building a data set and the variables names are long, confusing, etc.

mutate()

The job of mutate() is to add new columns that are calculated from the existing columns. For example, let’s transform the CIT_RACE variable into a variable that defines the cases as “white” and “non-white”. First, let’s look at the values of the CIT_RACE variable:

phx_uof_2018_2024 |> 
  distinct( CIT_RACE )
                           CIT_RACE
1  American Indian / Alaskan Native
2                             White
3          Asian / Pacific Islander
4                             Black
5                            indian
6                          AmIndian
7                          Hispanic
8                           Unknown
9                     Not Available
10         Black / African American

Now, having seen there are 10 distinct values for the CIT_RACE variable, we can reclassify them:


phx_uof_2018_2024 |> 
  select( race = CIT_RACE ) |> 
  mutate( white = ifelse( race == "White", "White", "Non-White" ) )
  

Here, we did a few steps. First, we used select() to just keep the CIT_RACE column and we renamed it to be race. Second, we created a variable called white using the ifelse() function. Simply, the function says “if race is equal to”White”, then the variable white will be “White” and if not, then “Non-White”.

Groups

So far you’ve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. Here, we’ll focus on the most important functions: summarize() and group_by().

summarize()

The most important grouped operation is a summary, which, if being used to calculate a single summary statistic, reduces the data frame to have a single row for each group. In dplyr, this operation is performed by summarize(). For example, let’s compute the average age of citizens using the CIT_AGE column:

phx_uof_2018_2024 |> 
  summarize( 
    avg_age = mean( CIT_AGE ) 
    )
  avg_age
1      NA

Huh? This doesn’t look right. Notice that in the code above, we are trying to calculate a mean on a variable that has missing values (i.e. NA). For the mean() function to work, we need to tell it what to do with the NA values. Specifically, we need to use the na.rm = argument by setting it to TRUE such that it removes missing values.

phx_uof_2018_2024 |> 
  summarize( 
    avg_age = mean( CIT_AGE, na.rm = TRUE )
    )
   avg_age
1 33.13641

Ah, that looks better.

Note that we can include other info in the output, such as the number of cases. This is done using n() and, as we will see below, is really useful when we have multiple means.

phx_uof_2018_2024 |> 
  summarize( 
    avg_age = mean( CIT_AGE, na.rm = TRUE ),
    n = n()
    )
   avg_age     n
1 33.13641 12324

group_by()

Many times we want to divide a dataset into groups and then perform calculations for each of those groups (e.g. mean, standard deviation, n, and so on). We can use group_by() to divide your dataset into groups in this way:

phx_uof_2018_2024 |>
  select( INC_YEAR ) |> 
  group_by( INC_YEAR )

Take a look at the output. You may notice that it looks a bit different. It returns output that is class tibble instead of data.frame. This is a small difference and does not augment our data, just the way it looks. Note that it also tells us how the data are grouped (look at where it says “Groups”).

Now, you may be saying, “isn’t this what arrange() does? Not exactly. arrange() sorts the rows such that the resulting data object is just re-arranged based on the variables we gave it. group_by() is used when we want to perform a calculation, like mutate() or summarize() and we first want to group the data based on a variable.

For example, now that our data are grouped, we can use the summarize() function to give us group specific info. Let’s start by seeing how many cases occur each year:

phx_uof_2018_2024 |>
  select( INC_YEAR ) |> 
  group_by( INC_YEAR ) |> 
  summarize( 
    total = n() 
    )
# A tibble: 7 × 2
  INC_YEAR total
     <int> <int>
1     2018  1101
2     2019  1212
3     2020  1391
4     2021  1544
5     2022  2132
6     2023  2302
7     2024  2642

What if we wanted to how many cases are in each race/ethnic group each year? Well, we could include the CIT_RACE variable in our selection AND in the group_by() statement:

phx_uof_2018_2024 |>
  select( INC_YEAR, CIT_RACE ) |> 
  group_by( INC_YEAR, CIT_RACE ) |> 
  summarize( 
    total = n()
    )
# A tibble: 40 × 3
# Groups:   INC_YEAR [7]
   INC_YEAR CIT_RACE                         total
      <int> <chr>                            <int>
 1     2018 American Indian / Alaskan Native    78
 2     2018 Asian / Pacific Islander            12
 3     2018 Black                              244
 4     2018 White                              767
 5     2019 American Indian / Alaskan Native    94
 6     2019 Asian / Pacific Islander             9
 7     2019 Black                              233
 8     2019 White                              876
 9     2020 American Indian / Alaskan Native    88
10     2020 Asian / Pacific Islander             7
# ℹ 30 more rows

We can include multiple cases here if we want additional information:

phx_uof_2018_2024 |>
  select( INC_YEAR, CIT_RACE, CIT_AGE ) |> 
  group_by( INC_YEAR, CIT_RACE ) |> 
  summarize( 
    total = n(),
    mean_age = mean( CIT_AGE )
    )
# A tibble: 40 × 4
# Groups:   INC_YEAR [7]
   INC_YEAR CIT_RACE                         total mean_age
      <int> <chr>                            <int>    <dbl>
 1     2018 American Indian / Alaskan Native    78     31.8
 2     2018 Asian / Pacific Islander            12     38.6
 3     2018 Black                              244     NA  
 4     2018 White                              767     33.7
 5     2019 American Indian / Alaskan Native    94     30.8
 6     2019 Asian / Pacific Islander             9     30.8
 7     2019 Black                              233     32.8
 8     2019 White                              876     NA  
 9     2020 American Indian / Alaskan Native    88     31.6
10     2020 Asian / Pacific Islander             7     39.1
# ℹ 30 more rows

Note that some cases have NA for the mean. This is because we didn’t tell the mean() function what to do with missing values. We can fix that using the na.rm= argument in the function:

phx_uof_2018_2024 |>
  select( INC_YEAR, CIT_RACE, CIT_AGE ) |> 
  group_by( INC_YEAR, CIT_RACE ) |> 
  summarize( 
    total = n(),
    mean_age = mean( CIT_AGE, na.rm = TRUE )
    )
# A tibble: 40 × 4
# Groups:   INC_YEAR [7]
   INC_YEAR CIT_RACE                         total mean_age
      <int> <chr>                            <int>    <dbl>
 1     2018 American Indian / Alaskan Native    78     31.8
 2     2018 Asian / Pacific Islander            12     38.6
 3     2018 Black                              244     33.9
 4     2018 White                              767     33.7
 5     2019 American Indian / Alaskan Native    94     30.8
 6     2019 Asian / Pacific Islander             9     30.8
 7     2019 Black                              233     32.8
 8     2019 White                              876     33.6
 9     2020 American Indian / Alaskan Native    88     31.6
10     2020 Asian / Pacific Islander             7     39.1
# ℹ 30 more rows


Analytical Questions and Workflow (revisited)

Now that we know how to manipulate the dataset, lets revisit our questions:

  • Are incidents increasing or decreasing over time?
  • How many incidents involve Male (vs. Female) individuals?
  • What proportion of incidents involve the suspect having a gun (see the SUBJ_ARMED_GUN column)?
  • Are non-white suspects more prevalent in the data than White suspects?

Before reviewing this section, I would encourage you to take a crack at it on your own, and then work through the solutions here.

Are incidents increasing or decreasing over time?

We can answer this question by creating a simple count by year of incidents:

phx_uof_2018_2024 |>
  select( INC_YEAR ) |> 
  group_by( INC_YEAR ) |> 
  summarize( total = n() )
# A tibble: 7 × 2
  INC_YEAR total
     <int> <int>
1     2018  1101
2     2019  1212
3     2020  1391
4     2021  1544
5     2022  2132
6     2023  2302
7     2024  2642

With this table it is fairly easy to read and see the increase. But, if we had lots of cases then it might be hard to see the trend. One way we could make this easier would be to take all of the cases over the years and calculate what proportion each year represents.

Before we write the code for it, let’s think about it conceptually. We want: “Count incidents by year, then divide each year’s count by the total number of incidents.”

So our data recipe should be:

  1. Group by year
  2. Count incidents per year
  3. Compute proportion using the total
phx_uof_2018_2024 |>
  select( INC_YEAR ) |> 
  group_by( INC_YEAR ) |> 
  summarize( total = n() ) |> 
  mutate( year_proportion = total / sum( total ) )
# A tibble: 7 × 3
  INC_YEAR total year_proportion
     <int> <int>           <dbl>
1     2018  1101          0.0893
2     2019  1212          0.0983
3     2020  1391          0.113 
4     2021  1544          0.125 
5     2022  2132          0.173 
6     2023  2302          0.187 
7     2024  2642          0.214 

How many incidents involve Male (vs. Female) individuals?

phx_uof_2018_2024 |>
  select( CIT_GENDER ) |> 
  group_by( CIT_GENDER ) |> 
  summarize( total = n() ) 
# A tibble: 4 × 2
  CIT_GENDER    total
  <chr>         <int>
1 Female         1137
2 Male          11180
3 Not Available     6
4 Unknown           1

What proportion of incidents involve the suspect having a gun?

phx_uof_2018_2024 |>
  select( SUBJ_ARMED_GUN ) |> 
  group_by( SUBJ_ARMED_GUN ) |> 
  summarize( total = n() ) |> 
  mutate( gun_proportion = total / sum( total ) )
# A tibble: 2 × 3
  SUBJ_ARMED_GUN total gun_proportion
  <chr>          <int>          <dbl>
1 No             11706         0.950 
2 Yes              618         0.0501

Are non-white suspects more prevalent in the data than White suspects?

phx_uof_2018_2024 |>
  select( CIT_RACE ) |> 
  group_by( CIT_RACE ) |> 
  summarize( total = n() ) |> 
  mutate( race_proportion = total / sum( total ) )
# A tibble: 10 × 3
   CIT_RACE                         total race_proportion
   <chr>                            <int>           <dbl>
 1 AmIndian                             3       0.000243 
 2 American Indian / Alaskan Native   782       0.0635   
 3 Asian / Pacific Islander            95       0.00771  
 4 Black                             3028       0.246    
 5 Black / African American            28       0.00227  
 6 Hispanic                             9       0.000730 
 7 Not Available                        6       0.000487 
 8 Unknown                             11       0.000893 
 9 White                             8361       0.678    
10 indian                               1       0.0000811

This gives us all the categories, so we would want to create a variable that gets at White vs. Non-White. Remember that we did this above using mutate( white = ifelse( race == "White", "White", "Non-White" ) ).

phx_uof_2018_2024 |>
  select( race = CIT_RACE ) |> 
  mutate( white = ifelse( race == "White", "White", "Non-White" ) ) |> 
  group_by( white ) |> 
  summarize( total = n() ) |> 
  mutate( race_proportion = total / sum( total ) )
# A tibble: 2 × 3
  white     total race_proportion
  <chr>     <int>           <dbl>
1 Non-White  3963           0.322
2 White      8361           0.678


Test Your Knowledge Exercises