Tidy Data

Have you ever opened a dataset and immediately thought, “This is a mess”? Columns don’t make sense. Dates are inconsistent. And even if you can read it, it’s hard to actually do anything with it. If so, then your second thought might have been, “this is going to be a lot of work to fix”.

Sound familiar? That’s not a coincidence. Rather it is the reality of working with open-source data or data “in the wild”. In data wrangling, we want to corral that wild data and get it into a format with which we can work.

There’s a common misconception that data analysis is mostly about: building models, creating visualizations, creating output, etc. But in practice, most analysts spend the majority of their time doing something cleaning and organizing data. Ironically, when we (i.e. scholars) teach methods and statistics for crime analysis, most of the time is spent on analysis and very little time is spent on preparing data.

“Tidying” data is one of the most important skills you can develop. In this chapter we are going to focus on turning messy data into something you can actually analyze.

What Is “Tidy Data”?

We’re going to use a concept from the tidyverse(a collection of R packages for data transformation) called tidy data. This gives us a standard way to structure datasets so they are easy to work with. Also, organizing data in this way is quickly becoming an industry standard, so knowing what it is and how to do it is valuable.

A dataset is considered tidy when it follows three simple rules:

Each variable has its own column
Each observation has its own row
Each value lives in a single cell

That might sound simple and obvious, but most “wild” datasets violate at least one of these rules and are therefore “untidy”. Let’s see some examples.

Untidy Data: Example 1

# A tibble: 4 × 2
  incident_id demographics   
        <int> <chr>          
1           1 Male_Adult     
2           2 Female_Juvenile
3           3 Male_Juvenile  
4           4 Female_Adult

See any issues? You probably noticed that “gender” and “age group” are mashed together in the variable demographics. This is a problem because demographics is actually two variables. How might we solve this problem?

Well, we just need to create two variables:

# A tibble: 4 × 3
  incident_id gender age_group
        <int> <chr>  <chr>    
1           1 Male   Adult    
2           2 Female Juvenile 
3           3 Male   Juvenile 
4           4 Female Adult

Untidy Data: Example 2

Try this one:

# A tibble: 2 × 3
  precinct arrests_2022 arrests_2023
  <chr>           <dbl>        <dbl>
1 A                  50           60
2 B                  30           45

If you said “year” is hidden in column names, then you win a prize! Data organized in this way make it hard to analyze trends by year. Rather, we would prefer data like this:

# A tibble: 4 × 3
  precinct year  arrests
  <chr>    <chr>   <dbl>
1 A        2022       50
2 A        2023       60
3 B        2022       30
4 B        2023       45

Note the difference. We know have a variable for “year”.

Untidy Data: Example 3

Last one (you got this!):

# A tibble: 2 × 2
  incident_id subjects 
        <int> <chr>    
1           1 John;Mike
2           2 Sara;Anna

These data are untidy because two people are stored in one row. It would be difficult to count or analyze individuals properly. We can fix this by making a variable for “subjects” and listing each one:

# A tibble: 4 × 2
  incident_id subjects
        <int> <chr>   
1           1 John    
2           1 Mike    
3           2 Sara    
4           2 Anna

Why Tidy?

Why ensure that your data is tidy? There are two main advantages:

There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. 2, There’s a specific advantage to placing variables in columns because it allows R’s vectorized nature to shine. Most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

dplyr (remember dplyr!?!), ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.

Since we will be working with functions that tidy data, we need to install tidyverse:

install.packages( "tidyverse" )

After installing (or if you already have the package installed), we need to use the library() function to load it:

library( tidyverse )

Reshaping Data

Now that we know what tidy data should look like, the next step is learning how to transform messy data into that structure. One of the most common problems you’ll run into is that your data is in the wrong shape.

Take a look at this example:

# A tibble: 2 × 3
  precinct arrests_2022 arrests_2023
  <chr>           <dbl>        <dbl>
1 A                  50           60
2 B                  30           45

If I asked you to give me a table of “arrests” over time, what problems do you see?

You probably noticed that the year is embedded in column names. This makes it had to extract this information for year.

Wide vs Long Data

This issue we saw in the example above has a specific name. These data are wide, meaning that there are many columns and the variable (in this case “year”) is spread over multiple columns. This is in contrast to data that is long, where there are fewer columns and values stacked into rows

Think of it like this:
- “Wide data spreads information across columns.” - “Long data stacks information down rows.”

`pivot_longer`

There is where the pivot_longer function comes in.

First, let’s create our simple wide data object:

data_wide <- data.frame(
  precinct = c( "A", "B" ),
  arrests_2022 = c( 50, 30 ),
  arrests_2023 = c( 60, 45 )
)

data_wide

  precinct arrests_2022 arrests_2023
1        A           50           60
2        B           30           45

Now, let’s create the long data object. Let me show you what it does first and we will go through each line:

data_long <- data_wide |>
  pivot_longer(
    cols = c( "arrests_2022", "arrests_2023"),
    names_to = "year",
    values_to = "arrests"
  )

data_long

# A tibble: 4 × 3
  precinct year         arrests
  <chr>    <chr>          <dbl>
1 A        arrests_2022      50
2 A        arrests_2023      60
3 B        arrests_2022      30
4 B        arrests_2023      45

Ok, let’s see what we did:

cols = c( "arrests_2022", "arrests_2023") asks “which column are we resphaping?” and we use two column names (note that we have to use c() to combine them into a vector.
names_to = "year" tells pivot_longer() where the old column names are supposed to go.
values_to = "arrests" tells the function what to do with the values.

In this case, they are going to go to a variable called num_arrests.

If you are a bit lost, think about it like this: we are taking column names and turning them into data.

If you are a visual learner, think of it this way:

flowchart LR

A["WIDE DATA<br/>(columns contain values)<br/><pre>
┌───────────┬───────────────┬───────────────┐
│ precinct  │ arrests_2022  │ arrests_2023  │
├───────────┼───────────────┼───────────────┤
│ A         │ 50            │ 30            │
│ B         │ 60            │ 45            │
└───────────┴───────────────┴───────────────┘
</pre>"]

B["pivot_longer()<br/>
cols = starts_with('arrests')<br/>
names_to = 'year'<br/>
values_to = 'arrests'"]

C["LONG DATA<br/>(values stacked into rows)<br/><pre>
┌───────────┬────────────────┬─────────┐
│ precinct  │ year           │ arrests │
├───────────┼────────────────┼─────────┤
│ A         │ arrests_2022   │ 50      │
│ A         │ arrests_2023   │ 60      │
│ B         │ arrests_2022   │ 30      │
│ B         │ arrests_2023   │ 45      │
└───────────┴────────────────┴─────────┘
</pre>"]

A --> B --> C

We are almost to a tidy dataset. What is missing? Take a look at the values for the year variable. It contains the string “arrests_” which was carried over from our transformation. What we need to is remove this information. There is a nice function called str_remove() which takes an object or column and removes any characters. The str_remove() function is in a package called stringr which is part of the tidyverse. We can put that into the mutate() function we we already saw in dpylr.

# make sure we load the functions
library( dplyr )

# now remove the unncessary text
data_long <- data_long |>
  mutate(
    year = str_remove(year, "arrests_")
  )

Now, we can create a table using the tools from dplyr!

Perhaps average arrests by year:

data_long |> 
  group_by( year ) |> 
  summarize( avg_arrests = mean( arrests ) )

# A tibble: 2 × 2
  year  avg_arrests
  <chr>       <dbl>
1 2022         40  
2 2023         52.5

Or average arrests by precinct:

data_long |> 
  group_by( precinct ) |> 
  summarize( avg_arrests = mean( arrests ) )

# A tibble: 2 × 2
  precinct avg_arrests
  <chr>          <dbl>
1 A               55  
2 B               37.5

`pivot_wide`

So far, we’ve taken messy wide data and made it tidy. But sometimes we actually want to go the other direction. This usually happens when we’re preparing results for reports or summaries. There is where the pivot_wide function comes in. First, let’s create a simple long data object. Suppose we had some data like this for crimes in two neighborhoods:

data_long <- tibble(
  neighborhood = c("A", "A", "A", "B", "B", "B" ),
  year = c( "2020", "2021", "2022","2020", "2021", "2022" ),
  crimes = c(50, 60, 30, 45, 90, 40 )
)

data_long

# A tibble: 6 × 3
  neighborhood year  crimes
  <chr>        <chr>  <dbl>
1 A            2020      50
2 A            2021      60
3 A            2022      30
4 B            2020      45
5 B            2021      90
6 B            2022      40

If we want this in a format that’s easier to read in a report we could have one row per precinct and separate columns for each year. We can do that by creating a data object that is in wide format using the pivot_wide() function:

data_wide <- data_long |>
  pivot_wider(
    names_from = year,
    values_from = crimes
  )

data_wide

# A tibble: 2 × 4
  neighborhood `2020` `2021` `2022`
  <chr>         <dbl>  <dbl>  <dbl>
1 A                50     60     30
2 B                45     90     40

Ok, let’s see what we did:

names_from = "year" tells pivot_wider() to use values in the year column as new column names.
values_from = "crimes" tells the function to fill the values with counts of crimes.

Note the difference from pivot_longer() in terms of the arguments which uses ...._to = whereas pivot_wider() uses ..._from =.

If you are a visual learner, think of it this way:

flowchart LR

A["LONG DATA<br/>(values stacked into rows)<br/><pre>
┌───────────────┬──────┬─────────┐
│ neighborhood  │ year │ crimes  │
├───────────────┼──────┼─────────┤
│ A             │ 2020 │ 50      │
│ A             │ 2021 │ 60      │
│ A             │ 2022 │ 30      │
│ B             │ 2020 │ 45      │
│ B             │ 2021 │ 90      │
│ B             │ 2022 │ 40      │
└───────────────┴──────┴─────────┘
</pre>"]

B["pivot_wider()<br/>
names_from = year<br/>
values_from = crimes"]

C["WIDE DATA<br/>(columns contain values)<br/><pre>
┌───────────────┬───────────────┬───────────────┬───────────────┐
│ neighborhood  │ crimes_2020   │ crimes_2021   │ crimes_2022   │
├───────────────┼───────────────┼───────────────┼───────────────┤
│ A             │ 50            │ 60            │ 30            │
│ B             │ 45            │ 90            │ 40            │
└───────────────┴───────────────┴───────────────┴───────────────┘
</pre>"]

A --> B --> C

Note that this works if we are using multiple variables as well. For example, say we were looking at arrests and calls for service in a neighborhood and our long data looked like this:

data_long <- tibble(
  neighborhood = c("A", "A", "A", "B", "B", "B" ),
  year = c( "2020", "2021", "2022","2020", "2021", "2022" ),
  arrests = c(50, 60, 30, 45, 90, 40 ),
  calls = c( 100, 125, 90, 265, 75, 182 )
)

data_long

# A tibble: 6 × 4
  neighborhood year  arrests calls
  <chr>        <chr>   <dbl> <dbl>
1 A            2020       50   100
2 A            2021       60   125
3 A            2022       30    90
4 B            2020       45   265
5 B            2021       90    75
6 B            2022       40   182

Then, we can create columns by year for arrests and calls like this:

data_wide <- data_long |>
  pivot_wider(
    names_from = year,
    values_from = c( arrests, calls )
  )

data_wide

# A tibble: 2 × 7
  neighborhood arrests_2020 arrests_2021 arrests_2022 calls_2020 calls_2021
  <chr>               <dbl>        <dbl>        <dbl>      <dbl>      <dbl>
1 A                      50           60           30        100        125
2 B                      45           90           40        265         75
# ℹ 1 more variable: calls_2022 <dbl>

In general, use pivot_wider() when you’re making a table for a report, you want side-by-side comparisons, you’re exporting data (Excel, dashboards), and so on. In comparing the two functions, think of it like this: pivot_longer() helps us prepare data for analysis, whereas pivot_wider() helps us present results.