# A tibble: 4 × 2
incident_id demographics
<int> <chr>
1 1 Male_Adult
2 2 Female_Juvenile
3 3 Male_Juvenile
4 4 Female_Adult
Tidy Data
Have you ever opened a dataset and immediately thought, “This is a mess”? Columns don’t make sense. Dates are inconsistent. And even if you can read it, it’s hard to actually do anything with it. If so, then your second thought might have been, “this is going to be a lot of work to fix”.
Sound familiar? That’s not a coincidence. Rather it is the reality of working with open-source data or data “in the wild”. In data wrangling, we want to corral that wild data and get it into a format with which we can work.
There’s a common misconception that data analysis is mostly about: building models, creating visualizations, creating output, etc. But in practice, most analysts spend the majority of their time doing something cleaning and organizing data. Ironically, when we (i.e. scholars) teach methods and statistics for crime analysis, most of the time is spent on analysis and very little time is spent on preparing data.
“Tidying” data is one of the most important skills you can develop. In this chapter we are going to focus on turning messy data into something you can actually analyze.
What Is “Tidy Data”?
We’re going to use a concept from the tidyverse(a collection of R packages for data transformation) called tidy data. This gives us a standard way to structure datasets so they are easy to work with. Also, organizing data in this way is quickly becoming an industry standard, so knowing what it is and how to do it is valuable.
A dataset is considered tidy when it follows three simple rules:
- Each variable has its own column
- Each observation has its own row
- Each value lives in a single cell
That might sound simple and obvious, but most “wild” datasets violate at least one of these rules and are therefore “untidy”. Let’s see some examples.
Untidy Data: Example 1
See any issues? You probably noticed that “gender” and “age group” are mashed together in the variable demographics. This is a problem because demographics is actually two variables. How might we solve this problem?
Well, we just need to create two variables:
# A tibble: 4 × 3
incident_id gender age_group
<int> <chr> <chr>
1 1 Male Adult
2 2 Female Juvenile
3 3 Male Juvenile
4 4 Female Adult
Untidy Data: Example 2
Try this one:
# A tibble: 2 × 3
precinct arrests_2022 arrests_2023
<chr> <dbl> <dbl>
1 A 50 60
2 B 30 45
If you said “year” is hidden in column names, then you win a prize! Data organized in this way make it hard to analyze trends by year. Rather, we would prefer data like this:
# A tibble: 4 × 3
precinct year arrests
<chr> <chr> <dbl>
1 A 2022 50
2 A 2023 60
3 B 2022 30
4 B 2023 45
Note the difference. We know have a variable for “year”.
Untidy Data: Example 3
Last one (you got this!):
# A tibble: 2 × 2
incident_id subjects
<int> <chr>
1 1 John;Mike
2 2 Sara;Anna
These data are untidy because two people are stored in one row. It would be difficult to count or analyze individuals properly. We can fix this by making a variable for “subjects” and listing each one:
# A tibble: 4 × 2
incident_id subjects
<int> <chr>
1 1 John
2 1 Mike
3 2 Sara
4 2 Anna
Why Tidy?
Why ensure that your data is tidy? There are two main advantages:
- There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. 2, There’s a specific advantage to placing variables in columns because it allows R’s vectorized nature to shine. Most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
dplyr (remember dplyr!?!), ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.
Since we will be working with functions that tidy data, we need to install tidyverse:
install.packages( "tidyverse" )After installing (or if you already have the package installed), we need to use the library() function to load it:
Reshaping Data
Now that we know what tidy data should look like, the next step is learning how to transform messy data into that structure. One of the most common problems you’ll run into is that your data is in the wrong shape.
Take a look at this example:
# A tibble: 2 × 3
precinct arrests_2022 arrests_2023
<chr> <dbl> <dbl>
1 A 50 60
2 B 30 45
If I asked you to give me a table of “arrests” over time, what problems do you see?
You probably noticed that the year is embedded in column names. This makes it had to extract this information for year.
Wide vs Long Data
This issue we saw in the example above has a specific name. These data are wide, meaning that there are many columns and the variable (in this case “year”) is spread over multiple columns. This is in contrast to data that is long, where there are fewer columns and values stacked into rows
Think of it like this:
- “Wide data spreads information across columns.” - “Long data stacks information down rows.”
pivot_longer
There is where the pivot_longer function comes in.
First, let’s create our simple wide data object:
data_wide <- data.frame(
precinct = c( "A", "B" ),
arrests_2022 = c( 50, 30 ),
arrests_2023 = c( 60, 45 )
)
data_wide precinct arrests_2022 arrests_2023
1 A 50 60
2 B 30 45
Now, let’s create the long data object. Let me show you what it does first and we will go through each line:
data_long <- data_wide |>
pivot_longer(
cols = c( "arrests_2022", "arrests_2023"),
names_to = "year",
values_to = "arrests"
)
data_long# A tibble: 4 × 3
precinct year arrests
<chr> <chr> <dbl>
1 A arrests_2022 50
2 A arrests_2023 60
3 B arrests_2022 30
4 B arrests_2023 45
Ok, let’s see what we did:
cols = c( "arrests_2022", "arrests_2023")asks “which column are we resphaping?” and we use two column names (note that we have to usec()to combine them into a vector.names_to = "year"tellspivot_longer()where the old column names are supposed to go.values_to = "arrests"tells the function what to do with the values.
In this case, they are going to go to a variable called num_arrests.
If you are a bit lost, think about it like this: we are taking column names and turning them into data.
If you are a visual learner, think of it this way:
flowchart LR
A["WIDE DATA<br/>(columns contain values)<br/><pre>
┌───────────┬───────────────┬───────────────┐
│ precinct │ arrests_2022 │ arrests_2023 │
├───────────┼───────────────┼───────────────┤
│ A │ 50 │ 30 │
│ B │ 60 │ 45 │
└───────────┴───────────────┴───────────────┘
</pre>"]
B["pivot_longer()<br/>
cols = starts_with('arrests')<br/>
names_to = 'year'<br/>
values_to = 'arrests'"]
C["LONG DATA<br/>(values stacked into rows)<br/><pre>
┌───────────┬────────────────┬─────────┐
│ precinct │ year │ arrests │
├───────────┼────────────────┼─────────┤
│ A │ arrests_2022 │ 50 │
│ A │ arrests_2023 │ 60 │
│ B │ arrests_2022 │ 30 │
│ B │ arrests_2023 │ 45 │
└───────────┴────────────────┴─────────┘
</pre>"]
A --> B --> C
We are almost to a tidy dataset. What is missing? Take a look at the values for the year variable. It contains the string “arrests_” which was carried over from our transformation. What we need to is remove this information. There is a nice function called str_remove() which takes an object or column and removes any characters. The str_remove() function is in a package called stringr which is part of the tidyverse. We can put that into the mutate() function we we already saw in dpylr.
# make sure we load the functions
library( dplyr )
# now remove the unncessary text
data_long <- data_long |>
mutate(
year = str_remove(year, "arrests_")
)Now, we can create a table using the tools from dplyr!
Perhaps average arrests by year:
# A tibble: 2 × 2
year avg_arrests
<chr> <dbl>
1 2022 40
2 2023 52.5
Or average arrests by precinct:
pivot_wide
So far, we’ve taken messy wide data and made it tidy. But sometimes we actually want to go the other direction. This usually happens when we’re preparing results for reports or summaries. There is where the pivot_wide function comes in. First, let’s create a simple long data object. Suppose we had some data like this for crimes in two neighborhoods:
# A tibble: 6 × 3
neighborhood year crimes
<chr> <chr> <dbl>
1 A 2020 50
2 A 2021 60
3 A 2022 30
4 B 2020 45
5 B 2021 90
6 B 2022 40
If we want this in a format that’s easier to read in a report we could have one row per precinct and separate columns for each year. We can do that by creating a data object that is in wide format using the pivot_wide() function:
data_wide <- data_long |>
pivot_wider(
names_from = year,
values_from = crimes
)
data_wide# A tibble: 2 × 4
neighborhood `2020` `2021` `2022`
<chr> <dbl> <dbl> <dbl>
1 A 50 60 30
2 B 45 90 40
Ok, let’s see what we did:
names_from = "year"tellspivot_wider()to use values in the year column as new column names.values_from = "crimes"tells the function to fill the values with counts of crimes.
Note the difference from pivot_longer() in terms of the arguments which uses ...._to = whereas pivot_wider() uses ..._from =.
If you are a visual learner, think of it this way:
flowchart LR A["LONG DATA<br/>(values stacked into rows)<br/><pre> ┌───────────────┬──────┬─────────┐ │ neighborhood │ year │ crimes │ ├───────────────┼──────┼─────────┤ │ A │ 2020 │ 50 │ │ A │ 2021 │ 60 │ │ A │ 2022 │ 30 │ │ B │ 2020 │ 45 │ │ B │ 2021 │ 90 │ │ B │ 2022 │ 40 │ └───────────────┴──────┴─────────┘ </pre>"] B["pivot_wider()<br/> names_from = year<br/> values_from = crimes"] C["WIDE DATA<br/>(columns contain values)<br/><pre> ┌───────────────┬───────────────┬───────────────┬───────────────┐ │ neighborhood │ crimes_2020 │ crimes_2021 │ crimes_2022 │ ├───────────────┼───────────────┼───────────────┼───────────────┤ │ A │ 50 │ 60 │ 30 │ │ B │ 45 │ 90 │ 40 │ └───────────────┴───────────────┴───────────────┴───────────────┘ </pre>"] A --> B --> C
Note that this works if we are using multiple variables as well. For example, say we were looking at arrests and calls for service in a neighborhood and our long data looked like this:
# A tibble: 6 × 4
neighborhood year arrests calls
<chr> <chr> <dbl> <dbl>
1 A 2020 50 100
2 A 2021 60 125
3 A 2022 30 90
4 B 2020 45 265
5 B 2021 90 75
6 B 2022 40 182
Then, we can create columns by year for arrests and calls like this:
data_wide <- data_long |>
pivot_wider(
names_from = year,
values_from = c( arrests, calls )
)
data_wide# A tibble: 2 × 7
neighborhood arrests_2020 arrests_2021 arrests_2022 calls_2020 calls_2021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 50 60 30 100 125
2 B 45 90 40 265 75
# ℹ 1 more variable: calls_2022 <dbl>
In general, use pivot_wider() when you’re making a table for a report, you want side-by-side comparisons, you’re exporting data (Excel, dashboards), and so on. In comparing the two functions, think of it like this: pivot_longer() helps us prepare data for analysis, whereas pivot_wider() helps us present results.