Data Transformation

When we first start working with data, it’s easy to assume that once we load a dataset into R, we’re ready to analyze it. In reality, that’s almost never the case. Most open data is messy, inconsistent, and not immediately useful for answering questions. It is wild.

In the previous chapter we discussed how to get data tidy, but even with a tidy dataset we are not yet ready for visualization and analysis. We still have to get the objects of analysis and visualization prepared.

That’s where data transformation comes in. Data transformation is the process of taking raw data and turning it into something we can actually use for analysis. We reshape variables, create new ones, and standardize messy information so that patterns can be identified and decisions can be made.

For example, a dataset might give us:

On their own, these aren’t very helpful. But through transformation, we can turn them into: - Day of the week (Was this a weekend incident?)
- Time category (Did this occur during a night shift?)
- Cleaned location data (So everything is consistently labeled)

In this chapter, we’re going to focus on three major types of transformations:

All of these transformations have one ultimate goal: to prepare our data for analysis and visualization.

Why is this important? When we move into visualization using ggplot2 and start building visual displays, we won’t just be plotting data we pulled from an open source location. We’ll be plotting variables that we created, like shifts, weekends, trends over time, and categorized outcomes.

Logical Vectors & Conditional Operations

At its core, a logical vector is just a series of TRUE and FALSE values. These come from asking questions of your data.

For example:
- Is the hour greater than 20?
- Is the day Saturday or Sunday?
- Does this record match a certain condition?

In R, every time you ask a question like that, you get back TRUE or FALSE.

To illustrate, let’s start by creating a toy data set of crime incidents to work with:

dat <- data.frame(
  id = 1:10,
  INC_HOUR = c(2, 9, 14, 21, 23, 7, 12, 18, 3, 16),
  INC_DAY_WEEK = c(
    "Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
    "Saturday", "Sunday", "Monday", "Friday", "Sunday"
  ),
  OFFENSE = c(
    "Theft", "Assault", "Burglary", "Theft", "Robbery",
    "Assault", "Theft", "Burglary", "Robbery", "Assault"
  )
)

dat
   id INC_HOUR INC_DAY_WEEK  OFFENSE
1   1        2       Monday    Theft
2   2        9      Tuesday  Assault
3   3       14    Wednesday Burglary
4   4       21     Thursday    Theft
5   5       23       Friday  Robbery
6   6        7     Saturday  Assault
7   7       12       Sunday    Theft
8   8       18       Monday Burglary
9   9        3       Friday  Robbery
10 10       16       Sunday  Assault

Let’s think of a few questions we can ask with the data:

  • Which incidents occur after 7?
  • Which incidents occur on Friday?

The way we ask questions with logical vectors is to use comparison operators:

  • == equal to (note: = is used to assign objects in R so we use == for an equality statement)
  • != not equal to
  • > greater than
  • < less than
  • >= greater than or equal to
  • <= less than or equal to

Which incidents occur after 7?

To answer this we can use the > operator like so:

dat$INC_HOUR > 7

Let’s take a closer look at our code: we identify the object dat, we use the $ sign to indicate the specific column we are interested in, here it is INC_HOUR, and then we state the logical statement > 7. Before you run this line, look at the data and think about which cases are going to be TRUE and which are going to be FALSE.

When you run it, you get this:

 [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE

It is a vector of logical values telling us for which case the condition is TRUE or FALSE. For example, the first case has a value of 2 for the INC_HOUR variable so the value for the logical vector is FALSE because it is not greater than 7.

Which incidents occur on Friday?

dat$INC_DAY_WEEK == "Friday"

Here, we are saying which cases are equal to "Friday". This is an important point to pause and remind you that the R programming language is case sensitive. Also, that "Friday" is a character (as opposed to say, numeric). Try these and see if it gives you the same results:

dat$INC_DAY_WEEK == "friday"

dat$INC_DAY_WEEK == "FRIDAY"

dat$INC_DAY_WEEK == "Friday "

These (should) come back as all FALSE. Why?

These are all FALSE because the condition we stated, that dat$INC_DAY_WEEK is equal to "friday" or "FRIDAY" or "Friday ", is not met by any case. This is because our day of the week variable is a character so it wants the EXACT string of letters. Can you see how these three instances are different?

We will discuss strings more below and how to edit them (particularly when there are inconsistencies).

Operators can be used together to satisfy multiple logical conditions using the | (“or”) or the & (“and”) operators:

# or
dat$INC_HOUR >= 7 | dat$INC_DAY_WEEK != "Friday"

# and
dat$INC_HOUR >= 7 & dat$INC_DAY_WEEK != "Friday"

Think about what the first statement above says: “incident hour is greater than or equal to 7 or incident day is not equal to Friday”. Think about what the second statement above says: “incident hour is greater than or equal to 7 and incident day is not equal to Friday”.

A final operator that is really useful is the %in% operator. This allows us to specify a set of values or strings and then extract the cases that meet that condition. For example we could write:

dat$INC_DAY_WEEK == "Saturday" | dat$INC_DAY_WEEK ==  "Sunday"

Or, we could write it like this:

dat$INC_DAY_WEEK %in% c( "Saturday", "Sunday" )

These statements give the same logical vector, but one is cleaner. For example, what if we have several days:

# messy
dat$INC_DAY_WEEK == "Friday" | dat$INC_DAY_WEEK == "Saturday" | dat$INC_DAY_WEEK == "Sunday"

# clean
dat$INC_DAY_WEEK %in% c("Friday", "Saturday", "Sunday")

As we move on with the material, the use of logical vectors will make a lot more sense because you will see that they are a convenient way to filter data. For example, in the prior chapters we saw how the filter() function in the dplyr package could be used to select specific columns. But, we can plug a logical vector into the filter() function to get a subset of data:

dat |> 
  dplyr::filter( INC_DAY_WEEK == "Saturday" )
  id INC_HOUR INC_DAY_WEEK OFFENSE
1  6        7     Saturday Assault

Here, we took just those cases meeting the condition of the logical vector (note: I used dplyr::filter to call the dplyr package rather than loading the library with library( dplyr ), to save space).

Suppose we want cases that are after noon and on the weekend:

dat |> 
  dplyr::filter( dat$INC_HOUR >= 12 & INC_DAY_WEEK %in% c( "Saturday", "Sunday" ) )
  id INC_HOUR INC_DAY_WEEK OFFENSE
1  7       12       Sunday   Theft
2 10       16       Sunday Assault

Working with Strings (Text Data)

Now we shift from logical thinking to text data, which are everywhere in open-source datasets.

Think about variables like: - Offense descriptions
- Location fields - City names
- Narrative reports

Unlike numbers, text data are often inconsistent (“Phoenix”, “PHOENIX”, “phx”) and messy (extra spaces, abbreviations) making them hard to analyze directly.

Let’s create a separate toy dataset to work with for strings.

dat <- data.frame(
  id = 1:12,
  city = c(
    "Phoenix", "PHOENIX ", "phx", "Tempe", "tempe", " Glendale",
    "GLENDALE", "Mesa", "mesa ", "Scottsdale", "SCOTTSDALE", "scottsdale"
  ),
  offense = c(
    "Assault", "ASSAULT", "assault ", "Burglary", "BURGLARY", "burglary",
    "Domestic Violence", "domestic violence", "Domestic   Violence",
    "Theft", "THEFT", " theft "
  ),
  location = c(
    "123 MAIN ST", "123 Main St", "123 Main Street",
    "45 E 2ND AVE", "45 E 2nd Ave", "45 e 2ND AVE",
    "700 W RIO SALADO PKWY", "700 W Rio Salado Pkwy", "700 W RIO SALADO PARKWAY",
    "100 N CENTER RD", "100 N Center Rd", "100 N CENTER RD"
  )
)

dat
   id       city             offense                 location
1   1    Phoenix             Assault              123 MAIN ST
2   2   PHOENIX              ASSAULT              123 Main St
3   3        phx            assault           123 Main Street
4   4      Tempe            Burglary             45 E 2ND AVE
5   5      tempe            BURGLARY             45 E 2nd Ave
6   6   Glendale            burglary             45 e 2ND AVE
7   7   GLENDALE   Domestic Violence    700 W RIO SALADO PKWY
8   8       Mesa   domestic violence    700 W Rio Salado Pkwy
9   9      mesa  Domestic   Violence 700 W RIO SALADO PARKWAY
10 10 Scottsdale               Theft          100 N CENTER RD
11 11 SCOTTSDALE               THEFT          100 N Center Rd
12 12 scottsdale              theft           100 N CENTER RD

Take a look through the data. Look for the following issues we commonly find with string variables: - inconsistent capitalization - extra spaces
- abbreviations - punctuation differences - multi-word labels - alternate spellings for the same thing

This is fairly standard for what you will see with open-source data. Let’s take a look at fixing some of these issues.

Standardize Case

A very common first step is to make everything the same case. Why do this? There are several reasons: It makes values easier to compare, it reduces duplicates caused only by case differences, it makes later cleaning more consistent.

For example, using the unique() function, we can list all the unique values of a variable. Since these are strings, we are looking at the unique strings:

unique( dat$city )
 [1] "Phoenix"    "PHOENIX "   "phx"        "Tempe"      "tempe"     
 [6] " Glendale"  "GLENDALE"   "Mesa"       "mesa "      "Scottsdale"
[11] "SCOTTSDALE" "scottsdale"

We can see that there are several different cases for each name. We can fix this by using some functions from the stringr package in conjunction with functions from dplyr. You can install the package using install.packages( "stringr" ).

Now that the functions are loaded, let’s adjust the cases to be consistent:

dat <- dat |>
  mutate(
    city = str_to_title( city ),
    offense = str_to_title( offense ),
    location = str_to_upper( location )
  )

We used a few functions, here is what they are doing: -str_to_title() converts to title case, where only the first letter of each word is capitalized. If you look, you can see that the first letter of “city” and “offense” are now capitalized. - str_to_upper() converts to upper case. For “location” we made all the letters upper case.

There is also: - str_to_lower() which converts to lower case. - str_to_sentence() which convert to sentence case, where only the first letter of sentence is capitalized (nice for long strings like quotes or descriptions).

Remove White Space

Extra spaces are one of the most common hidden problems. Take a look through the dat object and see if you can spot instances where there is extra space (i.e. white space) that should not be there.

For the variable “city” you can see that the second and ninth cases have an extra space:

dat$city
 [1] "Phoenix"    "Phoenix "   "Phx"        "Tempe"      "Tempe"     
 [6] " Glendale"  "Glendale"   "Mesa"       "Mesa "      "Scottsdale"
[11] "Scottsdale" "Scottsdale"

To fix this, we can use the str_squish() function which removes leading spaces (space in front of a word), removes trailing spaces (space after a word), and collapses multiple spaces into one:

dat <- dat |>
  mutate(
    city = str_squish( city ),
    offense = str_squish( offense ),
    location = str_squish( location )
  )

This looks much better, we can see the difference by examining the unique strings for each variable again:

unique( dat$city )
[1] "Phoenix"    "Phx"        "Tempe"      "Glendale"   "Mesa"      
[6] "Scottsdale"
unique( dat$offense )
[1] "Assault"           "Burglary"          "Domestic Violence"
[4] "Theft"            
unique( dat$location )
[1] "123 MAIN ST"              "123 MAIN STREET"         
[3] "45 E 2ND AVE"             "700 W RIO SALADO PKWY"   
[5] "700 W RIO SALADO PARKWAY" "100 N CENTER RD"         

Group Similar Values

As you can see above, we still have the issue that different naming and abbreviations are used for the same city. We can fix this by building on what we saw above with the %in% operator and the case_when() function. Essentially what we want to do is find all the unique ways that a city is named, and replaced those with a single consistent name.

To do this, we will create a new variable called city_clean which is the cleaned version of the names. We will create a new variable because we don’t want to write over the old names (in case we want them for reference).

dat <- dat |>
  mutate(
    city_clean = case_when(
      city %in% c( "Phx", "Phoenix" ) ~ "Phoenix",
      city %in% c( "Tempe" ) ~ "Tempe",
      city %in% c( "Glendale" ) ~ "Glendale",
      city %in% c( "Mesa" ) ~ "Mesa",
      city %in% c( "Scottsdale" ) ~ "Scottsdale",
      TRUE ~ city
    )
  )

Ok, that was a lot so let’s unpack it: - we are using mutate to create a new variable. - we are using case_when() to check conditions we specify from the first to the last value of the variable “city”. What the function is doing is taking each case and if the condition is TRUE, it assigns the value on the right side. - the statement city %in% c("Phx", "Phoenix") ~ "Phoenix" says “take the variable city and if the string is”Phx” or “Phoenix” make the string in city_clean be “Phoenix”. - we repeated this for Temple, Glendale, Mesa, and Scottsdale. - the last line TRUE ~ city tells the function what to do if the conditions above are not met. For example, what should it do if not of the conditions are above are met? Well, we want it to return the value for “city” (otherwise it will return NA; try it!)

Replace Text

Sometimes you do not want a new label, you just want to fix the text. We can use the str_replace_all() function to accomplish this task.

For example, take a look at the “location” variable and you will see that there are a few different usages of street. If we want these to be consistent, then we can replace the text:

dat <- dat |>
  mutate(
    location = str_replace_all( location, "STREET", "ST" ),
    location = str_replace_all( location, "PARKWAY", "PKWY" ),
    location = str_replace_all( location, "ROAD", "RD" )
  )

Putting It All Together

We have discussed these separately, but once we know what we want to clean up we can put it all together in a single workflow.

Since we have been writing over our dat object, we need to recreate it from the original:

dat <- data.frame(
  id = 1:12,
  city = c(
    "Phoenix", "PHOENIX ", "phx", "Tempe", "tempe", " Glendale",
    "GLENDALE", "Mesa", "mesa ", "Scottsdale", "SCOTTSDALE", "scottsdale"
  ),
  offense = c(
    "Assault", "ASSAULT", "assault ", "Burglary", "BURGLARY", "burglary",
    "Domestic Violence", "domestic violence", "Domestic   Violence",
    "Theft", "THEFT", " theft "
  ),
  location = c(
    "123 MAIN ST", "123 Main St", "123 Main Street",
    "45 E 2ND AVE", "45 E 2nd Ave", "45 e 2ND AVE",
    "700 W RIO SALADO PKWY", "700 W Rio Salado Pkwy", "700 W RIO SALADO PARKWAY",
    "100 N CENTER RD", "100 N Center Rd", "100 N CENTER RD"
  )
)

dat
   id       city             offense                 location
1   1    Phoenix             Assault              123 MAIN ST
2   2   PHOENIX              ASSAULT              123 Main St
3   3        phx            assault           123 Main Street
4   4      Tempe            Burglary             45 E 2ND AVE
5   5      tempe            BURGLARY             45 E 2nd Ave
6   6   Glendale            burglary             45 e 2ND AVE
7   7   GLENDALE   Domestic Violence    700 W RIO SALADO PKWY
8   8       Mesa   domestic violence    700 W Rio Salado Pkwy
9   9      mesa  Domestic   Violence 700 W RIO SALADO PARKWAY
10 10 Scottsdale               Theft          100 N CENTER RD
11 11 SCOTTSDALE               THEFT          100 N Center Rd
12 12 scottsdale              theft           100 N CENTER RD

Now, let’s build out the full workflow:

dat_clean <- dat |>
  
  # fix the case
  mutate(
    city = str_to_title( city ),
    offense = str_to_title( offense ),
    location = str_to_upper( location )
  ) |> 
  
  # clean the white space
  mutate(
    city = str_squish( city ),
    offense = str_squish( offense ),
    location = str_squish( location ) 
    ) |> 
  
  # make names consistent
  mutate(
    city_clean = case_when(
      city %in% c( "Phx", "Phoenix" ) ~ "Phoenix",
      city %in% c( "Tempe" ) ~ "Tempe",
      city %in% c( "Glendale" ) ~ "Glendale",
      city %in% c( "Mesa" ) ~ "Mesa",
      city %in% c( "Scottsdale" ) ~ "Scottsdale",
      TRUE ~ city )
    ) |> 
  
  # replace text for consistency
  mutate(
    location = str_replace_all( location, "STREET", "ST" ),
    location = str_replace_all( location, "PARKWAY", "PKWY" ),
    location = str_replace_all( location, "ROAD", "RD" )
  )

Dates

Dates are notoriously messy and frustrating to handle. For example, take a look at this fictitious (though representative!) data:

   id incident_date
1   1    2022-01-15
2   2    01/22/2021
3   3 March 3, 2024
4   4    2023/04/10
5   5     5-12-2024
6   6    2020-06-18
7   7    07/04/2024
8   8 August 9 2021
9   9    2024.09.14
10 10    10/31/2022

Do you see the problem? Everything is in a different format and is inconsistent. Wild data!

Fortunately, there is an excellent package for working with dates and times: lubridate.

To see the functionality, let’s create our toy data set to work with:

dat_dates <- data.frame(
  id = 1:10,
  
  incident_date = c(
    "2022-01-15",
    "01/22/2021",
    "March 3, 2024",
    "2023/04/10",
    "5-12-2024",
    "2020-06-18",
    "07/04/2024",
    "August 9 2021",
    "2024.09.14",
    "10/31/2022"
  )
)

dat_dates

The parse_date_time() Function

Essentially what we want is the dates to be the same format. In lubridate there is a function, parse_date_time(), that does exactly this: it takes the input and parses it into a consistent format.

I will give you the code chunk first, then we will work through it:

dat_dates |>
  mutate(
    date_clean = parse_date_time(
      incident_date,
      orders = c( "ymd", "mdy", "B d Y" )
    )
  )
   id incident_date date_clean
1   1    2022-01-15 2022-01-15
2   2    01/22/2021 2021-01-22
3   3 March 3, 2024 2024-03-03
4   4    2023/04/10 2023-04-10
5   5     5-12-2024 2024-05-12
6   6    2020-06-18 2020-06-18
7   7    07/04/2024 2024-07-04
8   8 August 9 2021 2021-08-09
9   9    2024.09.14 2024-09-14
10 10    10/31/2022 2022-10-31

This prints out the messy data as well as a new column “date_clean” that has been parsed. Ok, let’s take a look at what the function is doing:
- using mutate(), we are creating a new variable called “date_clean” with the parse_date_time() function. - the parse_date_time() function takes the messy data, “incident_date”, and then uses some additional info to parse it. - the orders = argument takes several pieces: ymd is year-month-day, mdy is month-day-year, and B d Y is full month name, day, year.

The orders = argument might be a bit confusing. Essentially we are telling the function the different ways that the date is presented in the “incident_date” variable. For example, in the dat_dates object, look at the first case and you will see that the date looks like 2022-01-15. This is year, month, day format which the orders = argument recognizes with the ymd value. How about mdy? Take a look at the second case: 01/22/2021. And so on. You can see these different format codes by examining the help page for the parse_date_time() function with: ?parse_date_time.

Working with dates is often a trial and error process of getting a sense of what formats are being used and how to parse them correctly. For example, let’s say we didn’t see that some of the dates are in ymd format and exclude this from orders =:

dat_dates |>
  mutate(
    date_clean = parse_date_time(
      incident_date,
      orders = c( "mdy", "B d Y" )
    )
  )
   id incident_date date_clean
1   1    2022-01-15       <NA>
2   2    01/22/2021 2021-01-22
3   3 March 3, 2024 2024-03-03
4   4    2023/04/10       <NA>
5   5     5-12-2024 2024-05-12
6   6    2020-06-18       <NA>
7   7    07/04/2024 2024-07-04
8   8 August 9 2021 2021-08-09
9   9    2024.09.14       <NA>
10 10    10/31/2022 2022-10-31

We get an error that says it failed to parse several cases. These cases have a value of <NA> in the output object. If we take a look, then we can see that these are cases with the year-month-day format. We just need to add that to our orders = statement.

Date parsing is one of the messiest parts of open-source data wrangling. Just keep in mind that the goal is not to memorize formats, but to learn how to recognize patterns and standardize them.

Transformation

Let’s create an object with the cleaned dates:

dat_dates_clean <- dat_dates |>
  mutate(
    date_clean = parse_date_time(
      incident_date,
      orders = c( "ymd", "mdy", "B d Y" )
    )
  )

Now that our date variable is clean, we can actually use it to make some other useful variables:

dat_dates_clean <- dat_dates_clean |>
  mutate(
    year = year( date_clean ),
    month = month( date_clean, label = TRUE),
    day = day( date_clean )
  )

In the statement above we use mutate() to create several variables: “year”, “month”, and “day”. When creating “month” we see the argument label = to TRUE to keep the labels for the months (otherwise it defaults to a numeric value such as “1” for “Jan”).

Now that are data are cleaned, we can use dplyr to answer questions like: Which month has the most incidents?

dat_dates_clean |>
  group_by( month ) |>  
  summarize( n() )
# A tibble: 9 × 2
  month `n()`
  <ord> <int>
1 Jan       2
2 Mar       1
3 Apr       1
4 May       1
5 Jun       1
6 Jul       1
7 Aug       1
8 Sep       1
9 Oct       1

Example

Let’s pull this together by working with an example. As we have seen in prior chapters, the DWVpack package contains a data file on crimes in Phoenix, AZ called phx_crime. Take a look at the time_date variable for that object:

# load the library
library( DWVpack )

# print out the first few cases using the head() function
head( phx_crime$time_date )
[1] "11/01/2015  00:00" "11/01/2015  00:00" "11/01/2015  00:00"
[4] "11/01/2015  00:00" "11/01/2015  00:00" "11/01/2015  00:00"

We can see that the time_date variable contains the date of the incident as well as the time and looks like this: 11/01/2015 00:00. There is a lot of information here. What we want to do is pull it apart and get several variables:
- Year - Month - Day - Time

We can use dplyr and lubridate to help us create these variables. Let’s start by getting the date and time separated. If you take a look, 11/01/2015 00:00, you will see that the format is: “month/day/year hour:minute”. There is a function in lubridate called mdy_hm() that is build specifically for this format and converts a character string into a true date-time object that R can recognize and work with. For example, mdy_hm( "11/1/2015 14:30" ) becomes something like "2015-11-01 14:30:00 UTC" where 2015-11-01 is the date stored in standardized YYYY-MM-DD format, 14:30:00 is the time in 24-hour format, and UTC is the time zone (“Coordinated Universal Time”).

Take a look using the Phoenix data:

# create the object
datetime <- mdy_hm( phx_crime$time_date )

# take a look at what it creates
tail( datetime )
[1] "2025-09-02 23:00:00 UTC" "2025-09-02 23:29:00 UTC"
[3] "2025-09-02 23:38:00 UTC" "2025-09-02 23:42:00 UTC"
[5] "2025-09-02 23:49:00 UTC" "2025-09-02 23:55:00 UTC"

Next we want to get the date and time as separate objects. We do this using the as_date() and format() functions.

# create the objects
date <- as_date( datetime )
time <- format( datetime, "%H:%M" )

# inspect the first few cases

tail( date )
[1] "2025-09-02" "2025-09-02" "2025-09-02" "2025-09-02" "2025-09-02"
[6] "2025-09-02"
tail( time )
[1] "23:00" "23:29" "23:38" "23:42" "23:49" "23:55"

The formatting codes in format() indicate that we want time in 24-hour (%H) and minutes (%M).

As a final step, we want to pull out calendar components (i.e. year, month, day) as we did above:

year  <- year( datetime )

# here, we want to keep the label and don't want it abbreviated
month <- month( datetime, label = TRUE, abbr = FALSE)

day   <- day( datetime )

I have presented this in separate steps, but we can put this all together as a “data recipe” using the mutate() function:

phx_crime_clean_date_time <- phx_crime |>
  mutate(
    
    # Parse datetime
    datetime = mdy_hm( time_date ),

    # Separate date and time
    date = as_date( datetime ),
    time = format( datetime, "%H:%M" ),

    # Create components
    year  = year( datetime ),
    month = month( datetime, label = TRUE, abbr = FALSE ),
    day   = day( datetime )

  )

We will work with these data in subsequent chapters. To avoid having to create a tidy data set with the cleaned date variables, DWVpack includes a data object called tidy_phx_crime which has already cleaned up the date and time variables.


NEED EXERCISES FOR TEST YOUR KNOWLEDGE