Pre-Processing Crime Data for Phoenix

The Crime Data file from the data portal contains incidents reported to the Phoenix Police Department. The city updates the file at 11am every day and it contains data beginning November 2015 up to 7 days before the posting date.

Before we do anything with it, we need to do some pre-processing. Mainly, cleaning up variables, creating basic objects, and incorporating population data to calculate rates.

Setup and Loading the Data

Preparing the environment

First we need to clear the workspace and load the libraries we are going to use.

# clear workspace
rm( list = ls() )


# load libraries
library( dplyr )        # used for wrangling the data
library( tidyr )        # used for wrangling the data
library( openxlsx )     # for opening an excel file
library( here )         # for referencing the local directory
library( tidycensus )   # getting data from the census API
library( zoo )          # to help with interpolation

Loading the data

Next we want to load the data and do some cleaning. The data can be called directly from the website link, here.

# get the data
url <- "https://www.phoenixopendata.com/dataset/cc08aace-9ca9-467f-b6c1-f0879ab1a358/resource/0ce3411a-2fc6-4302-a33f-167f68608a20/download/crime-data_crime-data_crimestat.csv"

# assign the data to an object
crimeData <- read.csv( url, as.is = TRUE, header = TRUE )

# remove cases that are missing data
crimeData <- na.omit( crimeData )

# remove duplicate ids
duplicateIds <- crimeData$INC.NUMBER[ duplicated( crimeData$INC.NUMBER )]
crimeData <- crimeData[ !crimeData$INC.NUMBER %in% duplicateIds, ]
rm( duplicateIds )

# clean up the dates
date.vec <- strptime( crimeData$OCCURRED.ON, format="%m/%d/%Y %H:%M" )
crimeData$year   <- format( date.vec, format="%Y" )
crimeData$month  <- format( date.vec, format="%B" )
crimeData$day365 <- format( date.vec, format="%j" )
crimeData$week   <- format( date.vec, format="%V" )

# now we want to clean up the times
crimeData$hour   <- format(date.vec, format = "%H") %>% as.numeric()

# clean up the variable classifying the cases
crimeData <- 
  crimeData %>% 
  mutate( crime.type = case_when( 
    UCR.CRIME.CATEGORY == "AGGRAVATED ASSAULT" ~ "Assault",
    UCR.CRIME.CATEGORY == "ARSON" ~ "Arson",
    UCR.CRIME.CATEGORY == "BURGLARY" ~ "Burglary",
    UCR.CRIME.CATEGORY == "DRUG OFFENSE" ~ "Drugs",
    UCR.CRIME.CATEGORY == "LARCENY-THEFT" ~ "Theft",
    UCR.CRIME.CATEGORY == "MURDER AND NON-NEGLIGENT MANSLAUGHTER" ~ "Homicide",
    UCR.CRIME.CATEGORY == "MOTOR VEHICLE THEFT" ~ "MV Theft",
    UCR.CRIME.CATEGORY == "RAPE" ~ "Rape",
    UCR.CRIME.CATEGORY == "ROBBERY" ~ "Robbery" ) )

# drop cases from 2015 (these are dropped because the 2015 cases begin in December)
crimeData <- 
  crimeData %>% 
  filter( year != 2015 )

# drop cases for the most recent month (since the data for the current month are incomplete)
crimeData <- crimeData[ ! ( 
  crimeData$month == format( Sys.Date(), format="%B" ) &
    crimeData$year == format( Sys.Date(), format="%Y" ) 
) , ]

Adjusting for Population

Right now, the crimeData object is a list of incidents. When we go to aggregate over months or years, we are going to want to adjust based on population differences. In other words, we will want to calculate the rates (not just examine raw counts).

The rates are computed using population data from the Census Bureau. The command below uses the Census Bureau API to pull the data. To see more about how the API is used, see the phoenix-population.R file. As you will see if you look through the phoenix-population.R file, population data are available for the years 2016-2022 and interpolated for the years 2023-2025.

import::here( "phoenixPopDat",
              .from = here::here( "utils/phoenix-population.R" ),
              .character_only = TRUE )

We now have an object, phoenixPopDat that is the yearly population for Phoenix. As noted above, the population for 2023-2025 are just an estimate, as these are not yet available through the Census Bureau’s API. Also, the data for 2020 are interpolated as well due to low response rates. The Census Bureau did released a set of experimental estimates for the 2020 1-year ACS, but for ease of analysis here we will just interpolate the data.

Appending the Geo Data

We can append the geographic data for the data. This piece calls the crimeDatGeo2016-2024.rds file that is created using the crime-geo.R script. It uses the Census API to pull coordinate data.

# get the file
crimeDatGeo <- readRDS( here( "data/data-geo/crimeDatGeo2016-2024.rds" ) )

# merge the geographic data with the cases
crimeData <- left_join( crimeData, crimeDatGeo, by = "INC.NUMBER" )

Creating Objects

Counts

Now that we have the counts pre-processed and the population data, we can build a few objects that we can use for analysis.

# daily count of crimes
crimesByDay <- 
  crimeData %>% 
  select( year, month, day365 ) %>%   
  filter( !is.na( day365 ) ) %>% 
  group_by( year, month, day365 ) %>% 
  summarize( counts = n() ) %>% 
  ungroup() %>% 
  mutate( day.time = seq( 1, length( counts ) ) ) %>% 
  select( counts, day.time ) %>% 
  mutate( days = 
            seq( 
              as.Date( head( strptime( crimeData$OCCURRED.ON, format="%m/%d/%Y %H:%M" ) ) )[1], 
              as.Date( tail( strptime( crimeData$OCCURRED.ON, format="%m/%d/%Y %H:%M" ) ) )[6], 
              length.out = length( counts ) ) ) %>% 
  arrange( day.time )
crimesByDay <- as.data.frame( crimesByDay )

# monthly count of crimes
crimesByMonth <- 
  crimeData %>% 
  select( year, month ) %>%   
  filter( year != 2015 ) %>%  
  filter( !is.na( year ) ) %>% 
  group_by( year, month ) %>% 
  summarize( counts = n() ) %>% 
  spread( year, counts ) %>% 
  arrange( match( month, month.name ) ) %>% 
  select( !month )

# monthly count of crimes by type
crimesByTypeByMonth <-
  crimeData %>% 
  select( year, month, crime.type ) %>%   
  filter( year != 2015 ) %>%  
  filter( !is.na( year ) ) %>% 
  group_by( year, month, crime.type ) %>% 
  summarize( counts = n() ) %>% 
  arrange( match( month, month.name ) ) %>% 
  select( !month )
crimesByTypeByMonth$month <-factor( crimesByTypeByMonth$month,levels = month.name )
crimesByTypeByMonth$month <-factor( month.abb[crimesByTypeByMonth$month],levels = month.abb )

# crimes by year
crimesByYear <- 
  crimeData %>% 
  select( year ) %>%   
  filter( year != 2015 ) %>%  
  filter( !is.na( year ) ) %>% 
  group_by( year ) %>% 
  summarize( counts = n() ) 

# yearly count of crimes by type
crimesByTypeByYear <-
  crimeData %>% 
  select( year, crime.type ) %>%   
  filter( year != 2015 ) %>%  
  filter( !is.na( year ) ) %>% 
  group_by( year, crime.type ) %>% 
  summarize( counts = n() )

Rates

Now we want to create objects that capture the rates.

# crime rate is calculated as the count of 
# crimes divided by the population size, then multiplied by 100,000

# calculate the crime rate by month
crimeRatesMonth <- as.data.frame( crimesByMonth )

for ( i in 1: dim( crimeRatesMonth )[2] ){
  crimeRatesMonth[,i] <- ( crimeRatesMonth[,i] / phoenixPopDat[i,2] ) * 100000
}

# calculate the crime rate by year
crimeRatesYear <- as.data.frame( crimesByYear )

for ( i in 1: dim( crimeRatesYear )[1] ){
  crimeRatesYear[i,2] <- ( crimeRatesYear[i,2] / phoenixPopDat[i,2] ) * 100000
}

# rename the counts column to be rates
crimeRatesYear <- crimeRatesYear %>%  rename( rates = counts )


# calculate rates for types by month
crimeRatesMonthType <- as.data.frame( crimesByTypeByMonth )

# assign the population data
crimeRatesMonthType <- crimeRatesMonthType %>%
  left_join( phoenixPopDat, by = "year" )

# create the rates
crimeRatesMonthType <- crimeRatesMonthType %>% 
  mutate( rates = ( ( counts / population ) * 100000 ) )


# calculate rates for types by year
crimeRatesYearType <- as.data.frame( crimesByTypeByYear )

# assign the population data
crimeRatesYearType <- crimeRatesYearType %>%
  left_join( phoenixPopDat, by = "year" )

# create the rates
crimeRatesYearType <- crimeRatesYearType %>% 
  mutate( rates = ( ( counts / population ) * 100000 ) )

Saving the File

We can now save the objects as a file that can be referenced for analysis on separate pages.

# save the objects as .rds files

saveRDS( crimeData,       file = here( "data/crimeData.rds" ) )
saveRDS( crimesByDay,     file = here( "data/crimesByDay.rds" ) )
saveRDS( crimesByMonth,   file = here( "data/crimesByMonth.rds" ) )
saveRDS( crimesByYear,    file = here( "data/crimesByYear.rds" ) )
saveRDS( crimeRatesMonth, file = here( "data/crimeRatesMonth.rds" ) )
saveRDS( crimeRatesYear,  file = here( "data/crimeRatesYear.rds" ) )

# crimes by type
saveRDS( crimesByTypeByMonth, file = here( "data/crimesByTypeByMonth.rds" ) )
saveRDS( crimesByTypeByYear , file = here( "data/crimesByTypeByYear.rds" ) )
saveRDS( crimeRatesMonthType, file = here( "data/crimeRatesMonthType.rds" ) )
saveRDS( crimeRatesYearType , file = here( "data/crimeRatesMonthYear.rds" ) )

If you look in the data folder for the repository, you will see that these files have been added. When I run this script, I have to push these files to the repository. But, we can now reference them using the readRDS() function.

Next steps…

Now that the data are pre-processed, we can reference this object when we conduct analyses. Visit the Crime in Phoenix page to see these analyses.

Back to Open Criminology Phoenix page

Please report any needed corrections to the Issues page. Thanks!