The Crime Data file from the data portal contains incidents reported to the Phoenix Police Department. The city updates the file at 11am every day and it contains data beginning November 2015 up to 7 days before the posting date.
Before we do anything with it, we need to do some pre-processing. Mainly, cleaning up variables, creating basic objects, and incorporating population data to calculate rates.
First we need to clear the workspace and load the libraries we are going to use.
# clear workspace
rm( list = ls() )
# load libraries
library( dplyr ) # used for wrangling the data
library( tidyr ) # used for wrangling the data
library( openxlsx ) # for opening an excel file
library( here ) # for referencing the local directory
library( tidycensus ) # getting data from the census API
library( zoo ) # to help with interpolation
Next we want to load the data and do some cleaning. The data can be called directly from the website link, here.
# get the data
url <- "https://www.phoenixopendata.com/dataset/cc08aace-9ca9-467f-b6c1-f0879ab1a358/resource/0ce3411a-2fc6-4302-a33f-167f68608a20/download/crime-data_crime-data_crimestat.csv"
# assign the data to an object
crime_dat <- read.csv( url, as.is = TRUE, header = TRUE )
# remove cases that are missing data
crime_dat <- na.omit( crime_dat )
# remove duplicate ids
duplicate_ids <- crime_dat$INC.NUMBER[ duplicated( crime_dat$INC.NUMBER )]
crime_dat <- crime_dat[ !crime_dat$INC.NUMBER %in% duplicate_ids, ]
rm( duplicate_ids )
# clean up the dates
date_vec <- strptime( crime_dat$OCCURRED.ON, format="%m/%d/%Y %H:%M" )
crime_dat$year <- format( date_vec, format="%Y" )
crime_dat$month <- format( date_vec, format="%B" )
crime_dat$day365 <- format( date_vec, format="%j" )
crime_dat$week <- format( date_vec, format="%V" )
# now we want to clean up the times
crime_dat$hour <- format(date_vec, format = "%H") %>% as.numeric()
# clean up the variable classifying the cases
crime_dat <-
crime_dat %>%
mutate(
crime_type = case_when(
UCR.CRIME.CATEGORY == "AGGRAVATED ASSAULT" ~ "Assault",
UCR.CRIME.CATEGORY == "ARSON" ~ "Arson",
UCR.CRIME.CATEGORY == "BURGLARY" ~ "Burglary",
UCR.CRIME.CATEGORY == "DRUG OFFENSE" ~ "Drugs",
UCR.CRIME.CATEGORY == "LARCENY-THEFT" ~ "Theft",
UCR.CRIME.CATEGORY == "MURDER AND NON-NEGLIGENT MANSLAUGHTER" ~ "Homicide",
UCR.CRIME.CATEGORY == "MOTOR VEHICLE THEFT" ~ "MV Theft",
UCR.CRIME.CATEGORY == "RAPE" ~ "Rape",
UCR.CRIME.CATEGORY == "ROBBERY" ~ "Robbery" ),
crime_vnv = case_when(
UCR.CRIME.CATEGORY == "AGGRAVATED ASSAULT" ~ "Violent",
UCR.CRIME.CATEGORY == "ARSON" ~ "NonViolent",
UCR.CRIME.CATEGORY == "BURGLARY" ~ "NonViolent",
UCR.CRIME.CATEGORY == "DRUG OFFENSE" ~ "NonViolent",
UCR.CRIME.CATEGORY == "LARCENY-THEFT" ~ "NonViolent",
UCR.CRIME.CATEGORY == "MURDER AND NON-NEGLIGENT MANSLAUGHTER" ~ "Violent",
UCR.CRIME.CATEGORY == "MOTOR VEHICLE THEFT" ~ "NonViolent",
UCR.CRIME.CATEGORY == "RAPE" ~ "Violent",
UCR.CRIME.CATEGORY == "ROBBERY" ~ "Violent" )
)
# drop cases from 2015 (these are dropped because the 2015 cases begin in December)
crime_dat <-
crime_dat %>%
filter( year != 2015 )
# drop cases for the most recent month (since the data for the current month are incomplete)
crime_dat <- crime_dat[ ! (
crime_dat$month == format( Sys.Date(), format="%B" ) &
crime_dat$year == format( Sys.Date(), format="%Y" )
) , ]
Right now, the crime_dat
object is a list of incidents.
When we go to aggregate over months or years, we are going to want to
adjust based on population differences. In other words, we will want to
calculate the rates (not just examine raw counts).
The rates are computed using population data from the Census Bureau. The command below uses the Census Bureau API to pull the data. To see more about how the API is used, see the phoenix-population.R file. As you will see if you look through the phoenix-population.R file, population data are available for the years 2016-2022 and interpolated for the years 2023-2025.
import::here( "phoenixPopDat",
.from = here::here( "utils/phoenix-population.R" ),
.character_only = TRUE )
We now have an object, phoenixPopDat
that is the yearly
population for Phoenix. As noted above, the population for 2023-2025 are
just an estimate, as these are not yet available through the Census
Bureau’s API. Also, the data for 2020 are interpolated as well due to
low response rates. The Census Bureau did released a set of experimental
estimates for the 2020 1-year ACS, but for ease of analysis here we will
just interpolate the data.
We can append the geographic data for the data. This piece calls the crime_dat_geo2016-2024.rds file that is created using the crime-geo.R script. It uses the Census API to pull coordinate data.
# get the file
crime_dat_geo <- readRDS( here( "data/data-geo/crimeDatGeo2016-2024.rds" ) )
# merge the geographic data with the cases
crime_dat <- left_join( crime_dat, crime_dat_geo, by = "INC.NUMBER" )
Now that we have the counts pre-processed and the population data, we can build a few objects that we can use for analysis.
# daily count of crimes
crimes_by_day <-
crime_dat %>%
select( year, month, day365 ) %>%
filter( !is.na( day365 ) ) %>%
group_by( year, month, day365 ) %>%
summarize( counts = n() ) %>%
ungroup() %>%
mutate( day_time = seq( 1, length( counts ) ) ) %>%
select( counts, day_time ) %>%
mutate( days =
seq(
as.Date( head( strptime( crime_dat$OCCURRED.ON, format="%m/%d/%Y %H:%M" ) ) )[1],
as.Date( tail( strptime( crime_dat$OCCURRED.ON, format="%m/%d/%Y %H:%M" ) ) )[6],
length.out = length( counts ) ) ) %>%
arrange( day_time )
crimes_by_day <- as.data.frame( crimes_by_day )
# monthly count of crimes
crimes_by_month <-
crime_dat %>%
select( year, month ) %>%
filter( year != 2015 ) %>%
filter( !is.na( year ) ) %>%
group_by( year, month ) %>%
summarize( counts = n() ) %>%
spread( year, counts ) %>%
arrange( match( month, month.name ) ) %>%
select( !month )
# monthly count of crimes by type
crimes_by_type_by_month <-
crime_dat %>%
select( year, month, crime_type ) %>%
filter( year != 2015 ) %>%
filter( !is.na( year ) ) %>%
group_by( year, month, crime_type ) %>%
summarize( counts = n() ) %>%
arrange( match( month, month.name ) ) %>%
select( !month )
crimes_by_type_by_month$month <-factor( crimes_by_type_by_month$month,levels = month.name )
crimes_by_type_by_month$month <-factor( month.abb[crimes_by_type_by_month$month],levels = month.abb )
# monthly count of violent/nonviolent crimes by type
crimes_by_vnv_by_month <-
crime_dat %>%
select( year, month, crime_vnv ) %>%
filter( year != 2015 ) %>%
filter( !is.na( year ) ) %>%
group_by( year, month, crime_vnv ) %>%
summarize( counts = n() ) %>%
arrange( match( month, month.name ) ) %>%
select( !month )
crimes_by_vnv_by_month$month <-factor( crimes_by_vnv_by_month$month,levels = month.name )
crimes_by_vnv_by_month$month <-factor( month.abb[crimes_by_vnv_by_month$month],levels = month.abb )
# crimes by year
crimes_by_year <-
crime_dat %>%
select( year ) %>%
filter( year != 2015 ) %>%
filter( !is.na( year ) ) %>%
group_by( year ) %>%
summarize( counts = n() )
# yearly count of crimes by type
crimes_by_type_by_year <-
crime_dat %>%
select( year, crime_type ) %>%
filter( year != 2015 ) %>%
filter( !is.na( year ) ) %>%
group_by( year, crime_type ) %>%
summarize( counts = n() )
# yearly count of crimes by type
crimes_by_vnv_by_year <-
crime_dat %>%
select( year, crime_vnv ) %>%
filter( year != 2015 ) %>%
filter( !is.na( year ) ) %>%
group_by( year, crime_vnv ) %>%
summarize( counts = n() )
Now we want to create objects that capture the rates.
# crime rate is calculated as the count of
# crimes divided by the population size, then multiplied by 100,000
# calculate the crime rate by month
crime_rates_month <- as.data.frame( crimes_by_month )
for ( i in 1: dim( crime_rates_month )[2] ){
crime_rates_month[,i] <- ( crime_rates_month[,i] / phoenixPopDat[i,2] ) * 100000
}
# calculate the crime rate by year
crime_rates_year <- as.data.frame( crimes_by_year )
for ( i in 1: dim( crime_rates_year )[1] ){
crime_rates_year[i,2] <- ( crime_rates_year[i,2] / phoenixPopDat[i,2] ) * 100000
}
# rename the counts column to be rates
crime_rates_year <- crime_rates_year %>% rename( rates = counts )
# calculate rates for types by month
crime_rates_month_type <- as.data.frame( crimes_by_type_by_month )
# assign the population data
crime_rates_month_type <- crime_rates_month_type %>%
left_join( phoenixPopDat, by = "year" )
# create the rates
crime_rates_month_type <- crime_rates_month_type %>%
mutate( rates = ( ( counts / population ) * 100000 ) )
# calculate rates for violent/nonviolent by month
crime_rates_month_vnv <- as.data.frame( crimes_by_vnv_by_month )
# assign the population data
crime_rates_month_vnv <- crime_rates_month_vnv %>%
left_join( phoenixPopDat, by = "year" )
# create the rates
crime_rates_month_vnv <- crime_rates_month_vnv %>%
mutate( rates = ( ( counts / population ) * 100000 ) )
# calculate rates for types by year
crime_rates_year_type <- as.data.frame( crimes_by_type_by_year )
# assign the population data
crime_rates_year_type <- crime_rates_year_type %>%
left_join( phoenixPopDat, by = "year" )
# create the rates
crime_rates_year_type <- crime_rates_year_type %>%
mutate( rates = ( ( counts / population ) * 100000 ) )
# calculate rates for violent/nonviolent by year
crime_rates_year_vnv <- as.data.frame( crimes_by_vnv_by_year )
# assign the population data
crime_rates_year_vnv <- crime_rates_year_vnv %>%
left_join( phoenixPopDat, by = "year" )
# create the rates
crime_rates_year_vnv <- crime_rates_year_vnv %>%
mutate( rates = ( ( counts / population ) * 100000 ) )
We can now save the objects as a file that can be referenced for analysis on separate pages.
# save the objects as .rds files
saveRDS( crime_dat, file = here( "data/crime_dat.rds" ) )
saveRDS( crimes_by_day, file = here( "data/crimes_by_day.rds" ) )
saveRDS( crimes_by_month, file = here( "data/crimes_by_month.rds" ) )
saveRDS( crimes_by_year, file = here( "data/crimes_by_year.rds" ) )
saveRDS( crime_rates_month, file = here( "data/crime_rates_month.rds" ) )
saveRDS( crime_rates_year, file = here( "data/crime_rates_year.rds" ) )
# crimes by type
saveRDS( crimes_by_type_by_month, file = here( "data/crimes_by_type_by_month.rds" ) )
saveRDS( crimes_by_type_by_year , file = here( "data/crimes_by_type_by_year.rds" ) )
saveRDS( crime_rates_month_type , file = here( "data/crime_rates_month_type.rds" ) )
saveRDS( crime_rates_year_type , file = here( "data/crime_rates_year_type.rds" ) )
saveRDS( crimes_by_vnv_by_month , file = here( "data/crimes_by_vnv_by_month.rds" ) )
saveRDS( crimes_by_vnv_by_year , file = here( "data/crimes_by_vnv_by_year.rds" ) )
saveRDS( crime_rates_month_vnv , file = here( "data/crime_rates_month_vnv.rds" ) )
saveRDS( crime_rates_year_vnv , file = here( "data/crime_rates_year_vnv.rds" ) )
If you look in the data folder
for the repository, you will see that these files have been added. When
I run this script, I have to push these files to the repository. But, we
can now reference them using the readRDS()
function.
Now that the data are pre-processed, we can reference this object when we conduct analyses. Visit the Crime in Phoenix page to see these analyses.
Back to Open Criminology Phoenix page
Please report any needed corrections to the Issues page. Thanks!