The Crime Data file from the data portal contains incidents reported to the Phoenix Police Department. The city updates the file at 11am every day and it contains data beginning November 2015 up to 7 days before the posting date. Before we do anything with it, we need to do some pre-processing. Mainly, cleaning up variables, creating basic objects, and incorporating population data to calculate rates.



Setup and Loading the Data


Preparing the environment

First we need to clear the workspace and load the libraries we are going to use.


# clear workspace
rm( list = ls() )


# load libraries
library( dplyr )        # used for wrangling the data
library( tidyr )        # used for wrangling the data
library( openxlsx )     # for opening an excel file
library( here )         # for referencing the local directory
library( tidycensus )   # getting data from the census API
library( zoo )          # to help with interpolation


Loading the data

Next we want to load the data and do some cleaning. The data can be called directly from the website link, here.


# get the data
url <- "https://www.phoenixopendata.com/dataset/cc08aace-9ca9-467f-b6c1-f0879ab1a358/resource/0ce3411a-2fc6-4302-a33f-167f68608a20/download/crime-data_crime-data_crimestat.csv"
crimeData <- read.csv( url, as.is = TRUE, header = TRUE )
crimeData <- na.omit( crimeData )

# remove duplicate ids
duplicateIds <- crimeData$INC.NUMBER[ duplicated( crimeData$INC.NUMBER )]
crimeData <- crimeData[ !crimeData$INC.NUMBER %in% duplicateIds, ]
rm( duplicateIds )

# clean up the dates
date.vec <- strptime( crimeData$OCCURRED.ON, format="%m/%d/%Y %H:%M" )
crimeData$year   <- format( date.vec, format="%Y" )
crimeData$month  <- format( date.vec, format="%B" )
crimeData$day365 <- format( date.vec, format="%j" )
crimeData$week   <- format( date.vec, format="%V" )

# now we want to clean up the times
crimeData$hour   <- format(date.vec, format = "%H") %>% as.numeric()

# clean up the variable classifying the cases
crimeData <- 
  crimeData %>% 
  mutate( crime.type = case_when( 
    UCR.CRIME.CATEGORY == "AGGRAVATED ASSAULT" ~ "Assault",
    UCR.CRIME.CATEGORY == "ARSON" ~ "Arson",
    UCR.CRIME.CATEGORY == "BURGLARY" ~ "Burglary",
    UCR.CRIME.CATEGORY == "DRUG OFFENSE" ~ "Drugs",
    UCR.CRIME.CATEGORY == "LARCENY-THEFT" ~ "Theft",
    UCR.CRIME.CATEGORY == "MURDER AND NON-NEGLIGENT MANSLAUGHTER" ~ "Homicide",
    UCR.CRIME.CATEGORY == "MOTOR VEHICLE THEFT" ~ "MV Theft",
    UCR.CRIME.CATEGORY == "RAPE" ~ "Rape",
    UCR.CRIME.CATEGORY == "ROBBERY" ~ "Robbery" ) )

# drop cases from 2015 (these are dropped because the 2015 cases begin in December)
crimeData <- 
  crimeData %>% 
  filter( year != 2015 )

# drop cases for the most recent month
crimeData <- crimeData[ ! ( 
  crimeData$month == format( Sys.Date(), format="%B" ) &
    crimeData$year == format( Sys.Date(), format="%Y" ) 
) , ]


Adjusting for Population

Right now, the crimeData object is a list of incidents. When we go to aggregate over months or years, we are going to want to adjust based on population differences. In other words, we will want to calculate the rates (not just examine raw counts).

The rates are computed using population data from the Census Bureau. The command below uses the Census Bureau API to pull the data. To see more about how the API is used, see the phoenix-population.R file.


import::here( "phoenixPopDat",
              .from = here::here( "utils/phoenix-population.R" ),
              .character_only = TRUE )


We now have an object, phoenixPopDat that is the yearly population for Phoenix. Note that the population for 2022-2024 is just an estimate, not that reported to the Census Bureau.


Appending the Geo Data

We can append the geographic data for the data. This piece calls the crimeDatGeo2016-2024.rds file that is created using the crime-geo.R script. It uses the Census API to pull coordinate data.

# get the file
crimeDatGeo <- readRDS( here( "data/data-geo/crimeDatGeo2016-2024.rds" ) )

# merge the geographic data with the cases
crimeData <- left_join( crimeData, crimeDatGeo, by = "INC.NUMBER" )


Creating Objects


Counts

Now that we have the counts pre-processed and the population data, we can build a few objects that we can use for analysis.


# daily count of crimes
crimesByDay <- 
  crimeData %>% 
  select( year, month, day365 ) %>%   
  filter( !is.na( day365 ) ) %>% 
  group_by( year, month, day365 ) %>% 
  summarize( counts = n() ) %>% 
  ungroup() %>% 
  mutate( day.time = seq( 1, length( counts ) ) ) %>% 
  select( counts, day.time ) %>% 
  mutate( days = 
            seq( 
              as.Date( head( strptime( crimeData$OCCURRED.ON, format="%m/%d/%Y %H:%M" ) ) )[1], 
              as.Date( tail( strptime( crimeData$OCCURRED.ON, format="%m/%d/%Y %H:%M" ) ) )[6], 
              length.out = length( counts ) ) ) %>% 
  arrange( day.time )
crimesByDay <- as.data.frame( crimesByDay )

# monthly count of crimes
crimesByMonth <- 
  crimeData %>% 
  select( year, month ) %>%   
  filter( year != 2015 ) %>%  
  filter( !is.na( year ) ) %>% 
  group_by( year, month ) %>% 
  summarize( counts = n() ) %>% 
  spread( year, counts ) %>% 
  arrange( match( month, month.name ) ) %>% 
  select( !month )

# monthly count of crimes by type
crimesByTypeByMonth <-
  crimeData %>% 
  select( year, month, crime.type ) %>%   
  filter( year != 2015 ) %>%  
  filter( !is.na( year ) ) %>% 
  group_by( year, month, crime.type ) %>% 
  summarize( counts = n() ) %>% 
  arrange( match( month, month.name ) ) %>% 
  select( !month )
crimesByTypeByMonth$month <-factor( crimesByTypeByMonth$month,levels = month.name )
crimesByTypeByMonth$month <-factor( month.abb[crimesByTypeByMonth$month],levels = month.abb )

# crimes by year
crimesByYear <- 
  crimeData %>% 
  select( year ) %>%   
  filter( year != 2015 ) %>%  
  filter( !is.na( year ) ) %>% 
  group_by( year ) %>% 
  summarize( counts = n() ) 

# yearly count of crimes by type
crimesByTypeByYear <-
  crimeData %>% 
  select( year, crime.type ) %>%   
  filter( year != 2015 ) %>%  
  filter( !is.na( year ) ) %>% 
  group_by( year, crime.type ) %>% 
  summarize( counts = n() )


Rates

Now we want to create objects that capture the rates.


# crime rate is calculated as the count of 
# crimes divided by the population size, then multiplied by 100,000

# calculate the crime rate by month
crimeRatesMonth <- as.data.frame( crimesByMonth )

for ( i in 1: dim( crimeRatesMonth )[2] ){
  crimeRatesMonth[,i] <- ( crimeRatesMonth[,i] / phoenixPopDat[i,2] ) * 100000
}

# calculate the crime rate by year
crimeRatesYear <- as.data.frame( crimesByYear )

for ( i in 1: dim( crimeRatesYear )[1] ){
  crimeRatesYear[i,2] <- ( crimeRatesYear[i,2] / phoenixPopDat[i,2] ) * 100000
}

# rename the counts column to be rates
crimeRatesYear <- crimeRatesYear %>%  rename( rates = counts )


# calculate rates for types by month
crimeRatesMonthType <- as.data.frame( crimesByTypeByMonth )

# assign the population data
crimeRatesMonthType <- crimeRatesMonthType %>%
  left_join( phoenixPopDat, by = "year" )

# create the rates
crimeRatesMonthType <- crimeRatesMonthType %>% 
  mutate( rates = ( ( counts / population ) * 100000 ) )


# calculate rates for types by year
crimeRatesYearType <- as.data.frame( crimesByTypeByYear )

# assign the population data
crimeRatesYearType <- crimeRatesYearType %>%
  left_join( phoenixPopDat, by = "year" )

# create the rates
crimeRatesYearType <- crimeRatesYearType %>% 
  mutate( rates = ( ( counts / population ) * 100000 ) )


Saving the File

We can now save the objects as a file that can be referenced for analysis on separate pages.


# save the objects as .rds files

saveRDS( crimeData,       file = here( "data/crimeData.rds" ) )
saveRDS( crimesByDay,     file = here( "data/crimesByDay.rds" ) )
saveRDS( crimesByMonth,   file = here( "data/crimesByMonth.rds" ) )
saveRDS( crimesByYear,    file = here( "data/crimesByYear.rds" ) )
saveRDS( crimeRatesMonth, file = here( "data/crimeRatesMonth.rds" ) )
saveRDS( crimeRatesYear,  file = here( "data/crimeRatesYear.rds" ) )

# crimes by type
saveRDS( crimesByTypeByMonth, file = here( "data/crimesByTypeByMonth.rds" ) )
saveRDS( crimesByTypeByYear , file = here( "data/crimesByTypeByYear.rds" ) )
saveRDS( crimeRatesMonthType, file = here( "data/crimeRatesMonthType.rds" ) )
saveRDS( crimeRatesYearType , file = here( "data/crimeRatesMonthYear.rds" ) )

If you look in the data folder for the repository, you will see that these files have been added. When I run this script, I have to push these files to the repository. But, we can now reference them using the readRDS() function.


Next steps…

Now that the data are pre-processed, we can reference this object when we conduct analyses. Visit the Crime in Phoenix page to see these analyses.



Back to Open Criminology Phoenix page


Please report any needed corrections to the Issues page. Thanks!



Last updated 19 December, 2024