What do you think? Is homicide on the rise? Is homicide declining? Having a hard time answering that question (and being confident about your answer)?

As I discuss in a different post, crime trends are very difficult to assess, without seeing the data, because of the peculiar nature of crime and the huge delay in reporting of crime (because of the way it is collected). Getting a sense of whether homicide is increasing or decreasing is particularly hard due to the relative infrequency of the event in conjunction with the tendency for it to receive disproportionate attention in the media.

In a different post I discussed the trends in incidents reported to the police over all crimes. Here, let’s just focus on homicide.


Note: as with other posts, just click the tab to see the code



Daily Counts of Homicide from 2015-2024



Let’s pull the most recent data for crime incidents from the site.

The data are reported as UCR crime classifications and have geographic information (block address, zip) as well as the date and time of the incident. As of February, 2024, there were 526,426 crime incidents with complete data from 11/2015 to within a week of the current date.

We will select just homicides. In the data, these appear as MURDER AND NON-NEGLIGENT MANSLAUGHTER, but recoded as homicide. As of February, 2024, there were 1,351 homicides reported from 11/2015 to within a week of the current date.

Let’s start by looking at the monthly count.



This plot is a bit difficult to visualize given the low frequency of events each day. Note that days in which there are zero homicides reported are excluded from the plot.

We can see that there was a substantive increase in homicide beginning in March of 2020. This was followed by a decline into 2021 and then a small increase later in the year.

But, a major take-away from the plot is that the homicide level does appear to have had a sustained increase since mid-2020. The peak seen in October of 2020 is similar to the peak in October of 2017.

For a more in-depth examination of crime in 2020 and how it differed from 2019, check out this analysis.



Getting the data (code)


# set the url where the data are located.
url <- "https://www.phoenixopendata.com/dataset/cc08aace-9ca9-467f-b6c1-f0879ab1a358/resource/0ce3411a-2fc6-4302-a33f-167f68608a20/download/crimestat.csv"

# pull in the csv file.
crime.data <- read.csv( url, as.is = TRUE, header = TRUE )

# drop cases missing on date.
crime.data <- na.omit( crime.data )

# take a look at the data.
head( crime.data )  

Preprocessing the data (code)


Now that the data are in the workspace, let’s clean up the date and the crime categories to make plotting them fairly easy. To do so, I am drawing from a lab from ASU’s Foundations of Data Science Part I course in the Program Evaluation and Data Analytics. See the “Working with Dates” section of the site. We will use the strptime() and format() functions here.


# The date and time variable is a character string.
head( crime.data$OCCURRED.ON )
## [1] "11/01/2015  00:00" "11/01/2015  00:00" "11/01/2015  00:00" "11/01/2015  00:00" "11/01/2015  00:00" "11/01/2015  00:00"
is.character( crime.data$OCCURRED.ON )
## [1] TRUE
# Convert the string dates to a date format code.
date.vec <- strptime( crime.data$OCCURRED.ON, format="%m/%d/%Y %H:%M" )
head( date.vec )
## [1] "2015-11-01 MST" "2015-11-01 MST" "2015-11-01 MST" "2015-11-01 MST" "2015-11-01 MST" "2015-11-01 MST"
tail( date.vec )
## [1] "2024-01-31 23:00:00 MST" "2024-01-31 23:00:00 MST" "2024-01-31 23:27:00 MST" "2024-01-31 23:30:00 MST" "2024-01-31 23:30:00 MST"
## [6] "2024-01-31 23:57:00 MST"
# Now, let's use the format() function to create several objects based on the date and time.
crime.data$year   <- format( date.vec, format="%Y" )
crime.data$month  <- format( date.vec, format="%B" )
crime.data$day365 <- format( date.vec, format="%j" )
crime.data$week   <- format( date.vec, format="%V" )

# Clean up the variable classifying the cases.
crime.data <- 
  crime.data %>% 
  mutate( crime.type = case_when( 
    UCR.CRIME.CATEGORY == "AGGRAVATED ASSAULT" ~ "Assault",
    UCR.CRIME.CATEGORY == "ARSON" ~ "Arson",
    UCR.CRIME.CATEGORY == "BURGLARY" ~ "Burglary",
    UCR.CRIME.CATEGORY == "DRUG OFFENSE" ~ "Drugs",
    UCR.CRIME.CATEGORY == "LARCENY-THEFT" ~ "Theft",
    UCR.CRIME.CATEGORY == "MURDER AND NON-NEGLIGENT MANSLAUGHTER" ~ "Homicide",
    UCR.CRIME.CATEGORY == "MOTOR VEHICLE THEFT" ~ "MV Theft",
    UCR.CRIME.CATEGORY == "RAPE" ~ "Rape",
    UCR.CRIME.CATEGORY == "ROBBERY" ~ "Robbery" ) )

# Drop cases for the most recent month since the low counts will through off the scale.
crime.data <- crime.data[ ! ( 
  crime.data$month == format( Sys.Date(), format="%B" ) &
    crime.data$year == format( Sys.Date(), format="%Y" ) 
) , ]

Plotting the data (code)


# Now, let's use dplyr and tidyr to get the data in a format where we can look at the time series.
library( dplyr )
library( tidyr )

# Use dplyr() to create an object that is the daily count of homicides.
homicide.data <-
  crime.data %>% 
  filter( crime.type == "Homicide"  )

# Create the object of homicides by day.
homicides.by.day <- 
  homicide.data %>% 
  select( year, month, day365 ) %>%   
  filter( !is.na( day365 ) ) %>% 
  group_by( year, month, day365 ) %>% 
  summarize( counts = n() ) %>% 
  ungroup() %>% 
  mutate( day.time = seq( 1, length( counts ) ) ) %>% 
  select( counts, day.time ) %>% 
  mutate( days = 
            seq( 
              as.Date( head( strptime( homicide.data$OCCURRED.ON, format="%m/%d/%Y %H:%M" ) ) )[1], 
              as.Date( tail( strptime( homicide.data$OCCURRED.ON, format="%m/%d/%Y %H:%M" ) ) )[6], 
              length.out = length( counts ) ) ) %>% 
  arrange( day.time )

# Finally, let's take a look using ggplot2.
library( ggplot2 )

# Add the dates for the plot.
homicides.by.day <- as.data.frame( homicides.by.day )

# Now let's plot it!
homicides.by.day %>% 
  ggplot( aes( days, counts ) ) +
  geom_line( color = "grey80" ) +
  geom_point( alpha = 1/5, color = "black" ) +
  labs( x = "", y = "Counts of Homicide per Day" ) + 
  ggtitle( "Daily Homicides in Phoenix, AZ" ) +
  geom_smooth( color = "darkblue", span = 0.2 ) +
  scale_y_continuous(label = comma) +
  theme_minimal() 



What about Homicide by year and month?


Seasonality


The daily count view is useful, but it might help if we make two changes:

  • First, we should break it down by year and month. This adjustment corrects for the seasonality of crime. One way to show this seasonal variation in homicide is to plot the monthly incidents for each year. Note that for 2024, the line stops the month before February, as that is the last month of complete data from the portal.

  • Second, we should adjust the counts to be rates of incidents of crime. This will adjust for differences in the population of Phoenix from 2016-2024. The homicide rate is calculated as the count of homicides divided by the population size, then multiplied by 100,000. This then tells us how many homicides occur per 100,000 people in the population.



The plot showing the monthly rates by year helps us visualize the data better. There are a few important patterns we can take from the plot:

  • First, the trend for 2020 is unusual relative to other years. Particularly for the period of March-October. For a more elaborate discussion of this trend, see this analysis.

  • Second, relative to past years, homicides in 2022 appear to be more similar to pre-2020 trends. For example, comparing 2024 to 2019 shows a similar pattern of homicide incidents.


As a final point, we can restructure the visualization to show the monthly counts disaggregated by year. These are counts, not rates (as the plot above), but they reflect the same data as shown in the plot. The visualization is different in that the years are not superimposed. Also, note that the y-axis shifts magnitude across the years.

The plot shows the 2-month moving average to help illustrate the trends. This plot reinforces what we see in the plot above. Mainly, that there is an apparent upward swing in homicide. This can also be seen if you look at the values on the y-axis, the maximum value tends to increase over the more recent years beginning in 2020.



Reworking the data to monthly incidents (code)


This takes some reworking of the data.

  • First, rather than collapsing by day, we want to record counts by month.

  • Second, to create the rates, we need to adjust by the population for each year.

  • Third, we need to create a ts() object. That is, we need to create a time series object using the ts() function. We will also use the ggseasonplot() from the forecast package.


# Back to dplyr! Let's create an object that is monthly counts and sorted by year. 
homicides.by.month <- 
  homicide.data %>% 
  select( year, month ) %>%   
  filter( year != 2015 ) %>%  
  filter( !is.na( year ) ) %>% 
  group_by( year, month ) %>% 
  summarize( counts = n() ) %>% 
  spread( year, counts ) %>% 
  arrange( match( month, month.name ) ) %>% 
  select( !month )

Creating rates (code)

Let’s pull population data from the Census Bureau. Specifically, the data for Arizona. This is an Excel file with estimates of population for incorporated places. Since this is an .xlxs file, we will use the openxlxs package.

Let’s pull it in and get the data for Phoenix.

library( openxlsx )

# get the data.
pop.data <- read.xlsx(
  "https://www2.census.gov/programs-surveys/popest/tables/2010-2019/cities/totals/SUB-IP-EST2019-ANNRES-04.xlsx",
  colNames = TRUE,
  startRow = 4
)

# Find the row with the data for Phoenix.
grep("Phoenix", pop.data[,1])
## [1] 55
# It is the 55th row in the object. So, we need to pull that row.
phoenix.pop <-  pop.data[55,]
phoenix.pop
##                       X1  Census Estimates.Base    2010    2011    2012    2013    2014    2015    2016    2017    2018    2019
## 55 Phoenix city, Arizona 1445632        1446691 1449038 1469796 1499274 1526491 1555445 1583690 1612199 1633560 1654675 1680992
# Now, we only need the data for 2016-2019.
phoenix.pop <- phoenix.pop[-c(1:9)]
phoenix.pop
##       2016    2017    2018    2019
## 55 1612199 1633560 1654675 1680992

Ok! We have our population data. But, what about years after 2019? These are not reported yet. So, we need to fill that in. We could do this various ways, but for ease, let’s just add the difference in growth from 2018 to 2019.

# Just add the difference for each year until you get the actual demographic data.
phoenix.pop.2020 <- phoenix.pop[4] + phoenix.pop[4] - phoenix.pop[3]
phoenix.pop.2021 <- phoenix.pop.2020 + phoenix.pop[4] - phoenix.pop[3]
phoenix.pop.2022 <- phoenix.pop.2021 + phoenix.pop[4] - phoenix.pop[3]
phoenix.pop.data <- as.numeric( c( phoenix.pop, phoenix.pop.2020, phoenix.pop.2021, phoenix.pop.2022 ) )

# now, calculate the homicide rate. Crime rate is calculated as the count of homicides divided by the population size, then multiplied by 100,000.

homicide.rates <- as.data.frame( homicides.by.month )

for ( i in 1: dim( homicide.rates )[2] ){
  homicide.rates[,i] <- ( homicide.rates[,i] / phoenix.pop.data[i] ) * 100000
}

# Now, lets use the ts() function to create a time series object.
library( forecast )

monthly.homicides.by.year <- ts(
  matrix( as.matrix( homicide.rates ), ncol = 1 ), 
  start=c( 2016, 1 ), 
  end=c( as.numeric( tail( names( homicide.rates ), n=1 ) ), 12 ), frequency=12
)

Plotting the data (code)


# Let's take a look using ggseasonplot().
library( ggplot2 )
library( forecast )

monthly.homicides.by.year %>% 
  ggseasonplot(
    year.labels=FALSE,
    main = "Plot of Monthly Homicide Rate by Years for Phoenix",
    col = rainbow( dim( homicide.rates )[2] ) ) + 
  scale_y_continuous( label = comma ) +
  geom_line( size = 1.2 ) +
  theme_gray() 



Back to R 2 Phoenix page


Please report any needed corrections to the Issues page. Thanks!



Last updated 14 February, 2024