The analysis of the crime data in Austria for the year 2021, focusing on NUTS 3 regions, with the data sourced from Eurostat.

Data Preprocessing

We will focus on the Austrian regions according to the NUTS 3 administrative division.

# Load necessary libraries
library(eurostat)
library(ggplot2)
library(psych)
id = 'crim_gen_reg'
crim_data = get_eurostat(id=id)

# Filter for data from the year 2021
data_2021 = subset(crim_data, format(TIME_PERIOD, '%Y') == '2021')

# Filter for Austrian NUTS 3 regions
at_data = data_2021[grepl('^AT[0-9]{3}$', data_2021$geo), ]
df = subset(at_data, select = c(unit, iccs, geo, values))
df = label_eurostat(df)

# The subcategories 'Burglary of private residential premises' and 
# 'Theft of a motorized land vehicle' are already included in 'Burglary' and 'Theft'.
# We will exclude them to avoid duplication.
df = subset(df, !(iccs %in% c('Burglary of private residential premises',
                              'Theft of a motorized land vehicle')))

# Separate data into absolute numbers and per 100k inhabitants
nr_df = subset(df, df$unit == 'Number', select = c(iccs, geo, values))
pht_df = subset(df, df$unit == 'Per hundred thousand inhabitants', select = c(iccs, geo, values))

# Aggregate data by crime category (iccs) and region (geo)
nr_iccs_df = aggregate(list(values = nr_df$values), list(iccs = nr_df$iccs), sum)
nr_geo_df = aggregate(list(values = nr_df$values), list(geo = nr_df$geo), sum)

pht_iccs_df = aggregate(list(values = pht_df$values), list(iccs = pht_df$iccs), mean)
pht_geo_df = aggregate(list(values = pht_df$values), list(geo = pht_df$geo), mean)

Initial Data Exploration

We will analyze the number of criminal offenses in the regions of Austria according to the NUTS 3 administrative division for the year 2021.

At the NUTS 3 level, the territorial units of Austria are divided into so-called groups of political districts (Gruppen von Politischen Bezirken). The territorial division is shown on the following map.

Categories of Criminal Offenses

From 2008 onwards, the statistics include police-recorded offences for homicide, assault, sexual violence, robbery, burglary, (of which) burglary of residential premises, theft, (of which) theft of motorized land vehicle. [src]

rows = nr_iccs_df[order(-nr_iccs_df$values),]
row.names(rows) = 1:5
rows

theme_set(theme_gray(base_size = 14))
options(repr.plot.width=14, repr.plot.height=6)
ggplot(nr_iccs_df, aes(x=reorder(iccs, values), y=values)) + ggtitle('Categories of criminal offences') +
    geom_bar(stat='identity') + ylab('count') +
    geom_text(aes(label=values), hjust=-0.3) +
    coord_flip() + theme(axis.title.y = element_blank()) +
    expand_limits(y = c(0, 75000))
ggplot(pht_df, aes(x=reorder(iccs, values), y=values, fill=iccs)) +
    geom_boxplot(outlier.color='red', show.legend=F) + ylab('relative count [per 10^5 inh.]') +
    coord_flip() + theme(axis.title.y = element_blank()) +
    geom_jitter(color='black', size=0.1, alpha=0.8, show.legend=F)
A data.frame: 5 x 2
iccsvalues
<chr><dbl>
1Theft73213
2Burglary40385
3Assault34287
4Robbery2118
5Intentional homicide59

The table and the cumulative frequency chart show that in 2021, the most common category of crime was theft (a total of 73,213 cases), and the least common was intentional homicide (a total of 59 cases). The second graph (a boxplot of relative frequencies per 100,000 inhabitants) also shows that for more frequent crimes, the relative values vary significantly depending on the region. Specifically for theft, we observe two outlier values, which in this case are Wien and Linz-Wels.

Regions by NUTS 3 Division

rows = merge(x = nr_geo_df, y = pht_geo_df, by = 'geo')
rows = rows[order(-rows$values.x),]
colnames(rows) <- c('geo', 'count', 'relative.count')
row.names(rows) = 1:35

cat('Top 5')
head(rows, 5)

cat('Bottom 5')
tail(rows, 5)

summary(nr_geo_df)
r = describe(nr_geo_df$values, skew=F, IQR=T, ranges=F)
rownames(r) = c('values')
r$var = c(var(nr_geo_df$values))
r[,c(2,4,7,6)]

ggplot(rows, aes(x=count/relative.count, y=count)) + geom_point() + ggtitle('Inhabitants v. Crime') +
    xlab('inhabitants [10^5]') + ylab('crime count')
ggplot(rows[rows$geo != 'Wien', ], aes(x=count/relative.count, y=count)) + geom_point() + ggtitle('Inhabitants v. Crime (w/o Wien)') +
    xlab('inhabitants [10^5]') + ylab('crime count')
Top 5
A data.frame: 5 x 3
geocountrelative.count
<chr><dbl><dbl>
1Wien63198657.988
2Linz-Wels11245376.068
3Graz8469377.248
4Salzburg und Umgebung7600409.668
5Innsbruck5587357.274
Bottom 5
A data.frame: 5 x 3
geocountrelative.count
<chr><dbl><dbl>
31Liezen573143.986
32Osttirol516211.416
33Außerfern199120.410
34Mittelburgenland16487.576
35Lungau112111.344
          geo            values       
 Length:35          Min.   :  112  
 Class :character   1st Qu.: 1034  
 Mode  :character   Median : 1970  
                    Mean   : 4287  
                    3rd Qu.: 3352  
                    Max.   :63198  
A psych: 1 x 4
nsdvarIQR
<dbl><dbl><dbl><dbl>
values3510540.041110924392318.5

The table shows that the areas with the highest crime counts are Wien (63,198), Linz-Wels (11,245), and Graz (8,469), which contain Austria’s largest cities. The fewest crimes are recorded in smaller regions with smaller populations: Mittelburgenland (164) and Lungau (112). The graphs also show that regions with larger populations have a higher frequency of recorded crimes.

rows = merge(x = nr_geo_df, y = pht_geo_df, by = 'geo')
rows = rows[order(-rows$values.y),]
colnames(rows) <- c('geo', 'count', 'relative.count')
row.names(rows) = 1:35
head(rows, 5)

options(repr.plot.height=6)
ggplot(pht_geo_df, aes(x=values)) +
    geom_histogram(bins=8) + xlab('Relative crime count') + ylab('Frequency') +
    geom_rug(aes(values, y = NULL), length = unit(0.02, "npc")) +
    geom_boxplot(outlier.color='red', show.legend=F, position = position_nudge(y = -0.2))
A data.frame: 5 x 3
geocountrelative.count
<chr><dbl><dbl>
1Wien63198657.988
2Bludenz-Bregenzer Wald2822609.134
3Salzburg und Umgebung7600409.668
4Graz8469377.248
5Linz-Wels11245376.068

The histogram of relative crime frequencies shows the approximate distribution of observed values across all categories and regions. Interesting data points are the regions Bludenz-Bregenzer Wald and Salzburg und Umgebung, which, despite having a relatively low absolute number of crimes, have a high relative frequency, ranking second and third after Vienna.

Analyzing the Relationship Between Region and Crime Type

ct = xtabs(formula=values ~ geo + iccs, data=nr_df)
addmargins(ct)
A table: 36 x 6 of type dbl
AssaultBurglaryIntentional homicideRobberyTheftSum
Außerfern521812126199
Bludenz-Bregenzer Wald79756912114342822
Graz1807200958945598469
Innsbruck153392136630645587
Innviertel55347801310422086
Klagenfurt-Villach113894824323334464
Liezen16212201288573
Linz-Wels237330504239557911245
Lungau25180069112
Mittelburgenland39560168164
Mostviertel-Eisenwurzen43959211711342183
Mühlviertel292297145651159
Niederösterreich-Süd971122114321084344
Nordburgenland303381287241418
Oberkärnten22713722431799
Östliche Obersteiermark5283910119131843
Oststeiermark43444511510161911
Osttirol10313602275516
Pinzgau-Pongau4384030147531608
Rheintal-Bodenseegebiet114580813218973883
Salzburg und Umgebung2027187839535977600
Sankt Pölten46671133713562573
Steyr-Kirchdorf336444288101600
Südburgenland13211812324577
Tiroler Oberland264127259457909
Tiroler Unterland83137322114912718
Traunviertel44351611610302006
Unterkärnten315241066061168
Waldviertel4495220109891970
Weinviertel40067331710842177
West- und Südsteiermark385314196401349
Westliche Obersteiermark17315404405736
Wien13669196851411702866063198
Wiener Umland/Nordteil39958311211982193
Wiener Umland/Südteil639104612921883903
Sum342874038559211873213150062

The values in the contingency table are consistent with previous observations. High values are registered in populated regions and for common crimes. At first glance, the distribution of crime categories appears to be similar across all regions. We will now test this hypothesis.

Fisher’s Exact Test

We will use Fisher’s Exact Test to verify whether the probabilities of crime categories depend on the region, at a 5% significance level. This test can handle zero values in some cells and is suitable for larger contingency tables.

\(H_0\): Each row (region) is a realization of the same distribution (crime category), i.e., \(p_{ij} = p_{i:} \cdot p_{:j}\) for \(i=1,\ldots,35\) and \(j=1,\ldots, 5\).

\(H_A\): \(H_0\) is not true.

fish = fisher.test(ct, simulate.p.value = TRUE)
fish
  Fisher's Exact Test for Count Data with simulated p-value (based on
  2000 replicates)

data:  ct
p-value = 0.0004998
alternative hypothesis: two.sided

At a 5% significance level, we reject the null hypothesis in favor of the alternative, which states that the records of crime categories are not realizations of the same distribution. This means the probabilities of individual crime categories are different for different regions.

Hypothesis Testing

Hypothesis 1: Correlation Between Population and Crime Count

ggplot(rows, aes(x=count/relative.count, y=count)) + geom_point() + ggtitle('Inhabitants v. Crime') +
    xlab('inhabitants [10^5]') + ylab('crime count')
ggplot(rows[rows$geo != 'Wien', ], aes(x=count/relative.count, y=count)) + geom_point() + ggtitle('Inhabitants v. Crime (w/o Wien)') +
    xlab('inhabitants [10^5]') + ylab('crime count')

During the initial data exploration, we observed that regions with a larger population have a higher frequency of recorded criminal offenses. We will now test this hypothesis at a 5% significance level using a non-parametric correlation coefficient test - Spearman’s rank correlation coefficient.

\(H_0 :\) There is zero correlation between the number of inhabitants and the frequency of crime in a region: \(\rho_S = 0\)

\(H_A :\) There is a positive correlation between the number of inhabitants and the frequency of crime in a region: \(\rho_S \> 0\)

x = rows$count/rows$relative.count
y = rows$count

round(cor(x, y, method='spearman'), 4)
cor.test(x, y, method='spearman', alternative='greater')

0.8641

  Spearman's rank correlation rho

data:  x and y
S = 970, p-value = 1.586e-08
alternative hypothesis: true rho is greater than 0
sample estimates:
       rho 
0.8641457 

The Spearman’s correlation coefficient value of 0.86 indicates a very strong positive correlation. At a 5% significance level, we reject the null hypothesis in favor of the alternative that there is a positive correlation between the number of inhabitants and the number of criminal offenses. The more populous a region is, the more crimes are recorded in that area.

Hypothesis 2: Comparing Homicide Rates in Austria and Slovakia

cd = crim_data[grepl('^AT[0-9]{3}$', crim_data$geo), ]
cd = label_eurostat(cd, fix_duplicated = TRUE)
cd = subset(cd, unit == 'Number')
cd = subset(cd, freq == 'Annual')
cd = subset(cd, iccs == 'Intentional homicide')
cd = aggregate(list(values = cd$values), list(TIME_PERIOD = cd$TIME_PERIOD), sum)
mean.AT = mean(cd$values)
median.AT = median(cd$values)
colnames(cd) = c('year', 'AT')
cd.AT = cd

cd = crim_data[grepl('^SK[0-9]{3}$', crim_data$geo), ]
cd = label_eurostat(cd, fix_duplicated = TRUE)
cd = subset(cd, unit == 'Number')
cd = subset(cd, freq == 'Annual')
cd = subset(cd, iccs == 'Intentional homicide')
cd = aggregate(list(values = cd$values), list(TIME_PERIOD = cd$TIME_PERIOD), sum)
mean.SK= mean(cd$values)
median.SK= median(cd$values)
colnames(cd) = c('year', 'SK')
cd.SK = cd

cd.AT$SK = cd.SK$SK
cd = cd.AT
cd

cat('AT mean, median: ', mean.AT, ',', median.AT, '\n')
cat('SK mean, median: ', mean.SK, ',', median.SK)
A data.frame: 14 x 3
yearATSK
<date><dbl><dbl>
2008-01-015894
2009-01-015184
2010-01-015987
2011-01-018296
2012-01-018875
2013-01-016278
2014-01-014072
2015-01-014248
2016-01-014959
2017-01-016179
2018-01-017367
2019-01-017476
2020-01-015463
2021-01-015955
AT mean, median:  60.85714 , 59 
SK mean, median:  73.78571 , 75.5

We observe that the sample median number of homicides for the period 2008 to 2021 is 59 in Austria and 75.5 in Slovakia. We want to test at a 5% significance level whether the homicide rate in Austria is the same or significantly lower. We choose the non-parametric Mann-Whitney U test, which does not require data normality. Let \(\tilde{\mu}*\text{AT}\) and \(\tilde{\mu}*\text{SK}\) be the true respective median values.

\(H_0: \tilde{\mu}*\text{AT} = \tilde{\mu}*\text{SK}\)

\(H_A: \tilde{\mu}*\text{AT} < \tilde{\mu}*\text{SK}\)

wilcox.test(cd$AT, cd$SK, alternative='less', exact=F)
  Wilcoxon rank sum test with continuity correction

data:  cd$AT and cd$SK
W = 50, p-value = 0.01449
alternative hypothesis: true location shift is less than 0

At a 5% significance level, we reject the null hypothesis in favor of the alternative, that the median frequency of homicides in Slovakia is significantly higher than in Austria.

Hypothesis 3: Normality of Homicide Data in Vienna

cd = crim_data
cd = subset(cd, unit == 'NR')
cd = subset(cd, freq == 'A')
cd = subset(cd, geo == 'AT13')
cd = label_eurostat(cd)
cd = subset(cd, iccs == 'Intentional homicide')

ggplot(cd, aes(x=values)) + xlab('Count of intentional homicide') + ylab('Density') +
    geom_histogram(aes(y=after_stat(density)), colour = 1, fill = 'white', bins=5) +
    stat_function(fun=dnorm, 
                  args=list(mean=mean(cd$values), sd=sd(cd$values)),
                  colour='red', lwd=2, linetype='dashed')

From the records of homicide frequencies in Vienna, we will test whether they come from a normal distribution. We will verify this hypothesis with a Shapiro-Wilk test and then compare the test’s conclusion with a Q-Q plot.

\(H_0\): The annual number of homicides in the Vienna region comes from a normal distribution.

\(H_A\): The annual number of homicides in the Vienna region does not come from a normal distribution.

shapiro.test(cd$values)

options(repr.plot.height=5)
ggplot(cd, aes(sample=values)) +
        stat_qq(distribution=qnorm, show.legend=T) +
        stat_qq_line(distribution=qnorm, show.legend=F)
  Shapiro-Wilk normality test

data:  cd$values
W = 0.94623, p-value = 0.5039

At a 5% significance level, we do not reject the null hypothesis that the frequencies of homicides in Vienna come from a normal distribution. This result is consistent with the Q-Q plot, where the records lie on the line without any obvious systematic under/overestimation.