Clustering of fatalities in car crashes

Clustering_Fatalities

This project was part of the Data Mining course curriculum from the Master of Science in Data Science at SMU.

Team Members:

  • Lisa Mendez
  • Brandon Lawrence
  • Dominique Edwards
  • Mariana Llamas-Cendon

 

1. Business Understanding

 

1.1 Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?).

The National Highway Traffic Safety Administration’s (NHTSA) Office of Vehicle Safety Research’s mission is to “strategize, plan, and implement research programs to continually further the Agency’s goals in reduction of crashes, fatalities, and injuries. Our research is prioritized based on potential for crash/fatality/injury reductions and is aligned with Congressional Mandates, DOT & NHTSA goals.”
Fatality Analysis Reporting System (FARS) provides NHTSA a nationwide census of data regarding fatal injuries suffered in a motor vehicle traffic crashes.
FARS contains data that are derived from a census of fatal traffic crashes (within all 50 states, DC, Hawaii, American Samoa, Guam, U.S. Virgin Islands, and Puerto Rico).
In this context a crash, or accident, must involve a motor vehicle traveling on a trafficway open to the public and resulted in a death of at least one person within 30 days of the crash. The death of an occupant or a non-motorist involved in the crash results in a record included in FARS.
Even though data has been collected since 1975– to measure highway safety, identify traffic safety issues, recommend solutions, and assist in an objective basis to evaluate the effectiveness of motor vehicle safety standards and highway safety programs– the data set used in this project corresponds only to the year 2015.

1.2 How would you measure the effectiveness of a good prediction algorithm? Be specific.

In order to measure the effectiveness of a good prediction algorithm methods such as cross validation can be done. Five fold, stratified, and 10 fold. Other metrics, such as precision, recall, and truth charts can also be done to determine how well the prediction model generalizes the data.

 

2. Data Understanding

 

The data set used in this project named “accident_reduced_GOOD” is a cleaned version of the original data set (that contained 52 attributes), in which those attributes not useful to the analysis were already deleted. The main causes for deletion were:

  • Missing or unknown data for those attributes represented more than 50% of the total number of observations
  • Attributes were redundant
  • Attributes did not provide any insight

The original data set contained missing or unknown values coded as either: 9, 99, 9999, 99.9999 or 999.9999 so they were easily identifiable; while 8, 88, 8888, 88.8888, 888.8888, 98, 9998 meant values were not reported/not available .
Values that did not provide any useful information such as 0 “not applicable” or 7, 97, 98 “other” depending on the attribute, were dropped.

In the case of categorical variables in which there was a level “other” coded either as 7, 8, 997 that did not provide any additional or useful information, those rows containing it were also dropped.

Attribute LGT_COND, for instance, contains several levels dedicated to describe the darkness of a particular situation, such as: 2 (Dark – Not Lighted), 3 (Dark – Lighted) and 6 (Dark– unknown lighting). In the case of level 6 (Dark — unknown lighting) we considered it provides no additional nor useful information in comparison to levels 2 and 3. Therefore, we dropped the rows containing level 6.

Columns DAY, MONTH, YEAR were combined (and dropped) into a new one named DATETIME which was coerced into dtype datetime. Since we could not find a feasible way to combine DATETIME column with columns HOUR and MINUTE or to even properly combine HOUR and MINUTE into one column to create a timestamp, we decided to drop attribute MINUTE as HOUR would provide sufficient information that can bu used.

We did not find any duplicates in the data set. ST_CASE is the only attribute that should contain only unique values, so we did perform a duplicate checking just to make sure that there were not two repeated cases, and it came out clean.

But it was dropped because it only referred to the accident reported case.

In the case of outliers, once coded missing values were converted to NaNs, we revise continuous, categorical and binary attributes using .describe() to check that the min and max values corresponded to the different levels and values we were expecting. Nevertheless, we will use boxplots to look at the remaining attributes more closely to determine if outliers are present, and if so, determine their importance and make a decision about keeping some of those observations or not.

The continuous variables in our data set had no missing values, therefore imputation was not required.

The list of attributes dropped is found in the following table.

Attributes deleted from original data set

Attribute name

Delete Reason

COUNTY Use of geographical codes that will not contribute to analysis
ST_CASE Unique identifier for each accident reported, not contributing to analysis
CITY Too few observations available
DAY merged with Month and Year into new column named DATETIME
MONTH merged with DAY and Year into new column named DATETIME
MINUTE Not able to convert it into appropriate format
SP_JUR identifies if the location on the trafficway where the crash occurred qualifies as a Special Jurisdiction or not. Considering that there are about 30,560 instances of class 0 (No Special Jurisdiction), this attribute only adds noise and not much information as information about this can be obtained from attribute ROUTE.
RD_OWNER This attribute is pretty much similar to that named ROUTE, only that in here the route is identified with a particular agency, whether Federal, State, or Local
RUR_URB this attribute identifies if the road where the accident had ocurred is rural or urban, but that information can be derived from the attribute ROUTE
LONGITUD This element identifies the location of the crash using Global Position coordinates, and will not contribute to analysis
LATITUDE Same reason as LONGITUD
FUNC_SYS This element describe the type of road where an accident happened. Information can be obtained from ROUTE
TYP_INT TThis data element identifies and allows separation of various intersection types but can derived from RELJCT2, which identifies the crash’s location with respect to presence in or proximity to components typically in junction or interchange areas
NHS This data element identifies whether this crash occurred on a trafficway that is part of the National Highway System. We can obtain this information from ROUTE attribute
TWAY_ID This data element records the trafficway on which the crash occurred. This information can be derived from ROUTE attribute
TWAY_ID2 Too few values available
MILEPT Too few observations available
WEATHER1 Too few observations available
WEATHER2 Too few observations available
SCH_BUS Not relevant for analysis
RAIL Too few observations
NOT_HOUR Too few observations available
NOT_MIN Too few observations available
ARR_HOUR Too few observations available
ARR_MIN Too few observations available
HOSP_HR Not relevant for analysis
HOSP_MIN Not relevant for analysis
CF1 Too few observations available
CF2 Too few observations available
CF3 Too few observations available

Attributes kept

The “accident_reduced_GOOD” data set was cleaned prior to being used for this project, in order to reduce noise that other not-so-important or redundant attributes could add. The deletion was based on the following criteria:

  • Some missing values had already been deleted, when the attributes that did not provide any insight or had too few values were dropped, but it is necessary to check for missing values within the data set that will be used for this project.
  • Some Instances of some attributes that had less than 10 observations were removed (there are still some attributes that we will analyze based on a cut-off of 10) to decide if those instances are worth keeping).
  • No continuous attributes had any missing values so imputation was not required.

The attributes used for this analysis are found in the following table, along with a brief description of each:

Attributes kept

Attribute name

Type

Description

STATE Categorical/nominal Originally (1-56). The following states were deleted as the analysis was kept within continental U.S.:

– *American Samoa*

– *D.C*

– *Guam*

– *Puerto Rico*

– *Virgin Islands*

VE_TOTAL Continuous number of contact motor vehicles that the officer reported on the PAR as a unit involved in the crash
VE_FORMS Continuous count of the number of vehicles in-transport involved in the crash
PVH_INVL Continuous count of the number of parked and working vehicles involved in the crash
PEDS Continuous number of Person Forms (Not a Motor Vehicle Occupant) that are applicable to this case (i.e., non-occupants)
PERNOTMVIT Continuous count of the number of non-motorists in the crash
PERMVIT Continuous count of the number of motorists in the crash. A motorist is a driver,
passenger or unknown occupant type of a motor vehicle in-transport
PERSONS Continuous count of the number of Person Level (Motor Vehicle Occupant)
DAY_WEEK Categorical/ordinal (1-Sunday — 7-Monday) records the day of the week on which the crash occurred.
*Attribute was one-hot encoded
HOUR Continuous (0-23) records the hour at which the crash occurred
ROUTE Categorical/nominal identifies the route signing of the trafficway on which the crash occurred. Having the following levels:
1 Interstate2 U.S. Highway

3 State Highway

4 County Road

5 Local Street – Township

6 Local Street – Municipality

7 Local Street – Frontage Road

The attribute was hot-encoded

HARM_EV Categorical/nominal with more than 30 levels describes the first injury or damage producing event of the crash.
The attribute was one-hot encoded
RELJCT1 Categorical/binary This data element identifies the crash’s location with respect to presence in an interchange area
RELJCT2 Categorical/nominal identifies the crash’s location with respect to presence in an interchange area.

The coding of this data element is done in two sub-fields (see also C20B) and is
based on the location of the “First Harmful Event” of the crash

Levels:
1 Non-Junction

2 Intersection

3 Intersection Related

4 Driveway Access

5 Entrance/Exit Ramp Related

6 Railway Grade Crossing

7 Crossover Related

8 Driveway Access Related

16 Shared-Use Path Crossing

17 Acceleration/Deceleration Lane

18 Through Roadway

19 Other Location Within Interchange Area

20 Entrance/Exit Ramp
*Attribute was one-hot encoded

WRK_ZONE Categorical/nominal identifies a motor vehicle traffic crash in which the first harmful event occurs
within the boundaries of a work zone or on an approach to or exit from a work zone,
resulting from an activity, behavior, or control related to the movement of the
traffic units through the work zone.Levels:
0 None

1 Construction

2 Maintenance

3 Utility
*Attribute was one-hor encoded

MAN_COLL Categorical/nominal describes the orientation of two motor vehicles in-transport when they are
involved in the “First Harmful Event” of a collision crash. If the “First
Harmful Event” is not a collision between two motor vehicles in-transport it
is classified as such.”Levels:
0 Not Collision with Motor Vehicle in Transport

1 Front-to-Rear

2 Front-to-Front

6 Angle

7 Sideswipe – Same Direction

8 Sideswipe – Opposite Direction

9 Rear-to-Side

10 Rear-to-Rear

11 Other (End-Swipes and Others)
*Attribute was one-hor encoded

REL_ROAD Categorical/nominal identifies the location of the crash as it relates to its position within or
outside the trafficway based on the “First Harmful Event”.Levels:
1 On Roadway

2 On Shoulder

3 On Median

4 On Roadside

5 Outside Trafficway

6 Off Roadway – Location Unknown

7 In Parking Lane/Zone

8 Gore

10 Separator

11 Continuous Left-Turn Lane
*Attribute was one-hot encoded

LGT_COND Categorical/nominal records the type/level of light that existed at the time of the crash as indicated
in the case material.Levels:
1 Daylight

2 Dark – Not Lighted

3 Dark – Lighted

4 Dawn

5 Dusk

*Attribute was one-hot encoded

WEATHER Categorical/nominal records the prevailing atmospheric conditions that existed at the time of the crash
as indicated in the case material.Levels:
1 Clear

0 No Additional Atmospheric Conditions

2 Rain

3 Sleet, Hail

4 Snow

5 Fog, Smog, Smoke

6 Severe Crosswinds

7 Blowing Sand, Soil, Dirt

8 Other

10 Cloudy

11 Blowing Snow

12 Freezing Rain
*Attribute was one-hot encoded

FATALS Continuous records the number of fatally injured persons in the crash. 01-99 Number of
Fatalities that Occurred in the crash
DRUNK_DR Continuous records the number of drunk drivers involved in the crash. 00-99 Number of Drunk
Drivers Involved in the Fatal Crash
In [4]:
import warnings
warnings.filterwarnings("ignore")

from datetime import datetime

import pandas as pd
from pandas import DataFrame
import numpy as np
#import matplotlib as plt
import matplotlib.pyplot as plt
%matplotlib inline
In [5]:
df_reduced = pd.read_excel('accident_reduced_GOOD.xlsx')
df_reduced.head()
Out[5]:
STATE VE_TOTAL VE_FORMS PVH_INVL PEDS PERNOTMVIT PERMVIT PERSONS DAY_WEEK HOUR HARM_EV MAN_COLL RELJCT1 RELJCT2 WRK_ZONE REL_ROAD LGT_COND WEATHER FATALS DRUNK_DR
0 1 1 1 0 0 0 1 1 5 2.0 35.0 0.0 0.0 1.0 0 4.0 2.0 1.0 1 1
1 1 1 1 0 0 0 1 1 5 22.0 34.0 0.0 0.0 1.0 0 3.0 2.0 10.0 1 0
2 1 1 1 0 0 0 2 2 5 1.0 42.0 0.0 0.0 1.0 0 4.0 2.0 1.0 1 1
3 1 1 1 0 0 0 1 1 1 0.0 53.0 0.0 0.0 1.0 0 4.0 2.0 10.0 1 1
4 1 2 2 0 0 0 2 2 4 7.0 12.0 6.0 0.0 2.0 0 1.0 1.0 1.0 1 0

5 rows × 21 columns

In [6]:
print ('The number of rows is       ', df_reduced.shape[0])
print ('The number of attributes is ', df_reduced.shape[1])
The number of rows is        30782
The number of attributes is  21
In [7]:
df_reduced.describe()
Out[7]:
STATE VE_TOTAL VE_FORMS PVH_INVL PEDS PERNOTMVIT PERMVIT PERSONS DAY_WEEK HOUR HARM_EV MAN_COLL RELJCT1 RELJCT2 WRK_ZONE REL_ROAD LGT_COND WEATHER FATALS DRUNK_DR
count 30782.000000 30782.000000 30782.000000 30782.000000 30782.000000 30782.000000 30782.000000 30782.000000 30782.000000 30559.000000 30765.000000 30713.000000 30770.000000 30743.000000 30782.000000 30742.000000 30633.000000 30528.000000 30782.000000 30782.000000
mean 27.257326 1.560295 1.525275 0.035020 0.213599 0.222533 2.290982 2.299916 4.101683 12.809974 18.096408 1.590597 0.042931 1.992811 0.034338 2.197645 1.821793 2.741713 1.091515 0.282048
std 16.164732 0.896582 0.860231 0.264058 0.481171 0.508517 1.957383 1.962260 2.090771 6.868892 13.917434 2.507310 0.202706 2.811779 0.295109 1.583078 0.973972 3.397274 0.367148 0.466383
min 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000
25% 12.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 2.000000 7.000000 9.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000
50% 27.000000 1.000000 1.000000 0.000000 0.000000 0.000000 2.000000 2.000000 4.000000 14.000000 12.000000 0.000000 0.000000 1.000000 0.000000 1.000000 2.000000 1.000000 1.000000 0.000000
75% 42.000000 2.000000 2.000000 0.000000 0.000000 0.000000 3.000000 3.000000 6.000000 19.000000 30.000000 2.000000 0.000000 2.000000 0.000000 4.000000 2.000000 2.000000 1.000000 1.000000
max 56.000000 58.000000 58.000000 11.000000 16.000000 16.000000 93.000000 93.000000 7.000000 23.000000 59.000000 11.000000 1.000000 20.000000 4.000000 11.000000 5.000000 12.000000 10.000000 3.000000

8 rows × 21 columns

Data cleaning

Dealing with missing values

As it was mentioned earlier, those missing values that were easily identifiable were already deleted. It is imperative to check now for missing values within the current data set.

Based on our experience wth this data set, we expect missing values to be located within categorical attributes but not continuous ones.

In [8]:
df_reduced.isnull().any()
Out[8]:
STATE         False
VE_TOTAL      False
VE_FORMS      False
PVH_INVL      False
PEDS          False
PERNOTMVIT    False
PERMVIT       False
PERSONS       False
DAY_WEEK      False
HOUR           True
ROUTE          True
HARM_EV        True
MAN_COLL       True
RELJCT1        True
RELJCT2        True
WRK_ZONE      False
REL_ROAD       True
LGT_COND       True
WEATHER        True
FATALS        False
DRUNK_DR      False
dtype: bool

None of the continuous variables have missing values:

  • **VE_TOTAL**
  • ** VE_FORMS**
  • ** PVH_INVL**
  • **PEDS**
  • **PERNOTMVIT**
  • **PERMVIT**
  • **PERSONS**
  • **FATALS**
  • **DR_DRIVER**

We tried eliminating missing values from categorical attributes by imputing the mode, but missing values prevailed and since Pandas treats missing values as zero, so deleting missing values from dataframe is the best option since some of our categorical variables are binary.

So in the following two cells, we will get rid of missing values from the whole data set, and then check that the missing values are in fact out.

In [9]:
# Drop missing values from data set
df_reduced.dropna(inplace = True)
In [10]:
# Check that missing values are in fact gone
df_reduced.isnull().any()
Out[10]:
STATE         False
VE_TOTAL      False
VE_FORMS      False
PVH_INVL      False
PEDS          False
PERNOTMVIT    False
PERMVIT       False
PERSONS       False
DAY_WEEK      False
HOUR          False
ROUTE         False
HARM_EV       False
MAN_COLL      False
RELJCT1       False
RELJCT2       False
WRK_ZONE      False
REL_ROAD      False
LGT_COND      False
WEATHER       False
FATALS        False
DRUNK_DR      False
dtype: bool

Coercing values from float to integer

The categorical attributes from the original data set were recorded as integers not floats, although we can see in the table above that many of the categoricals were read as floats for an unknown reason. For the sake of consistency, we will coerce floats to integers. This was not performed previously because presence of missing values prevented us from doing so.

In [11]:
df_reduced['HOUR'] = df_reduced.HOUR.astype(int)
df_reduced['ROUTE'] = df_reduced.ROUTE.astype(int)
df_reduced['HARM_EV'] = df_reduced.HARM_EV.astype(int)
df_reduced['MAN_COLL'] = df_reduced.MAN_COLL.astype(int)
df_reduced['RELJCT1'] = df_reduced.RELJCT1.astype(int)
df_reduced['RELJCT2'] = df_reduced.RELJCT2.astype(int)
df_reduced['REL_ROAD'] = df_reduced.REL_ROAD.astype(int)
df_reduced['LGT_COND'] = df_reduced.LGT_COND.astype(int)
df_reduced['WEATHER'] = df_reduced.WEATHER.astype(int)

Deleting instances with less than 10 values

A cutoff of 10 was determined for all of the attributes, considering that for this project the intention is not to look for individual cases, but at cases that have at least a good number of observations to be considered important. The cutoff of 10 seemed appropriate because it also includes lower values that could be interesting to look at without narrowing down too much to only higher incidences.

First, it is imperative to check the frequencies of each of the attributes that we believe could contain values less than 10.

In [12]:
df_reduced.VE_TOTAL.value_counts()
Out[12]:
1     16036
2     10620
3      1685
4       382
5       112
6        58
7        25
8        13
9         9
12        2
11        2
19        2
31        1
22        1
14        1
10        1
58        1
29        1
16        1
Name: VE_TOTAL, dtype: int64
In [13]:
df_reduced = df_reduced[df_reduced.VE_TOTAL != 11]
df_reduced = df_reduced[df_reduced.VE_TOTAL != 12]
df_reduced = df_reduced[df_reduced.VE_TOTAL != 9]
In [14]:
df_reduced.VE_FORMS.value_counts()
Out[14]:
1     16585
2     10340
3      1509
4       325
5        98
6        44
7        21
8         9
18        2
31        1
14        1
29        1
58        1
10        1
19        1
16        1
Name: VE_FORMS, dtype: int64
In [15]:
df_reduced.PERNOTMVIT.value_counts()
Out[15]:
0     23246
1      5286
2       295
3        61
4        27
5         9
6         7
7         5
9         2
11        1
16        1
Name: PERNOTMVIT, dtype: int64
In [16]:
df_reduced = df_reduced[df_reduced.PERNOTMVIT != 16]
df_reduced = df_reduced[df_reduced.PERNOTMVIT != 11]
df_reduced = df_reduced[df_reduced.PERNOTMVIT != 13]
df_reduced = df_reduced[df_reduced.PERNOTMVIT != 9]
df_reduced = df_reduced[df_reduced.PERNOTMVIT != 7]
df_reduced = df_reduced[df_reduced.PERNOTMVIT != 6]
df_reduced = df_reduced[df_reduced.PERNOTMVIT != 5]
In [17]:
df_reduced.PERMVIT.value_counts()
Out[17]:
1     11812
2      7958
3      4272
4      2167
5      1189
6       660
7       346
8       206
9        86
10       59
11       42
12       28
0        27
13       15
14        8
18        6
15        6
16        4
24        4
17        2
27        2
22        1
28        1
32        1
48        1
93        1
49        1
29        1
50        1
23        1
19        1
35        1
25        1
20        1
68        1
37        1
53        1
Name: PERMVIT, dtype: int64
In [18]:
df_reduced = df_reduced[df_reduced.PERMVIT != 53]
df_reduced = df_reduced[df_reduced.PERMVIT != 37]
df_reduced = df_reduced[df_reduced.PERMVIT != 68]
df_reduced = df_reduced[df_reduced.PERMVIT != 20]
df_reduced = df_reduced[df_reduced.PERMVIT != 25]
df_reduced = df_reduced[df_reduced.PERMVIT != 35]
df_reduced = df_reduced[df_reduced.PERMVIT != 23]
df_reduced = df_reduced[df_reduced.PERMVIT != 28]
df_reduced = df_reduced[df_reduced.PERMVIT != 50]
df_reduced = df_reduced[df_reduced.PERMVIT != 49]
df_reduced = df_reduced[df_reduced.PERMVIT != 93]
df_reduced = df_reduced[df_reduced.PERMVIT != 48]
df_reduced = df_reduced[df_reduced.PERMVIT != 32]
df_reduced = df_reduced[df_reduced.PERMVIT != 29]
df_reduced = df_reduced[df_reduced.PERMVIT != 22]
df_reduced = df_reduced[df_reduced.PERMVIT != 19]
df_reduced = df_reduced[df_reduced.PERMVIT !=17]
df_reduced = df_reduced[df_reduced.PERMVIT != 24]
df_reduced = df_reduced[df_reduced.PERMVIT != 16]
df_reduced = df_reduced[df_reduced.PERMVIT != 15]
df_reduced = df_reduced[df_reduced.PERMVIT != 18]
df_reduced = df_reduced[df_reduced.PERMVIT != 14]
In [19]:
df_reduced.PEDS.value_counts()
Out[19]:
0    23325
1     5224
2      264
3       39
4       17
Name: PEDS, dtype: int64
In [20]:
df_reduced.PERSONS.value_counts()
Out[20]:
1     11717
2      7989
3      4306
4      2179
5      1200
6       665
7       347
8       207
9        84
10       59
11       44
12       28
0        27
13       14
27        2
14        1
Name: PERSONS, dtype: int64
In [21]:
df_reduced = df_reduced[df_reduced.PERSONS != 27]
df_reduced = df_reduced[df_reduced.PERSONS != 14]
In [22]:
df_reduced.PVH_INVL.value_counts()
Out[22]:
0    28150
1      556
2      112
3       30
4       11
5        4
7        3
Name: PVH_INVL, dtype: int64
In [23]:
df_reduced = df_reduced[df_reduced.PVH_INVL != 7]
df_reduced = df_reduced[df_reduced.PVH_INVL != 5]
In [24]:
df_reduced.DAY_WEEK.value_counts()
Out[24]:
7    4957
1    4654
6    4361
5    3975
4    3725
2    3616
3    3571
Name: DAY_WEEK, dtype: int64
In [25]:
df_reduced.HOUR.value_counts()
Out[25]:
18    1676
17    1661
20    1636
21    1626
19    1583
15    1523
16    1490
14    1365
22    1359
13    1267
23    1178
0     1119
12    1101
1     1069
2     1066
6     1055
11    1031
7     1023
10     942
5      890
9      859
3      834
8      833
4      673
Name: HOUR, dtype: int64
In [26]:
df_reduced.ROUTE.value_counts()
Out[26]:
3    9168
6    5250
2    5044
4    4494
1    3650
5    1041
7     212
Name: ROUTE, dtype: int64
In [27]:
df_reduced.HARM_EV.value_counts()
Out[27]:
12    11174
8      4374
1      2498
42     2062
33      961
34      893
35      842
24      687
9       677
30      606
32      381
59      364
38      348
14      310
31      264
53      248
25      239
43      228
5       184
11      169
52      158
18      129
15      123
21       86
23       85
10       79
58       73
39       68
17       61
57       59
19       59
26       47
41       39
46       35
20       33
3        33
7        30
16       29
54       24
48       23
44       22
40       20
49       13
45       12
51       10
Name: HARM_EV, dtype: int64
In [28]:
df_reduced.MAN_COLL.value_counts()
Out[28]:
0     17658
6      5262
2      3036
1      2015
7       468
8       308
11       73
9        38
10        1
Name: MAN_COLL, dtype: int64
In [29]:
# Level 10 refers to Rear-to-Rear type of collision
df_reduced = df_reduced[df_reduced.MAN_COLL != 10]
In [30]:
df_reduced.RELJCT1.value_counts()
Out[30]:
0    27618
1     1240
Name: RELJCT1, dtype: int64
In [31]:
df_reduced.RELJCT2.value_counts()
Out[31]:
1     19782
2      4735
3      2211
8       839
5       433
18      301
19      213
20      112
4        99
6        90
7        43
Name: RELJCT2, dtype: int64
In [32]:
df_reduced.WRK_ZONE.value_counts()
Out[32]:
0    28287
1      404
4      110
2       57
Name: WRK_ZONE, dtype: int64
In [33]:
df_reduced.REL_ROAD.value_counts()
Out[33]:
1     17375
4      9048
3      1142
5       507
2       424
8       158
10       84
6        56
7        49
11       15
Name: REL_ROAD, dtype: int64
In [34]:
df_reduced.LGT_COND.value_counts()
Out[34]:
1    14009
2     8239
3     5283
5      730
4      597
Name: LGT_COND, dtype: int64
In [35]:
df_reduced.WEATHER.value_counts()
Out[35]:
1     20799
10     4954
2      2269
5       368
4       302
3        95
12       23
6        21
11       20
7         7
Name: WEATHER, dtype: int64
In [36]:
df_reduced = df_reduced[df_reduced.WEATHER != 7]
df_reduced = df_reduced[df_reduced.WEATHER != 8]
In [37]:
df_reduced.DRUNK_DR.value_counts()
Out[37]:
0    20929
1     7707
2      213
3        2
Name: DRUNK_DR, dtype: int64
In [38]:
df_reduced = df_reduced[df_reduced.DRUNK_DR != 3]
In [39]:
df_reduced.FATALS.value_counts()
Out[39]:
1    26726
2     1753
3      275
4       70
5       20
6        5
Name: FATALS, dtype: int64
In [40]:
df_reduced = df_reduced[df_reduced.FATALS != 6]

Delete levels from attribute STATE

The STATE attribute, which is a categorical, contained 56 levels (from 1 to 56)– each one representing a state plus territories. Our analysis will only consider those states from continental U.S., including Hawaii and Alaska, then the following incidences will be removed:

  • American Samoa
  • D.C
  • Guam
  • Puerto Rico
  • Virgin Islands
In [41]:
# Delete those states that are not part of continental USA.
df_reduced = df_reduced[df_reduced.STATE != 3] # Delete rows that pertain to American Samoa
df_reduced = df_reduced[df_reduced.STATE != 11] # Delete rows that pertian to D.C.
df_reduced = df_reduced[df_reduced.STATE != 14] # Delete rows that contain Guam
df_reduced = df_reduced[df_reduced.STATE != 43] # Delete rows that pertain to Puerto Rico
df_reduced = df_reduced[df_reduced.STATE != 52] # Delete rows that pertain to Virgin Islands

After deleting those states that won’t be useful in this project, we check the frequencies of the remaining states to make sure that none has less than 10 observations.

In [42]:
df_reduced.STATE.value_counts()
Out[42]:
6     2818
48    2386
12    1498
13    1308
37    1259
42    1066
36    1034
39    1002
17     879
26     874
45     861
47     832
1      765
29     762
18     744
51     688
21     674
22     641
4      631
28     596
40     572
55     506
8      504
53     485
34     453
5      447
24     436
41     359
27     348
20     309
32     286
19     276
9      245
54     241
35     235
49     235
30     192
31     185
16     151
25     147
23     139
56     122
46     114
33     101
38     100
10      91
15      81
2       55
50      49
44      41
Name: STATE, dtype: int64

All of the states that were kept have more than 10 observations, so there is no need to perform any other actions on this variable.

In [43]:
print ('The number of rows is       ', df_reduced.shape[0])
print ('The number of attributes is ', df_reduced.shape[1])
The number of rows is        28823
The number of attributes is  21

After removing all rows that had less than 10 incidences from every attribute, and deleting the territories that do not pertain to continental USA, only 1,959 observations were removed, about 6% of the data.

Properly Indexing dataframe

For an unknown reason, the index of the whole data set was set to start at 0 as opposed to 1, so it was necessary for our own understanding to switch it back to starting at 1.

In [44]:
#change index to start at 1 not 0
df_reduced.index = np.arange(1, len(df_reduced)+1)

Visualizing the first version of the “cleaned” data set

In [45]:
df_reduced.describe()
Out[45]:
STATE VE_TOTAL VE_FORMS PVH_INVL PEDS PERNOTMVIT PERMVIT PERSONS DAY_WEEK HOUR HARM_EV MAN_COLL RELJCT1 RELJCT2 WRK_ZONE REL_ROAD LGT_COND WEATHER FATALS DRUNK_DR
count 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000
mean 27.862436 1.551435 1.519759 0.031676 0.205149 0.212816 2.269160 2.276828 4.097006 12.795962 18.116088 1.611977 0.043021 2.000312 0.033029 2.195122 1.815182 2.733893 1.089963 0.281997
std 16.182681 0.740342 0.715242 0.221970 0.442517 0.458635 1.619565 1.621389 2.086971 6.867671 13.903453 2.517900 0.202908 2.848601 0.284969 1.585693 0.972106 3.389701 0.351929 0.466114
min 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000
25% 13.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 2.000000 7.000000 9.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000
50% 28.000000 1.000000 1.000000 0.000000 0.000000 0.000000 2.000000 2.000000 4.000000 14.000000 12.000000 0.000000 0.000000 1.000000 0.000000 1.000000 2.000000 1.000000 1.000000 0.000000
75% 42.000000 2.000000 2.000000 0.000000 0.000000 0.000000 3.000000 3.000000 6.000000 19.000000 30.000000 2.000000 0.000000 2.000000 0.000000 4.000000 2.000000 2.000000 1.000000 1.000000
max 56.000000 8.000000 8.000000 4.000000 4.000000 4.000000 13.000000 13.000000 7.000000 23.000000 59.000000 11.000000 1.000000 20.000000 4.000000 11.000000 5.000000 12.000000 5.000000 2.000000

8 rows × 21 columns

By using describe, it becomes evident that there are no observation with less than 10 incidences, and no missing values.

In [46]:
df_reduced.plot(kind='box', figsize = (30, 30))
Out[46]:

Outliers

Considering that for these data a cutoff of 10 was established, any outlier that is displayed in the boxplot figure above is not really an outlier because even the most unusual scenarios, for instance a crash involving more than 6 cars, will have some significant representation within a one year period thus becoming not as uncommon.

In this section we will review only those attributes that based on their boxplot seem odd or interesting.

Based in the boxplots for each of the attributes in the above figure, VE_TOTAL (number of vehicles involved in a crash, whether in-transit or not)– as it was expected after deleting all instances with less than 10 observations– displays some observations outside the boxplot with a maximum value of 8. A crash involving 8 cars could be seen as an extremely unusual situation, but when our data contains 13 accidents that involved 8 cars in just one year, this cannot be considered an outlier.

VE_FORMS (number of vehicles in-transport involved in crash),on the other hand, displays the exact same values as VE_TOTAL. Their means differ by 0.03 and their standard deviations differ by 0.02, making us believe that these two variables are measuring exactly the same. Therefore, VE_FORMS will be deleted next.

PVH_INVL (number of parked vehicles involved in crash) displays a maximum of 4 vehicles involved in an single accident. Then again, considering the cutoff of 10, we can be sure that even though the 25%, median, and 75% quantiles equal 0, having 11 accidents involving 4 parked vehicles is not as unusual.

The high values shown in PEDS (number of people not occupants of a vehicle involved in the accident)– considering that the min, 25%, median and 75% equal to 0– are not outliers either, considering that there are 17 events in our data that involved at once the maximum value of 4 people. It could be argued that PEDS and PERNOTMVIT are astoundingly similar, even pertaining their means and standard deviations which differ approximately by 0.02. They are not equal. Consider that both have a maximum value of 4 people involved at once in an accident, but the number of observations clearly differ: PEDS have 11 accidents involving 4 people, while PERNOTMVIT has 27 accidents involving the same amount of people.

Similar is the case involving attributes PERMVIT (number of motorists in the crash) and PERSONS (number of occupants in vehicles directly involved in the crash) as judging only their boxplots they look exactly the same, and even their values for min, 25%, 50%, 75%, and max are equal. Judging also from the non-significant difference between their means and standard deviations, we could argue that both attributes are identical. Even both are left skewed, denoting that they are not normally distributed. Nevetheless, they are not equal if we look at their frequencies, in which PERMVIT’s maximum value of 13 has a frequency of 15, while PERSONS’s has a frequency of 14. table above the figure display the same exact values. Another difference comes with their median values, in which even though slight they are different: PERMVIT has 11,812 occurrences of the value 1, while PERSONS has 11,717.

MAN_COLL is another categorical attribute with 11 levels. Being level 0 the median. Interesting is that largest non-outlier is equal to the 75% (level 2). One reason for this behavior could be that the frequencies after level 2 (3036), dramatically descend to 468 up to 38. A similar case is that of RELJCT2 which frequencies dramatically decrease as the levels increase. There is nothing strange for its boxplot not having lower whiskers considering that the median coresponds to level 1 which has the highest frequency of 19,782.

DRUNK_DR as expected the median is 0, but interestingly enough there were 213 cases in which two drivers involved in an accident were drunk.

FATALS also indicates that 1 fatality in an accident is the most scenario with 26,726 cases. Nevertheless, the case in which an accident resulted in 5 fatalities had 20 incidences in the year 2015.

Some attributes that show nothing really interesting are:

  • STATE has no outliers, and it seems to be normally distributed.
  • DAY_WEEK is a categorical variable with only 7 levels denoting each day of the week starting from Sunday, so as it was expected it has no outliers. Surprising though is the fact that it is normally distributed.
  • The case of HOUR is similar to that of DAY_WEEK in the sense that their values are well-defined from 0-23. HOUR though is slightly right skewed, with a median value of 14 hours.
  • ROUTE is a categorical attribute with 7 different levels starting at 1. Also it is not normally distributed as it is slightly skewed to the left. The most common level is 3 with more than 9 thousand instances.
  • HARM_EV is another categorical attribute with 59 levels from which level 12 is the median, while the mean is approximately 18. This attribute is clearly positively skewed.
  • RELJCT1 is an unbalanced binary attribute, where class 0 represents about 90% of the data.
In [47]:
# Delete VE_FORMS as it seems to be redundant with VE_TOTAL
df_reduced.drop('VE_FORMS', axis= 1, inplace =True)

Data preparation for analysis

Now that missing values, non-significant attributes and non-significant instances (cutoff of 10) had been removed from the data set, the next step is to prepare data for analysis.

First and foremost, df_reduced will contain the discretized versions of some of the attributes that contain more than 10 levels, such as:

  • For attribute HARM_EV that contains 59 levels, some of which belong to the same category, we will combine levels that similar such as: Embarkement–Earth, Embarkement–Rock, Stone or Concrete, Embarkement–Material Unknown into one single level “Embarkement”. This way it will reduce the attribute dimensionality aloowing for a more manageable one that will be easier to interpret.
  • Attribute HOUR has 23 levels, therefore we will discretize it and store its new values in a new column named HR_RANGE as follows:
  • **1** = 0-3 a.m. (when supposedly most of the crashes involving fatalities occur)
  • **2** = 4-7 a.m. (morning hours before hitting the road for most people)
  • **3** = 8-11 a.m. (rush hour to get to work)
  • **4** = 12-15 p.m. (lunch hour)
  • **5** = 16 – 19 p.m. (peak hour after work)
  • **6** = 20-23 (night time)

Tried using pd.cut but could not get it to work the bins as we wanted, so we tried the harder way by creating a new column HR_RANGE derived from HOUR, which we will drop next.

Derived from df_reduced, we will create four different subsets:

  • The first one that will be called “df_accident”, and will contain the attribute STATE–which we haven’t used for other projects– and all the other attributes, some of which we had discretized to reduced their dimensions because they contained more than 10 levels;
  • The second one called df_acc_dumm, will store the dummies of the categorical variables, including STATE.
  • The third one will be called “df_reg” where we will create a new attribute derived from STATE REGION, which will have only four levels: West (1), Midwest (2), South (3) and Northeast (4).
  • The fourth one called df_r_dumm will store the one-hot encoded categorical attributes. store the one-hot encoded attributes derived from “df_accident” considering that most clustering techniques do not work well with categorical variables (half of our attibutes).

We would like to test if whether one-hot encoding or not categorical attributes will play an important role in the predictive power of the final model.

Creation of df_accident data frame

For df_accident, we will reduce dimensionality of two attributes that we believe are important to determine the number of fatalities or to understand how and under what circumstances a fatality is more likely to happen.

  • HARM_EV, which describes the first injury or damage producing the crash. This attribute is made up of 20 levels, we will reduce them to six by grouping causes that are similar to each other.
  • HOUR, from 0-23 indicates the hour (in military style) in which an accident happened. We will also discretize this attribute as it was already explained in the previous cell.
In [48]:
# Create new data frame df_accident

df_accident = df_reduced

# reducing dimensionality of HARM_EV attribute by creating a new colum to store the new combined levels
# and then drop original colum

df_accident['HARM_EV_RED'] = 0

# 1. Rollover (1), Fire/Explosion (2), immersion (3), Motor vehicle in transport (12), Motor vehicle parked (14). 
#will be stored in 1 as "Causes" for accident.
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 2] = 1
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 3] = 1
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 12] = 1
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 14] = 1
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 54] = 1
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 55] = 1
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 72] = 1
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 73] = 1
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 45] = 1
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 35] = 1


#2. Gas Inhalation (4), Fell/Jumped from vehicle (5), Injured in vehicle (non-collision-- 6) will be stored as 2, which now 
# will become "non-collision causes" for accident
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 4] = 2
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 5] = 2
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 6] = 2

# 3. pedestrian (8), Pedalcyclist (9), non-motorist on personal conveyance (15) will be stored as 3, becoming "People cause for
# accident"
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 8] = 3
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 9] = 3
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 15] = 3

# 4. Live animal (11) will be stored as 4, becoming "Living beings cause for accident"
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 11] = 4
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 49] = 4

# 5. Post (31), Utility Pole (30) Culvert (32), Curb(33), Ditch (34), Concrete Traffic Barrier (25), 
#Traffic Signal support (46), will be stored in 5, will now become "Traffic/road causes for accident"
df_accident[&##39;HARM_EV_RED'][df_accident['HARM_EV'] == 31] = 5
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 30] = 5
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 32] = 5
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 33] = 5
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 34] = 5
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 25] = 5
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 46] = 5
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 59] = 5
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 44] = 5

# 6.Fence (38), Wall (39), Fire Hydrant (40), Shrubbery (41), Tree (42), Snowbank (48) 
# Thrown or falling object (16), Boulder (17), Other Object (18), Bulding (19), Bridge Pier or support (21), Bridge Rail (23),
# Guardrail Face (24),  Other barrier (26) will be stored as 6 under what we will call now "Barriers". 

df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 38] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 39] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 40] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 41] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 42] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 48] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 16] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 17] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 18] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 19] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 21] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 24] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 26] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 50] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 51] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 52] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 53] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 57] = 6
df_accident['HARM_EV_RED'][df_accident['HARM_EV'] == 58] = 6
In [49]:
#Delete original attrubute STATE
if 'HARM_EV' in df_accident:
    del df_accident['HARM_EV'] 
In [50]:
# Create new column to store the discretized values for HOUR then drop original attribute. 
df_accident['HR_RANGE'] = 0
df_accident['HR_RANGE'][df_accident['HOUR'] <=3] = 1
df_accident['HR_RANGE'][df_accident['HOUR'] ==4] = 2
df_accident['HR_RANGE'][df_accident['HOUR'] ==5] = 2
df_accident['HR_RANGE'][df_accident['HOUR'] == 6] = 2
df_accident['HR_RANGE'][df_accident['HOUR'] == 7] = 2
df_accident['HR_RANGE'][df_accident['HOUR'] == 8] = 3
df_accident['HR_RANGE'][df_accident['HOUR'] == 9] = 3
df_accident['HR_RANGE'][df_accident['HOUR'] == 10] = 3
df_accident['HR_RANGE'][df_accident['HOUR'] == 11] = 3
df_accident['HR_RANGE'][df_accident['HOUR'] == 12] = 4
df_accident['HR_RANGE'][df_accident['HOUR'] == 13] = 4
df_accident['HR_RANGE'][df_accident['HOUR'] == 14] = 4
df_accident['HR_RANGE'][df_accident['HOUR'] == 15] = 4
df_accident['HR_RANGE'][df_accident['HOUR'] == 16] = 5
df_accident['HR_RANGE'][df_accident['HOUR'] == 17] = 5
df_accident['HR_RANGE'][df_accident['HOUR'] == 18] = 5
df_accident['HR_RANGE'][df_accident['HOUR'] == 19] = 5
df_accident['HR_RANGE'][df_accident['HOUR'] == 20] = 6
df_accident['HR_RANGE'][df_accident['HOUR'] == 21] = 6
df_accident['HR_RANGE'][df_accident['HOUR'] == 22] = 6
df_accident['HR_RANGE'][df_accident['HOUR'] == 23] = 6
In [51]:
#Delete original attrubute HOUR
if 'HOUR' in df_accident:
    del df_accident['HOUR'] 
In [52]:
df_accident.describe()
Out[52]:
STATE VE_TOTAL PVH_INVL PEDS PERNOTMVIT PERMVIT PERSONS DAY_WEEK ROUTE MAN_COLL RELJCT1 RELJCT2 WRK_ZONE REL_ROAD LGT_COND WEATHER FATALS DRUNK_DR HARM_EV_RED HR_RANGE
count 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000
mean 27.862436 1.551435 0.031676 0.205149 0.212816 2.269160 2.276828 4.097006 3.373903 1.611977 0.043021 2.000312 0.033029 2.195122 1.815182 2.733893 1.089963 0.281997 2.532561 3.819484
std 16.182681 0.740342 0.221970 0.442517 0.458635 1.619565 1.621389 2.086971 1.609935 2.517900 0.202908 2.848601 0.284969 1.585693 0.972106 3.389701 0.351929 0.466114 2.066207 1.702918
min 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000
25% 13.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 2.000000 2.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 2.000000
50% 28.000000 1.000000 0.000000 0.000000 0.000000 2.000000 2.000000 4.000000 3.000000 0.000000 0.000000 1.000000 0.000000 1.000000 2.000000 1.000000 1.000000 0.000000 1.000000 4.000000
75% 42.000000 2.000000 0.000000 0.000000 0.000000 3.000000 3.000000 6.000000 4.000000 2.000000 0.000000 2.000000 0.000000 4.000000 2.000000 2.000000 1.000000 1.000000 5.000000 5.000000
max 56.000000 8.000000 4.000000 4.000000 4.000000 13.000000 13.000000 7.000000 7.000000 11.000000 1.000000 20.000000 4.000000 11.000000 5.000000 12.000000 5.000000 2.000000 6.000000 6.000000
In [53]:
df_accident.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28823 entries, 1 to 28823
Data columns (total 20 columns):
STATE          28823 non-null int64
VE_TOTAL       28823 non-null int64
PVH_INVL       28823 non-null int64
PEDS           28823 non-null int64
PERNOTMVIT     28823 non-null int64
PERMVIT        28823 non-null int64
PERSONS        28823 non-null int64
DAY_WEEK       28823 non-null int64
ROUTE          28823 non-null int64
MAN_COLL       28823 non-null int64
RELJCT1        28823 non-null int64
RELJCT2        28823 non-null int64
WRK_ZONE       28823 non-null int64
REL_ROAD       28823 non-null int64
LGT_COND       28823 non-null int64
WEATHER        28823 non-null int64
FATALS         28823 non-null int64
DRUNK_DR       28823 non-null int64
HARM_EV_RED    28823 non-null int64
HR_RANGE       28823 non-null int64
dtypes: int64(20)
memory usage: 4.6 MB

From describe, We can see the newly created attributes: REGION, HARM_EV_RED and HR_RANGE, each with the newer number levels that were determined.

Let’s now visualize the attributes in order to find more information about each.

In [54]:
df_accident.plot(kind='box', figsize = (30, 30))
Out[54]:

The boxplots above do not differ much from the previously graphed ones, and that’s what we expected. The newly created attributes, particularly REGION, HARM_EV_RED and HR_RANGE display a somewhat normal distribution as opposed to the variables from which they were derived, and no points that could be considered as outliers, so from this standpoint, the discretization was a good choice.

Is weather the cause for most fatalities?

In [55]:
#df_reduced.plot(x='HR_RANGE', y = 'FATALS', kind = 'scatter')

df_WTR= df_accident.groupby(by=['WEATHER'])

# Sum number of Fatalities
fatal_weather = df_WTR.FATALS.sum()

# Graph Chart
plt.xlabel('Total Fatalities')
g_fatal_weather = fatal_weather.plot(kind='barh')

The bar plot above is quite interesting as it shows that at least for the year 2015, about 25,000 fatalities were not related to bad weather (0, means no additional weather conditions). About 2,500 were caused by rainy conditions (2), less than 2,000 by either snow (4), fog or smoke(5), and quite interesting more than 5,000 occur under cloudy (10) conditions.

Is the lighting condition a main cause for accidents with a great number of fatalities?

In [56]:
df_lgt= df_accident.groupby(by=['LGT_COND'])

# Sum number of Fatalities
fatal_cond = df_lgt.FATALS.sum()

# Graph Chart
plt.xlabel('Total Fatalities')
g_fatal_cond = fatal_cond.plot(kind='barh')

Based on the above graph, we can argue that most accidents in which there were about 14,500 deaths that happened in plain daylight (0). Complete darkeness (2) was the cause for more than 8,000 accidents in 2015, while partial darkness (3) was the cause of almost 6,000 fatalities.

Weather and tota number accidents

We have already seen that most fatalities happen under normal weather conditions, but what about the total number accidents?

In [57]:
df_weath= df_accident.groupby(by=['WEATHER'])

# Sum number of Fatalities
fatal_accidents = df_weath.VE_TOTAL.sum()

# Graph Chart
plt.xlabel('Total Accidents')
g_fatal_accidents = fatal_accidents.plot(kind='barh')

At first sight the bar plots between WEATHER and FATALS, and WEATHER and VE_TOTAL look alike. But if we look beyond appearances, it is possible to see that the number of ttotal accidents under normal weather conditions in the year 2015 exceed the 30,000 occurrences; there were about 25,000 fatalities that happened under the same conditions. Also there were more than 3,000 total accidents that happened due to the rain (2) –in comparison to about 2,500 fatalities under the same condition–, and about 7,500 under cloudy (10) conditions –a little over 5,000 fatalities happened under the same conditions.

Deadly states

Are there any states that concentrate a vast number of the total fatalities due to traffic accidents? We tend to believe that states with a higher concentration of inhabitants will result in a higher number of fatalities.

In [58]:
# create a temporary data frame, grouping data by state 

df_st= df_accident.groupby(by=['STATE'])

# Sum number of Fatalities on a new variable that will be used to plot
fatal_states = df_st.FATALS.sum()

# Graph Chart
plt.xlabel('Total Fatalities')
g_fatal_states = fatal_states.plot(kind='barh', figsize = (30,30))

Data shows that for the year 2015, Calirofnia (6) concentrated a little over 3,000 of the total fatalities, followed by Texas (48) with about 2,750. Alaska (2) presented the lowest fatality numbers at less than 100, but it is important to put in perspective the number of inhabitants of Alaska (737,709) in comparison to CA (39 million) and TX (27 million), and with the rest of the states to determine if in fact is significant.

Creation of df_acc_dum data frame

This particular dataframe will store the categorical attributes to which we will perform a one-hot encoding. Then, we will drop the attributes from which they were derived.

In [59]:
df_acc_dum = df_accident.copy(deep=True)

One-hot encoding of df_acc_dum categoricals

In [60]:
# Create dummy variable from STATE attribute
#st_df = pd.get_dummies(df_acc_dum.STATE,prefix='ST')
#df_acc_dum = pd.concat((df_acc_dum, st_df), axis =1)
In [61]:
# Create dummy variable from ROUTE attribute
rte_df = pd.get_dummies(df_acc_dum.ROUTE,prefix='ROUTE')
df_acc_dum = pd.concat((df_acc_dum, rte_df), axis =1)
In [62]:
# Create dummy variable from HARM_EV attribute
harm_ev_df = pd.get_dummies(df_acc_dum.HARM_EV_RED,prefix='HARM_RED')
df_acc_dum = pd.concat((df_acc_dum, harm_ev_df), axis =1)
In [63]:
# Create dummy variable from RELJCT2 attribute
rjct_df = pd.get_dummies(df_acc_dum.RELJCT2,prefix='RELJCT2')
df_acc_dum = pd.concat((df_acc_dum, rjct_df), axis =1)
In [64]:
# Create dummy variable from WRK_ZONE attribute
wzone_df = pd.get_dummies(df_acc_dum.WRK_ZONE,prefix='WRK_ZONE')
df_acc_dum = pd.concat((df_acc_dum, wzone_df), axis =1)
In [65]:
# Create dummy variable from REL_ROAD attribute
rrd_df = pd.get_dummies(df_acc_dum.REL_ROAD,prefix='REL_ROAD')
df_acc_dum = pd.concat((df_acc_dum, rrd_df), axis =1)
In [66]:
# Create dummy variable from LGT_COND attribute
lcond_df = pd.get_dummies(df_acc_dum.LGT_COND,prefix='LGT_COND')
df_acc_dum = pd.concat((df_acc_dum, lcond_df), axis =1)
In [67]:
# Create dummy variable from WEATHER attribute
wth_df = pd.get_dummies(df_acc_dum.WEATHER,prefix='WEATHER')
df_acc_dum = pd.concat((df_acc_dum, wth_df), axis =1)
In [68]:
# Create dummy variable from DAY_WEEK attribute
daywk_df = pd.get_dummies(df_acc_dum.DAY_WEEK,prefix='DAY_WEEK')
df_acc_dum = pd.concat((df_acc_dum, daywk_df), axis =1)
In [69]:
# Create dummy variable from HR_RANGE attribute
h_range = pd.get_dummies(df_acc_dum.HR_RANGE,prefix='HR_RANGE')
df_acc_dum = pd.concat((df_acc_dum, h_range), axis =1)
In [70]:
# # Create dummy variable from MAN_COLL attribute
m_coll = pd.get_dummies(df_acc_dum.MAN_COLL,prefix='MAN_COLL')
df_acc_dum = pd.concat((df_acc_dum, m_coll), axis =1)
In [71]:
# To clean df_accident, drop those attributes from which the dummy variables were derived
    
if 'ROUTE' in df_acc_dum:
    del df_acc_dum['ROUTE'] 

    
if 'HARM_EV' in df_acc_dum:
    del df_acc_dum['HARM_EV_RED'] 
    
if 'RELJCT2' in df_acc_dum:
    del df_acc_dum['RELJCT2'] 

    
if 'WRK_ZONE' in df_acc_dum:
    del df_acc_dum['WRK_ZONE'] 
    
if 'REL_ROAD' in df_acc_dum:
    del df_acc_dum['REL_ROAD'] 
    
if 'LGT_COND' in df_acc_dum:
    del df_acc_dum['LGT_COND'] 
    
if 'WEATHER' in df_acc_dum:
    del df_acc_dum['WEATHER'] 
    

if 'DAY_WEEK' in df_acc_dum:
    del df_acc_dum['DAY_WEEK']
    
if 'HR_RANGE' in df_acc_dum:
    del df_acc_dum['HR_RANGE']

    
if 'MAN_COLL' in df_acc_dum:
    del df_acc_dum['MAN_COLL']
    
In [72]:
print ('The number of rows is       ', df_acc_dum.shape[0])
print ('The number of attributes is ', df_acc_dum.shape[1])
The number of rows is        28823
The number of attributes is  85

By perfoming one-hot encoding the number of attributes rose to 85.

In [73]:
df_acc_dum.describe(include='all')
Out[73]:
STATE VE_TOTAL PVH_INVL PEDS PERNOTMVIT PERMVIT PERSONS RELJCT1 FATALS DRUNK_DR HR_RANGE_5 HR_RANGE_6 MAN_COLL_0 MAN_COLL_1 MAN_COLL_2 MAN_COLL_6 MAN_COLL_7 MAN_COLL_8 MAN_COLL_9 MAN_COLL_11
count 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000
mean 27.862436 1.551435 0.031676 0.205149 0.212816 2.269160 2.276828 0.043021 1.089963 0.281997 0.222045 0.200881 0.612011 0.069875 0.105124 0.182354 0.016168 0.010651 0.001318 0.002498
std 16.182681 0.740342 0.221970 0.442517 0.458635 1.619565 1.621389 0.202908 0.351929 0.466114 0.415628 0.400666 0.487300 0.254940 0.306719 0.386143 0.126122 0.102655 0.036286 0.049918
min 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 13.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 28.000000 1.000000 0.000000 0.000000 0.000000 2.000000 2.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 42.000000 2.000000 0.000000 0.000000 0.000000 3.000000 3.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 56.000000 8.000000 4.000000 4.000000 4.000000 13.000000 13.000000 1.000000 5.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 85 columns

Creation of df_reg

This dataframe will differ from df_accident in the sense that instead of containing STATE, it will contain a discretized version of it, called REGION, as we had already explained.

In [74]:
df_reg = df_accident.copy(deep=True)

# We decided to create a REGION attribute considering that the STATE attribute had 50 levels, which we will use only
# as a reference, whereas REGION can allow us to interpret, discover and create new relationships. 
# We decided to group states in only four regions: West (1), Midwest (2), South (3) and Northeast (4), as it is done in
# https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States. 
df_reg['REGION'] = 0

# Add states according to the region where they belong:
df_reg['REGION'][df_reg['STATE'] == 1] = 3
df_reg['REGION'][df_reg['STATE'] == 2] = 1
df_reg['REGION'][df_reg['STATE'] == 4] = 1
df_reg['REGION'][df_reg['STATE'] == 5] = 3
df_reg['REGION'][df_reg['STATE'] == 6] = 1
df_reg['REGION'][df_reg['STATE'] == 8] = 1
df_reg['REGION'][df_reg['STATE'] == 9] = 4
df_reg['REGION'][df_reg['STATE'] == 10] = 4
df_reg['REGION'][df_reg['STATE'] == 12] = 3
df_reg['REGION'][df_reg['STATE'] == 13] = 3
df_reg['REGION'][df_reg['STATE'] == 15] = 1
df_reg['REGION'][df_reg['STATE'] == 16] = 1
df_reg['REGION'][df_reg['STATE'] == 17] = 2
df_reg['REGION'][df_reg['STATE'] == 18] = 2
df_reg['REGION'][df_reg['STATE'] == 19] = 2
df_reg['REGION'][df_reg['STATE'] == 20] = 2
df_reg['REGION'][df_reg['STATE'] == 21] = 3
df_reg['REGION'][df_reg['STATE'] == 22] = 3
df_reg['REGION'][df_reg['STATE'] == 23] = 4
df_reg['REGION'][df_reg['STATE'] == 24] = 3
df_reg['REGION'][df_reg['STATE'] == 25] = 4
df_reg['REGION'][df_reg['STATE'] == 26] = 4
df_reg['REGION'][df_reg['STATE'] == 27] = 4
df_reg['REGION'][df_reg['STATE'] == 28] = 3
df_reg['REGION'][df_reg['STATE'] == 29] = 4
df_reg['REGION'][df_reg['STATE'] == 30] = 1
df_reg['REGION'][df_reg['STATE'] == 31] = 4
df_reg['REGION'][df_reg['STATE'] == 32] = 1
df_reg['REGION'][df_reg['STATE'] == 33] = 4
df_reg['REGION'][df_reg['STATE'] == 34] = 4
df_reg['REGION'][df_reg['STATE'] == 35] = 1
df_reg['REGION'][df_reg['STATE'] == 36] = 4
df_reg['REGION'][df_reg['STATE'] == 37] = 3
df_reg['REGION'][df_reg['STATE'] == 38] = 2
df_reg['REGION'][df_reg['STATE'] == 39] = 2
df_reg['REGION'][df_reg['STATE'] == 40] = 3
df_reg['REGION'][df_reg['STATE'] == 41] = 1
df_reg['REGION'][df_reg['STATE'] == 42] = 4
df_reg['REGION'][df_reg['STATE'] == 44] = 4
df_reg['REGION'][df_reg['STATE'] == 45] = 3
df_reg['REGION'][df_reg['STATE'] == 46] = 2
df_reg['REGION'][df_reg['STATE'] == 47] = 3
df_reg['REGION'][df_reg['STATE'] == 48] = 3
df_reg['REGION'][df_reg['STATE'] == 49] = 1
df_reg['REGION'][df_reg['STATE'] == 50] = 4
df_reg['REGION'][df_reg['STATE'] == 51] = 3
df_reg['REGION'][df_reg['STATE'] == 53] = 1
df_reg['REGION'][df_reg['STATE'] == 54] = 3
df_reg['REGION'][df_reg['STATE'] == 55] = 2
df_reg['REGION'][df_reg['STATE'] == 56] = 1
In [75]:
if 'STATE' in df_reg:
    del df_reg['STATE']
In [76]:
df_reg.describe(include='all')
Out[76]:
VE_TOTAL PVH_INVL PEDS PERNOTMVIT PERMVIT PERSONS DAY_WEEK ROUTE MAN_COLL RELJCT1 RELJCT2 WRK_ZONE REL_ROAD LGT_COND WEATHER FATALS DRUNK_DR HARM_EV_RED HR_RANGE REGION
count 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000
mean 1.551435 0.031676 0.205149 0.212816 2.269160 2.276828 4.097006 3.373903 1.611977 0.043021 2.000312 0.033029 2.195122 1.815182 2.733893 1.089963 0.281997 2.532561 3.819484 2.628665
std 0.740342 0.221970 0.442517 0.458635 1.619565 1.621389 2.086971 1.609935 2.517900 0.202908 2.848601 0.284969 1.585693 0.972106 3.389701 0.351929 0.466114 2.066207 1.702918 1.022042
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000
25% 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 2.000000 2.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 2.000000 2.000000
50% 1.000000 0.000000 0.000000 0.000000 2.000000 2.000000 4.000000 3.000000 0.000000 0.000000 1.000000 0.000000 1.000000 2.000000 1.000000 1.000000 0.000000 1.000000 4.000000 3.000000
75% 2.000000 0.000000 0.000000 0.000000 3.000000 3.000000 6.000000 4.000000 2.000000 0.000000 2.000000 0.000000 4.000000 2.000000 2.000000 1.000000 1.000000 5.000000 5.000000 3.000000
max 8.000000 4.000000 4.000000 4.000000 13.000000 13.000000 7.000000 7.000000 11.000000 1.000000 20.000000 4.000000 11.000000 5.000000 12.000000 5.000000 2.000000 6.000000 6.000000 4.000000
In [77]:
df_reg.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28823 entries, 1 to 28823
Data columns (total 20 columns):
VE_TOTAL       28823 non-null int64
PVH_INVL       28823 non-null int64
PEDS           28823 non-null int64
PERNOTMVIT     28823 non-null int64
PERMVIT        28823 non-null int64
PERSONS        28823 non-null int64
DAY_WEEK       28823 non-null int64
ROUTE          28823 non-null int64
MAN_COLL       28823 non-null int64
RELJCT1        28823 non-null int64
RELJCT2        28823 non-null int64
WRK_ZONE       28823 non-null int64
REL_ROAD       28823 non-null int64
LGT_COND       28823 non-null int64
WEATHER        28823 non-null int64
FATALS         28823 non-null int64
DRUNK_DR       28823 non-null int64
HARM_EV_RED    28823 non-null int64
HR_RANGE       28823 non-null int64
REGION         28823 non-null int64
dtypes: int64(20)
memory usage: 4.6 MB

Creation of df_r_dumm data frame

This dataframe is pretty similar to df_acc_dum in the sense that it also stores the one-hot encoded attributes along with REGION, which will also be one-hot encoded.

In [78]:
df_r_dumm = df_reg.copy(deep=True)

One hot encoding of categorical variables of df_r_dumm

In [79]:
# Create dummy variable from ROUTE attribute
route_df = pd.get_dummies(df_r_dumm.ROUTE,prefix='ROUTE')
df_r_dumm = pd.concat((df_r_dumm, route_df), axis =1)
In [80]:
# Create dummy variable from HARM_EV attribute
harm_df = pd.get_dummies(df_r_dumm.HARM_EV_RED,prefix='HARM_RED')
df_r_dumm = pd.concat((df_r_dumm, harm_df), axis =1)
In [81]:
# Create dummy variable from RELJCT2 attribute
reljct_df = pd.get_dummies(df_r_dumm.RELJCT2,prefix='RELJCT2')
df_r_dumm = pd.concat((df_r_dumm, reljct_df), axis =1)
In [82]:
# Create dummy variable from WRK_ZONE attribute
wrk_df = pd.get_dummies(df_r_dumm.WRK_ZONE,prefix='WRK_ZONE')
df_r_dumm = pd.concat((df_r_dumm, wrk_df), axis =1)
In [83]:
# Create dummy variable from REL_ROAD attribute
relrd_df = pd.get_dummies(df_r_dumm.REL_ROAD,prefix='REL_ROAD')
df_r_dumm = pd.concat((df_r_dumm, relrd_df), axis =1)
In [84]:
# Create dummy variable from LGT_COND attribute
lgt_df = pd.get_dummies(df_r_dumm.LGT_COND,prefix='LGT_COND')
df_r_dumm = pd.concat((df_r_dumm, lgt_df), axis =1)
In [85]:
# Create dummy variable from WEATHER attribute
wtr_df = pd.get_dummies(df_r_dumm.WEATHER,prefix='WEATHER')
df_r_dumm = pd.concat((df_r_dumm, wtr_df), axis =1)
In [86]:
# Create dummy variable from DAY_WEEK attribute
wk_df = pd.get_dummies(df_r_dumm.DAY_WEEK,prefix='DAY_WEEK')
df_r_dumm = pd.concat((df_r_dumm, wk_df), axis =1)
In [87]:
# Create dummy variable from HR_RANGE attribute
hr_range = pd.get_dummies(df_r_dumm.HR_RANGE,prefix='HR_RANGE')
df_r_dumm = pd.concat((df_r_dumm, hr_range), axis =1)
In [88]:
# # Create dummy variable from REGION attribute
df_region = pd.get_dummies(df_r_dumm.REGION,prefix='REGION')
df_r_dumm = pd.concat((df_r_dumm, df_region), axis =1)
In [89]:
# # Create dummy variable from MAN_COLL attribute
man_coll = pd.get_dummies(df_r_dumm.MAN_COLL,prefix='MAN_COLL')
df_r_dumm = pd.concat((df_r_dumm, man_coll), axis =1)
In [90]:
# To clean df_accident, drop those attributes from which the dummy variables were derived
    
if 'ROUTE' in df_r_dumm:
    del df_r_dumm['ROUTE'] 

    
if 'HARM_EV' in df_r_dumm:
    del df_r_dumm['HARM_EV_RED'] 
    
if 'RELJCT2' in df_r_dumm:
    del df_r_dumm['RELJCT2'] 

    
if 'WRK_ZONE' in df_r_dumm:
    del df_r_dumm['WRK_ZONE'] 
    
if 'REL_ROAD' in df_r_dumm:
    del df_r_dumm['REL_ROAD'] 
    
if 'LGT_COND' in df_r_dumm:
    del df_r_dumm['LGT_COND'] 
    
if 'WEATHER' in df_r_dumm:
    del df_r_dumm['WEATHER'] 
    

if 'DAY_WEEK' in df_r_dumm:
    del df_r_dumm['DAY_WEEK']
    
if 'HR_RANGE' in df_r_dumm:
    del df_r_dumm['HR_RANGE']
    
if 'REGION' in df_r_dumm:
    del df_r_dumm['REGION']
    
if 'MAN_COLL' in df_r_dumm:
    del df_r_dumm['MAN_COLL']
    
In [91]:
print ('The number of rows is       ', df_r_dumm.shape[0])
print ('The number of attributes is ', df_r_dumm.shape[1])
The number of rows is        28823
The number of attributes is  88

Adding a one-hot encoding of REGION only added 3 columns to the dataset.

In [92]:
df_r_dumm.describe(include='all')
Out[92]:
VE_TOTAL PVH_INVL PEDS PERNOTMVIT PERMVIT PERSONS RELJCT1 FATALS DRUNK_DR HARM_EV_RED REGION_3 REGION_4 MAN_COLL_0 MAN_COLL_1 MAN_COLL_2 MAN_COLL_6 MAN_COLL_7 MAN_COLL_8 MAN_COLL_9 MAN_COLL_11
count 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000 28823.000000
mean 1.551435 0.031676 0.205149 0.212816 2.269160 2.276828 0.043021 1.089963 0.281997 2.532561 0.458106 0.192034 0.612011 0.069875 0.105124 0.182354 0.016168 0.010651 0.001318 0.002498
std 0.740342 0.221970 0.442517 0.458635 1.619565 1.621389 0.202908 0.351929 0.466114 2.066207 0.498250 0.393907 0.487300 0.254940 0.306719 0.386143 0.126122 0.102655 0.036286 0.049918
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 1.000000 0.000000 0.000000 0.000000 2.000000 2.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 2.000000 0.000000 0.000000 0.000000 3.000000 3.000000 0.000000 1.000000 1.000000 5.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 8.000000 4.000000 4.000000 4.000000 13.000000 13.000000 1.000000 5.000000 2.000000 6.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 88 columns

Modeling and Evaluation (50 points total)

Different tasks will require different evaluation methods. Be as thorough as possible when analyzing
the data you have chosen and use visualizations of the results to explain the performance and
expected outcomes whenever possible. Guide the reader through your analysis with plenty of
discussion of the results.

For this task we tried to divide the data from the four dataframes we have creates into 80% train/ 20% test but after more than 4 hours trying, we just could not make it work properly.

We tried dividing the different dataframes into X and y numpy arrays, then creating a loop to split them into X_train, y_train, X_test, and y_test; we also tried train_test_split but the clustering methods returned errors when using numpy arrays that we were unable to fix given the time we had to finish the project. We are sure that having more time would had also allowed us to find a solution for our issue.

Therefore, we decided to keep out dataframes as pandas, but then again splitting them turn out to be a tall order that with ourr time constraint we could not figure out. So the dataframes were left intact.

We, then, decided to test how KMeans behave when using a dataframe with no one-hot encoded attributes, and when using one with one-hot encoded ones to decide which one will work better based on their accuracy and deviation. Thus choosing the best performing one to perform other clustering methods such as: agglomerative and DBSCAN.

Modeling of dataframe df_accident

We used Stratified K folds with 10 splits as our cross-validation. And created an object using Random Forest with 150 estimators. We will not change its parameters, instead to train our data we will try KMeans with different dataframes, different attributes to decide which yields the best acuracy and deviation combined. Then, we will use that particular one to perform other cluster methods and try different validations of it.

Training and adjusting parameters

In [93]:
%%time 

from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

cv = StratifiedKFold(n_splits=10)

clf = RandomForestClassifier(n_estimators=150,random_state=1)



X = df_accident[['STATE', 'VE_TOTAL', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'LGT_COND', 'WEATHER', 'DRUNK_DR']]

y = df_accident['FATALS']

acc = cross_val_score(clf,X,y=y,cv=cv)

print ("Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
Average accuracy =  90.0391601924 +- 1.18571260271
CPU times: user 27.1 s, sys: 1.49 s, total: 28.6 s
Wall time: 34.5 s
In [94]:
X1 = df_accident[['PERSONS', 'PERMVIT']]
X2 = df_accident[['STATE', 'VE_TOTAL', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'LGT_COND', 'WEATHER', 'DRUNK_DR']]

params = []
for n_person in range(3,5):
    for n_permvit in range(3,10):
        # get the first clustering
        cls_person = KMeans(n_clusters=n_person, init='k-means++',random_state=1)
        cls_person.fit(X1)
        newfeature_person = cls_person.labels_ # the labels from kmeans clustering

        # append on the second clustering
        cls_permvit = KMeans(n_clusters=n_permvit, init='k-means++',random_state=1)
        cls_permvit.fit(X2)
        newfeature_permvit = cls_permvit.labels_ # the labels from kmeans clustering
        
        # append to dataframe
        y = df_accident['FATALS']
        X = df_accident[['STATE', 'VE_TOTAL', 'PEDS', 'PERMVIT', 'PERSONS', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'LGT_COND', 'WEATHER', 'DRUNK_DR']]
        X = np.column_stack((X,pd.get_dummies(newfeature_person),pd.get_dummies(newfeature_permvit)))

        acc = cross_val_score(clf,X,y=y,cv=cv)
        params.append((n_person,n_permvit,acc.mean()*100,acc.std()*100)) # save state

        print ("Clusters",n_person,n_permvit,"Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
        
        
Clusters 3 3 Average accuracy =  91.1772625529 +- 0.809078230501
Clusters 3 4 Average accuracy =  91.2708692553 +- 0.53649571456
Clusters 3 5 Average accuracy =  91.1876828277 +- 0.885859754237
Clusters 3 6 Average accuracy =  91.2119655142 +- 0.878254792137
Clusters 3 7 Average accuracy =  91.2258483721 +- 0.807301867525
Clusters 3 8 Average accuracy =  91.3542158084 +- 0.66011986219
Clusters 3 9 Average accuracy =  91.1495521593 +- 0.874140083122
Clusters 4 3 Average accuracy =  91.2605368608 +- 0.725835773996
Clusters 4 4 Average accuracy =  91.3090624795 +- 0.544059737054
Clusters 4 5 Average accuracy =  91.156483404 +- 0.980667575489
Clusters 4 6 Average accuracy =  91.2119462434 +- 0.771837507198
Clusters 4 7 Average accuracy =  91.2432010467 +- 0.761776997933
Clusters 4 8 Average accuracy =  91.2813509358 +- 0.684569461582
Clusters 4 9 Average accuracy =  91.1565122415 +- 0.924601139294

Based on the above results, it seems that 3 clusters for PERSONS and 4 for PERMVIT land the highest accuracy (91.27) and lowest deviation (0.54). Next we will use the number of clusters that seemed appropriate to perform Kmeans on those two attributes together.

In [95]:
n_person=3
n_permvit=4

X2 = df_accident[['STATE', 'VE_TOTAL','PERSONS', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'LGT_COND', 'WEATHER', 'DRUNK_DR']]

cls_person = KMeans(n_clusters=n_person, init='k-means++',random_state=1)
cls_person.fit(X1)
newfeature_person = cls_person.labels_ # the labels from kmeans clustering

# append on the second clustering
cls_permvit = KMeans(n_clusters=n_permvit, init='k-means++',random_state=1)
cls_permvit.fit(X2)
newfeature_permvit = cls_permvit.labels_ # the labels from kmeans clustering

plt.figure()
plt.subplot(1,2,1)
X2=X2.values
plt.scatter(X2[:, 7], X2[:, 1]+np.random.random(X2[:, 1].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Persons Involved')
plt.grid()

plt.subplot(1,2,2)
plt.scatter(X2[:, 7], X2[:, 8]+np.random.random(X2[:, 1].shape)/2, c=newfeature_permvit, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Alcohol Involved')
plt.grid()

X1=X1.values
plt.figure()
plt.scatter(X1[:, 0], X1[:, 1]+np.random.random(X1[:, 0].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('People Involved'), plt.ylabel('People in Vehicle')
plt.grid()

Based on the above graph, PERSONS (people in vehicle) and PERMVIT (people involved) can be easily clustered in 3. PERSONS and WEATHER also show three, somehow easily defined clusters. WEATHER and Alcohol involved is a different story as only one cluster (in red) can be easily defined.

Next we will try performing KMeans on just PERSONS.

In [96]:
X3 = df_accident[['PERSONS', 'PERMVIT']]
X4 = df_accident[['STATE', 'VE_TOTAL', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'LGT_COND', 'WEATHER', 'DRUNK_DR']]

params = []
for n_person in range(3,7):
        # get the first clustering
        cls_person = KMeans(n_clusters=n_person, init='k-means++',random_state=1)
        cls_person.fit(X3)
        newfeature_person = cls_person.labels_ # the labels from kmeans clustering


        y = df_accident['FATALS']
        X = df_accident[['STATE', 'VE_TOTAL', 'PEDS', 'PERMVIT', 'PERSONS', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'LGT_COND', 'WEATHER', 'DRUNK_DR']]
        X = np.column_stack((X,pd.get_dummies(newfeature_person)))

        acc = cross_val_score(clf,X,y=y,cv=cv)
        params.append((n_person,acc.mean()*100,acc.std()*100)) # save state

        print ("Clusters",n_person,"Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
        
        
Clusters 3 Average accuracy =  91.4790291582 +- 0.245862662762
Clusters 4 Average accuracy =  91.3819008699 +- 0.307578533487
Clusters 5 Average accuracy =  91.406166691 +- 0.295276546539
Clusters 6 Average accuracy =  91.4200760359 +- 0.289407462158

We can see that performing KMeans on just PERSONS did increase the accuracy and lower the deviation, making it tighter. For graphing the clusters we will use only 3, which is the number that provides the higher accuracy/ lower deviation.

In [97]:
n_person=3


X4 = df_accident[['STATE', 'VE_TOTAL','PERSONS', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'LGT_COND', 'WEATHER', 'DRUNK_DR']]

cls_person = KMeans(n_clusters=n_person, init='k-means++',random_state=1)
cls_person.fit(X3)
newfeature_person = cls_person.labels_ # the labels from kmeans clustering

plt.figure()
plt.subplot(1,2,1)
X4=X4.values
plt.scatter(X4[:, 7], X4[:, 1]+np.random.random(X4[:, 1].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Persons Involved')
plt.grid()

plt.subplot(1,2,2)
plt.scatter(X4[:, 7], X4[:, 8]+np.random.random(X4[:, 1].shape)/2, c=newfeature_permvit, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Alcohol Involved')
plt.grid()

X3=X3.values
plt.figure()
plt.scatter(X3[:, 0], X3[:, 1]+np.random.random(X3[:, 0].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('People Involved'), plt.ylabel('People in Vehicle')
plt.grid()

We chose three clusters because they have the higher accuracy (91.48) and the lowest deviation (0.25). Even tough there isn’t a dramatic difference with the previous graph, the clusters in this plot are slightly more defined.

Lighting condition and weather

From the df_accident dataframe we observed that most accidents happen under normal weather and lighting conditions, but there were also a good number that did happen in “unusual” conditions. So we want to know what would happen if we use KMeans on these two attributes (not one-hot encoded) together.

In [98]:
X5 = df_accident[['LGT_COND', 'WEATHER']]
X6 = df_accident[['STATE', 'PERSONS', 'PERMVIT', 'VE_TOTAL', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'HARM_EV_RED', 'DRUNK_DR']]

params = []
for n_lgt in range(3,7):
     for n_wthr in range(3,10):
    
        # get the first clustering
        cls_lgt = KMeans(n_clusters=n_lgt, init='k-means++',random_state=1)
        cls_lgt.fit(X5)
        newfeature_lgt = cls_lgt.labels_ # the labels from kmeans clustering

        # append second clustering
        cls_wthr = KMeans(n_clusters=n_wthr, init='k-means++',random_state=1)
        cls_wthr.fit(X6)
        newfeature_wthr = cls_wthr.labels_ # the labels from kmeans clustering

        y = df_accident['FATALS']
        X = df_accident[['STATE', 'PERSONS', 'PERMVIT', 'VE_TOTAL', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'HARM_EV_RED', 'DRUNK_DR']]
        X = np.column_stack((X,pd.get_dummies(newfeature_lgt),pd.get_dummies(newfeature_wthr) ))

        acc = cross_val_score(clf,X,y=y,cv=cv)
        params.append((n_lgt, n_wthr,acc.mean()*100,acc.std()*100)) # save state

        print ("Clusters",n_lgt, n_wthr, "Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
        
        
Clusters 3 3 Average accuracy =  91.121791257 +- 0.658452787108
Clusters 3 4 Average accuracy =  91.1009458991 +- 0.64449512011
Clusters 3 5 Average accuracy =  90.8823561124 +- 0.922536030319
Clusters 3 6 Average accuracy =  91.041967485 +- 0.792655383407
Clusters 3 7 Average accuracy =  91.0662405349 +- 0.734261718598
Clusters 3 8 Average accuracy =  91.0731524762 +- 0.721583074588
Clusters 3 9 Average accuracy =  91.1564328152 +- 0.703395062105
Clusters 4 3 Average accuracy =  91.1460486592 +- 0.635841301003
Clusters 4 4 Average accuracy =  91.0974845138 +- 0.700248619731
Clusters 4 5 Average accuracy =  91.0349893848 +- 0.882021132247
Clusters 4 6 Average accuracy =  91.1217358824 +- 0.790341098368
Clusters 4 7 Average accuracy =  91.1252285632 +- 0.755286350444
Clusters 4 8 Average accuracy =  91.114797459 +- 0.843519861624
Clusters 4 9 Average accuracy =  91.1355958812 +- 0.75075793137
Clusters 5 3 Average accuracy =  91.2571260043 +- 0.653598131982
Clusters 5 4 Average accuracy =  91.2223833385 +- 0.57422621246
Clusters 5 5 Average accuracy =  91.2327422477 +- 0.647123712075
Clusters 5 6 Average accuracy =  91.173830005 +- 0.865509152334
Clusters 5 7 Average accuracy =  91.1356548415 +- 0.805746915127
Clusters 5 8 Average accuracy =  91.180766028 +- 0.745767577532
Clusters 5 9 Average accuracy =  91.2848483948 +- 0.764129689792
Clusters 6 3 Average accuracy =  91.3438376409 +- 0.583246977411
Clusters 6 4 Average accuracy =  91.2882965579 +- 0.617418618186
Clusters 6 5 Average accuracy =  91.2848099082 +- 0.666377116232
Clusters 6 6 Average accuracy =  91.2571139622 +- 0.836353250627
Clusters 6 7 Average accuracy =  91.2258796199 +- 0.786574554884
Clusters 6 8 Average accuracy =  91.225877207 +- 0.90181195039
Clusters 6 9 Average accuracy =  91.3819417895 +- 0.62417960537

Accuracy haven’e changed much, but the deviation went up again. We attribute this to the fact that both are categorical attributes that were not previously one-hot encoded, but instead were discretized during the KMeans process. It may seem discretizing them was not adequate. Nevertheless, we decided using the combination of 6 clusters for lighting and 3 for weather, seem their average accuracy resulted in 91.3, and their deviation was the lowest at 0.58.

In [99]:
n_lgt=6
n_wthr=3

X6 = df_accident[['STATE', 'PERSONS', 'PERMVIT', 'VE_TOTAL', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'HARM_EV_RED', 'DRUNK_DR']]

cls_lgt = KMeans(n_clusters=n_lgt, init='k-means++',random_state=1)
cls_lgt.fit(X5)
newfeature_lgt = cls_lgt.labels_

# append on the second clustering
cls_wthr = KMeans(n_clusters=n_wthr, init='k-means++',random_state=1)
cls_wthr.fit(X6)
newfeature_wthr = cls_wthr.labels_ # the labels from kmeans clustering

plt.figure()
plt.subplot(1,2,1)
X6=X6.values
plt.scatter(X6[:, 7], X6[:, 1]+np.random.random(X6[:, 1].shape)/2, c=newfeature_lgt, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Lighting Condition')
plt.grid()

plt.subplot(1,2,2)
plt.scatter(X6[:, 7], X6[:, 8]+np.random.random(X6[:, 1].shape)/2, c=newfeature_wthr, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Alcohol Involved')
plt.grid()

X5=X5.values
plt.figure()
plt.scatter(X5[:, 0], X5[:, 1]+np.random.random(X5[:, 0].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('Lighting Condition'), plt.ylabel('WEATHER')
plt.grid()

None of the graphs above display clusterings that are able to be separated at plain sight, the visual is confusing and does not allow us to make a solid conclusion. We did not know what to expect when performing KMeans on the two categorical attributes that were discretized.

Only that pertaining to WEATHER and Alcohol involved displays a more defined clustering, in which one big green cluster can be determined. We can only say that Weather and Alcohol don’t seem to be so different when the lighting condition changed.

There is nothing we can conclude just from watching at the clusters from these two attributes together.

Therefore, we wanted to experiment if changing the number of clusters could produce some interesting results: Therefore, we ireduced the number of clusters from lighting from 6 to 5 and increased the number of clusters for weather from 3 to 4.

In [100]:
n_lgt=5
n_wthr=4

X6 = df_accident[['STATE', 'PERSONS', 'PERMVIT', 'VE_TOTAL', 'DAY_WEEK', 'HR_RANGE', 
                 'MAN_COLL', 'HARM_EV_RED', 'DRUNK_DR']]

cls_lgt = KMeans(n_clusters=n_lgt, init='k-means++',random_state=1)
cls_lgt.fit(X5)
newfeature_lgt = cls_lgt.labels_

# append on the second clustering
cls_wthr = KMeans(n_clusters=n_wthr, init='k-means++',random_state=1)
cls_wthr.fit(X6)
newfeature_wthr = cls_wthr.labels_ # the labels from kmeans clustering

plt.figure()
plt.subplot(1,2,1)
X6=X6.values
plt.scatter(X6[:, 7], X6[:, 1]+np.random.random(X6[:, 1].shape)/2, c=newfeature_lgt, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Lighting Condition')
plt.grid()

plt.subplot(1,2,2)
plt.scatter(X6[:, 7], X6[:, 8]+np.random.random(X6[:, 1].shape)/2, c=newfeature_wthr, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Alcohol Involved')
plt.grid()

X5=X5
plt.figure()
plt.scatter(X5[:, 0], X5[:, 1]+np.random.random(X5[:, 0].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('Lighting Condition'), plt.ylabel('WEATHER')
plt.grid()

Changing the number of clusters did not return a more clear picture, except from when Lighting and Alcohol involved are graphed together but for the most part, we can again conclude that even with different lighting conditions, the way alcohol or drunk drivers are still pretty much part of a same cluster. We were expecting to see a greater difference at this respect, but failed to achieve it.

KMeans with df_acc_dum

As previously mentioned, we will try to cluster a different dataframe, this time containing one-hot encoded attributes
We would like to test again how does the one-hot encoded versions of LGT_COND and WEATHER work when using KMeans.

In [101]:
X_1 = df_acc_dum[['LGT_COND_1', 'LGT_COND_2', 'LGT_COND_3', 'LGT_COND_4', 'LGT_COND_5',  'WEATHER_1', 'WEATHER_2',
                 'WEATHER_3', 'WEATHER_4', 'WEATHER_5', 'WEATHER_6', 'WEATHER_10', 'WEATHER_11', 'WEATHER_12']]
X_2 = df_acc_dum[['STATE', 'PERSONS', 'PERMVIT', 'VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'DRUNK_DR']]

params = []
for n_lgt in range(3,7):
     for n_wthr in range(3,10):
    
        # get the first clustering
        cls_lgt = KMeans(n_clusters=n_lgt, init='k-means++',random_state=1)
        cls_lgt.fit(X_1)
        newfeature_lgt = cls_lgt.labels_ # the labels from kmeans clustering

        # append second clustering
        cls_wthr = KMeans(n_clusters=n_wthr, init='k-means++',random_state=1)
        cls_wthr.fit(X_2)
        newfeature_wthr = cls_wthr.labels_ # the labels from kmeans clustering

        y = df_acc_dum['FATALS']
        X = df_acc_dum[['STATE', 'PERSONS', 'PERMVIT', 'VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'DRUNK_DR']]
        

        acc = cross_val_score(clf,X,y=y,cv=cv)
        params.append((n_lgt, n_wthr,acc.mean()*100,acc.std()*100)) # save state

        print ("Clusters",n_lgt, n_wthr, "Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
        
        
Clusters 3 3 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 3 4 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 3 5 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 3 6 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 3 7 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 3 8 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 3 9 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 4 3 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 4 4 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 4 5 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 4 6 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 4 7 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 4 8 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 4 9 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 5 3 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 5 4 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 5 5 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 5 6 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 5 7 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 5 8 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 5 9 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 6 3 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 6 4 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 6 5 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 6 6 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 6 7 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 6 8 Average accuracy =  90.4867684823 +- 0.389364377398
Clusters 6 9 Average accuracy =  90.4867684823 +- 0.389364377398

Using 2 one-hot encoded attributes together did not result well: We end up getting the same accuracy and deviation for each iteration. The only reason we can find is that both attributes have similar levels.

Instead, we will use KMeans on 2 continuous variables, but using the dataframe df_acc_dum in which we performed one-hot encoding for the categorical variables.

In [102]:
X_1 = df_acc_dum[['PERSONS', 'PERMVIT']]
X_2 = df_acc_dum[['STATE', 'VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'LGT_COND_1', 'LGT_COND_2', 'LGT_COND_3', 'LGT_COND_4', 'LGT_COND_5',  'WEATHER_1', 'WEATHER_2',
                 'WEATHER_3', 'WEATHER_4', 'WEATHER_5', 'WEATHER_6', 'WEATHER_10', 'WEATHER_11', 'WEATHER_12', 'DRUNK_DR']]


params = []
for n_person in range(3,5):
    for n_permvit in range(3,10):
        # get the first clustering
        cls_person = KMeans(n_clusters=n_person, init='k-means++',random_state=1)
        cls_person.fit(X_1)
        newfeature_person = cls_person.labels_ # the labels from kmeans clustering

        # append on the second clustering
        cls_permvit = KMeans(n_clusters=n_permvit, init='k-means++',random_state=1)
        cls_permvit.fit(X_2)
        newfeature_permvit = cls_permvit.labels_ # the labels from kmeans clustering

        y = df_acc_dum['FATALS']
        X = df_acc_dum[['STATE', 'VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'LGT_COND_1', 'LGT_COND_2', 'LGT_COND_3', 'LGT_COND_4', 'LGT_COND_5',  'WEATHER_1', 'WEATHER_2',
                 'WEATHER_3', 'WEATHER_4', 'WEATHER_5', 'WEATHER_6', 'WEATHER_10', 'WEATHER_11', 'WEATHER_12', 'DRUNK_DR']]
        X = np.column_stack((X,pd.get_dummies(newfeature_person),pd.get_dummies(newfeature_permvit)))

        acc = cross_val_score(clf,X,y=y,cv=cv)
        params.append((n_person,n_permvit,acc.mean()*100,acc.std()*100)) # save state

        print ("Clusters",n_person,n_permvit,"Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
        
Clusters 3 3 Average accuracy =  91.1148046828 +- 0.350919753096
Clusters 3 4 Average accuracy =  91.1876755714 +- 0.305644692967
Clusters 3 5 Average accuracy =  91.1148263641 +- 0.493990625566
Clusters 3 6 Average accuracy =  91.2847870192 +- 0.312010891088
Clusters 3 7 Average accuracy =  91.4444140871 +- 0.311303990711
Clusters 3 8 Average accuracy =  91.3819899404 +- 0.445075758891
Clusters 3 9 Average accuracy =  91.2536526169 +- 0.742869101888
Clusters 4 3 Average accuracy =  91.222397776 +- 0.480462235568
Clusters 4 4 Average accuracy =  91.3369112521 +- 0.460393906905
Clusters 4 5 Average accuracy =  91.3993101294 +- 0.475598455941
Clusters 4 6 Average accuracy =  91.4895072055 +- 0.329512800055
Clusters 4 7 Average accuracy =  91.5034129423 +- 0.341523726666
Clusters 4 8 Average accuracy =  91.5658647663 +- 0.408332817988
Clusters 4 9 Average accuracy =  91.3924379226 +- 0.77310470876

df_accident with no one-hot encoded attributes, and df_acc_dum with one-hot encoded attributes, when performing KMeans on PERSONS and PERMVIT turned out to return the same number of clusters for the same attributes used: 3 and 4 that landed the highest accuracy (both surrounding 0.92) and the lowest deviation on both cases. Although clustering using the df_acc_dum–dataframe with one-hot encoded attributes– achieved a tightest deviation of 0.31 compared to that of 0.64 from df_accident –with no one-hot encoded attributes.

Therefore for the rest of the analysis, we will be using only those dataframes with one-hot encoded attributes: df_acc_dum and df_r_dumm, which replaced attribute STATE with REGION–derived from STATE.

First, we will use the best performance clusters –3 for PERSONS and 4 for PERMVIT– to visualize.

In [103]:
n_person=3
n_permvit=4

X_2 = df_acc_dum[['STATE', 'VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'LGT_COND_1', 'LGT_COND_2', 'LGT_COND_3', 'LGT_COND_4', 'LGT_COND_5',  'WEATHER_1', 'WEATHER_2',
                 'WEATHER_3', 'WEATHER_4', 'WEATHER_5', 'WEATHER_6', 'WEATHER_10', 'WEATHER_11', 'WEATHER_12', 'DRUNK_DR']]



cls_person = KMeans(n_clusters=n_person, init='k-means++',random_state=1)
cls_person.fit(X_1)
newfeature_person = cls_person.labels_ # the labels from kmeans clustering

# append on the second clustering
cls_permvit = KMeans(n_clusters=n_permvit, init='k-means++',random_state=1)
cls_permvit.fit(X_2)
newfeature_permvit = cls_permvit.labels_ # the labels from kmeans clustering

plt.figure()
plt.subplot(1,2,1)
X_2=X_2.values
plt.scatter(X_2[:, 7], X_2[:, 1]+np.random.random(X_2[:, 1].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Persons Involved')
plt.grid()

plt.subplot(1,2,2)
plt.scatter(X_2[:, 7], X_2[:, 8]+np.random.random(X_2[:, 1].shape)/2, c=newfeature_permvit, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('WEATHER'), plt.ylabel('Alcohol Involved')
plt.grid()

X_1=X_1.values
plt.figure()
plt.scatter(X_1[:, 0], X_1[:, 1]+np.random.random(X_1[:, 0].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('People Involved'), plt.ylabel('People in Vehicle')
plt.grid()

Choosing to use df_acc_dum was a really good choice as the clusters are more easily distinguishable and defined than when the data has no one-hot encoded attributes. It is evident that 3 clusters work well for PERSONS and PERMVIT, where one cluster (green) has a range between 0 and 2 persons in vehicle and persons involved in an accident (not vehicle occupants); a second cluster (purple) with a range between 4 and 6; and finally a third cluster (red) of about 6 to 12.

Therefore, at this point we decided to proceed considering only those dataframes we have created that contain one-hot encoded attributes:

  • df_acc_dum, which includes the attribute STATE in its original form
  • df_r_dumm, which contains the newly created attribute REGION (derived from STATE) that was one-hot encoded also.

After getting the accuracy and deviation results from using df_r_dumm we will decide if doing Agglomerative and DBSCAN should be performed on both dataframes or just the one that provides better results: higher accuracy and lower deviation.

KMeans with df_r_dumm

These dataframe contains REGION– a one-hot encoded attribute with 4 levels– derived from STATE.

We will try this dataframe with the same continuous attributes PERSONS and PERMVIT to compare.

In [ ]:
X_1 = df_r_dumm[['PERSONS', 'PERMVIT']]
X_2 = df_r_dumm[['VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'LGT_COND_1', 'LGT_COND_2', 'LGT_COND_3', 
                 'LGT_COND_4', 'LGT_COND_5',  'WEATHER_1', 'WEATHER_2', 'WEATHER_3', 'WEATHER_4', 'WEATHER_5', 
                 'WEATHER_6', 'WEATHER_10', 'WEATHER_11', 'WEATHER_12', 'REGION_1', 'REGION_2', 'REGION_3', 'REGION_4', 'DRUNK_DR']]


params = []
for n_person in range(3,5):
    for n_permvit in range(3,5):
        # get the first clustering
        cls_person = KMeans(n_clusters=n_person, init='k-means++',random_state=1)
        cls_person.fit(X_1)
        newfeature_person = cls_person.labels_ # the labels from kmeans clustering

        # append on the second clustering
        cls_permvit = KMeans(n_clusters=n_permvit, init='k-means++',random_state=1)
        cls_permvit.fit(X_2)
        newfeature_permvit = cls_permvit.labels_ # the labels from kmeans clustering

        y = df_r_dumm['FATALS']
        X = df_r_dumm[['VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'LGT_COND_1', 'LGT_COND_2', 'LGT_COND_3', 
                 'LGT_COND_4', 'LGT_COND_5',  'WEATHER_1', 'WEATHER_2', 'WEATHER_3', 'WEATHER_4', 'WEATHER_5', 
                 'WEATHER_6', 'WEATHER_10', 'WEATHER_11', 'WEATHER_12', 'REGION_1', 'REGION_2', 'REGION_3', 'REGION_4', 'DRUNK_DR']]
        X = np.column_stack((X,pd.get_dummies(newfeature_person),pd.get_dummies(newfeature_permvit)))

        acc = cross_val_score(clf,X,y=y,cv=cv)
        params.append((n_person,n_permvit,acc.mean()*100,acc.std()*100)) # save state

        print ("Clusters",n_person,n_permvit,"Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
        

Having REGION as opposed to STATE, and the one-hot encoded attributes really made a difference in the accuracy and deviation, as overall the accuracy did not experience any major changes, but the deviation dramatically dropped. The best performing clusters in this case are 3 and 5 with an accuracy of 91.3 and the lowest deviation of 0.16. We will be using these combination of clusters to visualize next.

In [105]:
n_person=3
n_permvit=5

X_2 = df_r_dumm[['VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'LGT_COND_1', 'LGT_COND_2', 'LGT_COND_3', 
                 'LGT_COND_4', 'LGT_COND_5',  'WEATHER_1', 'WEATHER_2', 'WEATHER_3', 'WEATHER_4', 'WEATHER_5', 
                 'WEATHER_6', 'WEATHER_10', 'WEATHER_11', 'WEATHER_12', 'REGION_1', 'REGION_2', 'REGION_3', 'REGION_4', 'DRUNK_DR']]



cls_person = KMeans(n_clusters=n_person, init='k-means++',random_state=1)
cls_person.fit(X_1)
newfeature_person = cls_person.labels_ # the labels from kmeans clustering

# append on the second clustering
cls_permvit = KMeans(n_clusters=n_permvit, init='k-means++',random_state=1)
cls_permvit.fit(X_2)
newfeature_permvit = cls_permvit.labels_ # the labels from kmeans clustering

plt.figure()
plt.subplot(1,2,1)
X_2=X_2.values
plt.scatter(X_2[:, 1], X_2[:, 8]+np.random.random(X_2[:, 1].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('Total accidents'), plt.ylabel('Saturday')
plt.grid()

plt.subplot(1,2,2)
plt.scatter(X_2[:, 8], X_2[:, 47]+np.random.random(X_2[:, 8].shape)/2, c=newfeature_permvit, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('Saturday'), plt.ylabel('Alcohol Involved')
plt.grid()

X_1=X_1.values
plt.figure()
plt.scatter(X_1[:, 0], X_1[:, 1]+np.random.random(X_1[:, 0].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('People Involved'), plt.ylabel('People in Vehicle')
plt.grid()

Agglomerative Clustering and DBSCAN

Now that we have decided which dataframe to use: df_r_dumm, we will perform agglomerative clustering and DBSCAN on it using the same continuous attributes we have used for KMeans.

The parameters we will be using are Ward, Complete, and Average to compare the best perorming one for the data at hand. Also, due to the time that takes to run the clusters considering that we are using all of the data, we decided to reduce the interval of clusters and keeping it to 3,5.

In [107]:
%%time 

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
import numpy as np

from sklearn.cluster import AgglomerativeClustering

X_1 = df_r_dumm[['PERSONS', 'PERMVIT']]

params= []
for link in ['ward', 'complete', 'average']:
    for n_person in range(3,5):
        

            # append on the clustering
            cls_person = AgglomerativeClustering(n_clusters=n_person, linkage=link)
            cls_person.fit(X_2)
            newfeature_person = cls_person.labels_ # the labels from kmeans clustering
        
        

            y = df_r_dumm['FATALS']
            X = df_r_dumm[['VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'LGT_COND_1', 'LGT_COND_2', 'LGT_COND_3', 
                 'LGT_COND_4', 'LGT_COND_5',  'WEATHER_1', 'WEATHER_2', 'WEATHER_3', 'WEATHER_4', 'WEATHER_5', 
                 'WEATHER_6', 'WEATHER_10', 'WEATHER_11', 'WEATHER_12', 'REGION_1', 'REGION_2', 'REGION_3', 'REGION_4', 'DRUNK_DR']]
            X = np.column_stack((X,pd.get_dummies(newfeature_person)))

            acc = cross_val_score(clf,X,y=y,cv=cv)
            params.append((n_person,acc.mean()*100,acc.std()*100)) # save state

            print ("C=",n_person, link,"Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
C= 3 ward Average accuracy =  91.3403136646 +- 0.398750618492
C= 4 ward Average accuracy =  91.3437918979 +- 0.338833702568
C= 3 complete Average accuracy =  91.3299150586 +- 0.398908843548
C= 4 complete Average accuracy =  91.3437931056 +- 0.376177761137
C= 3 average Average accuracy =  91.3507363423 +- 0.39869253004
C= 4 average Average accuracy =  91.357667542 +- 0.373444990436
CPU times: user 12min 54s, sys: 5min 38s, total: 18min 33s
Wall time: 24min 15s

DBSCAN

We will be testing EPS with 0.1, 1 and 5, and keeping minpoint to an interval of 5,8.

In [ ]:
%%time 

from sklearn.cluster import DBSCAN
from sklearn import metrics

params = []
for eps in [0.1, 1, 5]:
    for mpts in range(5,8):

        # append on the clustering
        cls_person = DBSCAN(eps=eps, min_samples=mpts)
        cls_person.fit(X_2)
        newfeature_person = cls_person.labels_ # the labels from kmeans clustering

        y = df_r_dumm['FATALS']
        X = df_r_dumm[['VE_TOTAL', 'DAY_WEEK_1', 'DAY_WEEK_2', 'DAY_WEEK_3', 'DAY_WEEK_4' , 
                  'DAY_WEEK_5', 'DAY_WEEK_6', 'DAY_WEEK_7', 'HR_RANGE_1', 'HR_RANGE_2', 'HR_RANGE_3', 'HR_RANGE_4', 
                  'HR_RANGE_5','HR_RANGE_6', 'MAN_COLL_0','MAN_COLL_1', 'MAN_COLL_2','MAN_COLL_6', 'MAN_COLL_7', 
                  'MAN_COLL_8', 'MAN_COLL_9', 'MAN_COLL_11', 'HARM_RED_1','HARM_RED_2', 'HARM_RED_3', 
                  'HARM_RED_4', 'HARM_RED_5', 'HARM_RED_6', 'LGT_COND_1', 'LGT_COND_2', 'LGT_COND_3', 
                 'LGT_COND_4', 'LGT_COND_5',  'WEATHER_1', 'WEATHER_2', 'WEATHER_3', 'WEATHER_4', 'WEATHER_5', 
                 'WEATHER_6', 'WEATHER_10', 'WEATHER_11', 'WEATHER_12', 'REGION_1', 'REGION_2', 'REGION_3', 'REGION_4', 'DRUNK_DR']]
        X = np.column_stack((X,pd.get_dummies(newfeature_person)))

        acc = cross_val_score(clf,X,y=y,cv=cv)
        params.append((n_person,n_permvit,acc.mean()*100,acc.std()*100)) # save state

        print (eps,mpts,"Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
0.1 5 Average accuracy =  91.3542145756 +- 0.358016967876
0.1 6 Average accuracy =  91.3299186617 +- 0.349985661518
0.1 7 Average accuracy =  91.3299162613 +- 0.379072725774
1 5 Average accuracy =  91.305626361 +- 0.387867452108

Evaluation

In [ ]:
n_clusters = 5

model = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1).fit(X_2)
model = DBSCAN(eps=0.1, min_samples = 10).fit(X_2)

labels = model.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))

Visualizations

In [ ]:
data = X_2

cls = DBSCAN(eps=0.1, min_samples=6)
cls.fit(data)
dbs_labels = cls.labels_ 

cls = AgglomerativeClustering(n_clusters=14, linkage='complete')
cls.fit(data)
hac_labels = cls.labels_ 

cls = KMeans(n_clusters=3, random_state=1)
cls.fit(data)
kmn_labels = cls.labels_

fig = plt.figure(figsize=(12,8))
title = ['DBSCAN','HAC','KMEANS']

for i,l in enumerate([dbs_labels,hac_labels,kmn_labels]):
    
    plt.subplot(3,2,2*i+1)
    plt.scatter(X_2[:, 1], X_2[:, 8]+np.random.random(X_2[:, 1].shape)/2, c=newfeature_person, cmap=plt.cm.rainbow, s=20, linewidths=0)
    plt.xlabel('Total accidents'), plt.ylabel('Saturday')
    plt.grid()
    plt.title(title[i])
    
    plt.subplot(3,2,2*i+2)
    plt.scatter(X_2[:, 8], X_2[:, 47]+np.random.random(X_2[:, 8].shape)/2, c=newfeature_permvit, cmap=plt.cm.rainbow, s=20, linewidths=0)
    plt.xlabel('Saturday'), plt.ylabel('Alcohol Involved')
    plt.grid()
    plt.title(title[i])
    
    


plt.tight_layout()
plt.show()

Summary

We didn’t realize the amount of time and computer resources that each of analysis took when using all of our data, specially when running agglomerative clustering. This prevented us from experimenting more on our data, as the average time for agglomerative clustering to run was 30 minutes.

One interesting finding was that performing KMeans on two one-hot encoded attributes resulted in a variety of different clusters that it was almost impossible to draw a conclusion from.

It was interesting to see the number of clusters that came out of the KMeans modeling when plotting WEATHER vs PERSONS (not vehicle occupants) and WEATHER and DRUNK_DR because we were expecting to have between 3 to 5 clusters that provided a clearer insight into how these two could have played an important role in an event where there was at least one dead person.

It was also interesting to find that alcohol and weather were not as dependent as we thought that the combination of unusual weather and a drunk driver would have resulted in at least one fatality. What we’ve got from it was one big cluster that did not provide any additional or interesting information.

Persons involved in an accident that were not vehicle occupants can be higher than the number of persons in the vehicle but not the other way around.

Using DBSCAN with different eps and minpoints even though it did not have an impact on the accuracy, we were surprised to see that a lower eps of 0.01 and a higher one of 5, returned lower deviations that an eps of 1, which we were expecting to be a little higher than that of 0.01 but not as high as 0.5. The most intersting thing regarding this is that with an eps of 5, and minpoints in a range from 5,8, returned the same deviation for all of the iterations.

Nonetheless, the model that best performed from our standpoint, considering that our measurement of success was a high accuracy and a lower deviation, was the model that used the df_r_dumm dataframe.

Deployment

How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)?

Our model using K-Means (we can change this) is useful to help predict if various situations
can determine if a car accident will have one or more fatalities. The issue with our model is that
our data is highly correlated. The data we use is a data set where all the entries have resulted
in a fatality. It is useful for checking various variables that may have contributed to the fatality
and how that would impact the accident.
The FARS dataset is a dataset that many department of transportation offices can utilize and
use various clustering methods to predict fatalities based on various variables, such as weather
conditions, lighting, etc. Even though the prediction rate may be overstated due to the high
correlation, it again shows that fatalities occur in those conditions so if they can be mitigated,
fatalities could be lessened.
Due to the size of the dataset the agglomerative clustering method takes a long time, and has
about the same accuracy (
may change *) as K-Means, so the K-means method is less time
consuming with the entire data set.

How would you deploy your model for interested parties?

This model would best be deployed through a web portal where various variables could be used
when running the K-Means clustering model; however experienced users should be the ones to
run the model as they will have the knowledge to change variables as needed. Knowing the
data will be key to deploying this model.

What other data should be collected?

Additional data that captures accidents that did not result in a fatality would be a great addition
to the dataset. This should reduce the correlation and provide a better prediction model. In
addition, regional information could be captured to be able to cluster based on weather
conditions, or terrain, or even “hot spots” that are more potential for accidents. The FARS data
is reported data, but if there were other more “automatic” data methods that could be used, the
model could also be used more in real time, than waiting for reported data.

How often would the model need to be updated, etc.?

The model should be updated as new data is obtained. In addition to attributes surrounding the
accident of a fatality, safety features of automobiles should be captured. This may also help in
the prediction of a fatality, even using outside conditions. For example, if an automobile has
side curtain airbags, and were deployed, but still resulted in a fatality 20% of the time, then this
safety feature may need to be modified. Or, if you consistently see no fatalities with high end
safety features, the model would only be applied to non-high end safety features.