[Project 2] Day 2: Initiation of data exploration – Advance Mathematical Statistics (MTH 522) Assignments

Today I started my exploration of the ‘fatal police shootings’ data.

The first thing I did was load the 2 csv’s, namely ‘fatal-police-shootings-data’ and ‘fatal-police-shootings-agencies’ to jupyter notebooks.
The ‘fatal-police-shootings-data’ dataframe has 8770 instances and 19 features while the ‘fatal-police-shootings-agencies’ dataframe has 3322 instances and 5 features.
On reading the column descripts given on github, I realized that the ‘ids’ column in the ‘fatal-police-shootings-agencies’ dataframe is the same as ‘agency_ids’ in the ‘fatal-police-shootings-data’ dataframe.
Hence, I changed the column name form ‘ids’ to ‘agency_ids’ in the ‘fatal-police-shootings-agencies’ dataframe.
Next, I started to merge both csv’s on the ‘agency_ids’ colmn. However I got an error which stated the I coud not merge on a column with 2 different data types.
On checking the data types of the columns by using ‘.info()’ function, I learnt that in one dataframe the column was that of type object while the column in the other sheet was of type int64.
To rectify this, I used the ‘pd.to_numeric()’ function and ensured that both columns are of type ‘int64’.
Once again I started to merge the data, however I am currently getting an error owing to the fact that in the ‘fatal-police-shootings-data’ dataframe, the ‘agency-ids’ column has multiple id’s present in a single instance (or cell).
I am currently trying to split these cells into multiple rows.
Once I split the cells, I will go furthur into the data exploration and start the data preprocessing.

Leave a Reply Cancel reply