[Project 2] Day 8: Understanding Encoding of data

Today while working on the police-shooting data, I learnt about encoding of data.

  • Data encoding is the process of converting data from one form to another. We usually perform encoding for purpose of transmission, storage, or analysis.
  • By the process of encoding, we can:
    • Prepare data for analysis by transforming it into a suitable format that can be processed by models and/or algorithms.
    • Create features by extracting relevant information from data and creating new variables to improve the accuracy of analysis.
    • Compress data by reducing its size or complexity without reducing its quality.
    • Encrypt the data so that we can prevent unauthorized access.
  • There are many types of encoding techniques used in data analysis, the few which I learnt are:
    • One-hot encoding
    • Label Encoding
    • Hash Encoding
    • Feature Scaling
  • One-hot encoding is a technique to convert categorical variables to numerical. In this technique we create new variables that take on values 0 and 1 to represent the original categorical values.
  • Lable encoding is also a method to convert categorical variables to numerical type. In this type, the difference is we assign each categorical value an integer value based on alphabetical order.
  • Binary Encoding is a technique for encoding categorical variables with a large number of categories, which can pose a challenge for one-hot encoding or label encoding. Binary encoding converts each category into a binary code of 0s and 1s, where the length of the code is equal to the number of bits required to represent the number of categories.
  • Hash encoding is a technique for encoding categorical variables with a very high number of categories, which can pose a challenge for binary encoding or other encoding techniques.
  • Feature scaling is a technique for encoding numerical variables, which are variables that have continuous or discrete numerical values. For example, age, height, weight, or income are numerical variables.

[Project 2] Day 7: Intro to K-means, K-medoids and DBSCAN clustering

Today I learnt about K-means, clustering, K-medoids and DBSCAN clustering methods.

  • K-means is a nonhierarchical clustering method. You tell it how many clusters you want, and it tries to find the “best” clustering.
  • “K means” refers to the following:
    1. The number of clusters you specify (K).
    2. The process of assigning observations to the cluster with the nearest center (mean).
  • The drawbacks of K-means are as follows:
    1. Sensitivity to initial conditions
    2.  Difficulty in Determining K
    3.  Inability to handle categorical data.
    4. Time complexity
  •  K-medoids clustering is a variant of K-means that is more robust to noises and outliers.
  • Instead of using the mean point as the center of a cluster, K-medoids uses an actual point in the cluster to represent it.
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is also a clustering algorithm.
  • Although it is an old algorithm (published in 1996) it is still used today because it is versatile and generates very high-quality clusters, all the points which don’t fit being designated as outliers.
  • There are two hyper-parameters in DBSCAN:
    1. epsilon: A distance measure that will be used to locate the points/to check the density in the neighborhood of any point.
    2. minPts: Minimum number of data points to define a cluster.
  • Hierarchical DBSCAN (HDBSCAN) is a more recent algorithm that essentially replaces the epsilon hyperparameter of DBSCAN with a more intuitive one calledmin_cluster_size’.

[Project 2] Day 6 : Logistic Regression (Contd..).

Today, I read up further on Logistic Regression.

  • Logistic Regression is divided into three main types of logistic regression: Binary Logistic Regression, Ordinal Logistic Regression, and Multinomial Logistic Regression.
  • Binary Logistic Regression: 
    The most common of the three logistic regression types, Binary Logistic Regression, is used when the dependent variable is binary. It can only assume two possible outcomes.

    Examples:

    • Deciding on whether or not to offer a loan to a bank customer: Outcome = yes or no.
    • Evaluating the risk of cancer: Outcome = high or low.
    • Predicting a team’s win in a football match: Outcome = yes or no.
  • Ordinal Logistic Regression:
    The second type of logistic regression, Ordinal Logistic Regression, is employed when the dependent variable is ordinal. An ordinal variable can be logically ordered, but the intervals between the values are not necessarily equally spaced.

    Examples

    • Predicting whether a student will join a college, vocational/trade school, or corporate industry.
    • Estimating the type of food consumed by pets, the outcome may be wet food, dry food, or junk food.
  • Multinomial Logistic regression:
    Multinomial Logistic Regression is the third type of logistic regression. It is utilized when the dependent variable is nominal and includes more than two levels with no order or priority

    Examples

    • Formal shirt size: Outcomes = XS/S/M/L/XL
    • Survey answers: Outcomes = Agree/Disagree/Unsure
    • Scores on a math test: Outcomes = Poor/Average/Good
  • The best way Logistic Regression practices are :-
    1. Identify dependent variables to ensure the model’s consistency.
    2. Discover the technical requirements of the model.
    3. Estimate the model and evaluate the goodness of the fit.
    4. Appropriately interpret the results.
    5. Validate the observed results.

[Project 2] Day 5: Intro to Logistic Regression

Today I started learning logestic regression:

  • When you are analyzing datasets in which there are one or more classification independent variables that determine an outcome.
  • It is primarily used for binary classification problems, where the goal is to predict outcome, such as whether an email is spam and not spam, whether a customer will buy a product or not, or whether a student will pass or fail the exam.
  • This type of statistical model (also known as logit model) is often used for classification and predictive analytics.
  • Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset of independent variables.
  • Since the outcome is a probability, the dependent variable is bounded between 0 and 1.
  • In logistic regression, a logit transformation is applied on the odds—that is, the probability of success divided by the probability of failure. This is also commonly known as the log odds, or the natural logarithm of odds, and this logistic function is represented by the following formulas:

Logit(pi) = 1/(1+ exp(-pi))

ln(pi/(1-pi)) = Beta_0 + Beta_1*X_1 + … + B_k*K_k

  • In this logistic regression equation, logit(pi) is the dependent or response variable and x is the independent variable.
  • The beta parameter, or coefficient, in this model is commonly estimated via maximum likelihood estimation (MLE).
  • This method tests different values of beta through multiple iterations to optimize for the best fit of log odds.
  • All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimate.
  • Once the optimal coefficient (or coefficients if there is more than one independent variable) is found, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.
  • For binary classification, a probability less than .5 will predict 0 while a probability greater than 0 will predict 1.
  • After the model has been computed, it’s best practice to evaluate the how well the model predicts the dependent variable, which is called goodness of fit.
  • The Hosmer–Lemeshow test is a popular method to assess model fit.

[Project 2] Day 4: Understanding Bias

Today I learnt about statistical Bias.

  • When the data does not give an accurate representation of the population from which it was collected, then we say that the data is Biased.
  • Sometimes data is flawed because the sample of people it surveys doesn’t accurately represent the population.

 Few of the common types of bias which I read about are given below:-

1. Sampling Bias

Sampling bias occurs when some members of a population ar selected in a sample than others.

2. Bias in assignment

When the data used in the analysis of research data for factors  can skew the results of a study it is termed as assignment bias.

3.  Omitted variables

Omitted variable bias occurs when a statistical model fails to include one or more relevant variables. In other words, it means that you left out an important factor in your analysis.

4. Self serving bias.

A self-serving bias occurs because researchers and analysts tend to attribute positive effects to internal factors and negative effects to external factor. In other words, it occurs when we tend to favor a particular factor and show a dispreference to other factors

[Project 2] Day 3 : Intro Cluster analysis

Today I was introduced learnt about clustering and cluster analysis.

  • Cluster analysis is an exploratory analysis that tries to identify structures within the data.  Cluster analysis is also called segmentation analysis or taxonomy analysis.
  • In Data Analytics we often have very large data  which are however similar to each other; so to organize them, we arrange the data into groups or ‘clusters’ based on the similarity.
  • There are various methods to perform cluster analysis; but they can be broadly classified as:
    –>  Hierarchical methods
    –> Non-hierarchical methods
  • In heirarchal methods there are 2 types, namely Agglomerative methods and Divisive Methods
  •  In Agglomerative methods, the observations start in their own separate cluster and the two most similar clusters are then combined. This is done repeatedly until all subjects are in one cluster. At the end, the best number of clusters is then chosen out of all cluster solutions.
  • In Divisive methods,  all observations start in the same cluster. We then do the opposite or perform a strategy  reverse to agglomerative methods, until every subject is in a separate cluster.
  • Agglomerative methods are used more often than divisive methods, so this handout will concentrate on the former rather than the latter.
  • Non-hierarchical methods  is also called as ‘K-means Clustering’. In this method, we divide a set of (n) observations into k clusters.
  • We use K-means clustering when we don’t have existing group labels and want to assign similar data points to the number of groups we specify (K).

[Project 2] Day 2: Initiation of data exploration

Today I started my exploration of the ‘fatal police shootings’ data.

  • The first thing I did was load the 2 csv’s, namely ‘fatal-police-shootings-data’ and ‘fatal-police-shootings-agencies’  to jupyter notebooks.
  •  The ‘fatal-police-shootings-data’ dataframe  has 8770  instances and 19 features while the  ‘fatal-police-shootings-agencies’ dataframe has  3322  instances and 5 features.
  • On reading the column descripts given on github, I realized that the ‘ids’ column in the ‘fatal-police-shootings-agencies’ dataframe is the same as ‘agency_ids’  in the ‘fatal-police-shootings-data’ dataframe.
  • Hence, I changed the column name form ‘ids’ to ‘agency_ids’ in the ‘fatal-police-shootings-agencies’ dataframe.
  • Next, I started to merge both csv’s on the ‘agency_ids’ colmn. However I got an error which stated the I coud not merge on a column with 2 different data types.
  • On checking the data types of the columns by using ‘.info()’  function, I learnt that in one dataframe the column was that of type object while the column in the other sheet was of type int64.
  • To rectify this, I used the ‘pd.to_numeric()’ function and ensured that both columns are of type ‘int64’.
  • Once again I started to merge the data, however I am currently getting an error owing to the fact that in the ‘fatal-police-shootings-data’  dataframe, the ‘agency-ids’ column has multiple id’s present in a single instance (or cell).
  • I am currently trying to split these cells into multiple rows.
  • Once I split the cells, I will go furthur into the data exploration and start the data preprocessing.

[Project 2] Day 1: Understanding The Data & Formulating Questions

Today I started working on the data fatal police shootings in the United States collected by the Washington Post.

As with the previous project, the first thing we would have to identify is what questons we would like to answer. on looking at the data, 6 main questions stand out to me.

  1. What is the trend of fatal police shootings from 2015 to 2023?
  2.  In which areas are fatal police shootings more predominant?
  3.  To what extent does mental illness play a role in police shootings?
  4.  Is there a gender, age or racial bias in these fatal police shootings?
  5.  Does a certain weapon of a fugitive, experience more fatal shootings than other weapons?
  6. Do specific agencies play a role in higher fatal shootings and if so which?

One thing to note would be that since this is a ‘fatal’ police shootings data, we do not have data regarding police shootings where the fugitive survived. We also do not have data regarding the race of the cops involved in the shooting, the cause for the fugitive’s altercation with the police and data on wheter the shooting was justified. The lack of this data may bring about some limitations during our analysis

[Project 1] Day 12: Writing Project Report

Today my group started working on our final analysis report on the CDC Data

  • The first thing we had to do was come up with a title which had relavence to the questions we wanted to pursue.
  • Next we jotted down the issues or questions we addressed.
  • Next we started listing our findings which is simple word were the answers to the issues which we preiviously mentioned.  As each of the members of our group addressed different issues, there were many findings.
  • After that we started with Appendix A which describes the methods we used to conduct the analysis. This section metioned the plots which we made like scatter plots, histograms, etc. It also mentioned the statistics tests which we ran like t-test, anova,  calculation Pearsons R, etc
  • We are currently working on the last parts which is Appendix B: Results and Appendix C: Code  which would show the  ‘results’ of the tests we conducted that were the basis of our findings, and the python code which we wrte up in jupyter to conduct our analysis , respectively.

[Project 1] Day 11 : Understanding Anova Test

Today I learnt and applied ANOVA Test to my CDC Data

  • Analysis of variance a.k.a ANOVA test, is a statistical method which we use to separate the observed variance of data into different parts to use for additional tests.
  • ANOVA is helpful for testing three or more variables. Although it is similar to multiple two-sample t-tests,  ANOVA results in fewer type I errors and is appropriate for a range of issues.

    *A type I error is a.k.a a false positive and takes place if the tru null hypothesis is incorrectly rejected
    *A type II error is a.k.a a false negative and takes places if a false null hypothesis is incorrectly accepted i.e the alternative hypothesis is rejected.

  • ANOVA groups differences by comparing the means of each group and includes spreading out the variance into diverse sources
  • The type of ANOVA test used depends on a number of factors. It is applied when data needs to be experimental
  • .There are two main types of ANOVA: one-way and two-way. There also variations of ANOVA like MANOVA (multivariate ANOVA) and Analysis of Covariance (ANCOVA).
  • A one-way ANOVA is used for three or more groups of data, to gain information about the relationship between the dependent and independent variables.
  • A two-way ANOVA is similar to one-way ANOVA, but there are two independents and a single dependent.
  • The formula for calculation of f statitics via anova is given as
    F
    =MST/MSE
    ​ 
    where:
    F=ANOVA coefficient,
    MST=Mean sum of squares due to treatment   and
    MSE=Mean sum of squares due to error.

  • In the CDC data i used Anova test to check whether being inactive and having no health insurance has an effect on % Diabetic and  whether being obese and having no health insurance has an effect on % Diabetic

[Project 1] Day 10: Understanding T-Tests

Today I started learning about the various T-tests and when to use them

  • T-test is a statistical test which is used to compare the mean values of two groups. To do this we create two hypothesis:
    H0 = there being no significant difference
    H1 = there being a significant difference
    The different types of data would indicate when to use which type of t-test
  • Paired t test is used when the groups come from a single population (e.g., measuring before and after an experimental treatment) like in the case of the crab molt data
  • Two-sample t test also called independent test is used when the groups come from two different populations ( two different species of animal)
  • A one-sample t test is performed if there exists a single  group being compared against a standard value
  • One-tailed t test should be performed when we want to know if the mean of one group or population is more than or less than the other.
  • A two-tailed t test is performed when we you only care to know  if  the two groups are different from one another