[Project 3] Day 10: Writing the Project Report

Today my group and I worked to the final parts of our project report. Below are the findings that I contributed to the report:

  • Permit Distributions over time
    • From 2009 to 2010, there was a steep increase in the number of permits issued. There was also a drop in the number of permits from 2019 to 2020.
    • Maximum number of permits are issued in October.
    • When we check the permit trends by its status, the number of active permits is above 20000 since 2007. The trend of expire permits is between 10000 and 20000 with a downward trend since 2015 (although there are minor fluctuations where it increases).
    • Since 2010, the largest number of permits issued was for ‘short form building permit’ followed by ‘Electrical Permit’.
    • The highest frequency declared valuations is of the range $ -1000000 to 1000.
    • The correlation of Sq_feet and declared valuations fluctuate a lot, but the highest correlation was between 2017-2019
    • The Kruska-Wallis test shows that there is a statistically significant difference in permit durations between cities.

[Project 3] Day 9:

To analyze the data set I did the following:

  • First i plotted a trend chart to show how the number of permits vary over time.
  • Next, I created a chart to check the status of permits over time.
  • After this I created a heatmap to explore the type of permits over time
  • For the same thing I also drew up a line graph.
  • I will be trying to see now if there is any correlation between certain categories of permits as time passes.

[Project 3] Day 8: Analysing the questions to solve in the data set

Today after completing the preprocessing of the Boston building permit data my group decided to answer certain questions based in the following.

  • Distribution of Permit Types:
    • Question: What is the distribution of different types of permits in the dataset?
    • Analysis: Generate a bar chart or pie chart to visualize the frequency of each permit type.
  • Temporal Trends in Permit Issuance:
    • Question: How does the number of permits vary over time?
    • Analysis: Create a time series plot or bar chart to observe monthly or yearly trends in permit issuance.
  • Geographical Patterns:
    • Question: Are there any geographical patterns in the distribution of permits?
    • Analysis: Plot the permits on a map to identify clusters or patterns based on location

[Project3] Day 7: Introduction to Simple Exponential Smoothing

Today I learnt about Simple Exponential Smoothing:

  • The most basic of the exponentially smoothed methods goes by the name of simple exponential smoothing (SES).
  • It is best suited for predictive modeling tasks involving data that exhibits little to no discernible long-term trends or recurring patterns.

     

  • The central concept behind this approach is to presume that future market trends will closely resemble recent historical patterns observed in demand data.
  • In other words, the model will primarily rely on historical data to predict future levels of demand without accounting for any potential changes or shifts in consumer behavior or broader economic factors.

     

  • Compared to simpler forecasting methods like naive or moving average models, the exponential smoothing model possesses certain benefits.:
    •  Exponential smoothing techniques require only three inputs to operate effectively: the latest forecast, the actual value from that time period, and a smoothing constant (or weighting factor) that determines how much importance is assigned to recent data points.
    •  By using an exponential smoothing method, we can generate forecasts for future time periods based on past performance. These forecasts are deemed accurate because they take into account any discrepancies between predicted and actual outcomes.
    • When applying smoothing techniques, we tend to give more emphasis to recent observations compared to earlier ones, making it simpler to identify patterns within the data. This approach allows us to ignore the inherent unpredictability of certain events, resulting in more reliable predictions.

[Project 3] Day 5: VARMA vs VARMAX

The previous time I learnt about Vector Auto Regression. I went further into the topic today to learn about Vector Autoregression Moving-Average (VARMA) and Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)

  1. Vector Autoregression Moving-Average (VARMA)
    The Vector Autoregression Moving-Average (VARMA) method models the upcoming value in multiple time series by utilising the ARMA model approach. It is the generalization of ARMA to multiple parallel time series, e.g. multivariate time series.
  2. Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
    The Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX) extends the capabilities of the VARMA model which also includes the modelling of exogenous variables. It is a multivariate version of the ARMAX method.
  3. In essence, VARMAX represents an extension of VARMA by accommodating additional variables with no causal connection to the system under investigation.
  4. These “exogenous” variables do not directly influence the internal workings of the system; however, they might still have an indirect effect through their interactions with the endogenous variables.
  5. To capture this complexity, VARMAX models each variable as a linear combination of its previous values, the collective histories of all other variables, current and past errors across all variables, and possibly delayed values of the exogenous variables.
  6. By doing so, VARMAX enables the inclusion of external influences, such as long-term trends, cyclical patterns, or deliberate interventions, which could otherwise go unaccounted for in simpler VARMA frameworks.

[Project 3] Day 5: Vector Auto Regression

Today I learnt more on Vector Auto Regression

  • Vector Autoregression (VAR) is a forecasting algorithm that can be used when two or more time series influence each other. That is, the relationship between the time series involved is bi-directional.
  • VAR modeling is a multi-step process, and a complete VAR analysis involves:

    1)  Specifying and estimating a VAR model.
    2) Using inferences to check and revise the model (as needed).
    3) Forecasting.
    4) Structural analysis.

  • In a VAR model, each variable is modeled as a linear function of past lags of itself and past lags of other variables in the system.
  • VAR models differ from univariate autoregressive models because they allow feedback to occur between the variables in the model.
  • An estimated VAR model can be used for forecasting, and the quality of the forecasts can be judged, in ways that are completely analogous to the methods used in univariate autoregressive modelling.
  • Using an autoregressive (AR) modeling approach, the vector autoregression (VAR) method examines the relationships between multiple time series variables at different time steps.
  • The VAR model’s parameter specification involves providing the order of the AR(p) model, which represents the number of lagged values included in the analysis.
  • By applying this technique to mutually independent time series, the VAR method offers a useful tool for investigating their interdependencies without accounting for overall pattern influences.

[Project 3] Day 4: Further Inspection of Economic Dataset

Today my team and I conducted further exploration into the economic data set from Boston Analyse.

  • The dataset provides an overview of the economic factors of Boston from January 2013 to December 2019.
  • The Tourism factors include the flight activity at Logan International Airport along with the passenger traffic.
  • The Hospitality factors include the hotel occupancy rates and average daily rates.
  • The unemployment rates, total number of jobs would come under Labor factors.
  • Pipeline development, construction costs, and square footage could be classified under Construction Factors.
  • The Real Estate factors would include housing sales volume and median housing prices, median housing prices, foreclosure rates, and new housing construction permits.
  • If we were to take the time as a factor, we would more than likely perform time series analysis to understand the trends in Boston’s economic growth/fall.
  • We initialized our preprocessing of the data to convert the raw data to something more usable.
  • We also removed some descriptive statistics relevant to the dataset.

[Project 2] Day3: Time Series Forecasting contd…

I continued my study for timeseries forecasting. Bellow is what I learnt:

  • There are 11 different classical time series forecasting methods which are:
    1. Autoregression (AR)
    2. Moving Average (MA)
    3. Autoregressive Moving Average (ARMA)
    4. Autoregressive Integrated Moving Average (ARIMA)
    5. Seasonal Autoregressive Integrated Moving-Average (SARIMA)
    6. Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)
    7. Vector Autoregression (VAR)
    8. Vector Autoregression Moving-Average (VARMA)
    9. Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
    10. Simple Exponential Smoothing (SES)
    11. Holt Winter’s Exponential Smoothing (HWES)
  • Out of these, the three below are the ones on which i did an in-depth reading.
  • ARIMA stands for Autoregressive Integrated Moving Average.
    • The following step in the sequence is predicted by the Autoregressive Integrated Moving Average (ARIMA) technique model as a linear function of the differenced observations and residual errors at earlier time steps.
    • In order to make the sequence stable, the method combines the concepts of Moving Average (MA) and Autoregression (AR) models with a differencing pre-processing phase known as integration (I).
    • For single-variable time series with a trend but no seasonal changes, the ARIMA method works well.
  • The full-form of VAR is Vector Auto Regression
    • Using an AR model approach, the Vector Autoregression (VAR) method models each time series’ subsequent step. In essence, it expands the AR paradigm to accommodate several time series that are parallel, such as multivariate time series.
    • The model’s nomenclature entails passing a VAR function’s parameters, such as VAR(p), as the order for the AR(p) model.
    • Multivariate time series devoid of trend and seasonal components can benefit from this strategy.
  • Holt Winter’s Exponential Smoothing (HWES)is also called the Triple Exponential Smoothing method.
    • It models the next time step as an exponentially weighted linear function of observations at prior time steps, taking trends and seasonality into account.
    • The method is suitable for univariate time series with trend and/or seasonal components.

[Project 3] Day 2: Intro to Time Series Forecasting

Today in class we briefly touched the topic of time series forecasting. So, I decided to go a bit deeper into the topic.

  • Time series forecasting. is basically making scientific forecasts based on past time-stamped data.
  • It entails creating models via historical analysis and applying them to draw conclusions and inform strategic choices in the future.
  • The fact that the future outcome is totally unknown at the time of the task and can only be approximated via rigorous analysis and priors supported by data is a significant differentiator in forecasting.
  • Time series forecasting is the practice of utilizing modeling and statistics to analyze time series data in order to produce predictions and assist with strategic decision-making.
  • Forecasts are not always accurate, and their likelihood might vary greatly, particularly when dealing with variables in time series data that fluctuate frequently and uncontrollably.
  • Still, forecasting provides information about which possible scenarios are more likely—or less likely—to materialize. Generally speaking, our estimates can be more accurate the more complete the data we have.
  • There is a significant difference between forecasting and “prediction”, even though they often mean the same thing. In certain sectors of the economy, forecasting may pertain to data at a certain future point in time, whereas prediction relates to future data generally.

[Project 3] Day 1: Initial observation about ‘Analyze Boston’ Dataset

Today I commenced my exploration of the ‘Economic Dataset’ from ‘Analyze Boston’.

  • There are 19 columns which are divided as follows:
    • Date: which gives 2 columns of ‘Year’ and ‘Month’
    • Tourism: Has 2 columns which give ‘Number of domestic and international passengers at Logan Airport’ and ‘Total international flights at Logan Airport’
    • Hotel Market: Has 2 columns which give ‘Hotel occupancy for Boston’ and ‘Hotel average daily rate for Boston’
    • Labor Market: Has 3 columns which give details of ‘Total Jobs’ ‘Unemployment rate for Boston’ and ‘Labor rate for Boston’
    • Real Estate Board approved development projects: Has 4 columns which give the details of ‘Number of units approved’, ‘Total development cost of approved projects’, ‘Square feet of approved projects’ and ‘Construction jobs’.
    • Real Estate (Housing): has 6 columns which give the details of ‘Foreclosure house petitions’, ‘Foreclosure house deeds’, ‘Median housing sales price’, ‘Number of houses sold’, ‘New housing construction permits’ and ‘New affordable construction permits’.
  • Since this dataset give reference to economic groups, my first thought is that i should perform some sort of cluster analysis.
  • It may also be possible to check the relation between the ‘Tourism’ and ‘Hotel market’ as well as the relation between ‘Labor market’ and ‘Real Estate’ variables.

[Project 2] Day 12: Continuing analysis

  • To check the trend in fatal shootings I plotted line graph.
  • From the 1st plot we see that we can see that the fatal shootings spike in March and October and reach the all-time low in December.
  • To understand the distribution of fatal shootings across the various states, I created a bar plot
  • From the plot it can be seen that California has highest number of fatal shootings while Rhode Island has lowest.
  • I plotted a histogram to check the distribution armed fugitives of only “White”, “Black” and “Hispanic” race and got the following result.

[Project 2] Day 11: Understanding Random Forest

Today I attempted to build a random forest model to predict mental illness based on the fatal police shootings data.

  • For both classification and regression applications, Random Forest is a potent ensemble machine learning technique that is frequently utilized. It is a member of the decision tree-based algorithm family, which is renowned for its resilience and adaptability. The unique feature of Random Forest is its capacity to reduce overfitting and excessive variance, two major issues with individual decision trees.
  • Random Forest’s technical foundation is the construction of a set of decision trees, thus the word “forest.” The functions used in Random Forest are as follows:

1. Bootstrapping: As I previously learnt, is a technique in which we create several subsets, referred to as bootstrap samples, by randomly sampling the dataset with replacement. Random Forest uses one of these samples is used to train each decision tree, adding diversity.

2. Feature Randomization: Random Forest chooses a random collection of characteristics for every tree in order to increase diversity. This guarantees that no single feature controls the decision-making process and lowers the correlation between the trees.

3. Decision Tree Construction: A customized decision tree method is used to build each tree in the forest. These trees divide the data into nodes that maximize information gain or decrease impurity based on the attributes that have been selected.

4. Voting: Random Forest uses majority voting to aggregate the predictions of individual trees for classification problems and takes the average of the tree predictions for regression tasks.

[Project 2] Day 10: Using locations and GPS Co-ordinates in Analysis

Today I decided to look into using the GPS coordinates (i.e. latitudes and longitude) for analysis.

  • On looking at the data, I initially it would be a good idea to use the coordinates give to visualize the clusters of shootings and the cities provided in the dataframe.
  • However, on further inspection of the data, it was clear that there was a scarce number any GPS co-ordinates given.  If i was to use these co-ordinates, i would not get much information.
  • So, I decided to try and use only the city and state information provided to visualize the cluster.
  • Although this would not be a precise location, (since we do not have the exact address or co-ordinates), it would be possible to use a heatmaps to view to overall distribution of police shootings in a city or state.
  • I am currently trying to use the Geo-Pandas Library to create a geo-heatmap of the US with regards to the fatal police shootings.

[Project 2] Day 9: Descriptive Statistics of Data

Today while checking my analysis and what i have done so far, I realized I had not properly noted down the descriptive statistics. So, I decided to note them down in today’ s blog.

  • On checking the information for the dataframe, it can be seen that there are a maximum of 8002 non null values, which essentially indicates that there are 8002 records.

  • It can also be seen that all features do not have the same number of non-null values. This indicates missing values. So, next i checked the total number of missing values and got the following result:
    From this, I observed that the ‘race’ feature had the maximum number of missing values. This if followed by ‘flee’.
  • Next, I used the ‘describe( )’ function. It displayed the following:

    The ‘id’ , ‘longitude’, ‘latitude’ feature description does not help much.  The descripts for age show a mean of 37.209 and a standard deviation of 12.979.
  • To visualize the skewness of the age distribution I created a plot.

    The data appears to be left skewed with its the mean being 37.2. The maximum ages lie between 27 years and 45 years.
  • On creating a bar plot to view the distribution of ‘manner of death’,  it can be seen that maximum deaths occur with only shootings with barely 4.2% of the victims being tasered and then shot.

  • The barplot of gender distribution shows that majority of the victims are male with less than 1000 female victims.
  • The boxplot of race distribution shows that maximum of the victims are white with (41%) followed by 22% of black victims and 15% Hispanic. We have to remember that we have over 1000 missing values for race which this would indicate a strong bias towards White victims

  • On checking the statistics ‘weapon type’ which the victim/fugitive possessed, it shows that over 4000 of them had a gun while other typyes of weapons in possession are 1200.

[Project 2] Day 8: Understanding Encoding of data

Today while working on the police-shooting data, I learnt about encoding of data.

  • Data encoding is the process of converting data from one form to another. We usually perform encoding for purpose of transmission, storage, or analysis.
  • By the process of encoding, we can:
    • Prepare data for analysis by transforming it into a suitable format that can be processed by models and/or algorithms.
    • Create features by extracting relevant information from data and creating new variables to improve the accuracy of analysis.
    • Compress data by reducing its size or complexity without reducing its quality.
    • Encrypt the data so that we can prevent unauthorized access.
  • There are many types of encoding techniques used in data analysis, the few which I learnt are:
    • One-hot encoding
    • Label Encoding
    • Hash Encoding
    • Feature Scaling
  • One-hot encoding is a technique to convert categorical variables to numerical. In this technique we create new variables that take on values 0 and 1 to represent the original categorical values.
  • Lable encoding is also a method to convert categorical variables to numerical type. In this type, the difference is we assign each categorical value an integer value based on alphabetical order.
  • Binary Encoding is a technique for encoding categorical variables with a large number of categories, which can pose a challenge for one-hot encoding or label encoding. Binary encoding converts each category into a binary code of 0s and 1s, where the length of the code is equal to the number of bits required to represent the number of categories.
  • Hash encoding is a technique for encoding categorical variables with a very high number of categories, which can pose a challenge for binary encoding or other encoding techniques.
  • Feature scaling is a technique for encoding numerical variables, which are variables that have continuous or discrete numerical values. For example, age, height, weight, or income are numerical variables.

[Project 2] Day 7: Intro to K-means, K-medoids and DBSCAN clustering

Today I learnt about K-means, clustering, K-medoids and DBSCAN clustering methods.

  • K-means is a nonhierarchical clustering method. You tell it how many clusters you want, and it tries to find the “best” clustering.
  • “K means” refers to the following:
    1. The number of clusters you specify (K).
    2. The process of assigning observations to the cluster with the nearest center (mean).
  • The drawbacks of K-means are as follows:
    1. Sensitivity to initial conditions
    2.  Difficulty in Determining K
    3.  Inability to handle categorical data.
    4. Time complexity
  •  K-medoids clustering is a variant of K-means that is more robust to noises and outliers.
  • Instead of using the mean point as the center of a cluster, K-medoids uses an actual point in the cluster to represent it.
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is also a clustering algorithm.
  • Although it is an old algorithm (published in 1996) it is still used today because it is versatile and generates very high-quality clusters, all the points which don’t fit being designated as outliers.
  • There are two hyper-parameters in DBSCAN:
    1. epsilon: A distance measure that will be used to locate the points/to check the density in the neighborhood of any point.
    2. minPts: Minimum number of data points to define a cluster.
  • Hierarchical DBSCAN (HDBSCAN) is a more recent algorithm that essentially replaces the epsilon hyperparameter of DBSCAN with a more intuitive one calledmin_cluster_size’.

[Project 2] Day 6 : Logistic Regression (Contd..).

Today, I read up further on Logistic Regression.

  • Logistic Regression is divided into three main types of logistic regression: Binary Logistic Regression, Ordinal Logistic Regression, and Multinomial Logistic Regression.
  • Binary Logistic Regression: 
    The most common of the three logistic regression types, Binary Logistic Regression, is used when the dependent variable is binary. It can only assume two possible outcomes.

    Examples:

    • Deciding on whether or not to offer a loan to a bank customer: Outcome = yes or no.
    • Evaluating the risk of cancer: Outcome = high or low.
    • Predicting a team’s win in a football match: Outcome = yes or no.
  • Ordinal Logistic Regression:
    The second type of logistic regression, Ordinal Logistic Regression, is employed when the dependent variable is ordinal. An ordinal variable can be logically ordered, but the intervals between the values are not necessarily equally spaced.

    Examples

    • Predicting whether a student will join a college, vocational/trade school, or corporate industry.
    • Estimating the type of food consumed by pets, the outcome may be wet food, dry food, or junk food.
  • Multinomial Logistic regression:
    Multinomial Logistic Regression is the third type of logistic regression. It is utilized when the dependent variable is nominal and includes more than two levels with no order or priority

    Examples

    • Formal shirt size: Outcomes = XS/S/M/L/XL
    • Survey answers: Outcomes = Agree/Disagree/Unsure
    • Scores on a math test: Outcomes = Poor/Average/Good
  • The best way Logistic Regression practices are :-
    1. Identify dependent variables to ensure the model’s consistency.
    2. Discover the technical requirements of the model.
    3. Estimate the model and evaluate the goodness of the fit.
    4. Appropriately interpret the results.
    5. Validate the observed results.

[Project 2] Day 5: Intro to Logistic Regression

Today I started learning logestic regression:

  • When you are analyzing datasets in which there are one or more classification independent variables that determine an outcome.
  • It is primarily used for binary classification problems, where the goal is to predict outcome, such as whether an email is spam and not spam, whether a customer will buy a product or not, or whether a student will pass or fail the exam.
  • This type of statistical model (also known as logit model) is often used for classification and predictive analytics.
  • Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset of independent variables.
  • Since the outcome is a probability, the dependent variable is bounded between 0 and 1.
  • In logistic regression, a logit transformation is applied on the odds—that is, the probability of success divided by the probability of failure. This is also commonly known as the log odds, or the natural logarithm of odds, and this logistic function is represented by the following formulas:

Logit(pi) = 1/(1+ exp(-pi))

ln(pi/(1-pi)) = Beta_0 + Beta_1*X_1 + … + B_k*K_k

  • In this logistic regression equation, logit(pi) is the dependent or response variable and x is the independent variable.
  • The beta parameter, or coefficient, in this model is commonly estimated via maximum likelihood estimation (MLE).
  • This method tests different values of beta through multiple iterations to optimize for the best fit of log odds.
  • All of these iterations produce the log likelihood function, and logistic regression seeks to maximize this function to find the best parameter estimate.
  • Once the optimal coefficient (or coefficients if there is more than one independent variable) is found, the conditional probabilities for each observation can be calculated, logged, and summed together to yield a predicted probability.
  • For binary classification, a probability less than .5 will predict 0 while a probability greater than 0 will predict 1.
  • After the model has been computed, it’s best practice to evaluate the how well the model predicts the dependent variable, which is called goodness of fit.
  • The Hosmer–Lemeshow test is a popular method to assess model fit.

[Project 2] Day 4: Understanding Bias

Today I learnt about statistical Bias.

  • When the data does not give an accurate representation of the population from which it was collected, then we say that the data is Biased.
  • Sometimes data is flawed because the sample of people it surveys doesn’t accurately represent the population.

 Few of the common types of bias which I read about are given below:-

1. Sampling Bias

Sampling bias occurs when some members of a population ar selected in a sample than others.

2. Bias in assignment

When the data used in the analysis of research data for factors  can skew the results of a study it is termed as assignment bias.

3.  Omitted variables

Omitted variable bias occurs when a statistical model fails to include one or more relevant variables. In other words, it means that you left out an important factor in your analysis.

4. Self serving bias.

A self-serving bias occurs because researchers and analysts tend to attribute positive effects to internal factors and negative effects to external factor. In other words, it occurs when we tend to favor a particular factor and show a dispreference to other factors

[Project 2] Day 3 : Intro Cluster analysis

Today I was introduced learnt about clustering and cluster analysis.

  • Cluster analysis is an exploratory analysis that tries to identify structures within the data.  Cluster analysis is also called segmentation analysis or taxonomy analysis.
  • In Data Analytics we often have very large data  which are however similar to each other; so to organize them, we arrange the data into groups or ‘clusters’ based on the similarity.
  • There are various methods to perform cluster analysis; but they can be broadly classified as:
    –>  Hierarchical methods
    –> Non-hierarchical methods
  • In heirarchal methods there are 2 types, namely Agglomerative methods and Divisive Methods
  •  In Agglomerative methods, the observations start in their own separate cluster and the two most similar clusters are then combined. This is done repeatedly until all subjects are in one cluster. At the end, the best number of clusters is then chosen out of all cluster solutions.
  • In Divisive methods,  all observations start in the same cluster. We then do the opposite or perform a strategy  reverse to agglomerative methods, until every subject is in a separate cluster.
  • Agglomerative methods are used more often than divisive methods, so this handout will concentrate on the former rather than the latter.
  • Non-hierarchical methods  is also called as ‘K-means Clustering’. In this method, we divide a set of (n) observations into k clusters.
  • We use K-means clustering when we don’t have existing group labels and want to assign similar data points to the number of groups we specify (K).

[Project 2] Day 2: Initiation of data exploration

Today I started my exploration of the ‘fatal police shootings’ data.

  • The first thing I did was load the 2 csv’s, namely ‘fatal-police-shootings-data’ and ‘fatal-police-shootings-agencies’  to jupyter notebooks.
  •  The ‘fatal-police-shootings-data’ dataframe  has 8770  instances and 19 features while the  ‘fatal-police-shootings-agencies’ dataframe has  3322  instances and 5 features.
  • On reading the column descripts given on github, I realized that the ‘ids’ column in the ‘fatal-police-shootings-agencies’ dataframe is the same as ‘agency_ids’  in the ‘fatal-police-shootings-data’ dataframe.
  • Hence, I changed the column name form ‘ids’ to ‘agency_ids’ in the ‘fatal-police-shootings-agencies’ dataframe.
  • Next, I started to merge both csv’s on the ‘agency_ids’ colmn. However I got an error which stated the I coud not merge on a column with 2 different data types.
  • On checking the data types of the columns by using ‘.info()’  function, I learnt that in one dataframe the column was that of type object while the column in the other sheet was of type int64.
  • To rectify this, I used the ‘pd.to_numeric()’ function and ensured that both columns are of type ‘int64’.
  • Once again I started to merge the data, however I am currently getting an error owing to the fact that in the ‘fatal-police-shootings-data’  dataframe, the ‘agency-ids’ column has multiple id’s present in a single instance (or cell).
  • I am currently trying to split these cells into multiple rows.
  • Once I split the cells, I will go furthur into the data exploration and start the data preprocessing.

[Project 2] Day 1: Understanding The Data & Formulating Questions

Today I started working on the data fatal police shootings in the United States collected by the Washington Post.

As with the previous project, the first thing we would have to identify is what questons we would like to answer. on looking at the data, 6 main questions stand out to me.

  1. What is the trend of fatal police shootings from 2015 to 2023?
  2.  In which areas are fatal police shootings more predominant?
  3.  To what extent does mental illness play a role in police shootings?
  4.  Is there a gender, age or racial bias in these fatal police shootings?
  5.  Does a certain weapon of a fugitive, experience more fatal shootings than other weapons?
  6. Do specific agencies play a role in higher fatal shootings and if so which?

One thing to note would be that since this is a ‘fatal’ police shootings data, we do not have data regarding police shootings where the fugitive survived. We also do not have data regarding the race of the cops involved in the shooting, the cause for the fugitive’s altercation with the police and data on wheter the shooting was justified. The lack of this data may bring about some limitations during our analysis

[Project 1] Day 12: Writing Project Report

Today my group started working on our final analysis report on the CDC Data

  • The first thing we had to do was come up with a title which had relavence to the questions we wanted to pursue.
  • Next we jotted down the issues or questions we addressed.
  • Next we started listing our findings which is simple word were the answers to the issues which we preiviously mentioned.  As each of the members of our group addressed different issues, there were many findings.
  • After that we started with Appendix A which describes the methods we used to conduct the analysis. This section metioned the plots which we made like scatter plots, histograms, etc. It also mentioned the statistics tests which we ran like t-test, anova,  calculation Pearsons R, etc
  • We are currently working on the last parts which is Appendix B: Results and Appendix C: Code  which would show the  ‘results’ of the tests we conducted that were the basis of our findings, and the python code which we wrte up in jupyter to conduct our analysis , respectively.

[Project 1] Day 11 : Understanding Anova Test

Today I learnt and applied ANOVA Test to my CDC Data

  • Analysis of variance a.k.a ANOVA test, is a statistical method which we use to separate the observed variance of data into different parts to use for additional tests.
  • ANOVA is helpful for testing three or more variables. Although it is similar to multiple two-sample t-tests,  ANOVA results in fewer type I errors and is appropriate for a range of issues.

    *A type I error is a.k.a a false positive and takes place if the tru null hypothesis is incorrectly rejected
    *A type II error is a.k.a a false negative and takes places if a false null hypothesis is incorrectly accepted i.e the alternative hypothesis is rejected.

  • ANOVA groups differences by comparing the means of each group and includes spreading out the variance into diverse sources
  • The type of ANOVA test used depends on a number of factors. It is applied when data needs to be experimental
  • .There are two main types of ANOVA: one-way and two-way. There also variations of ANOVA like MANOVA (multivariate ANOVA) and Analysis of Covariance (ANCOVA).
  • A one-way ANOVA is used for three or more groups of data, to gain information about the relationship between the dependent and independent variables.
  • A two-way ANOVA is similar to one-way ANOVA, but there are two independents and a single dependent.
  • The formula for calculation of f statitics via anova is given as
    F
    =MST/MSE
    ​ 
    where:
    F=ANOVA coefficient,
    MST=Mean sum of squares due to treatment   and
    MSE=Mean sum of squares due to error.

  • In the CDC data i used Anova test to check whether being inactive and having no health insurance has an effect on % Diabetic and  whether being obese and having no health insurance has an effect on % Diabetic

[Project 1] Day 10: Understanding T-Tests

Today I started learning about the various T-tests and when to use them

  • T-test is a statistical test which is used to compare the mean values of two groups. To do this we create two hypothesis:
    H0 = there being no significant difference
    H1 = there being a significant difference
    The different types of data would indicate when to use which type of t-test
  • Paired t test is used when the groups come from a single population (e.g., measuring before and after an experimental treatment) like in the case of the crab molt data
  • Two-sample t test also called independent test is used when the groups come from two different populations ( two different species of animal)
  • A one-sample t test is performed if there exists a single  group being compared against a standard value
  • One-tailed t test should be performed when we want to know if the mean of one group or population is more than or less than the other.
  • A two-tailed t test is performed when we you only care to know  if  the two groups are different from one another

[Project1] Day 9: Handling Missing data

Today while working on the CDC data I faced the problem of handling missing data.

  • There are many reasons in general for certain values of data being missing, the 2 which i mainly believe could be
    1. Past data might get corrupted due to improper maintenance
    2. The data was not recorded/ human error, etc
  • in the case of our data  diabetes about 3k rows but the the obesity data little over 300 rows and inactivity a bit over 1k rows.
  • Due to this on merging all three data sets i had a little over 300 rows of data. But when i checked the data, there were still 9 null values in the % inactive column.
  • Now if the data were vast, i would have considered dropping the rows which had null values. However since we only have 300 rows, I was not too keen on lessening the data.
  • on researching, I found main method i found which would help to handle these missing values was Imputing the Missing Values. There were 4 types of ways of doing this.
  • Imputing an arbitrary value involved replacing missing values with a specified (arbitrary) value like -3, 0, 7,  etc.  But ths method has a lot of disadvantages. but it has a number of disadvantages like  if the arbitrary value  could inject bias into the dataset if it is not indicative of the ‘underlying data’. It could also limit the variability of the data an make it difficult to find patterns.
  • Replacing with the mean involves, as the name suggests, imputing the data with the average or mean value of the column. The only scenario where this is not appropriate is if there are outliers. However since we already treated the outliers, i used this method to fill in the missing values and continue with my analysis of the data
  • The other 2 ways of handling the missing data would be Replacing with the median and Replacing with the mode.

[Project1] Day 8: Conducting of Cross Validation and Boostrap

Today I went a little indepth into how to do Cross Validation and understanding Bootstrap

  • To start cross validation, first the data has to be divided 2 parts, i.e the training data and the testing data.
  • To do this we need to first decide into how many subsets or folds (k) we will split up the total data; as the training data would be  k-1 folds and the testing data would be the remainder of the data i.e 1 fold.
  • Next we need to select a performance metric. Performance metrics are used to measure behaviour and activities which would help to evaluate our model.
  • We would then repeat this process k-1 times, then take the average of the performance metrics which would be the estimate of the model’s overall performance.
  • By conducting Cross Validation we would estimate the test error
  • From my understanding, Cross Validation is random sampling without replacement whereas in Bootstrap there is replacement.
  • While in cross validation, we conducted the test over a number of training samples but 1 test data which would be useful if we had a large amount of data. But Bootstrap is better when we have less amount of data.  By repeatedly resampling from the observed (limited) data, we would  estimate the distribution.
  • Coming back to the project, Now that i have better understood the concepts I believe that Bootstrapping would better serve our model built on the CDC data as it has only 300 odd data point.

[Project 1] Day 7: Intro to Resampling Methods

Today I started learning about resampling techniques, In particular Cross Validation.

  • What is Resampling?  From what I understood, since sampling is process of data collection, resampling is the conduction of repetitive tests on on the same sample or the creation of a new samples on the basis of the 1st observed sample.
  • Why do we use Resampling? When we create prediction models on some data, it is always good to test it on new data. but since we may not always had new data, we can use resampling methods to generate new data.
  • The main usage of Cross Validation is for checking our prediction model for test errors due to over fitting.
    * Test Error is the mean (avg.) error that comes from testing new data while Training Error is the error computed when testing our training data.
    *What is Overfitting? When we conduct our regression analysis, if we ‘fit’ the line extremely close to certain data points, it is said to be over fit.  It would result in the model being fit only for this initial data and not help give a good prediction for other data.
  • In cross-validation we divide the data into 2 parts: the training data and the validation data. In simple words, the training data would be used to ‘train’ or fit the model to the data, and this fitted model would then be used  to try and predict the outcomes in the validation data.
  • With relation to the current project on the cdc data, im still considering if I want to use this approach as compared to bootstrap.

[Project 1] Day 6: Understanding Linear Regression

  • In Statistical analysis, ‘Regression’ is a method implemented to understand or determine the relationship between 2 or more variables . Using this relationship, we would be able to determine an unknown value which depends on ur predicting variable
    *Variable in statistics represents any quantity that can be measured or counted
    There are 2 main types of Variables:
    a) Categorical variables: the variables which represent groupings
    b) Quantitative variables: The variables which use numbers to represent total values.
    In the case of the CDC data we have 3 quantitative variables variables, namely % Diabetes, % Inactivity and % Obesity. The categorical variables  are ‘STATE’ and ‘COUNTY’ .
  • In regression ,  we classify the variables we want to analyze under ‘dependent variable’ and independent variable’.
  • As the name suggests, ‘Linear Regression’ assumes that there is a linear relationship between the variables. The end result would have us plot a straight line through the data points on a plot which would best describe the relationship.
  • Simple Linear Regression plots a strait 2 dimensional line to find the relationship between 2 variables.
    In our project we would be doing simple linear regression to show the relationships between  % Diabetes &  % Inactivity, % Diabetes & %  Obesity and % Inactivity &% Obesity respectively . This line would be represented by the equation y = β₀ + β₁*x₁ + ε.
  • Multiple linear Regression, on the other hand ,  would create a 3 dimensional plot i.e a plane. with the data of % Diabetes, % Inactivity and % Obesity.
  • Project work: On meeting with the group today, we have decided to explore our options after having calculated the R^2 value for the data. We have begun the ‘feature engineering process’ . On my part,  I started  feature scaling and am exploring the statistical tests which will be helpfull towards the project.

Analyzing Crab Molt Data

In toda’s class, we discussed the analyssis of the ‘crab molt’ data.

  • Crab molting is the action of the crab breaking ints outer skeleton or exoskeleton inorder to grow. Our objective was to try and predict the size of the crab before molting using the post molt size of the crabs.
  • To begin with we plotted a linear model and found that the value of R^2 wa 0.98.
  • The descriptive statistics of the post-molt data showed a skewness of -2.3469 and kurtosis of 13.116 and that of the pre molt data showed a skewness of -2.00349 and a kurtosis of 9.76632
  • Looking at the graphs of the normal curve of both pre and post molt data, it appeared that they were similar with just a shift in mean.
  • To check wheter this was true we conducted a T-test.
    * T-test is a statistical test which is used to compare the mean values of two groups. To do this we create two hypothesis:
    H0 = there being no significant difference
    H1 = there being a significant difference
  • The T-test we conducted on the crab molt data indicated that out null hypotheses (H0) was false; there is a significant difference in the means of both pre-molt and post-molt data. We did this by carrying out ‘Monte-Carlo’ procedure to estimate the p-value for the difference of means which was observed

[Project 1] Day 4: Diabetes,Inactivity,Obesity Distribution by State

Today i tried to see if there was any relation between the distribution of the average %diabetes ,%inactivity and %obesity with by state
1. The bar plot of % diabetes showed that the states of South Carolina, Alabama, Mississippi, Delaware, Florida and Maryland have the highest aveager % of diabetics
(>=10%)


2. The bar plot of % inactivity showed that the states of Alabama, Kentucky, Nebraska, New York and Oklahoma have the highest average % of inactivity (>=17%)


3. The bar plot of % obese showed that barring 10 states, all the others have have the high average % of obesity (>18%)


4. While plotting the corelation there seems to be no correlation between ‘State’ and ‘%Diabetic,%Inactivity and %Obesity. However there a couple of states like Alabama and Kentucky that have corresponding high or low values of each which would indicate that there could be a correlation. one hing to note would be that this correlations is occuring on only 300 or so data points, there are many counties whose data we do not have and this could account for the negative correlation.
5. Another thing to note is that there is a positive correlation between ‘%Diabetic- %Inactivity’ ,’%Diabetic -%Obesity’ and ‘%Inactivity -%Obesity’ of 0.7032549 ,0.5269106 and 0.590831 respectively.

[Project1] Day 3: Continuing Data Exploration : Understanding Concepts

In my previous blog, I mentioned the term ‘heteroscedacity’. However I didnt expain its relativity in the project. So I decided to revisit the concept.

  • While plotting the linear regression for diabetes and inactivty, it was observed that the percentage of diabetics due inactivity was 19.51%.On checking the residuals after, the following plot was observed (noted in the last blog).

    While there is significant amount of data near the 0-line which ranges between -0.5 and 0.5, there is a larger amount of data which is further from the line.
  • Heteroscedacity is the ‘fanning out’ i.e the distance of the residuals increases along the line. Heteroscedacity makes the data less relaiable . The common methods to identify heteroscedacity are by visual plots and by performing the Breusch Pagan Test.
  • When comparing the heteroscadacity of the residual plot for dataframe of the combined obesity-inactivity-diabetes data with the individual inactiviyt-diabetes dataframe, the range of scatterdness of outliers in the combined dataframe is lesser than that of the individual dataframe. <- combined data residual plot
    This shows that the combined dataframe is more reliable thant the individual data.
  • So coming back to Pearson’s square, it should confirm that combined data frame is more reliable than limited data frame.
  • To get started with calculating R^2, I have claculated the correlation between the data. I will go further into this in the next blog.

[Project 1] DAY1 : Exploring The CDC Data

Today my project group and I started  exploration into the CDC data.

  •  3 sheets are given in the excel file: diabeties, obesity and inactivity.  The titles of the sheets itself give an inclination towards there being individual and combined relationships between  the 3 groups of data.
  • On first glance it is obvious that the amount of data on  the percentage of diabetes (about 3k rows) is far greater than the data of both obesity (a little over 300 rows) and inactivity (a bit over 1k rows).  This was confirmed by merging the dataframes on the  FIPS column.
  • In class today, we discussed the relationship between inactivity and diabetes. So as a group, we decided to explore the relationship between diabeties and obesity.
  • In additon to calculating the discriptive statistics (mean, median, etc.) I visualized these calculations with the help of histograms, box plots and qqplots.
  • Additionally, in class today I was introduced to the concept of ‘Heteroscedacity’. Through the scatter plot it was clearly visible that the data is unevenly scattered. The data will have to be ‘transformed’ (for lack of a better word) to bring the distribution closer to normal.
  • The obesity data was negatively skewed (-2.69210) and the diabetic data had a skeweness of 0.089145. Although the calculation look correct. I feel there is mistake in my work and will go through it again to check if there are any errors.