November 2023 – Advance Mathematical Statistics (MTH 522) Assignments

November 29, 2023

[Project 3] Day 5: VARMA vs VARMAX

The previous time I learnt about Vector Auto Regression. I went further into the topic today to learn about Vector Autoregression Moving-Average (VARMA) and Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)

Vector Autoregression Moving-Average (VARMA)
The Vector Autoregression Moving-Average (VARMA) method models the upcoming value in multiple time series by utilising the ARMA model approach. It is the generalization of ARMA to multiple parallel time series, e.g. multivariate time series.
Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
The Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX) extends the capabilities of the VARMA model which also includes the modelling of exogenous variables. It is a multivariate version of the ARMAX method.
In essence, VARMAX represents an extension of VARMA by accommodating additional variables with no causal connection to the system under investigation.
These “exogenous” variables do not directly influence the internal workings of the system; however, they might still have an indirect effect through their interactions with the endogenous variables.
To capture this complexity, VARMAX models each variable as a linear combination of its previous values, the collective histories of all other variables, current and past errors across all variables, and possibly delayed values of the exogenous variables.
By doing so, VARMAX enables the inclusion of external influences, such as long-term trends, cyclical patterns, or deliberate interventions, which could otherwise go unaccounted for in simpler VARMA frameworks.

November 27, 2023

[Project 3] Day 5: Vector Auto Regression

Today I learnt more on Vector Auto Regression

Vector Autoregression (VAR) is a forecasting algorithm that can be used when two or more time series influence each other. That is, the relationship between the time series involved is bi-directional.
VAR modeling is a multi-step process, and a complete VAR analysis involves:
1) Specifying and estimating a VAR model.
2) Using inferences to check and revise the model (as needed).
3) Forecasting.
4) Structural analysis.
In a VAR model, each variable is modeled as a linear function of past lags of itself and past lags of other variables in the system.
VAR models differ from univariate autoregressive models because they allow feedback to occur between the variables in the model.
An estimated VAR model can be used for forecasting, and the quality of the forecasts can be judged, in ways that are completely analogous to the methods used in univariate autoregressive modelling.
Using an autoregressive (AR) modeling approach, the vector autoregression (VAR) method examines the relationships between multiple time series variables at different time steps.
The VAR model’s parameter specification involves providing the order of the AR(p) model, which represents the number of lagged values included in the analysis.
By applying this technique to mutually independent time series, the VAR method offers a useful tool for investigating their interdependencies without accounting for overall pattern influences.

November 20, 2023

[Project 3] Day 4: Further Inspection of Economic Dataset

Today my team and I conducted further exploration into the economic data set from Boston Analyse.

The dataset provides an overview of the economic factors of Boston from January 2013 to December 2019.
The Tourism factors include the flight activity at Logan International Airport along with the passenger traffic.
The Hospitality factors include the hotel occupancy rates and average daily rates.
The unemployment rates, total number of jobs would come under Labor factors.
Pipeline development, construction costs, and square footage could be classified under Construction Factors.
The Real Estate factors would include housing sales volume and median housing prices, median housing prices, foreclosure rates, and new housing construction permits.
If we were to take the time as a factor, we would more than likely perform time series analysis to understand the trends in Boston’s economic growth/fall.
We initialized our preprocessing of the data to convert the raw data to something more usable.
We also removed some descriptive statistics relevant to the dataset.

November 17, 2023

[Project 2] Day3: Time Series Forecasting contd…

I continued my study for timeseries forecasting. Bellow is what I learnt:

There are 11 different classical time series forecasting methods which are:
1. Autoregression (AR)
2. Moving Average (MA)
3. Autoregressive Moving Average (ARMA)
4. Autoregressive Integrated Moving Average (ARIMA)
5. Seasonal Autoregressive Integrated Moving-Average (SARIMA)
6. Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)
7. Vector Autoregression (VAR)
8. Vector Autoregression Moving-Average (VARMA)
9. Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
10. Simple Exponential Smoothing (SES)
11. Holt Winter’s Exponential Smoothing (HWES)
Out of these, the three below are the ones on which i did an in-depth reading.
ARIMA stands for Autoregressive Integrated Moving Average.
- The following step in the sequence is predicted by the Autoregressive Integrated Moving Average (ARIMA) technique model as a linear function of the differenced observations and residual errors at earlier time steps.
- In order to make the sequence stable, the method combines the concepts of Moving Average (MA) and Autoregression (AR) models with a differencing pre-processing phase known as integration (I).
- For single-variable time series with a trend but no seasonal changes, the ARIMA method works well.
The full-form of VAR is Vector Auto Regression
- Using an AR model approach, the Vector Autoregression (VAR) method models each time series’ subsequent step. In essence, it expands the AR paradigm to accommodate several time series that are parallel, such as multivariate time series.
- The model’s nomenclature entails passing a VAR function’s parameters, such as VAR(p), as the order for the AR(p) model.
- Multivariate time series devoid of trend and seasonal components can benefit from this strategy.
Holt Winter’s Exponential Smoothing (HWES)is also called the Triple Exponential Smoothing method.
- It models the next time step as an exponentially weighted linear function of observations at prior time steps, taking trends and seasonality into account.
- The method is suitable for univariate time series with trend and/or seasonal components.

November 15, 2023November 15, 2023

[Project 3] Day 2: Intro to Time Series Forecasting

Today in class we briefly touched the topic of time series forecasting. So, I decided to go a bit deeper into the topic.

Time series forecasting. is basically making scientific forecasts based on past time-stamped data.
It entails creating models via historical analysis and applying them to draw conclusions and inform strategic choices in the future.
The fact that the future outcome is totally unknown at the time of the task and can only be approximated via rigorous analysis and priors supported by data is a significant differentiator in forecasting.
Time series forecasting is the practice of utilizing modeling and statistics to analyze time series data in order to produce predictions and assist with strategic decision-making.
Forecasts are not always accurate, and their likelihood might vary greatly, particularly when dealing with variables in time series data that fluctuate frequently and uncontrollably.
Still, forecasting provides information about which possible scenarios are more likely—or less likely—to materialize. Generally speaking, our estimates can be more accurate the more complete the data we have.
There is a significant difference between forecasting and “prediction”, even though they often mean the same thing. In certain sectors of the economy, forecasting may pertain to data at a certain future point in time, whereas prediction relates to future data generally.

November 13, 2023

[Project 3] Day 1: Initial observation about ‘Analyze Boston’ Dataset

Today I commenced my exploration of the ‘Economic Dataset’ from ‘Analyze Boston’.

There are 19 columns which are divided as follows:
- Date: which gives 2 columns of ‘Year’ and ‘Month’
- Tourism: Has 2 columns which give ‘Number of domestic and international passengers at Logan Airport’ and ‘Total international flights at Logan Airport’
- Hotel Market: Has 2 columns which give ‘Hotel occupancy for Boston’ and ‘Hotel average daily rate for Boston’
- Labor Market: Has 3 columns which give details of ‘Total Jobs’ ‘Unemployment rate for Boston’ and ‘Labor rate for Boston’
- Real Estate Board approved development projects: Has 4 columns which give the details of ‘Number of units approved’, ‘Total development cost of approved projects’, ‘Square feet of approved projects’ and ‘Construction jobs’.
- Real Estate (Housing): has 6 columns which give the details of ‘Foreclosure house petitions’, ‘Foreclosure house deeds’, ‘Median housing sales price’, ‘Number of houses sold’, ‘New housing construction permits’ and ‘New affordable construction permits’.
Since this dataset give reference to economic groups, my first thought is that i should perform some sort of cluster analysis.
It may also be possible to check the relation between the ‘Tourism’ and ‘Hotel market’ as well as the relation between ‘Labor market’ and ‘Real Estate’ variables.

November 12, 2023November 12, 2023

[Project 2] Day 13: Submission of Washington Post Data Analysis Report

THE TOLL OF POLICE SHOOTINGS IN THE UNITED STATES

November 9, 2023

[Project 1] Re-submission of Final Report

THE_EFFECTS_OF_SOCIAL_DETERMINANTS_OF_HEALTH_FINAL_REPORT

November 9, 2023

[Project 2] Day 12: Continuing analysis

To check the trend in fatal shootings I plotted line graph.
From the 1st plot we see that we can see that the fatal shootings spike in March and October and reach the all-time low in December.
To understand the distribution of fatal shootings across the various states, I created a bar plot
From the plot it can be seen that California has highest number of fatal shootings while Rhode Island has lowest.
I plotted a histogram to check the distribution armed fugitives of only “White”, “Black” and “Hispanic” race and got the following result.

November 6, 2023

[Project 2] Day 11: Understanding Random Forest

Today I attempted to build a random forest model to predict mental illness based on the fatal police shootings data.

For both classification and regression applications, Random Forest is a potent ensemble machine learning technique that is frequently utilized. It is a member of the decision tree-based algorithm family, which is renowned for its resilience and adaptability. The unique feature of Random Forest is its capacity to reduce overfitting and excessive variance, two major issues with individual decision trees.
Random Forest’s technical foundation is the construction of a set of decision trees, thus the word “forest.” The functions used in Random Forest are as follows:

1. Bootstrapping: As I previously learnt, is a technique in which we create several subsets, referred to as bootstrap samples, by randomly sampling the dataset with replacement. Random Forest uses one of these samples is used to train each decision tree, adding diversity.

2. Feature Randomization: Random Forest chooses a random collection of characteristics for every tree in order to increase diversity. This guarantees that no single feature controls the decision-making process and lowers the correlation between the trees.

3. Decision Tree Construction: A customized decision tree method is used to build each tree in the forest. These trees divide the data into nodes that maximize information gain or decrease impurity based on the attributes that have been selected.

4. Voting: Random Forest uses majority voting to aggregate the predictions of individual trees for classification problems and takes the average of the tree predictions for regression tasks.

November 3, 2023

[Project 2] Day 10: Using locations and GPS Co-ordinates in Analysis

Today I decided to look into using the GPS coordinates (i.e. latitudes and longitude) for analysis.

On looking at the data, I initially it would be a good idea to use the coordinates give to visualize the clusters of shootings and the cities provided in the dataframe.
However, on further inspection of the data, it was clear that there was a scarce number any GPS co-ordinates given. If i was to use these co-ordinates, i would not get much information.
So, I decided to try and use only the city and state information provided to visualize the cluster.
Although this would not be a precise location, (since we do not have the exact address or co-ordinates), it would be possible to use a heatmaps to view to overall distribution of police shootings in a city or state.
I am currently trying to use the Geo-Pandas Library to create a geo-heatmap of the US with regards to the fatal police shootings.

November 1, 2023

[Project 2] Day 9: Descriptive Statistics of Data

Today while checking my analysis and what i have done so far, I realized I had not properly noted down the descriptive statistics. So, I decided to note them down in today’ s blog.

On checking the information for the dataframe, it can be seen that there are a maximum of 8002 non null values, which essentially indicates that there are 8002 records.

It can also be seen that all features do not have the same number of non-null values. This indicates missing values. So, next i checked the total number of missing values and got the following result:
From this, I observed that the ‘race’ feature had the maximum number of missing values. This if followed by ‘flee’.
Next, I used the ‘describe( )’ function. It displayed the following:

The ‘id’ , ‘longitude’, ‘latitude’ feature description does not help much. The descripts for age show a mean of 37.209 and a standard deviation of 12.979.
To visualize the skewness of the age distribution I created a plot.

The data appears to be left skewed with its the mean being 37.2. The maximum ages lie between 27 years and 45 years.
On creating a bar plot to view the distribution of ‘manner of death’, it can be seen that maximum deaths occur with only shootings with barely 4.2% of the victims being tasered and then shot.
The barplot of gender distribution shows that majority of the victims are male with less than 1000 female victims.
The boxplot of race distribution shows that maximum of the victims are white with (41%) followed by 22% of black victims and 15% Hispanic. We have to remember that we have over 1000 missing values for race which this would indicate a strong bias towards White victims
On checking the statistics ‘weapon type’ which the victim/fugitive possessed, it shows that over 4000 of them had a gun while other typyes of weapons in possession are 1200.