Cesar Vazquez | Data Science & Machine Learning Projects

Project: Classifying handwritten digits dataset from UCI Machine Learning Repository

In this project I perform some basic data compilation and manipulation, exploratory data analysis, PCA, multidimensional scaling (MDS), and tSNE.

Tasks:

  1. Read both the training and the test data set into R.
  2. Perform Exploratory Data Analysis (EDA) using a heatmap.
  3. Run PCA with with and obtain the scree plot.
  4. Try out classical MDS with a Manhattan-based distance matrix.
  5. Apply tSNE to the data.


Project: Determining defaulting likelihood on loans.

In this project I perfrom Agglomerative Clustering, K-means clustering, and Hierarchical Clustering.

Tasks:

  1. Read the data into R. List the missing rate (in percentage) for each variable.
  2. Do some data cleaning by replacing missing values for both JOB and REASON with default constant “Unknown”.
  3. Perform the (natural) logarithm transformation.
  4. Impute all the remaining values.
  5. Obtain a distance matrix.
  6. Cluster the data by excluding the variable BAD.
  7. Make a two-way contingency table that compares two clustering results.
  8. conduct some post hoc analysis using various numerical and graphical tools.


Project: Predicting a baseball hitters salary based on his performance variables.

In this project I investigate if there are certain variables that determine whether someone will default on a loan. I perfrom Agglomerative Clustering, K-means clustering, and Hierarchical Clustering.

Tasks:

  1. EDA by obtaining the histograms of both salary and the logarithm (natural base) of salary.
  2. Inspect for any missing data.
  3. Partition the data and apply variable selection methods.
  4. Output the necessary fitting results for each best model and apply it to the test data.
  5. Refit the final model using the entire data.
  6. Check normality, homoscedasticity, independence, linearity as well as detecting outliers and assessing multicollinearity.
  7. Apply the final model to predict the log-salary for the new data set.


Project: Communities and crime.

In this project I analyze communities and crime data using 5 different regression models.

Tasks:

  1. Prepare data by removing, imputing, or replacing appropriately.
  2. Conduct some EDA, mostly by checking the distribution of the target variable.
  3. Partition the data.
  4. Do predictive modeling with LASSO Regression.
  5. Do predictive modeling with Principal Components Regression.
  6. Try Partial Least Squares Regression.
  7. Finally, try Weighted Orthogonal Components Regression as well as Stagewise Regression.
  8. Test and compare the models.


Project: Seeing what factors affect liver disease.

In this project I consider the Indian Liver Patient dataset, use regression models, and compare the models using an ROC curve.

Tasks:

  1. Prepare and clean the data via listwise deletion or imputation.
  2. For each categorical predictor I use chi squared-test of independence to assess its association with the binary response liver.
  3. For other types of predictors I use either two-sample t test.
  4. Applying a threshold significance level alpha = 0.20, I exclude predictors that are associated with a p-value larger than that from the subsequent logistic model fitting.
  5. Fit the full model with all predictors that have passed the screening.
  6. Select the best model stepwise selection at the aid of BIC.
  7. select the best model with one of the regularization methods with different types of penalties.
  8. Compute the jackknife predicted probabilities from every model.
  9. Plot their ROC curves and find their AUC values.
  10. Interpret the results within the liver disease diagnostic context.