Cesar Vazquez | Data Science & Machine Learning Projects

Project: Classifying handwritten digits dataset from UCI Machine Learning Repository

Tasks:

Tasks:

Read the data into R. List the missing rate (in percentage) for each variable.
Do some data cleaning by replacing missing values for both JOB and REASON with default constant “Unknown”.
Perform the (natural) logarithm transformation.
Impute all the remaining values.
Obtain a distance matrix.
Cluster the data by excluding the variable BAD.
Make a two-way contingency table that compares two clustering results.
conduct some post hoc analysis using various numerical and graphical tools.

Tasks:

EDA by obtaining the histograms of both salary and the logarithm (natural base) of salary.
Inspect for any missing data.
Partition the data and apply variable selection methods.
Output the necessary fitting results for each best model and apply it to the test data.
Refit the final model using the entire data.
Check normality, homoscedasticity, independence, linearity as well as detecting outliers and assessing multicollinearity.
Apply the final model to predict the log-salary for the new data set.

Tasks:

Prepare data by removing, imputing, or replacing appropriately.
Conduct some EDA, mostly by checking the distribution of the target variable.
Partition the data.
Do predictive modeling with LASSO Regression.
Do predictive modeling with Principal Components Regression.
Try Partial Least Squares Regression.
Finally, try Weighted Orthogonal Components Regression as well as Stagewise Regression.
Test and compare the models.

Tasks:

Prepare and clean the data via listwise deletion or imputation.
For each categorical predictor I use chi squared-test of independence to assess its association with the binary response liver.
For other types of predictors I use either two-sample t test.
Applying a threshold significance level alpha = 0.20, I exclude predictors that are associated with a p-value larger than that from the subsequent logistic model fitting.
Fit the full model with all predictors that have passed the screening.
Select the best model stepwise selection at the aid of BIC.
select the best model with one of the regularization methods with different types of penalties.
Compute the jackknife predicted probabilities from every model.
Plot their ROC curves and find their AUC values.
Interpret the results within the liver disease diagnostic context.