Data Analysis Projects
Asian American Quality of Life Analysis
Asian Americans, the fastest-growing minority group in the United States, currently represent 5.6% of the population and are projected to reach 10% by 2050. This rapid growth, combined with the unique challenges faced by new immigrant communities, highlights the need for a comprehensive understanding of their social and health needs.
In this data analysis project, I developed a linear regression model to predict the Quality of Life for the Asian American community. The process began with meticulous data cleaning, encoding, and outlier detection. Exploratory Data Analysis (EDA) was conducted to gain insights into the dataset.
Initially, a linear regression model was built using all available features, resulting in a lower Root Mean Square Error (RMSE) score, indicating potential areas for improvement. By examining the correlation matrix generated during EDA, I refined the model to focus on features with high correlations to the Quality of Life metric. This targeted approach significantly improved the RMSE score, demonstrating the model's enhanced predictive accuracy.
Source: AAQoL Dataset
Full Model
Selected Features Model
Car Sales Analysis: Supervised Learning
In the automotive industry, pinpointing factors that contribute to vehicle wreckage is vital for enhancing safety features, improving vehicle design, and optimizing risk management strategies. To address this need, our project analyzed a fictional car sales dataset to determine which two features most significantly impact the likelihood of a vehicle being involved in a wreck. Our objective was to develop a predictive model with high accuracy, offering valuable insights for safety improvements and strategic decision-making in the automotive sector. Here’s a summary of the approach and findings:
Data Preparation:
Cleaned the dataset by removing redundant columns and NULL values.
Encoded categorical features into numerical values using LabelEncoder to facilitate model training.
Feature Selection:
Identified target variable as ‘Wrecked‘ and all other columns as Features
Built Logistic and Random Forest Regression Model
Logistic Regression Model:
Iterated through all possible combinations of two features to identify the pair that achieved the highest accuracy in classifying the target variable.
Trained a Logistic Regression model for each feature combination and evaluated its performance based on accuracy.
Random Forest Model:
Repeated the process of feature combination testing using a Random Forest Classifier, which is typically robust for feature selection.
Compared the accuracy of different feature combinations and identified the pair that resulted in the highest accuracy.
Results:
Both Logistic Regression and Random Forest models were used to find the optimal feature pair. The feature combination that provided the best classification accuracy was identified.
The Random Forest model, after training with the best feature pair, achieved high accuracy, confirming that the selected features were indeed effective for classification.
Visualization:
Plotted the top 10 feature combinations and their corresponding accuracies to visualize how different feature pairs performed in terms of classification accuracy.
Final Model Evaluation:
Used the best feature pair to train a Random Forest model with a 20% test split. Evaluated the model’s performance on the test set, achieving the final classification accuracy.
The project involved comprehensive feature analysis, model training, and performance evaluation to ensure the identified features and chosen model were optimal for achieving accurate classification results
Analysing Paris Climate agreement (2015) effects on different industries
Our project explores how various industries have evolved in response to the Paris Agreement, focusing on forestry, agriculture, and buildings sectors. By utilizing diverse analytical tools such as Tableau, R Studio, and Power BI, we analyzed the Climate Action Tracker database to assess the impact of the agreement on these sectors. Our findings highlight key trends in greenhouse gas emissions, building activity, and energy consumption across different countries, providing valuable insights into the progress and challenges faced by each industry in meeting climate goals.
Source: Paris Climate Action Dataset