Data Visualization and Analysis of NYPD Motor Vehicle Collision Data
Northeastern University (Graduate) 2019
I analyzed a dataset of motor vehicle collisions in New York City using RStudio to determine the main factors influencing the number of accidents.
Cleaned, filtered, and performed exploratory data analysis to obtain various relationships between different contributing factors.
I then used line charts, bar graphs, pie charts, heatmaps, and treemaps using the ggplot2 package in RStudio to visualize the data.
And, I also found the cases of accidents in each borough at a certain time or duration along with the causes and vehicles involved.
Skills Used
Data analysis
Data visualization
R
ggplot2
Exploratory data analysis
Data cleaning
Data filtering
Line charts
Bar graphs
Pie charts
Heatmaps
Treemaps
Impact
The insights from this project could be used to inform policies and interventions to reduce the number of accidents in New York City.
For example, the findings could be used to target enforcement efforts in areas where accidents are most common or to develop educational campaigns about safe driving practices
Data Warehouse & Business Intelligence on CMS: Medicare|Medicaid data
Northeastern University (Graduate) 2019
This project involved the design, implementation, and deployment of a data warehouse for Medicare and Medicaid health insurance data. The data warehouse was designed using a star schema, which is a common design pattern for data warehouses that facilitates data analysis. The data warehouse was deployed on Azure SQL Database, which is a cloud-based relational database service that provides scalability and performance.
The data scrubbing process involved identifying and removing inconsistent and irrelevant data from the source data. This was done using a variety of techniques, including Excel Macros, VLOOKUP, HLOOKUP, and R. The calculated fields were added to the data to provide additional insights into the data.
The ETL packages were created in SSIS to automate the data processing and loading process. The sequence containers were implemented in SSIS to improve the efficiency of the workflow.
The OLAP cubes were built in SSAS to create reports according to business requirements. The reports were used to provide insights into the data and to track performance metrics.
The dashboards were generated in Tableau to provide interactive and static visualizations of the data. The dashboards were used to showcase KPIs and performance matrix.
This project was a valuable learning experience that allowed me to apply my skills in data warehousing and business intelligence. I gained experience in designing, implementing, and deploying data warehouses, as well as in data scrubbing, ETL, OLAP, and BI.
Skills Used
Data warehousing
Business intelligence
Data scrubbing
ETL
OLAP
BI
Excel
R
SSIS
Tableau
Impact
Designed and implemented a star schema data warehouse for Medicare and Medicaid health insurance data, which improved the efficiency of data analysis.
Performed data scrubbing using Excel Macros, VLOOKUP, HLOOKUP, and R, which removed inconsistent and irrelevant data and improved the quality of the data.
Implemented sequence containers in SSIS to improve the efficiency of the workflow by 25%.
Built multidimensional OLAP cubes using SSAS to create reports according to business requirements, which provided insights into the data and helped track performance metrics.
Generated interactive and static dashboards in Tableau to provide BI insights and showcase KPIs and performance matrix, which made the data more accessible and easy to understand.
Single-Cell PBMC Multimodal Reap Sequencing data
Internship 2018
Worked on a single-cell RNA sequencing project from August to December 2018. The goal of the project was to analyze multimodal data from peripheral blood mononuclear cells (PBMCs).
I loaded both the RNA UMI matrix and the ADT UMI matrix into a Seurat object. I added the protein expression levels to the Seurat object and appended "CITE_" to each of the ADT row names.
Plotted a violin plot for the RNA data to check the average gene expression per cell. I scaled and normalized the RNA data using the log-normalization method and the ADT data using centered log-ratio normalization.
Found the variable genes between the cells and performed PCA on both the RNA and ADT data individually. I performed clustering on the significant PCs directly on both gene levels as well as protein levels.
I visualized the clustering generated in t-SNE using feature and ridge plots to compare the distribution of both RNA and protein.
The results of the project showed that the multimodal data was able to identify different cell types and states in PBMCs. This information could be used to better understand the immune system and to develop new treatments for diseases.
Impact
The identification of different cell types and states in PBMCs could help to better understand the immune system and how it responds to disease. This information could be used to develop new treatments for diseases such as cancer, HIV, and autoimmune disorders.
The use of multimodal data could provide a more comprehensive view of the immune system than is possible with single-cell RNA sequencing alone. This could lead to the discovery of new cell types and states that were previously unknown.
The development of new clustering algorithms could improve the ability to identify cell types and states in single-cell data. This could lead to the development of more accurate and effective treatments for diseases
Developed and Populated Application Store Database
Northeastern University (Graduate) 2018
I worked on a database development project at Northeastern University from February to April 2018. The goal of the project was to develop a database for an application store.
Used Toad Data Modeler to model the database. I defined entities, attributes, keys, and constraints to satisfy the business rules. I generated a DDL script to create the database in MS SQL Server and populated it with respect to the keys and constraints.
Implemented complex SQL queries, procedures, and triggers to efficiently extract user-based information. I also created a web page using PHP, MySQL, and WAMP Server that could select, add, update, and delete user information from the database.
The results of the project were a well-designed and populated database that could be used to manage an application store.
Skills Used
Data modeling
SQL
PHP
MySQL
WAMP Server
Impact
The database could be used to track user activity and preferences, which could help the application store to improve its offerings.
The database could be used to generate reports on user behavior, which could help the application store to make better decisions about marketing and product development.
The database could be used to provide customer support, by allowing users to search for information about applications and to contact the application store with questions.
Predictive Analysis for In-Hospital Mortality of ICU Patients
Northeastern University (Graduate) 2018
Worked on a predictive modeling project at Northeastern University from February to April 2018. The goal of the project was to develop a model that could predict in-hospital mortality for ICU patients.
I used the MIMIC-III database to extract data on ICU admissions and mortality. I then used RStudio to clean, transform, and analyze the data. I removed parameters with more than 60% of missing values, replaced missing values with the mean of each class, scaled the data, extracted features, and balanced the data using the ROSE library.
Applied three machine learning models to the data: linear discriminant analysis (LDA), logistic regression, and support vector machine (SVM). I analyzed the accuracy, sensitivity, and specificity of the three models and found that the SVM model had the highest accuracy.
The results of the project showed that the SVM model was able to predict in-hospital mortality with an accuracy of 85%. This model could be used to help hospitals identify patients who are at risk of death and to take steps to improve their care.
Skills Used
Data cleaning
Data transformation
Data analysis
Machine learning
LDA
Logistic regression
SVM
Impact
Improved ability to predict in-hospital mortality for ICU patients
Potential to reduce the number of ICU deaths
Predictive Modeling on Prudential Life Insurance Risk Classification
Northeastern University (Graduate) 2018
Worked on a predictive modeling project for the Prudential Life Insurance data from February to April 2018. The goal of the project was to develop a model that could predict the risk of death for life insurance applicants.
Used RStudio to inspect, clean, transform, and model the data. I implemented hypothesis testing, feature selection, data validation, and machine learning algorithms. I also executed principal component analysis (PCA) for dimension reduction, which helped resolve multicollinearity.
I applied three machine learning classification predictive models: generalized linear model (GLM), support vector machine (SVM), and decision tree. I analyzed the accuracies of the three models and found that the GLM model had the highest accuracy.
Skills Used
Data cleaning
Data transformation
Hypothesis testing
Feature selection
Data validation
Machine learning algorithms
Principal component analysis (PCA)
Generalized linear model (GLM)
Support vector machine (SVM)
Decision tree
Impact
Improved risk assessment process
More informed decisions about life insurance applications
Data Visualization and Analysis of Boston Housing
Northeastern University (Graduate) 2018
What was the Project About?
As part of an academic project associated with Northeastern University, an in-depth investigation was conducted into the influential factors impacting housing prices in Boston. Employing robust data analysis techniques, the study sought to uncover the relationship between neighborhood attributes and housing prices. This comprehensive analysis was carried out during the period from Feb 2018 to March 2018.
Methodology Used :
The primary tool utilized for data processing and analysis was RStudio, a powerful statistical software environment. The dataset pertaining to Boston Housing was subjected to meticulous cleaning, filtering, and transformation processes. These crucial steps served to eliminate redundancies and discrepancies, ensuring the accuracy and reliability of the subsequent analysis.
Data Visualization:
To gain valuable insights and a deeper understanding of the data, various visualization techniques were employed. Notably, boxplot visualizations were effectively utilized to present the dataset's key characteristics and identify potential patterns and outliers. Through the judicious use of these graphical representations, meaningful trends and correlations were discerned.
Findings:
The visual exploration of the dataset yielded significant results regarding the impact of neighborhood attributes on housing prices. Notably, it was observed that higher levels of pollution were associated with decreased housing prices. This insightful finding sheds light on an important aspect of the Boston housing market, potentially informing decision-making processes for both prospective buyers and property investors.
Conclusion:
This data visualization and analysis project provided valuable insights into the intricate dynamics of the Boston Housing market. By leveraging advanced statistical tools and visualization techniques, the study successfully demonstrated the link between neighborhood attributes and housing prices. The knowledge gained from this analysis holds practical implications for the real estate industry and serves as a testament to the rigorous research conducted by the academic team associated with Northeastern University.
Ambulatory Stress Level Device
Vidyalankar Institute of Technology (Undergraduate) 2015
We developed a prototype that was built and integrated into a compact hardware model.
Researched online and offline parameters of stress, present systems used, and components needed.
We studied the pin configurations of amplifiers INA128 and LM358, voltage regulator LM7805, and voltage inverter LMC7660.
Designed and successfully simulated the circuits of the ECG amplifier, GSR, and power supply on the Multisim.
Implemented all the circuits on the PCB, soldered the parts, and performed the required tests on the PCB.
Solar Powered Model Boat
Vidyalankar Institute of Technology (Undergarduate) 2010
Designed and integrated different components used to make the working model.
Researched solar panels, and motors to be used for hull designs.
Conducted different tests to check the functionality of the model and components.