Data Visualization and Analysis of NYPD Motor Vehicle Collision Data

Northeastern University (Graduate) 2019

I analyzed a dataset of motor vehicle collisions in New York City using RStudio to determine the main factors influencing the number of accidents.

Cleaned, filtered, and performed exploratory data analysis to obtain various relationships between different contributing factors.

I then used line charts, bar graphs, pie charts, heatmaps, and treemaps using the ggplot2 package in RStudio to visualize the data.

And, I also found the cases of accidents in each borough at a certain time or duration along with the causes and vehicles involved.

Skills Used

Data analysis

Data visualization

ggplot2

Exploratory data analysis

Data cleaning

Data filtering

Line charts

Bar graphs

Pie charts

Heatmaps

Treemaps

Impact

The insights from this project could be used to inform policies and interventions to reduce the number of accidents in New York City.

For example, the findings could be used to target enforcement efforts in areas where accidents are most common or to develop educational campaigns about safe driving practices

Data Warehouse & Business Intelligence on CMS: Medicare|Medicaid data

Northeastern University (Graduate) 2019

This project involved the design, implementation, and deployment of a data warehouse for Medicare and Medicaid health insurance data. The data warehouse was designed using a star schema, which is a common design pattern for data warehouses that facilitates data analysis. The data warehouse was deployed on Azure SQL Database, which is a cloud-based relational database service that provides scalability and performance.

The data scrubbing process involved identifying and removing inconsistent and irrelevant data from the source data. This was done using a variety of techniques, including Excel Macros, VLOOKUP, HLOOKUP, and R. The calculated fields were added to the data to provide additional insights into the data.

The ETL packages were created in SSIS to automate the data processing and loading process. The sequence containers were implemented in SSIS to improve the efficiency of the workflow.

The OLAP cubes were built in SSAS to create reports according to business requirements. The reports were used to provide insights into the data and to track performance metrics.

The dashboards were generated in Tableau to provide interactive and static visualizations of the data. The dashboards were used to showcase KPIs and performance matrix.

This project was a valuable learning experience that allowed me to apply my skills in data warehousing and business intelligence. I gained experience in designing, implementing, and deploying data warehouses, as well as in data scrubbing, ETL, OLAP, and BI.

Skills Used

Data warehousing

Business intelligence

Data scrubbing

ETL

OLAP

Excel

SSIS

Tableau

Impact

Designed and implemented a star schema data warehouse for Medicare and Medicaid health insurance data, which improved the efficiency of data analysis.

Performed data scrubbing using Excel Macros, VLOOKUP, HLOOKUP, and R, which removed inconsistent and irrelevant data and improved the quality of the data.

Implemented sequence containers in SSIS to improve the efficiency of the workflow by 25%.

Built multidimensional OLAP cubes using SSAS to create reports according to business requirements, which provided insights into the data and helped track performance metrics.

Generated interactive and static dashboards in Tableau to provide BI insights and showcase KPIs and performance matrix, which made the data more accessible and easy to understand.

Single-Cell PBMC Multimodal Reap Sequencing data

Internship 2018

Worked on a single-cell RNA sequencing project from August to December 2018. The goal of the project was to analyze multimodal data from peripheral blood mononuclear cells (PBMCs).

I loaded both the RNA UMI matrix and the ADT UMI matrix into a Seurat object. I added the protein expression levels to the Seurat object and appended "CITE_" to each of the ADT row names.

Plotted a violin plot for the RNA data to check the average gene expression per cell. I scaled and normalized the RNA data using the log-normalization method and the ADT data using centered log-ratio normalization.

Found the variable genes between the cells and performed PCA on both the RNA and ADT data individually. I performed clustering on the significant PCs directly on both gene levels as well as protein levels.

I visualized the clustering generated in t-SNE using feature and ridge plots to compare the distribution of both RNA and protein.

The results of the project showed that the multimodal data was able to identify different cell types and states in PBMCs. This information could be used to better understand the immune system and to develop new treatments for diseases.

Impact

The identification of different cell types and states in PBMCs could help to better understand the immune system and how it responds to disease. This information could be used to develop new treatments for diseases such as cancer, HIV, and autoimmune disorders.

The use of multimodal data could provide a more comprehensive view of the immune system than is possible with single-cell RNA sequencing alone. This could lead to the discovery of new cell types and states that were previously unknown.

The development of new clustering algorithms could improve the ability to identify cell types and states in single-cell data. This could lead to the development of more accurate and effective treatments for diseases

Developed and Populated Application Store Database

Northeastern University (Graduate) 2018

I worked on a database development project at Northeastern University from February to April 2018. The goal of the project was to develop a database for an application store.

Used Toad Data Modeler to model the database. I defined entities, attributes, keys, and constraints to satisfy the business rules. I generated a DDL script to create the database in MS SQL Server and populated it with respect to the keys and constraints.

Implemented complex SQL queries, procedures, and triggers to efficiently extract user-based information. I also created a web page using PHP, MySQL, and WAMP Server that could select, add, update, and delete user information from the database.

The results of the project were a well-designed and populated database that could be used to manage an application store.

Skills Used

Data modeling

SQL

PHP

MySQL

WAMP Server

Impact

The database could be used to track user activity and preferences, which could help the application store to improve its offerings.

The database could be used to generate reports on user behavior, which could help the application store to make better decisions about marketing and product development.

The database could be used to provide customer support, by allowing users to search for information about applications and to contact the application store with questions.

Predictive Analysis for In-Hospital Mortality of ICU Patients

Northeastern University (Graduate) 2018

Worked on a predictive modeling project at Northeastern University from February to April 2018. The goal of the project was to develop a model that could predict in-hospital mortality for ICU patients.

I used the MIMIC-III database to extract data on ICU admissions and mortality. I then used RStudio to clean, transform, and analyze the data. I removed parameters with more than 60% of missing values, replaced missing values with the mean of each class, scaled the data, extracted features, and balanced the data using the ROSE library.

Applied three machine learning models to the data: linear discriminant analysis (LDA), logistic regression, and support vector machine (SVM). I analyzed the accuracy, sensitivity, and specificity of the three models and found that the SVM model had the highest accuracy.

The results of the project showed that the SVM model was able to predict in-hospital mortality with an accuracy of 85%. This model could be used to help hospitals identify patients who are at risk of death and to take steps to improve their care.

Skills Used

Data cleaning

Data transformation

Data analysis

Machine learning

LDA

Logistic regression

SVM

Impact

Improved ability to predict in-hospital mortality for ICU patients

Potential to reduce the number of ICU deaths

Predictive Modeling on Prudential Life Insurance Risk Classification

Northeastern University (Graduate) 2018

Worked on a predictive modeling project for the Prudential Life Insurance data from February to April 2018. The goal of the project was to develop a model that could predict the risk of death for life insurance applicants.

Used RStudio to inspect, clean, transform, and model the data. I implemented hypothesis testing, feature selection, data validation, and machine learning algorithms. I also executed principal component analysis (PCA) for dimension reduction, which helped resolve multicollinearity.

I applied three machine learning classification predictive models: generalized linear model (GLM), support vector machine (SVM), and decision tree. I analyzed the accuracies of the three models and found that the GLM model had the highest accuracy.

Skills Used

Data cleaning

Data transformation

Hypothesis testing

Feature selection

Data validation

Machine learning algorithms

Principal component analysis (PCA)

Generalized linear model (GLM)

Support vector machine (SVM)

Decision tree

Impact

Improved risk assessment process

More informed decisions about life insurance applications

Data Visualization and Analysis of Boston Housing

Northeastern University (Graduate) 2018

What was the Project About?
As part of an academic project associated with Northeastern University, an in-depth investigation was conducted into the influential factors impacting housing prices in Boston. Employing robust data analysis techniques, the study sought to uncover the relationship between neighborhood attributes and housing prices. This comprehensive analysis was carried out during the period from Feb 2018 to March 2018.

Methodology Used :
The primary tool utilized for data processing and analysis was RStudio, a powerful statistical software environment. The dataset pertaining to Boston Housing was subjected to meticulous cleaning, filtering, and transformation processes. These crucial steps served to eliminate redundancies and discrepancies, ensuring the accuracy and reliability of the subsequent analysis.

Data Visualization:
To gain valuable insights and a deeper understanding of the data, various visualization techniques were employed. Notably, boxplot visualizations were effectively utilized to present the dataset's key characteristics and identify potential patterns and outliers. Through the judicious use of these graphical representations, meaningful trends and correlations were discerned.

Findings:
The visual exploration of the dataset yielded significant results regarding the impact of neighborhood attributes on housing prices. Notably, it was observed that higher levels of pollution were associated with decreased housing prices. This insightful finding sheds light on an important aspect of the Boston housing market, potentially informing decision-making processes for both prospective buyers and property investors.

Conclusion:
This data visualization and analysis project provided valuable insights into the intricate dynamics of the Boston Housing market. By leveraging advanced statistical tools and visualization techniques, the study successfully demonstrated the link between neighborhood attributes and housing prices. The knowledge gained from this analysis holds practical implications for the real estate industry and serves as a testament to the rigorous research conducted by the academic team associated with Northeastern University.

Ambulatory Stress Level Device

Vidyalankar Institute of Technology (Undergraduate) 2015

We developed a prototype that was built and integrated into a compact hardware model.

Researched online and offline parameters of stress, present systems used, and components needed.

We studied the pin configurations of amplifiers INA128 and LM358, voltage regulator LM7805, and voltage inverter LMC7660.

Designed and successfully simulated the circuits of the ECG amplifier, GSR, and power supply on the Multisim.

Implemented all the circuits on the PCB, soldered the parts, and performed the required tests on the PCB.

Solar Powered Model Boat

Vidyalankar Institute of Technology (Undergarduate) 2010

Designed and integrated different components used to make the working model.

Researched solar panels, and motors to be used for hull designs.

Conducted different tests to check the functionality of the model and components.

Go Back to Home Page