Overview
Covid-19 presented a lot of data challenges due the complexity and wide reaching nature of the disease. In this project, I plan on utilizing SQL & Power BI to re-organize and analyse a Covid-19 dataset. The goal is to showcase usefulness of SQL and Power BI in similar problem cases.
Data Overview
-
General
-
Originating from the Johns Hopkins University CSSE, this data encompasses various attributes of individuals diagnosed with COVID-19 within a hospital setting, covering the period between January and April 2020.
-
-
Main Characteristics of Interest
-
Status
-
Their status in the hospital. A observation can either be released isolated or deceased ​
-
-
Month confirmed
-
When the observation entered the hospital​
-
-
Age
-
Nationality
-
-
Problems
-
The 'Nationality' field requires standardization. The field will be normalized by converting country-of-origin entries into their proper nationality formats (e.g., replacing 'France' with 'French') to maintain uniform categorization.
-
The high cardinality of the 'Nationality' field presents challenges for meaningful aggregate analysis. To address this, we will introduce a 'Continent' classification to consolidate the data into broader, more manageable groups. This strategic aggregation reduces categorical complexity and provides a clearer macro-level perspective for high-level reporting.
-
SQL Implementation
-
Goal
-
Standardize the nationality field to resolve naming inconsistencies between country nouns and demonyms, such as the use of France in place of French.
-
-
Process
-
Map all country-of-origin nouns to formal demonyms to ensure categorical consistency.
-
Convert stand-alone country names to standardized nationalities even if the target form does not currently exist in the dataset.
-
-
Goal
-
Derive a continent attribute based off the standardized nationality field to its respective geographic region.
-
-
Process
-
Applying a mapping schema to the nationality column to enable hierarchical data aggregation.​
-
Power BI Insights

General Data Summaries
-
The current analysis is performed on a filtered cohort derived from the primary dataset. This subset is restricted to observations identifying as Korean, resulting in a sample size of approximately 3.3k records.
​
-
The population is primarily distributed between the Released and Isolated categories, which account for 55.5% and 42.5% of the total observations, respectively. The Deceased cohort represents a marginal segment, comprising only 2.0% of the sample.
​
-
The age distribution of the sample exhibits a central tendency around 50 years, with the density of observations decreasing progressively as they deviate from this midpoint.
​
-
The confirmed dates chart reveals that observation volume reached its peak in March 2020, representing the highest frequency of cases recorded within the reporting period.
​
-
An analysis of the gender distribution reveals a marginal predominance of female subjects, who account for 56% of the sample. Male subjects comprise the remaining 44% of the total observations.

Status/Average Age Combo Chart
The Status and Average Age chart indicates mean ages of 70, 40, and 35 for deceased, isolated, and released groups. Higher mortality among the elderly, as documented by the CDC, explains the elevated average age within the deceased cohort.​

Status/Sex Combo Chart
The Status/Sex Combo chart indicates that males constitute a higher proportion of the deceased group, contrasting with the female majority in the general dataset and other status categories. This discrepancy potentially arises from the limited sample size within the deceased cohort. It may also suggest an increased mortality risk for males influenced by external or biological variables.
Summary
I standardized the nationality column and derived a continent field using SQL to facilitate analysis through lower cardinality. These enhancements enabled a demographic exploration of the Korean subset within Power BI. I then employed combo charts to visualize and interpret data relationships.