Chapter 2 Proposal

2.1 Research topic

Although the data is readily available it is still not possible to infer insights and trends from this data by just looking at it. To make it useful and discernable for the public, we have decided to perform a thorough exploratory data analysis through interactive visualizations thus enabling us to infer valuable insights/information from the data available to us.

The columns we are planning to use are

YearStart (Year Start) / YearEnd (Year End) - Intend to merge these 2 columns to make a single column named “Year” as the values of YearStart and YearEnd is the same in every row
LocationAbbr (Location Abbreviation) - Short form abbreviations of the locations. Useful for neat plots
LocationDec (Location description) - full forms of Location Abbreviations column
Data Source - Source from which the particular data point is collected
Topic - Name of the disease
stratification Category1 - consisting of race, ethnicity, overall categories
Stratification1 - The respective values of stratification Category1 column is contained in this column
GeoLocation - The geographic Location of the particular data point. Useful for Geospatial Analysis

Insights that can be potentially inferred or questions that we can potentially answer through analyzing the data:

Q - Spread of topic based on the geospatial data and location. can be done for every year to understand which locations are susceptible or improving with respect to a particular topic in each year.

Q - Which gender is more prone to get affected by a specific topic?

Q - patterns between year/time and topic (highest, lowest, etc).

Q - Are there states where males/females getting affected more?

Q - States which are dealing with the most number of topics. Can try to infer which countries are more suitable to live because of less number of topics that they are dealing with?

2.2 Data availability

The data source we are leveraging is taken from data.gov [2]. This dataset is provided and maintained by the Centers for Disease Control and Prevention’s (CDC) division of population health. This dataset was first published on 21st Feb 2021 and the last update was done on 24th March 2022. It is a publicly available dataset and is licensed by ODBL.[3] There are about 124 indicators that are provided by the CDC’s Division of Population Health that allows states and territories and large metropolitan areas to uniformly define, collect and report chronic disease data which are integral to public health practice. In addition to providing access to state-specific indicator data, the CDI website serves as a gateway to supplementary information and data resources. Public health authorities require access to the most pertinent, recent, and consistently characterized chronic illness surveillance data at the state and county level in order to monitor diseases and risk factors throughout time and to plan and implement effective treatments. The National Association of Chronic Disease Directors (NACDD), the Council of State and Territorial Epidemiologists (CSTE), and the Centers for Disease Control and Prevention (CDC) collaborated to develop the chronic disease indicators (CDIs), a collection of surveillance indicators that is accessible online.Public health experts and decision-makers can access the CDI website to receive universally defined state-level and chosen metropolitan-level statistics for chronic illnesses and risk factors that significantly affect public health. The monitoring, prioritizing, and evaluation of public health interventions for chronic illness depend on these variables. Most of the current indicators, if not all of them, are accessible and provided on other websites, either by the data sources or by categorized chronic illness programs. The CDI website, on the other hand, is the sole complete source with full access to a variety of indicators for the surveillance of chronic illnesses, conditions, and risk factors at the state level and for specific big metropolitan regions. Multiple primary surveillance data sources are used to acquire CDI data. The expansion of the CDI set necessitates the inclusion of new sources of data. Previously, the CDIs relied on data from nine primary registries, the American Community Survey (ACS), birth and death certificates data in the National Vital Statistics System (NVSS), the State Tobacco Activities Tracking and Evaluation System, the United States Renal Data System, and the Youth Risk Behavior Surveillance System. The revised CDIs retain the use of data from these sources, and additional data will be obtained from the Pregnancy Risk Assessment Monitoring System, the Alcohol Epidemiologic Data System, the Alcohol Policy Information System, alcohol policy legal research, the National Survey of Children’s Health, State Emergency Department Databases, State Inpatient Databases, the Centers for Medicare and Medicaid Services Chronic Condition Warehouse and the Medicare Current Beneficiary Survey, the U.S. Department of Agriculture, the CDC School Health Profiles, Achieving a State of Healthy Weight, Maternal Practices in Infant Nutrition and Care, the Breastfeeding Report Card, the Health Resources and Services Administration Uniform Data System, the National Immunization Survey, and the Water Fluoridation Reporting System. Many of these, such as BRFSS, use complex sampling designs and weights that must be taken into account. Additional details on these data sources are provided (Table 4) of this paper. [4] The data is in the form of a Comma Separated Values file so we have basically downloaded this file on our local machine and plan to import it locally by reading the csv in R. After that we will proceed to perform further analysis.