Histogram Discrepancies of Those with And without Heart Disease Paper

BRFSS 2015 dataset on Kaggle

https://www.kaggle.com/alexteboul/heart-disease-he…

Chief task: Tell me a story about what factors could be behind HD from your explorations in this data set.Remark 1: Make the histogram your primary tool.Remark 2: You can construct and organize your story by including many factors on HD.Remark 3: For the sake of human knowledge, unimportant factors could be as important as important ones.”Try to write a creative investigative story out of your computational explorations”.

1. Notebook on Kaggle:

https://www.kaggle.com/code/alexteboul/heart-disease-health-indicators-dataset-notebook/notebook

2. The following is from discussions on Kaggle.

GenHlth (1-5 where 1 is excellent and 5 is poor)Age (convert to 14 level age buckets)1 Age 18 to 242 Age 25 to 293 Age 30 to 344 Age 35 to 395 Age 40 to 446 Age 45 to 497 Age 50 to 548 Age 55 to 599 Age 60 to 6410 Age 65 to 6911 Age 70 to 7412 Age 75 to 7913 Age 80 or olderDiabetes:In the original dataset, that feature was “DIABETE3: Ever told) you have diabetes (If “Yes” and respondent is female, ask “Was this only when you were pregnant?”. If Respondent says pre-diabetes or borderline diabetes, use response code 4″After change, 0 is no diabetes, 1 is pre-diabetes, and 2 is diabetes. First word of data!
Analysis of Variation (ANOVA)?
Exploring real data for sciences!
Exploring small and big simulated data for checking merits of
theoretical results.
You make “No Mistakes” in explorations.
You teach yourself what is right as well as what is wrong.
This is a balancing mindset for theoretical results and scientific
knowledge.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
7 / 45
First word of data!
Normality: random variable and its density function.
What is a Normal random variable Z ⇠ N(µ,
2 )?
What is a random variable (Z )?
ANS: A designated observable entity, not necessary measurement
per se, comes with varying outcomes, not necessary numerical
values.
Sets of observable outcomes: {Red, Green, Blue} or {6ft2.780in,
5ft10.234in,…, 7ft0.590in.}
88
A concept of frequency (or simply proportion): { 300
for Red, 132
300 for
80
Green, 300 for Blue }
1
1
1
{ 3000
for 6ft2.780in, 3000
5ft10.234in, ….., 3000
7ft0.590in}
Something is “wrong” or at least “incoherent” or “di↵erent” here!!
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
8 / 45
First word of data!
Information?
88
Frequencies (or proportions) on {Red, Green, Blue}: { 300
for Red,
132
80
300 for Green, 300 for Blue }, indeed bring out “Information”.
In contrast, Frequencies (or proportions) on {measured heights}:
1
1
1
{ 3000
for 6ft2.780in, 3000
5ft10.234in, ….., 3000
7ft0.590in}, do not
exhibit “information”.
Why?
The essence of “information” is some sort of “aggregating” pattern.
How to make aggregating patterns for precisely measured
quantities?
Do something fundamental shared by patterns of these two datasets
about these two kinds of “random variables”?
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
11 / 45
First word of data!
Aggregating pattern information.
“Distribution”
Describing: Where are data points located?
Categorical variable, like color, has its aggregating patterns on all
categories, like {Red, Green, Blue}.
Quantitative variable, like height, has its aggregating patterns
shown via “density” along the axis of measurement.
So, the proper description of “distribution” is the fundamental
information commonly contained in the two data-sets.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
12 / 45
First word of data!
Now you see.
Scatter plot of heights
• How many measurements are
• less than 𝑡𝑡𝑖𝑖 ? (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒: 𝐹𝐹𝑛𝑛 (𝑡𝑡) )
• This is a concept of “distribution”.
• Or how many are between
• successive 𝑡𝑡𝑖𝑖 ?
𝑡𝑡𝑖𝑖
Figure: Distribution of heights.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
13 / 45
First word of data!
Now you see.
Scatter plot of heights
• This is concept of “distribution”:
• What are the proportions between
• successive 𝑡𝑡𝑖𝑖 ?
𝑡𝑡𝑖𝑖
Figure: A histogram of heights.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
14 / 45
First word of data!
Distributional shape
The information of a distribution can be described by its “shape”.
The distributional shape of a finite set of data point derived from a
random variable is specifically called: Histogram.
Histogram of categorical variable has a fixed basis.
Histogram of a quantitative variable has a slightly fluid basis: some
are more revealing about the shape than others.
Histogram is a somehow unified visible representation of
information of “a distributional shape” pertaining a random variable
based on a finite sample.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
15 / 45
First word of data!
Example of histogram
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
16 / 45
First word of data!
Example of histogram
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
17 / 45
First word of data!
Remarks on histogram
Histogram can be built on any spaces to fit our visualization
purposes.
Such spaces might be or not be metric.
The idea of “taking an averaging” might not be meaningful.
That is, the concepts of “mean”, “variance” or “correlation” are
not suitably embraced in a histogram.
Histogram provides the unified piece of information content for any
1D-datasets, but mean and variance do not.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
18 / 45
First word of data!
Friday entertainment!
Show Adele’s “Easy on me” in YouTube.
Show her facial image.
The idea of “data” her facial images.
My research question: how Adele’s facial expressions are linking to
the song’s lyrics?
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
19 / 45
First word of data!
Comparing multiple 1D-datasets.
If a histogram indeed embraces the distributional shape information
in data, should it be the basis for comparing multiple 1D-datasets.
If multiple individual histograms of multiple 1D-datasets share the
same shape, then the histogram of their pooled dataset would have
the same shape as well.
If all 1D-dataset with same sample sizes from share the same
distribution information, then the proportions of data points from
di↵erent datasets are equal within each bar of histogram of pooled
dataset.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
20 / 45
First word of data!
Comparing multiple 1D-datasets.
Any unequal proportions within any single bar will shed a local piece
of information of non-equality.
Based on histogram with color-coding schemes, comparison of
multiple 1D-datasets becomes an easy task, at least visually and
practically speaking.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
21 / 45
First word of data!
Example of comparison via histogram
Sepal length
Sepal width
Petal length
Petal width
Figure: Example of comparing measurements of three species of Iris’s four features
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
22 / 45
First word of data!
Project #1 on Heart disease (HD)
BRFSS 2015 dataset on Kaggle
(https://www.kaggle.com/alexteboul/heart-disease-health-indicatorsdataset)
Chief task: Tell me a story about what factors could be behind HD
from your explorations in this data set.
Remark 1: Make the histogram your primary tool.
Remark 2: You can construct and organize your story by including
many factors’ “e↵ects” on HD.
Remark 3: For the sake of human knowledge, unimportant factors
could be as important as important ones.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
23 / 45
Basic facts of Normal random variable
Related facts of a histogram of continuous measurements
{zi }ni=1
Z ’s cumulative distribution function (CDF)
F (t) = Prob[Z  t] ⇡ Fn (t) = #zni t .
F (t ) F (t
)
k
k 1
f (tk⇤ ) =
. This is the density function f (.) at
tk tk 1
⇤
tk 2 [tk 1 , tk ].
F (t ) F (t
)
fˆ(tk⇤ ) = n ktk tkn 1k 1
That is, n ⇥ (tk
tk 1 ) ⇥ fˆ(tk⇤ ) = #zi 2 (tk 1 , tk ].
“A proportion of counts within a bin ((tk 1 , tk ]) re-scaled by its
bin-width tk tk 1 of a histogram” is an estimate of the density
function f (.), which is the slope of the CDF F (t), at tk⇤ .
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
24 / 45
Basic facts of Normal random variable
Practical issues in data analysis.
When the sample size n is small, the concept of density function
f (t) is not practical.
Only when n is large, f (t) can be estimated via a histogram with
very fine bin-width.
That is, the primary information of distribution shape resides in a
histogram.
When the sample size n is small, but scientists still need the
information of distribution shape for comparison purpose. What can
the scientists do?
Historically, statisticians o↵ered a solution: assuming Normality
N(µ, 2 ).
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
25 / 45
Basic facts of Normal random variable
Normality: random variable and its density function.
Normal random variable Z ⇠ N(µ,
2)
What is a “density function”?
(z) ⇡ lim !0
(z : µ,
z
#Z
< 3000

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Histogram Discrepancies of Those with And without Heart Disease Paper ”

Get high-quality paper

NEW! AI matching with writer