BRFSS 2015 dataset on Kaggle
https://www.kaggle.com/alexteboul/heart-disease-he…
Chief task: Tell me a story about what factors could be behind HD from your explorations in this data set.Remark 1: Make the histogram your primary tool.Remark 2: You can construct and organize your story by including many factors on HD.Remark 3: For the sake of human knowledge, unimportant factors could be as important as important ones.”Try to write a creative investigative story out of your computational explorations”.
1. Notebook on Kaggle:
https://www.kaggle.com/code/alexteboul/heart-disease-health-indicators-dataset-notebook/notebook
2. The following is from discussions on Kaggle.
GenHlth (1-5 where 1 is excellent and 5 is poor)Age (convert to 14 level age buckets)1 Age 18 to 242 Age 25 to 293 Age 30 to 344 Age 35 to 395 Age 40 to 446 Age 45 to 497 Age 50 to 548 Age 55 to 599 Age 60 to 6410 Age 65 to 6911 Age 70 to 7412 Age 75 to 7913 Age 80 or olderDiabetes:In the original dataset, that feature was “DIABETE3: Ever told) you have diabetes (If “Yes” and respondent is female, ask “Was this only when you were pregnant?”. If Respondent says pre-diabetes or borderline diabetes, use response code 4″After change, 0 is no diabetes, 1 is pre-diabetes, and 2 is diabetes. First word of data!
Analysis of Variation (ANOVA)?
Exploring real data for sciences!
Exploring small and big simulated data for checking merits of
theoretical results.
You make “No Mistakes” in explorations.
You teach yourself what is right as well as what is wrong.
This is a balancing mindset for theoretical results and scientific
knowledge.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
7 / 45
First word of data!
Normality: random variable and its density function.
What is a Normal random variable Z ⇠ N(µ,
2 )?
What is a random variable (Z )?
ANS: A designated observable entity, not necessary measurement
per se, comes with varying outcomes, not necessary numerical
values.
Sets of observable outcomes: {Red, Green, Blue} or {6ft2.780in,
5ft10.234in,…, 7ft0.590in.}
88
A concept of frequency (or simply proportion): { 300
for Red, 132
300 for
80
Green, 300 for Blue }
1
1
1
{ 3000
for 6ft2.780in, 3000
5ft10.234in, ….., 3000
7ft0.590in}
Something is “wrong” or at least “incoherent” or “di↵erent” here!!
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
8 / 45
First word of data!
Information?
88
Frequencies (or proportions) on {Red, Green, Blue}: { 300
for Red,
132
80
300 for Green, 300 for Blue }, indeed bring out “Information”.
In contrast, Frequencies (or proportions) on {measured heights}:
1
1
1
{ 3000
for 6ft2.780in, 3000
5ft10.234in, ….., 3000
7ft0.590in}, do not
exhibit “information”.
Why?
The essence of “information” is some sort of “aggregating” pattern.
How to make aggregating patterns for precisely measured
quantities?
Do something fundamental shared by patterns of these two datasets
about these two kinds of “random variables”?
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
11 / 45
First word of data!
Aggregating pattern information.
“Distribution”
Describing: Where are data points located?
Categorical variable, like color, has its aggregating patterns on all
categories, like {Red, Green, Blue}.
Quantitative variable, like height, has its aggregating patterns
shown via “density” along the axis of measurement.
So, the proper description of “distribution” is the fundamental
information commonly contained in the two data-sets.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
12 / 45
First word of data!
Now you see.
Scatter plot of heights
• How many measurements are
• less than 𝑡𝑡𝑖𝑖 ? (𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒: 𝐹𝐹𝑛𝑛 (𝑡𝑡) )
• This is a concept of “distribution”.
• Or how many are between
• successive 𝑡𝑡𝑖𝑖 ?
𝑡𝑡𝑖𝑖
Figure: Distribution of heights.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
13 / 45
First word of data!
Now you see.
Scatter plot of heights
• This is concept of “distribution”:
• What are the proportions between
• successive 𝑡𝑡𝑖𝑖 ?
𝑡𝑡𝑖𝑖
Figure: A histogram of heights.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
14 / 45
First word of data!
Distributional shape
The information of a distribution can be described by its “shape”.
The distributional shape of a finite set of data point derived from a
random variable is specifically called: Histogram.
Histogram of categorical variable has a fixed basis.
Histogram of a quantitative variable has a slightly fluid basis: some
are more revealing about the shape than others.
Histogram is a somehow unified visible representation of
information of “a distributional shape” pertaining a random variable
based on a finite sample.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
15 / 45
First word of data!
Example of histogram
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
16 / 45
First word of data!
Example of histogram
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
17 / 45
First word of data!
Remarks on histogram
Histogram can be built on any spaces to fit our visualization
purposes.
Such spaces might be or not be metric.
The idea of “taking an averaging” might not be meaningful.
That is, the concepts of “mean”, “variance” or “correlation” are
not suitably embraced in a histogram.
Histogram provides the unified piece of information content for any
1D-datasets, but mean and variance do not.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
18 / 45
First word of data!
Friday entertainment!
Show Adele’s “Easy on me” in YouTube.
Show her facial image.
The idea of “data” her facial images.
My research question: how Adele’s facial expressions are linking to
the song’s lyrics?
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
19 / 45
First word of data!
Comparing multiple 1D-datasets.
If a histogram indeed embraces the distributional shape information
in data, should it be the basis for comparing multiple 1D-datasets.
If multiple individual histograms of multiple 1D-datasets share the
same shape, then the histogram of their pooled dataset would have
the same shape as well.
If all 1D-dataset with same sample sizes from share the same
distribution information, then the proportions of data points from
di↵erent datasets are equal within each bar of histogram of pooled
dataset.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
20 / 45
First word of data!
Comparing multiple 1D-datasets.
Any unequal proportions within any single bar will shed a local piece
of information of non-equality.
Based on histogram with color-coding schemes, comparison of
multiple 1D-datasets becomes an easy task, at least visually and
practically speaking.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
21 / 45
First word of data!
Example of comparison via histogram
Sepal length
Sepal width
Petal length
Petal width
Figure: Example of comparing measurements of three species of Iris’s four features
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
22 / 45
First word of data!
Project #1 on Heart disease (HD)
BRFSS 2015 dataset on Kaggle
(https://www.kaggle.com/alexteboul/heart-disease-health-indicatorsdataset)
Chief task: Tell me a story about what factors could be behind HD
from your explorations in this data set.
Remark 1: Make the histogram your primary tool.
Remark 2: You can construct and organize your story by including
many factors’ “e↵ects” on HD.
Remark 3: For the sake of human knowledge, unimportant factors
could be as important as important ones.
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
23 / 45
Basic facts of Normal random variable
Related facts of a histogram of continuous measurements
{zi }ni=1
Z ’s cumulative distribution function (CDF)
F (t) = Prob[Z t] ⇡ Fn (t) = #zni t .
F (t ) F (t
)
k
k 1
f (tk⇤ ) =
. This is the density function f (.) at
tk tk 1
⇤
tk 2 [tk 1 , tk ].
F (t ) F (t
)
fˆ(tk⇤ ) = n ktk tkn 1k 1
That is, n ⇥ (tk
tk 1 ) ⇥ fˆ(tk⇤ ) = #zi 2 (tk 1 , tk ].
“A proportion of counts within a bin ((tk 1 , tk ]) re-scaled by its
bin-width tk tk 1 of a histogram” is an estimate of the density
function f (.), which is the slope of the CDF F (t), at tk⇤ .
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
24 / 45
Basic facts of Normal random variable
Practical issues in data analysis.
When the sample size n is small, the concept of density function
f (t) is not practical.
Only when n is large, f (t) can be estimated via a histogram with
very fine bin-width.
That is, the primary information of distribution shape resides in a
histogram.
When the sample size n is small, but scientists still need the
information of distribution shape for comparison purpose. What can
the scientists do?
Historically, statisticians o↵ered a solution: assuming Normality
N(µ, 2 ).
Hsieh Fushing (UC Davis)
ANOVA
January 29, 2023
25 / 45
Basic facts of Normal random variable
Normality: random variable and its density function.
Normal random variable Z ⇠ N(µ,
2)
What is a “density function”?
(z) ⇡ lim !0
(z : µ,
z
#Z
< 3000
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Achiever Papers is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Dissertation Writing Service Works
First, you will need to complete an order form. It's not difficult but, if anything is unclear, you may always chat with us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order form
Once we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignment
As soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download