Stroke Prediction Dashboard

Overview

According to the CDC's Stroke Facts, nearly 800,000 strokes occur in the United States on an annual basis. That is an average of one person suffering a stroke every 40 seconds (with someone dying from said stroke once every 3.5 minutes). Of those, there is approximately a 15% 30-day mortality rate. Even for people who survive, many can be left with either very long roads to recovery or ultimately serious long-term disability. Obviously (and unfortunately), strokes are a very serious issue for people in this country. By helping to uncover what factors are more likely to lead to strokes, we ultimately hope to help people avoid them in the first place.

In an attempt to do so, we came across and used a dataset featuring both demographic and health information on 5,110 people and whether or not they had a stroke. Demographic information included gender, age, marital history, occupation, and residence type while health information included whether or not they have hypertension and heart disease, glucose levels, body mass index (BMI), and smoking status. (Information on individuals in the dataset can be viewed using the dropdown menu on the right). Using that information, we aimed to see if we could:

predict whether or not a person would have a stroke based on that information

determine which of those factors contribute to having a stroke and by how much

Individual ID#:

Demographic/Health Info

Stroke Dataset Breakdown

* for hypertension, heart_disease, and stroke columns, 0 = no and 1 = yes

We were lucky enough to start out with a relatively clean dataset. It did include 201 null ("NaN") results for BMI and one other Gender that were all dropped in the cleaning process, but that still left us with a substantial group of 4,908 people to work with. Unfortunately, the ultimate origin of the data was unknown, and that would be something that could influence the results. If, for instance, the data came from people who were already concerned about a stroke, or just one hospital, or even a particular region in the United States, those are all things that could introduce bias to the data. As a result, we were left to ask our own questions about the data and how it could potentially impact our findings.

Gender

Gender is one category where we could clearly see that the data wasn't completely randomly selected. While you would expect to see something close to a 50/50 split, this was closer to 60/40. Did some of the data come from a women's clinic, introducing more women to our sample than men? Do women simply go to the hospital more often, and that's where our data came from? While this wasn't necessarily a problem, it did make it clear that our data wasn't a completely random sample.

Age

Age was another category that piqued our interest. While in a completely random sample you would expect to see ages that approximate that of our population, here the ages gradually increased before falling off after the age of 50. Then, while we had a number of people who were exactly 82 years old, there was nobody in the dataset who was older, which seemed very strange. Was the overall trend because older people go to the hospital more often, and that's where the data came from? Was there a cutoff after the age of 82 for some reason? Again, this didn't really cause any problems, but it did raise questions about the original source of the data.

Hypertension

Heart Disease

Marriage Status

Employment

Employment seems like it would be both a very interesting and useful category to look at when trying to see how it correlates to strokes. Are people with certain occupations more prone to strokes than others? That seems like it would be a good thing to know! Unfortunately, this data was lacking in that regard. While it was complete, lumping the vast majority of people into a nebulous Private group wasn't particularly helpful. While this wasn't completely useless, it would have been preferable to have had access to more specific job titles.

Residence

Glucose

* Glucose level <100 = normal, 100-125 = prediabetic, >125 = diabetic

Body Mass Index

* BMI <18.5 = underweight, 18.5-24.9 = normal, 25-29.9 = overweight, ≥30 = obese

Smoking Status

How much of an impact smoking has on having a stroke is definitely something we wanted to look at. While that information was included here, for some reason, the status was listed as unknown for nearly 1,500 of the participants. Unlike with BMI where we lacked information for only a small percentage of the people, here it was much more than that. We didn't want to lose one third of our dataset because of this, so we opted to leave the Unknown results in there.

Stroke Result

Gender

Stroke likelihood via Gender

Gender	Stroke	No Stroke	Stroke %
male	89	1922	4.426%
female	120	2777	4.142%

While we found that men are more likely to have strokes than women (4.426% vs 4.142%), it was fairly balanced between the two.

Age

Stroke likelihood via Age

Age	Stroke	No Stroke	Stroke %
0-5	0	311	0.000%
6-10	0	182	0.000%
11-15	1	233	0.427%
16-20	0	273	0.000%
21-25	0	264	0.000%
26-30	0	271	0.000%
31-35	1	312	0.319%
36-40	4	331	1.194%
41-45	5	358	1.377%
46-50	10	343	2.833%
51-55	15	403	3.589%
56-60	27	347	7.219%
61-65	15	309	4.630%
66-70	27	213	11.739%
71-75	24	198	10.811%
>75	80	351	18.561%

Age was the #1 predictor for whethor or not someone would suffer a stroke. While strokes are thankfully exceedingly rare for people 35 and younger, they do start gradually becoming more common after that. As ages continue to rise, the stroke likelihood increases become much more steep, with people 65 and older being particularly prone to them.

Hypertension

Stroke likelihood via Hypertension

Hypertension	Stroke	No Stroke	Stroke %
healthy	149	4308	3.343%
hypertension	60	391	13.304%

Hypertension was a factor with a high correlation with having a stroke, with people with hypertension being four times as likely to have a stroke as those without it.

Heart Disease

Stroke likelihood via Heart Disease

Heart Disease	Stroke	No Stroke	Stroke %
heart disease	40	203	16.461%
healthy	169	4496	3.623%

Like hypertension, heart disease was another high indicator of having a stroke (though luckily in both cases, having hypertension/heart disease itself is on the rare side). With heart disease, people are 4.5 times as likely to have a stroke as those without it are.

Marriage Status

Stroke likelihood via Marriage Status

Ever Married	Stroke	No Stroke	Stroke %
ever married	186	3018	5.805%
never married	23	1681	1.350%

In what came as a surprise to us, it appears as if marriage status plays a role in whether or not someone would have a stroke. Shockingly, people who were married at some people were 4.3 times as likely to have a stroke as those who have never been married. Yikes! Does this mean that people should avoid getting married if they don't want to end up having a stroke?? Or, could something else be in play here?

Employment

Stroke likelihood via Employment

Employment	Stroke	No Stroke	Stroke %
private	127	2683	4.520%
self-employed	53	722	6.839%
government	28	602	4.444%
children	1	670	0.149%
never worked	0	22	0.000%

While employment would be something that we would want to look at when seeing if it leads to having a stroke, the data unforunately lumped most of the people into a nebulous Private group, so this wasn't particularly useful. While it does once again show that children are very unlikely to have a stroke (and the Never Worked category is too small to draw any real conclusions from), the rest are in the same general ballpark. However, self-employed people are somewhat more likely to have a stroke, which could indicate that the stress of being self employed and running your own business could ultimately increase the risk of a stroke.

Residence

Stroke likelihood via Residence

Residence	Stroke	No Stroke	Stroke %
urban	109	2381	4.378%
rural	100	2318	4.136%

Residence is the category with the least amount of difference between having and not having a stroke, with results from people who live in urban vs rural environments being almost identical.

Glucose Level

Stroke likelihood via Glucose

Glucose Level	Stroke	No Stroke	Stroke %
50-75	38	1045	3.509%
76-100	55	1892	2.825%
101-125	29	916	3.069%
126-150	10	258	3.731%
151-175	6	116	4.918%
176-200	22	130	14.474%
201-225	28	212	11.667%
226-250	16	112	12.500%

* Glucose level <100 = normal, 100-125 = prediabetic, >125 = diabetic

Glucose levels had a high level of impact on having a stroke. While stroke results are fairly consistent up until 150, they increase after that. Once blood sugar levels reach 175, the likelihood of having a stroke jumps roughly three times.

Body Mass Index

Stroke likelihood via BMI

BMI	Stroke	No Stroke	Stroke %
<18.5	1	348	0.287%
18.5-24.9	37	1220	2.387%
25-30	73	1309	5.282%
30.1-40	80	1432	5.291%
40.1-50	17	312	5.167%
50.1-60	1	65	1.515%
60.1-70	0	9	0.000%
70.1-80	0	2	0.000%
80.1-90	0	0	n/a
>90	0	2	0.000%

* BMI <18.5 = underweight, 18.5-24.9 = normal, 25-29.9 = overweight, ≥30 = obese

Body Mass Index is another factor with a high correlation to stroke results. Strokes among underweight people are exceedingly rare, and people with healthy weight also have low stroke chances at 2.387%. However, overweight people are above double their chances of having a stroke, as chances then hover around 5%. Curiously, stroke percentages don't continue to rise as BMI goes up; it more or less stays the same once people become overweight.

Smoking Status

Stroke likelihood via Smoking Status

Smoking Status	Stroke	No Stroke	Stroke %
formerly	57	779	6.818%
never smoked	84	1768	4.536%
smokes	39	698	5.292%
unknown	29	1454	1.994%

Smoking status didn't have a particularly large impact on whether or not someone had a stroke. Smokers (and former smokers) did suffer them more often than lifetime non-smokers, but the difference wasn't overly large. Curiously, the unknown category was clearly the lowest, and it doesn't appear to the the result of a small sample size, so something else must be in play here.

When we initially saw that having been married increases your chance of having a stroke by 3 times, this set off alarm bells. Could the stresses of being married really outweigh all of the benefits, and by such a large degree? While possible, that seemed suspect, so we decided to inspect things further. We hypothesized that maybe it wasn't marriage itself that was leading to the increase in strokes; it could be that people who had been married would be older than people who hadn't. Since older people are much more likely to have strokes than those who are younger, this could mean that Marriage Status effectively stood in as somewhat of a placeholder for Age. In other words, having been married doesn't itself increase your chances of having a stroke, but having been married means that you're more likely to be older, which means that you're more likely to suffer a stroke. To test our hypothesis, we reexamined the data, but only looking at people who were at least 40 years of age, which should elimate most of age discrepancy between marriage status.

BONUS: Marriage Status (Age Adjusted)

(Only individuals 40+ were examined)

Stroke likelihood via Marriage Status

Ever Married	Stroke	No Stroke	Stroke %
ever married	182	2369	7.134%
never married	21	223	8.607%

Luckily (for people who are married or hope to be), our suspicions were confirmed! Once you get rid of the age bias towards having been married, it no longer had a higher stroke level. In fact, it now showed that it is actually people who have never been married that are more likely to suffer a stroke (8.607%) compared to those who have been (7.314%), though the difference isn't very large. Still, this should come as good news to people who are married or who hope to be!

Stroke Prediction Dashboard

Overview

Individual ID#:

Demographic/Health Info

Stroke Dataset Breakdown

Gender

Age

Hypertension

Heart Disease

Marriage Status

Employment

Residence

Glucose

Body Mass Index

Smoking Status

Stroke Result

Stroke Prediction

Random Forest Classification Report

Random Forest Confusion Matrix

Stroke Factor Correlation Matrix

Stroke Likelihood via Each Factor

Gender

Stroke likelihood via Gender

While we found that men are more likely to have strokes than women (4.426% vs 4.142%), it was fairly balanced between the two.

Age

Stroke likelihood via Age

Hypertension

Stroke likelihood via Hypertension

Hypertension was a factor with a high correlation with having a stroke, with people with hypertension being four times as likely to have a stroke as those without it.

Heart Disease

Stroke likelihood via Heart Disease

Like hypertension, heart disease was another high indicator of having a stroke (though luckily in both cases, having hypertension/heart disease itself is on the rare side). With heart disease, people are 4.5 times as likely to have a stroke as those without it are.

Marriage Status

Stroke likelihood via Marriage Status

Employment

Stroke likelihood via Employment

Residence

Stroke likelihood via Residence

Residence is the category with the least amount of difference between having and not having a stroke, with results from people who live in urban vs rural environments being almost identical.

Glucose Level

Stroke likelihood via Glucose

Glucose levels had a high level of impact on having a stroke. While stroke results are fairly consistent up until 150, they increase after that. Once blood sugar levels reach 175, the likelihood of having a stroke jumps roughly three times.

Body Mass Index

Stroke likelihood via BMI

Smoking Status

Stroke likelihood via Smoking Status

BONUS: Marriage Status (Age Adjusted)

(Only individuals 40+ were examined)

Stroke likelihood via Marriage Status