Stroke Prediction Dashboard

Overview


According to the CDC's Stroke Facts, nearly 800,000 strokes occur in the United States on an annual basis. That is an average of one person suffering a stroke every 40 seconds (with someone dying from said stroke once every 3.5 minutes). Of those, there is approximately a 15% 30-day mortality rate. Even for people who survive, many can be left with either very long roads to recovery or ultimately serious long-term disability. Obviously (and unfortunately), strokes are a very serious issue for people in this country. By helping to uncover what factors are more likely to lead to strokes, we ultimately hope to help people avoid them in the first place.


In an attempt to do so, we came across and used a dataset featuring both demographic and health information on 5,110 people and whether or not they had a stroke. Demographic information included gender, age, marital history, occupation, and residence type while health information included whether or not they have hypertension and heart disease, glucose levels, body mass index (BMI), and smoking status. (Information on individuals in the dataset can be viewed using the dropdown menu on the right). Using that information, we aimed to see if we could:
  1. predict whether or not a person would have a stroke based on that information
  2. determine which of those factors contribute to having a stroke and by how much

Individual ID#:

Demographic/Health Info

Stroke Dataset Breakdown


centered image

* for hypertension, heart_disease, and stroke columns, 0 = no and 1 = yes


We were lucky enough to start out with a relatively clean dataset. It did include 201 null ("NaN") results for BMI and one other Gender that were all dropped in the cleaning process, but that still left us with a substantial group of 4,908 people to work with. Unfortunately, the ultimate origin of the data was unknown, and that would be something that could influence the results. If, for instance, the data came from people who were already concerned about a stroke, or just one hospital, or even a particular region in the United States, those are all things that could introduce bias to the data. As a result, we were left to ask our own questions about the data and how it could potentially impact our findings.


Gender

centered image

Gender is one category where we could clearly see that the data wasn't completely randomly selected. While you would expect to see something close to a 50/50 split, this was closer to 60/40. Did some of the data come from a women's clinic, introducing more women to our sample than men? Do women simply go to the hospital more often, and that's where our data came from? While this wasn't necessarily a problem, it did make it clear that our data wasn't a completely random sample.


Age

centered image

Age was another category that piqued our interest. While in a completely random sample you would expect to see ages that approximate that of our population, here the ages gradually increased before falling off after the age of 50. Then, while we had a number of people who were exactly 82 years old, there was nobody in the dataset who was older, which seemed very strange. Was the overall trend because older people go to the hospital more often, and that's where the data came from? Was there a cutoff after the age of 82 for some reason? Again, this didn't really cause any problems, but it did raise questions about the original source of the data.




Hypertension

centered image

Heart Disease

centered image


Marriage Status

centered image

Employment

centered image

Employment seems like it would be both a very interesting and useful category to look at when trying to see how it correlates to strokes. Are people with certain occupations more prone to strokes than others? That seems like it would be a good thing to know! Unfortunately, this data was lacking in that regard. While it was complete, lumping the vast majority of people into a nebulous Private group wasn't particularly helpful. While this wasn't completely useless, it would have been preferable to have had access to more specific job titles.




Residence

centered image

Glucose

centered image

* Glucose level <100 = normal, 100-125 = prediabetic, >125 = diabetic

Body Mass Index

centered image

* BMI <18.5 = underweight, 18.5-24.9 = normal, 25-29.9 = overweight, ≥30 = obese

Smoking Status

centered image

How much of an impact smoking has on having a stroke is definitely something we wanted to look at. While that information was included here, for some reason, the status was listed as unknown for nearly 1,500 of the participants. Unlike with BMI where we lacked information for only a small percentage of the people, here it was much more than that. We didn't want to lose one third of our dataset because of this, so we opted to leave the Unknown results in there.


Stroke Result

centered image



Stroke Prediction


Since our data included the ultimate result that we were looking for (whether or not the person had a stroke), we used supervised machine learning to attempt to predict whether or not someone had a stroke given the ten different factors in the dataset. We used six different methods: logistical regression with naive random oversampling, SMOTE oversampling, logistical regression with undersampling, SMOTEENN combination (over and under) sampling, balanced random forest, and easy ensemble. In the end, balanced random forest reigned supreme with 77% accuracy, while the other methods ranged from 50% (logistical regression with undersampling) to 76% (logistical regression with random oversampling).


Random Forest Classification Report

centered image

Random Forest Confusion Matrix

centered image



Stroke Factor Correlation Matrix

centered image

This correlation matrix illustrates how each of the factors is correlated to one another. Ultimately we are interested in how (and how much) each factor is related to having a stroke. Since the last column (stroke_stroke) features people who have had a stroke, we can look at the bottom row to see how each of the other factors relates to that. Factors with the highest correlation are shaded closer to red (the highest being Age), while those with negative correlation (not having heart disease or hypertension) are more blue. Factors that didn't have much of an influence either way (Residence type) are right in the middle.





Stroke Likelihood via Each Factor

Gender

centered image

Stroke likelihood via Gender

Gender   Stroke   No Stroke   Stroke %  
male 89 1922 4.426%
female 120 2777 4.142%

While we found that men are more likely to have strokes than women (4.426% vs 4.142%), it was fairly balanced between the two.


Age

centered image

Stroke likelihood via Age

Age   Stroke   No Stroke   Stroke %  
0-5 0 311 0.000%
6-10 0 182 0.000%
11-15 1 233 0.427%
16-20 0 273 0.000%
21-25 0 264 0.000%
26-30 0 271 0.000%
31-35 1 312 0.319%
36-40 4 331 1.194%
41-45 5 358 1.377%
46-50 10 343 2.833%
51-55 15 403 3.589%
56-60 27 347 7.219%
61-65 15 309 4.630%
66-70 27 213 11.739%
71-75 24 198 10.811%
>75 80 351 18.561%

Age was the #1 predictor for whethor or not someone would suffer a stroke. While strokes are thankfully exceedingly rare for people 35 and younger, they do start gradually becoming more common after that. As ages continue to rise, the stroke likelihood increases become much more steep, with people 65 and older being particularly prone to them.




Hypertension

centered image

Stroke likelihood via Hypertension

Hypertension   Stroke   No Stroke   Stroke %  
healthy 149 4308 3.343%
hypertension 60 391 13.304%

Hypertension was a factor with a high correlation with having a stroke, with people with hypertension being four times as likely to have a stroke as those without it.


Heart Disease

centered image

Stroke likelihood via Heart Disease

Heart Disease   Stroke   No Stroke   Stroke %  
heart disease 40 203 16.461%
healthy 169 4496 3.623%

Like hypertension, heart disease was another high indicator of having a stroke (though luckily in both cases, having hypertension/heart disease itself is on the rare side). With heart disease, people are 4.5 times as likely to have a stroke as those without it are.




Marriage Status

centered image

Stroke likelihood via Marriage Status

Ever Married   Stroke   No Stroke   Stroke %  
ever married 186 3018 5.805%
never married 23 1681 1.350%

In what came as a surprise to us, it appears as if marriage status plays a role in whether or not someone would have a stroke. Shockingly, people who were married at some people were 4.3 times as likely to have a stroke as those who have never been married. Yikes! Does this mean that people should avoid getting married if they don't want to end up having a stroke?? Or, could something else be in play here?


Employment

centered image

Stroke likelihood via Employment

Employment   Stroke   No Stroke   Stroke %  
private 127 2683 4.520%
self-employed 53 722 6.839%
government 28 602 4.444%
children 1 670 0.149%
never worked 0 22 0.000%

While employment would be something that we would want to look at when seeing if it leads to having a stroke, the data unforunately lumped most of the people into a nebulous Private group, so this wasn't particularly useful. While it does once again show that children are very unlikely to have a stroke (and the Never Worked category is too small to draw any real conclusions from), the rest are in the same general ballpark. However, self-employed people are somewhat more likely to have a stroke, which could indicate that the stress of being self employed and running your own business could ultimately increase the risk of a stroke.




Residence

centered image

Stroke likelihood via Residence

Residence   Stroke   No Stroke   Stroke %  
urban 109 2381 4.378%
rural 100 2318 4.136%

Residence is the category with the least amount of difference between having and not having a stroke, with results from people who live in urban vs rural environments being almost identical.


Glucose Level

centered image

Stroke likelihood via Glucose

Glucose Level   Stroke   No Stroke   Stroke %  
50-75 38 1045 3.509%
76-100 55 1892 2.825%
101-125 29 916 3.069%
126-150 10 258 3.731%
151-175 6 116 4.918%
176-200 22 130 14.474%
201-225 28 212 11.667%
226-250 16 112 12.500%

* Glucose level <100 = normal, 100-125 = prediabetic, >125 = diabetic


Glucose levels had a high level of impact on having a stroke. While stroke results are fairly consistent up until 150, they increase after that. Once blood sugar levels reach 175, the likelihood of having a stroke jumps roughly three times.




Body Mass Index

centered image

Stroke likelihood via BMI

BMI   Stroke   No Stroke   Stroke %  
<18.5 1 348 0.287%
18.5-24.9 37 1220 2.387%
25-30 73 1309 5.282%
30.1-40 80 1432 5.291%
40.1-50 17 312 5.167%
50.1-60 1 65 1.515%
60.1-70 0 9 0.000%
70.1-80 0 2 0.000%
80.1-90 0 0 n/a
>90 0 2 0.000%

* BMI <18.5 = underweight, 18.5-24.9 = normal, 25-29.9 = overweight, ≥30 = obese


Body Mass Index is another factor with a high correlation to stroke results. Strokes among underweight people are exceedingly rare, and people with healthy weight also have low stroke chances at 2.387%. However, overweight people are above double their chances of having a stroke, as chances then hover around 5%. Curiously, stroke percentages don't continue to rise as BMI goes up; it more or less stays the same once people become overweight.


Smoking Status

centered image

Stroke likelihood via Smoking Status

Smoking Status   Stroke   No Stroke   Stroke %  
formerly 57 779 6.818%
never smoked 84 1768 4.536%
smokes 39 698 5.292%
unknown 29 1454 1.994%

Smoking status didn't have a particularly large impact on whether or not someone had a stroke. Smokers (and former smokers) did suffer them more often than lifetime non-smokers, but the difference wasn't overly large. Curiously, the unknown category was clearly the lowest, and it doesn't appear to the the result of a small sample size, so something else must be in play here.












When we initially saw that having been married increases your chance of having a stroke by 3 times, this set off alarm bells. Could the stresses of being married really outweigh all of the benefits, and by such a large degree? While possible, that seemed suspect, so we decided to inspect things further. We hypothesized that maybe it wasn't marriage itself that was leading to the increase in strokes; it could be that people who had been married would be older than people who hadn't. Since older people are much more likely to have strokes than those who are younger, this could mean that Marriage Status effectively stood in as somewhat of a placeholder for Age. In other words, having been married doesn't itself increase your chances of having a stroke, but having been married means that you're more likely to be older, which means that you're more likely to suffer a stroke. To test our hypothesis, we reexamined the data, but only looking at people who were at least 40 years of age, which should elimate most of age discrepancy between marriage status.



BONUS: Marriage Status (Age Adjusted)

(Only individuals 40+ were examined)
centered image

Stroke likelihood via Marriage Status

Ever Married   Stroke   No Stroke   Stroke %  
ever married 182 2369 7.134%
never married 21 223 8.607%

Luckily (for people who are married or hope to be), our suspicions were confirmed! Once you get rid of the age bias towards having been married, it no longer had a higher stroke level. In fact, it now showed that it is actually people who have never been married that are more likely to suffer a stroke (8.607%) compared to those who have been (7.314%), though the difference isn't very large. Still, this should come as good news to people who are married or who hope to be!