Stroke Prediction-EDA-Classification-Models Python (2024)


First what is a stroke?

  • Stroke is a medical emergency. A stroke occurs when blood flow to a part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die within minutes. Through this data we will try to know more about strokes and Make a model to try to do stroke prediction.

Risk factors for having a stroke include:

  • Age: People aged 55 years and over
  • Hypertension: if the systolic pressure is 140 mm Hg or more, or the diastolic pressure is 90 mm Hg or more
  • Hypercholesterolemia: If the cholesterol level in the blood is 200 milligrams per deciliter
  • Smoking
  • Diabetes
  • Obesity: if the body mass index (BMI) is 30 or more


import numpy as np import pandas as pd import matplotlib.pyplot as'ggplot')import seaborn as snsimport as pximport plotly.graph_objects as gofrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.ensemble import AdaBoostClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.neural_network import MLPClassifierfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import classification_report
Stroke Prediction-EDA-Classification-Models Python (1)

Read & Explore


Stroke Prediction-EDA-Classification-Models Python (2)
Stroke Prediction-EDA-Classification-Models Python (3)

Variance features Distribution

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))df.plot(kind="hist", y="age", bins=70, color="b", ax=axes[0][0])df.plot(kind="hist", y="bmi", bins=100, color="r", ax=axes[0][1])df.plot(kind="hist", y="heart_disease", bins=6, color="g", ax=axes[1][0])df.plot(kind="hist", y="avg_glucose_level", bins=100, color="orange", ax=axes[1][1])
Stroke Prediction-EDA-Classification-Models Python (4)
  • We have good distribution for age
  • I think we have outliers in bmi
  • Average glucose distribution is reasonable because the normal avg of blood in sugar is less than 140 , that may be not good this feature will not be helpful to know if diabetes have correlation between diabetes and strokes

Data Summary ( Check for missing values )

print ("Rows : " , df.shape[0])print ("Columns : " , df.shape[1])print ("\nFeatures : \n" , df.columns.tolist())print ("\nMissing values : ", df.isnull().sum().values.sum())print ("\nUnique values : \n",df.nunique())
Stroke Prediction-EDA-Classification-Models Python (5)

Data Visualization

Stroke Pie Chart

labels =df['stroke'].value_counts(sort = True).indexsizes = df['stroke'].value_counts(sort = True)colors = ["lightblue","red"]explode = (0.05,0) plt.figure(figsize=(7,7))plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90,)plt.title('Stroke Breakdown')
Stroke Prediction-EDA-Classification-Models Python (6)

Only 5% percent of people have Stroke!


Stroke Prediction-EDA-Classification-Models Python (7)

There is about 1000 diffrence between Female and Male in the data

Correlation with average glucose level

Visualize some features which maybe have correlation with avg glucose level

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))df.plot(kind='scatter', x='age', y='avg_glucose_level', alpha=0.5, color='green', ax=axes[0], title="Age vs. avg_glucose_level")df.plot(kind='scatter', x='bmi', y='avg_glucose_level', alpha=0.5, color='red', ax=axes[1], title="bmi vs. avg_glucose_level")
Stroke Prediction-EDA-Classification-Models Python (8)
  • Average glucose level is high with old people
  • BMI >40 have low average glucose.

Heatmap Correlation

Stroke Prediction-EDA-Classification-Models Python (9)

There is no correlation between stroke and BMI

BMI Boxplot


we have many outliers but before we fix this we must study BMI first.


Body mass index is a value derived from the mass and height of a person

Stroke Prediction-EDA-Classification-Models Python (11)
Stroke Prediction-EDA-Classification-Models Python (12)
# mean with outliers print(bmi_outliers['stroke'].value_counts())
Stroke Prediction-EDA-Classification-Models Python (13)
print ("\nMissing values : ", df.isnull().sum().values.sum())
Stroke Prediction-EDA-Classification-Models Python (14)

Double Check for missing values

df["bmi"] = df["bmi"].apply(lambda x: 50 if x>50 else x)df["bmi"] = df["bmi"].fillna(28.4)print ("\nMissing values : ", df.isnull().sum().values.sum())
Stroke Prediction-EDA-Classification-Models Python (15)

Stroke or not in Categorical Features

cat_df = df[['gender','Residence_type','smoking_status','stroke']]summary = pd.concat([pd.crosstab(cat_df[x], cat_df.stroke) for x in cat_df.columns[:-1]], keys=cat_df.columns[:-1])summary
Stroke Prediction-EDA-Classification-Models Python (16)

Stroke/Ever Married

Stroke Prediction-EDA-Classification-Models Python (17)

Stroke/Work Type

Stroke Prediction-EDA-Classification-Models Python (18)

Private work exposes you to more stroke

Stroke/Smoking Status

Stroke Prediction-EDA-Classification-Models Python (19)

Being a smoker or a formerly smoker increases your risk of having a stroke

Residence Type

Stroke Prediction-EDA-Classification-Models Python (20)

Residence Type has nothing to do with stroke, We cannot take it as a standard

Stroke/Heart Disease


Most people who have had a stroke do not have any heart disease, but that does not prevent it being an influential factor

Stroke Prediction-EDA-Classification-Models Python (22)

more than 25% of strok cases They had hypertension


  • Avg glucose level is high with old people
  • BMI >40 have low avg glucose
  • Being unmarried reduces your risk of a stroke
  • Being a smoker or a formerly smoker increases your risk of having a stroke
  • more than 25% of strok cases They had hypertension

Data preprocessing

Encoding Categorical Features

df["Residence_type"] = df["Residence_type"].apply(lambda x: 1 if x=="Urban" else 0)df["ever_married"] = df["ever_married"].apply(lambda x: 1 if x=="Yes" else 0)df["gender"] = df["gender"].apply(lambda x: 1 if x=="Male" else 0) df = pd.get_dummies(data=df, columns=['smoking_status'])df = pd.get_dummies(data=df, columns=['work_type'])df
Stroke Prediction-EDA-Classification-Models Python (23)

Scaling The variance in Features

std=StandardScaler()columns = ['avg_glucose_level','bmi','age']scaled = std.fit_transform(df[['avg_glucose_level','bmi','age']])scaled = pd.DataFrame(scaled,columns=columns)df=df.drop(columns=columns,axis=1)df=df.merge(scaled, left_index=True, right_index=True, how = "left")df.head()
Stroke Prediction-EDA-Classification-Models Python (24)

Drop ID feature and check for nulls

Stroke Prediction-EDA-Classification-Models Python (25)
Stroke Prediction-EDA-Classification-Models Python (26)

Classification Models

Target & Features

X = df.drop(['stroke'], axis=1).values y = df['stroke'].values


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

adaboost classification

#create adaboost classification objab_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, learning_rate=0.5, random_state=100)#training via adaboost classficiation, y_train)print("training....\n")#make prediction using the test setab_pred_stroke= ab_clf.predict(X_train)print('prediction: \n', ab_pred_stroke)print('\nparms: \n', ab_clf.get_params)#scoreab_clf_score = ab_clf.score(X_test, y_test)print("\nmean accuracy: %.2f" % ab_clf.score(X_test, y_test))
Stroke Prediction-EDA-Classification-Models Python (27)


xgboost = GradientBoostingClassifier(random_state=0), y_train)#== #Score #== xgboost_score = xgboost.score(X_train, y_train)xgboost_test = xgboost.score(X_test, y_test)#== #testing model #== y_pred = xgboost.predict(X_test)#== #evaluation#== cm = confusion_matrix(y_test,y_pred)print('Training Score',xgboost_score)print('Testing Score \n',xgboost_test)#=== #Confusion Matrix plt.figure(figsize=(14,5))conf_matrix = pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="Greens");print(accuracy_score(y_test,y_pred))
Stroke Prediction-EDA-Classification-Models Python (28)


svc = SVC(random_state=0), y_train)#== #Score #== svc_score = svc.score(X_train, y_train)svc_test = svc.score(X_test, y_test)#== #testing model #== y_pred = svc.predict(X_test)#== #evaluation#== cm = confusion_matrix(y_test,y_pred)print('Training Score',svc_score)print('Testing Score \n',svc_test)print(cm
Stroke Prediction-EDA-Classification-Models Python (29)

Random Forest Classifier

forest = RandomForestClassifier(n_estimators = 100)#==, y_train)#== #Score #== forest_score = forest.score(X_train, y_train)forest_test = forest.score(X_test, y_test)#== #testing model #== y_pred = forest.predict(X_test)#== #evaluation#== cm = confusion_matrix(y_test,y_pred)print('Training Score',forest_score)print('Testing Score \n',forest_test)print(cm)
Stroke Prediction-EDA-Classification-Models Python (30)

Logistic Regression

model = LogisticRegression(), y_train)score = model.score(X_test, y_test)print('Testing Score \n',score)logistic_score = model.score(X_train, y_train)logistic_test = model.score(X_test, y_test)#== y_pred= model.predict(X_test)print(classification_report(y_test, y_pred))#== cm = confusion_matrix(y_test,y_pred)print(cm)
Stroke Prediction-EDA-Classification-Models Python (31)

Feature Importance using Logistic Regression

coef = model.coef_[0]coef = [abs(number) for number in coef]print(coef)
Stroke Prediction-EDA-Classification-Models Python (32)
cols = list(df.columns)cols.index('stroke')#== #Delete target label #== del cols[5]cols
Stroke Prediction-EDA-Classification-Models Python (33)
sorted_index = sorted(range(len(coef)), key = lambda k: coef[k], reverse = True)for idx in sorted_index: print(cols[idx])
Stroke Prediction-EDA-Classification-Models Python (34)

Although BMI is considered an indicator for recognizing strokes, there are a large number of values ​​in the normal range and not a high rate that indicates a stroke.

MLP NN Classifier

X=df.drop(['stroke','gender','bmi','Residence_type','work_type_Never_worked','smoking_status_Unknown'], axis=1).values #X = df.drop(['stroke','bmi'], axis=1).values y = df['stroke'].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)# mlp = MLPClassifier(hidden_layer_sizes=(1000,300, 300, 300), solver='adam', shuffle=False, tol = 0.0001)mlp=MLPClassifier(hidden_layer_sizes=(300,300,300), max_iter=1000, alpha=0.00001, solver='adam', verbose=10, random_state=21), y_train)mlp_pred= mlp.predict(X_test)mlp_score = mlp.score(X_train, y_train)mlp_test = mlp.score(X_test, y_test)y_pred =mlp.predict(X_test)#== #evaluation#== cm = confusion_matrix(y_test,y_pred)print('Training Score',mlp_score)print('Testing Score \n',mlp_test)print(cm)
Iteration 1, loss = 0.25073982Iteration 2, loss = 0.15601721Iteration 3, loss = 0.15148236Iteration 4, loss = 0.15011300Iteration 5, loss = 0.14801346Iteration 6, loss = 0.14705574Iteration 7, loss = 0.14343648Iteration 8, loss = 0.14475396Iteration 9, loss = 0.14122289Iteration 10, loss = 0.14020491Iteration 11, loss = 0.14082460Iteration 12, loss = 0.13869296Iteration 13, loss = 0.13551809Iteration 14, loss = 0.13677271Iteration 15, loss = 0.13306991Iteration 16, loss = 0.13627428Iteration 17, loss = 0.13310803Iteration 18, loss = 0.13113676Iteration 19, loss = 0.12786408Iteration 20, loss = 0.12653028Iteration 21, loss = 0.12525292Iteration 22, loss = 0.12757926Iteration 23, loss = 0.12214366Iteration 24, loss = 0.12129737Iteration 25, loss = 0.12211088Iteration 26, loss = 0.12322562Iteration 27, loss = 0.11950508Iteration 28, loss = 0.11867142Iteration 29, loss = 0.11774275Iteration 30, loss = 0.11903667Iteration 31, loss = 0.11632040Iteration 32, loss = 0.11553193Iteration 33, loss = 0.11295480Iteration 34, loss = 0.11218260Iteration 35, loss = 0.10999969Iteration 36, loss = 0.11053086Iteration 37, loss = 0.10904621Iteration 38, loss = 0.10831232Iteration 39, loss = 0.10686522Iteration 40, loss = 0.10644428Iteration 41, loss = 0.10688178Iteration 42, loss = 0.10343191Iteration 43, loss = 0.10450590Iteration 44, loss = 0.10335569Iteration 45, loss = 0.10186789Iteration 46, loss = 0.10005436Iteration 47, loss = 0.10356312Iteration 48, loss = 0.10151862Iteration 49, loss = 0.10214588Iteration 50, loss = 0.10308373Iteration 51, loss = 0.09923623Iteration 52, loss = 0.09605030Iteration 53, loss = 0.09936861Iteration 54, loss = 0.09486939Iteration 55, loss = 0.09245237Iteration 56, loss = 0.09775333Iteration 57, loss = 0.09387213Iteration 58, loss = 0.09417488Iteration 59, loss = 0.09496724Iteration 60, loss = 0.09067467Iteration 61, loss = 0.08957575Iteration 62, loss = 0.09188115Iteration 63, loss = 0.09131175Iteration 64, loss = 0.08956810Iteration 65, loss = 0.09027089Iteration 66, loss = 0.09068501Iteration 67, loss = 0.08620702Iteration 68, loss = 0.08673546Iteration 69, loss = 0.08283293Iteration 70, loss = 0.08313578Iteration 71, loss = 0.08808702Iteration 72, loss = 0.08630748Iteration 73, loss = 0.08130300Iteration 74, loss = 0.08077653Iteration 75, loss = 0.08214762Iteration 76, loss = 0.08222929Iteration 77, loss = 0.07996879Iteration 78, loss = 0.08085455Iteration 79, loss = 0.07764043Iteration 80, loss = 0.08130066Iteration 81, loss = 0.07998853Iteration 82, loss = 0.07847984Iteration 83, loss = 0.08112860Iteration 84, loss = 0.07691877Iteration 85, loss = 0.07564515Iteration 86, loss = 0.07751632Iteration 87, loss = 0.07696659Iteration 88, loss = 0.08058930Iteration 89, loss = 0.07747721Iteration 90, loss = 0.07779515Iteration 91, loss = 0.07564913Iteration 92, loss = 0.07393943Iteration 93, loss = 0.07744015Iteration 94, loss = 0.07466905Iteration 95, loss = 0.07443650Iteration 96, loss = 0.07214443Iteration 97, loss = 0.07238843Iteration 98, loss = 0.07042956Iteration 99, loss = 0.06888013Iteration 100, loss = 0.06920919Iteration 101, loss = 0.06901262Iteration 102, loss = 0.07552961Iteration 103, loss = 0.07174945Iteration 104, loss = 0.07029673Iteration 105, loss = 0.07013814Iteration 106, loss = 0.06784715Iteration 107, loss = 0.07159969Iteration 108, loss = 0.06863485Iteration 109, loss = 0.06673842Iteration 110, loss = 0.06937063Iteration 111, loss = 0.06617347Iteration 112, loss = 0.06500215Iteration 113, loss = 0.06340067Iteration 114, loss = 0.06236733Iteration 115, loss = 0.06458241Iteration 116, loss = 0.06619115Iteration 117, loss = 0.07260931Iteration 118, loss = 0.06929901Iteration 119, loss = 0.06682100Iteration 120, loss = 0.06453708Iteration 121, loss = 0.06246274Iteration 122, loss = 0.06107513Iteration 123, loss = 0.06234550Iteration 124, loss = 0.06083020Iteration 125, loss = 0.06177546Iteration 126, loss = 0.05927088Iteration 127, loss = 0.05970574Iteration 128, loss = 0.06032682Iteration 129, loss = 0.06070094Iteration 130, loss = 0.06367095Iteration 131, loss = 0.05975269Iteration 132, loss = 0.06050048Iteration 133, loss = 0.06072319Iteration 134, loss = 0.06303969Iteration 135, loss = 0.06479217Iteration 136, loss = 0.06493533Iteration 137, loss = 0.06678607Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.Training Score 0.9751188146491473Testing Score 0.9347684279191129[[1420 29] [ 71 13]]
plt.figure(figsize=(14,5))cm = confusion_matrix(y_test,y_pred)conf_matrix = pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="Reds");
Stroke Prediction-EDA-Classification-Models Python (35)

Sensitivity & Specificity

Stroke Prediction-EDA-Classification-Models Python (36)
print('The acuuracy of the model = TP+TN/(TP+TN+FP+FN) = ',(TP+TN)/float(TP+TN+FP+FN),'\n','The Missclassification = 1-Accuracy = ',1-((TP+TN)/float(TP+TN+FP+FN)),'\n','Sensitivity or True Positive Rate = TP/(TP+FN) = ',TP/float(TP+FN),'\n','Specificity or True Negative Rate = TN/(TN+FP) = ',TN/float(TN+FP),'\n')
Stroke Prediction-EDA-Classification-Models Python (37)

This Notebook was written on Kaggle By Ahmed Ashour. Click on his name to follow him. To read more about Python Notebooks click here

Discover more from Geeky Codes

Subscribe to get the latest posts to your email.

Stroke Prediction-EDA-Classification-Models Python (2024)
Top Articles
Latest Posts
Article information

Author: Sen. Emmett Berge

Last Updated:

Views: 5989

Rating: 5 / 5 (60 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Sen. Emmett Berge

Birthday: 1993-06-17

Address: 787 Elvis Divide, Port Brice, OH 24507-6802

Phone: +9779049645255

Job: Senior Healthcare Specialist

Hobby: Cycling, Model building, Kitesurfing, Origami, Lapidary, Dance, Basketball

Introduction: My name is Sen. Emmett Berge, I am a funny, vast, charming, courageous, enthusiastic, jolly, famous person who loves writing and wants to share my knowledge and understanding with you.