Patient Segmentation Based on Prescription Patterns¶
Importing Necessary Libraries:¶
from warnings import filterwarnings
filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Reading the data, Understanding the data and Addressing basic inconsistencies if any...¶
df = pd.read_parquet('train.parquet')
df.head()
Patient-Uid | Date | Incident | |
---|---|---|---|
0 | a0db1e73-1c7c-11ec-ae39-16262ee38c7f | 2019-03-09 | PRIMARY_DIAGNOSIS |
1 | a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f | 2015-05-16 | PRIMARY_DIAGNOSIS |
3 | a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f | 2018-01-30 | SYMPTOM_TYPE_0 |
4 | a0dc950b-1c7c-11ec-b6ec-16262ee38c7f | 2015-04-22 | DRUG_TYPE_0 |
8 | a0dc9543-1c7c-11ec-bb63-16262ee38c7f | 2016-06-18 | DRUG_TYPE_1 |
df.shape
(3220868, 3)
df.duplicated().sum()
35571
df.index.duplicated().sum()
166693
df.drop_duplicates(inplace = True)
df.reset_index(drop = True, inplace = True)
df.shape
(3185297, 3)
We have already understood the data well enough in 001.ipynb
... So, let's straightaway jump into the context...
Our matter of importance lies with only the patient who took target drug atleast once... So, let's split that alone from main dataframe and use it to cluster and analyse...
Pulling just the positive set out of original dataset...¶
positive_set = df[df['Incident'] == 'TARGET DRUG']
print(positive_set.shape)
positive_set.head()
(67218, 3)
Patient-Uid | Date | Incident | |
---|---|---|---|
2065342 | a0eb742b-1c7c-11ec-8f61-16262ee38c7f | 2020-04-09 | TARGET DRUG |
2065362 | a0edaf09-1c7c-11ec-a360-16262ee38c7f | 2018-06-12 | TARGET DRUG |
2065502 | a0e9fa0e-1c7c-11ec-8dc7-16262ee38c7f | 2019-06-11 | TARGET DRUG |
2065613 | a0ecc615-1c7c-11ec-aa31-16262ee38c7f | 2019-11-15 | TARGET DRUG |
2065618 | a0ea612f-1c7c-11ec-8cf0-16262ee38c7f | 2020-03-18 | TARGET DRUG |
So, we have 67k + instances where patients took target drug and as we have already observed in 001.ipynb
, the total no. of unique patients who took target drug is around 9300 which is 1/3rd of total no. of unique patients in the entire dataset...
As all that incident column would have is just the Target Drug, it's of no use and so, let's drop it off...
positive_set.drop('Incident', inplace = True, axis = 1)
# Sorting patients by Patient-uid and date to manage them better
positive_set.sort_values(by=['Patient-Uid', 'Date'], inplace=True)
Engineering new feature...¶
Extracting a new feature which shows the time interval between each prescription using the date column that we have would help us train the model and get clusters based on it... So, that's what we are going to do below...
Steps are simple...
Adding a new column to positive set df and assigning values to it by grouping based on Patient-uid and using diff method to compute diff between one date to another...
Dropping null values in TimeInterval column as all we would have for the first date would be NaT values (Not a Time)...
Let's do it...
positive_set['TimeInterval'] = positive_set.groupby('Patient-Uid')['Date'].diff()
prescription_patterns = positive_set.copy()
prescription_patterns.dropna(subset=['TimeInterval'], inplace=True)
print(prescription_patterns['TimeInterval'].min(), prescription_patterns['TimeInterval'].max())
1 days 00:00:00 1219 days 00:00:00
print(df['Date'].min(), df['Date'].max())
2015-04-07 00:00:00 2020-09-03 00:00:00
print(positive_set['Date'].min(), positive_set['Date'].max())
2017-02-22 00:00:00 2020-09-03 00:00:00
The time interval as we can observe spans between 1 day to 1219 days...
Also from the subsequent cells we can notice that while the dataset has records starting from Apr 2015 and ends by Sep 2020, the targe drug administration has got started only by Feb 2017 and ends by the date up till which we have records... We will discuss this later on if needed...
Now let's find ideal no. of clusters using elbow method for k-means clustering... As we have time-delta object as input for the model, DB-scan won't take such input... So, let's stick to k-means clustering...
Elbow Plot to find ideal no. of clusters:¶
X = prescription_patterns['TimeInterval'].values.reshape(-1, 1)
k_values = range(1, 11)
inertia_scores = []
cluster_mapping = {}
for k in k_values:
kmeans = KMeans(n_clusters=k, n_init='auto', random_state=42) # init will by default be done using k++ which is the best one to go with...
kmeans.fit(X)
inertia_score = kmeans.inertia_
inertia_scores.append(inertia_score)
cluster_mapping[f'{k} Cluster' if k == 1 else f'{k} Clusters'] = inertia_score
print(cluster_mapping)
sns.set_theme()
plt.figure(figsize=(10, 7))
plt.plot(k_values, inertia_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia Score')
plt.title('Elbow Method: Inertia Scores')
plt.show()
{'1 Cluster': 8.596245577653992e+35, '2 Clusters': 4.237133518141459e+35, '3 Clusters': 2.4248190479612254e+35, '4 Clusters': 1.3773206859526056e+35, '5 Clusters': 1.165582461295304e+35, '6 Clusters': 7.009182021866456e+34, '7 Clusters': 5.067819584613085e+34, '8 Clusters': 4.035829757754878e+34, '9 Clusters': 3.079275106145406e+34, '10 Clusters': 2.6283973130400936e+34}
4 clusters looks good... If had some more time, I could have plotted silhoutte score plot, davis bouldin score plot and calinski score plot to check the quality of clusters and that would have added some strength to our n-clusters pick...So, here let's go with 4 clusters and visualize patterns...
During my previous clustering project, I encountered huge delay in computing silhoutte, davis bouldin and calinski scores for say n-clusters in the range of [1, 10]... That's the main reason why I am skipping it here considering time constraint...
Now, let's do clustering with 4 no. of clusters with k-means...
Clustering:¶
kmeans = KMeans(n_clusters=4, random_state=42)
prescription_patterns['Cluster'] = kmeans.fit_predict(X) + 1
# Cluster by default will start from 0... For our understandability we are making it start from 1
prescription_patterns['Cluster'].value_counts().plot.pie(autopct='%.2f')
<Axes: ylabel='Cluster'>
Instant observation:¶
95% patients falling in cluster 1 and 3 remaining 5% with cluster 2 and 4, that too in cluster, only around 0.6% patients are there.
As obvious, Cluster 1 has more patients followed by Cluster 3.
Let's move forward and analyse further...
prescription_patterns.head()
Patient-Uid | Date | TimeInterval | Cluster | |
---|---|---|---|---|
2094649 | a0e9c384-1c7c-11ec-81a0-16262ee38c7f | 2020-08-05 | 28 days | 3 |
2164002 | a0e9c384-1c7c-11ec-81a0-16262ee38c7f | 2020-09-02 | 28 days | 3 |
2637552 | a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f | 2018-05-17 | 23 days | 3 |
3171058 | a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f | 2018-06-13 | 27 days | 3 |
2375328 | a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f | 2018-08-07 | 55 days | 1 |
Some Seggregation and Grouping before we get into visualization part...¶
# Seggregating clusters into different dataframes to visualize wach of them with ease...
cluster_1 = prescription_patterns[prescription_patterns['Cluster'] == 1]
cluster_2 = prescription_patterns[prescription_patterns['Cluster'] == 2]
cluster_3 = prescription_patterns[prescription_patterns['Cluster'] == 3]
cluster_4 = prescription_patterns[prescription_patterns['Cluster'] == 4]
# Grouping the data in each cluster by month and counting the number of prescriptions using Grouper object and groupby method
cluster_1_counts = cluster_1.groupby(pd.Grouper(key='Date', freq='M')).size()
cluster_2_counts = cluster_2.groupby(pd.Grouper(key='Date', freq='M')).size()
cluster_3_counts = cluster_3.groupby(pd.Grouper(key='Date', freq='M')).size()
cluster_4_counts = cluster_4.groupby(pd.Grouper(key='Date', freq='M')).size()
Some first line of visualization before we dig deeper into the clusters...¶
plt.figure(figsize=(10, 8))
plt.plot(cluster_1_counts.index, cluster_1_counts.values, label='Cluster 1')
plt.plot(cluster_2_counts.index, cluster_2_counts.values, label='Cluster 2')
plt.plot(cluster_3_counts.index, cluster_3_counts.values, label='Cluster 3')
plt.plot(cluster_4_counts.index, cluster_4_counts.values, label='Cluster 4')
plt.xlabel('Period')
plt.ylabel('Number of Prescriptions')
plt.title('Number of Prescriptions overall as time progresses')
plt.legend()
plt.show()
Inferences from above visualization:¶
As we already observed earlier in this same notebook, the target drug administration
started only after Feb 2017
.
Cluster-1
: This group of patients has got target drug prescribed only after the patients from cluster 3 has got prescribed. But eventually, these patients havehad more prescriptions as compared to patients from cluster 3 and overall
... This might be because of the fact that cluster 1 has more no. of patients as compared to cluster 3. Also this clusterhad few ups and downs
but hashad a stark rise
in no. of prescriptions made as time progresses andhasn't had that much stark decrease
in no. of prescriptions.
Cluster-2
: This is the group which have hadleast amount of prescriptions throughout without any steep rise or fall
... They looked like taking atad bit more by the mid of 2020
but thenagain gone back
to same no. of prescriptions. Also they seem to belatest group which started taking target drug
only by around Feb or Mar 2019. We need additional details to see the trend of this cluster.
Cluster-3
: These patients seem to havestarted with the target drug earlier than any other group
and has had asteep rise in no. of prescriptions
made duringearly 2018
and it became more or like similar to aplateau
kinda thing and that's been the casefor the rest of the period
... They havehad stark rise and then plateau
as we discussed, but hasnever had steep decline
in no. of prescriptions made as time went by.
Cluster-4
: These group of patients are bitbetter than cluster 2 patients
when you see the overall picture and has hadnot steep but significant rise
in prescription by aroundAug or Sep 2018
and maintained the same rate throughout.
Creating new feature - Month from TimeInterval Column...¶
cluster_1['Month'] = (cluster_1['TimeInterval'].dt.days / 30.44).astype(int)
cluster_2['Month'] = (cluster_2['TimeInterval'].dt.days / 30.44).astype(int)
cluster_3['Month'] = (cluster_3['TimeInterval'].dt.days / 30.44).astype(int)
cluster_4['Month'] = (cluster_4['TimeInterval'].dt.days / 30.44).astype(int)
# Calculating the avg no. of prescriptions per unique patient for each month in each cluster
cluster_1_prescription = cluster_1.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
cluster_2_prescription = cluster_2.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
cluster_3_prescription = cluster_3.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
cluster_4_prescription = cluster_4.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
Visualizations explaining the pattern in each cluster with respect to average prescriptions every month:¶
std = cluster_1_prescription['Average Prescriptions'].std() # standard deviation calculation for errorbar
error_y = np.full(len(cluster_1_prescription), std)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=cluster_1_prescription['Month'],
y=cluster_1_prescription['Average Prescriptions'],
error_y=dict(type='data', array=error_y, visible=True),
mode='markers+lines',
marker={'size': 16}
))
fig.update_layout(
title='Average Prescriptions for Cluster 1',
xaxis_title='Month',
yaxis_title='Average Prescriptions'
)
fig.show()
Insights gained:¶
Patients in this cluster have
took the target drug for utmost 3 months
... As we have observed already, this group had themost number of patients
in it... So, it should be safe to assume that almost close to 60% patient who took target drug in total across clusters has took it for only 3 months...Month 1
: On an average presciption is made atleastthrice
in this month.Month 2
: On an average presciption is made atleasttwice
this month.Month 3
: On an average presciption is made atleastonce
this month.
So, the overall pattern in this cluster shows that, no. of prescriptions made on an average scale seems to be on a decreasing trend with the patients taking drug for 3 months to the max and then quit... This doesn't seem to be healthy trend
... So, we have to make them more engaged in taking prescriptions...
So, this looked like a good cluster of patients when we observed in that overall visualization. But, as I pointed out there, it seemed to look better because of more number of patients in this cluster...
std = cluster_2_prescription['Average Prescriptions'].std()
error_y = np.full(len(cluster_2_prescription), std)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=cluster_2_prescription['Month'],
y=cluster_2_prescription['Average Prescriptions'],
error_y=dict(type='data', array=error_y, visible=True),
mode='markers+lines',
marker={'size': 16}
))
fig.update_layout(
title='Average Prescriptions for Cluster 2',
xaxis_title='Month',
yaxis_title='Average Prescriptions'
)
fig.show()
Insights gained:¶
This cluster looks more of a
consistent and stable cluster
.Starting from 9th month
, the prescription seems to beconsistent until 27th month (A span of 18 months)
where there's been atleast 1 prescription every month which looks pretty good and ahealthy trend
rather than having high no. of prescription on a month and falling down steeply in the subsequent months.After 27th month until 32nd month
there seems to beneglible or no prescription
and after that 32nd month the average was 1 and after that prescription has been taken post 3 months from 32nd month and 5 months from there on. Some patients might have had somecomplications
which lead them to take target drug even after a span of 18 months period... This might also be because of thereccurrent symptoms
...
So, the overall pattern looks pretty much healthy
over here despite having few cuts here and there...
std = cluster_3_prescription['Average Prescriptions'].std()
error_y = np.full(len(cluster_3_prescription), std)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=cluster_3_prescription['Month'],
y=cluster_3_prescription['Average Prescriptions'],
error_y=dict(type='data', array=error_y, visible=True),
mode='markers+lines',
marker={'size': 16}
))
fig.update_layout(
title='Average Prescriptions for Cluster 3',
xaxis_title='Month',
yaxis_title='Average Prescriptions'
)
fig.show()
Insights gained:¶
These are the patient group who were the
earliest to take the target drug
as we observed before.Initially
they seem to have taken around3 prescriptions on an average.
After first prescription
, there seems to besteep decline
in average no. of prescription taken which almost became less than half of the initial month's average.These patients seem to have
stopped
taking target drugwithin a month or so
making them theworst amongst all that we have seen so far.
Overall, patients in this cluster represents the worst trend of all
that we have seen so far despite them being the earliest drug takers and the second most no. of patient falling under this umbrella.
std = cluster_4_prescription['Average Prescriptions'].std()
error_y = np.full(len(cluster_4_prescription), std)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=cluster_4_prescription['Month'],
y=cluster_4_prescription['Average Prescriptions'],
error_y=dict(type='data', array=error_y, visible=True),
mode='markers+lines',
marker={'size': 16}
))
fig.update_layout(
title='Average Prescriptions for Cluster 4',
xaxis_title='Month',
yaxis_title='Average Prescriptions'
)
fig.show()
Insights gained:¶
These group of patients all seem to have taken
around 1 drug on average throughout
from the 3rd month till 9th month(A span of 6 months)
with slight decline from inital administration.Apart from that this cluster seems to be
a consistent and stable cluster
andcan be put next to the best performing cluster - cluster 2.
Overall, this group though with second least no. of people coming under this umbrella, have performed really upto the mark. The span might just be because the parctitioner would have thought that these group just need to take for these many months... There might be other reasons as well.
Overall Summary:¶
At first instance when we have seen that overall visualization, we might have got carried away... But, only when we digged deeper, we came to sense, which is quite contrary to what we might have thought.
Cluster 1 and Cluster 3
which have more than95% of total patients
who took target drug,performs poorly
. Whilecluster 1
which has60% of total patients
in it, seems to have had adeclining trend
and the patients in this clustertook drug only for 3 months
whereas,cluster 3
which has35% of total patients
in it seems to be theworst of all
that we have as it has patients who hastaken drug just for a month or so
and that too on adeclining trend.
Cluster 2 and Cluster 4
which we would have thought as worst at first glance are the best actually. Among whichcluster 2
is thebest of all
that we have as theytook drug for a span of 18 months
which ispretty much consistent
and they tookatleast 1 drug on average throughout
, which isstable enough
. With respect tocluster 4
, these patients follow asimilar trend
to that of cluster 2except
for the fact that it has aslight decline
in no. of prescriptions and thespan
here is around6 months
which sound pretty much fine and consistent.
So, this sums it up all. Had some more time, we could have explored it further... Thanks for the opportunity though...