Patient Segmentation Based on Prescription Patterns¶
Importing Necessary Libraries:¶
from warnings import filterwarnings
filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Reading the data, Understanding the data and Addressing basic inconsistencies if any...¶
df = pd.read_parquet('train.parquet')
df.head()
| Patient-Uid | Date | Incident | |
|---|---|---|---|
| 0 | a0db1e73-1c7c-11ec-ae39-16262ee38c7f | 2019-03-09 | PRIMARY_DIAGNOSIS |
| 1 | a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f | 2015-05-16 | PRIMARY_DIAGNOSIS |
| 3 | a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f | 2018-01-30 | SYMPTOM_TYPE_0 |
| 4 | a0dc950b-1c7c-11ec-b6ec-16262ee38c7f | 2015-04-22 | DRUG_TYPE_0 |
| 8 | a0dc9543-1c7c-11ec-bb63-16262ee38c7f | 2016-06-18 | DRUG_TYPE_1 |
df.shape
(3220868, 3)
df.duplicated().sum()
35571
df.index.duplicated().sum()
166693
df.drop_duplicates(inplace = True)
df.reset_index(drop = True, inplace = True)
df.shape
(3185297, 3)
We have already understood the data well enough in 001.ipynb... So, let's straightaway jump into the context...
Our matter of importance lies with only the patient who took target drug atleast once... So, let's split that alone from main dataframe and use it to cluster and analyse...
Pulling just the positive set out of original dataset...¶
positive_set = df[df['Incident'] == 'TARGET DRUG']
print(positive_set.shape)
positive_set.head()
(67218, 3)
| Patient-Uid | Date | Incident | |
|---|---|---|---|
| 2065342 | a0eb742b-1c7c-11ec-8f61-16262ee38c7f | 2020-04-09 | TARGET DRUG |
| 2065362 | a0edaf09-1c7c-11ec-a360-16262ee38c7f | 2018-06-12 | TARGET DRUG |
| 2065502 | a0e9fa0e-1c7c-11ec-8dc7-16262ee38c7f | 2019-06-11 | TARGET DRUG |
| 2065613 | a0ecc615-1c7c-11ec-aa31-16262ee38c7f | 2019-11-15 | TARGET DRUG |
| 2065618 | a0ea612f-1c7c-11ec-8cf0-16262ee38c7f | 2020-03-18 | TARGET DRUG |
So, we have 67k + instances where patients took target drug and as we have already observed in 001.ipynb, the total no. of unique patients who took target drug is around 9300 which is 1/3rd of total no. of unique patients in the entire dataset...
As all that incident column would have is just the Target Drug, it's of no use and so, let's drop it off...
positive_set.drop('Incident', inplace = True, axis = 1)
# Sorting patients by Patient-uid and date to manage them better
positive_set.sort_values(by=['Patient-Uid', 'Date'], inplace=True)
Engineering new feature...¶
Extracting a new feature which shows the time interval between each prescription using the date column that we have would help us train the model and get clusters based on it... So, that's what we are going to do below...
Steps are simple...
Adding a new column to positive set df and assigning values to it by grouping based on Patient-uid and using diff method to compute diff between one date to another...
Dropping null values in TimeInterval column as all we would have for the first date would be NaT values (Not a Time)...
Let's do it...
positive_set['TimeInterval'] = positive_set.groupby('Patient-Uid')['Date'].diff()
prescription_patterns = positive_set.copy()
prescription_patterns.dropna(subset=['TimeInterval'], inplace=True)
print(prescription_patterns['TimeInterval'].min(), prescription_patterns['TimeInterval'].max())
1 days 00:00:00 1219 days 00:00:00
print(df['Date'].min(), df['Date'].max())
2015-04-07 00:00:00 2020-09-03 00:00:00
print(positive_set['Date'].min(), positive_set['Date'].max())
2017-02-22 00:00:00 2020-09-03 00:00:00
The time interval as we can observe spans between 1 day to 1219 days...
Also from the subsequent cells we can notice that while the dataset has records starting from Apr 2015 and ends by Sep 2020, the targe drug administration has got started only by Feb 2017 and ends by the date up till which we have records... We will discuss this later on if needed...
Now let's find ideal no. of clusters using elbow method for k-means clustering... As we have time-delta object as input for the model, DB-scan won't take such input... So, let's stick to k-means clustering...
Elbow Plot to find ideal no. of clusters:¶
X = prescription_patterns['TimeInterval'].values.reshape(-1, 1)
k_values = range(1, 11)
inertia_scores = []
cluster_mapping = {}
for k in k_values:
kmeans = KMeans(n_clusters=k, n_init='auto', random_state=42) # init will by default be done using k++ which is the best one to go with...
kmeans.fit(X)
inertia_score = kmeans.inertia_
inertia_scores.append(inertia_score)
cluster_mapping[f'{k} Cluster' if k == 1 else f'{k} Clusters'] = inertia_score
print(cluster_mapping)
sns.set_theme()
plt.figure(figsize=(10, 7))
plt.plot(k_values, inertia_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia Score')
plt.title('Elbow Method: Inertia Scores')
plt.show()
{'1 Cluster': 8.596245577653992e+35, '2 Clusters': 4.237133518141459e+35, '3 Clusters': 2.4248190479612254e+35, '4 Clusters': 1.3773206859526056e+35, '5 Clusters': 1.165582461295304e+35, '6 Clusters': 7.009182021866456e+34, '7 Clusters': 5.067819584613085e+34, '8 Clusters': 4.035829757754878e+34, '9 Clusters': 3.079275106145406e+34, '10 Clusters': 2.6283973130400936e+34}
4 clusters looks good... If had some more time, I could have plotted silhoutte score plot, davis bouldin score plot and calinski score plot to check the quality of clusters and that would have added some strength to our n-clusters pick...So, here let's go with 4 clusters and visualize patterns...
During my previous clustering project, I encountered huge delay in computing silhoutte, davis bouldin and calinski scores for say n-clusters in the range of [1, 10]... That's the main reason why I am skipping it here considering time constraint...
Now, let's do clustering with 4 no. of clusters with k-means...
Clustering:¶
kmeans = KMeans(n_clusters=4, random_state=42)
prescription_patterns['Cluster'] = kmeans.fit_predict(X) + 1
# Cluster by default will start from 0... For our understandability we are making it start from 1
prescription_patterns['Cluster'].value_counts().plot.pie(autopct='%.2f')
<Axes: ylabel='Cluster'>
Instant observation:¶
95% patients falling in cluster 1 and 3 remaining 5% with cluster 2 and 4, that too in cluster, only around 0.6% patients are there.
As obvious, Cluster 1 has more patients followed by Cluster 3.
Let's move forward and analyse further...
prescription_patterns.head()
| Patient-Uid | Date | TimeInterval | Cluster | |
|---|---|---|---|---|
| 2094649 | a0e9c384-1c7c-11ec-81a0-16262ee38c7f | 2020-08-05 | 28 days | 3 |
| 2164002 | a0e9c384-1c7c-11ec-81a0-16262ee38c7f | 2020-09-02 | 28 days | 3 |
| 2637552 | a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f | 2018-05-17 | 23 days | 3 |
| 3171058 | a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f | 2018-06-13 | 27 days | 3 |
| 2375328 | a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f | 2018-08-07 | 55 days | 1 |
Some Seggregation and Grouping before we get into visualization part...¶
# Seggregating clusters into different dataframes to visualize wach of them with ease...
cluster_1 = prescription_patterns[prescription_patterns['Cluster'] == 1]
cluster_2 = prescription_patterns[prescription_patterns['Cluster'] == 2]
cluster_3 = prescription_patterns[prescription_patterns['Cluster'] == 3]
cluster_4 = prescription_patterns[prescription_patterns['Cluster'] == 4]
# Grouping the data in each cluster by month and counting the number of prescriptions using Grouper object and groupby method
cluster_1_counts = cluster_1.groupby(pd.Grouper(key='Date', freq='M')).size()
cluster_2_counts = cluster_2.groupby(pd.Grouper(key='Date', freq='M')).size()
cluster_3_counts = cluster_3.groupby(pd.Grouper(key='Date', freq='M')).size()
cluster_4_counts = cluster_4.groupby(pd.Grouper(key='Date', freq='M')).size()
Some first line of visualization before we dig deeper into the clusters...¶
plt.figure(figsize=(10, 8))
plt.plot(cluster_1_counts.index, cluster_1_counts.values, label='Cluster 1')
plt.plot(cluster_2_counts.index, cluster_2_counts.values, label='Cluster 2')
plt.plot(cluster_3_counts.index, cluster_3_counts.values, label='Cluster 3')
plt.plot(cluster_4_counts.index, cluster_4_counts.values, label='Cluster 4')
plt.xlabel('Period')
plt.ylabel('Number of Prescriptions')
plt.title('Number of Prescriptions overall as time progresses')
plt.legend()
plt.show()
Inferences from above visualization:¶
As we already observed earlier in this same notebook, the target drug administration started only after Feb 2017.
Cluster-1: This group of patients has got target drug prescribed only after the patients from cluster 3 has got prescribed. But eventually, these patients havehad more prescriptions as compared to patients from cluster 3 and overall... This might be because of the fact that cluster 1 has more no. of patients as compared to cluster 3. Also this clusterhad few ups and downsbut hashad a stark risein no. of prescriptions made as time progresses andhasn't had that much stark decreasein no. of prescriptions.
Cluster-2: This is the group which have hadleast amount of prescriptions throughout without any steep rise or fall... They looked like taking atad bit more by the mid of 2020but thenagain gone backto same no. of prescriptions. Also they seem to belatest group which started taking target drugonly by around Feb or Mar 2019. We need additional details to see the trend of this cluster.
Cluster-3: These patients seem to havestarted with the target drug earlier than any other groupand has had asteep rise in no. of prescriptionsmade duringearly 2018and it became more or like similar to aplateaukinda thing and that's been the casefor the rest of the period... They havehad stark rise and then plateauas we discussed, but hasnever had steep declinein no. of prescriptions made as time went by.
Cluster-4: These group of patients are bitbetter than cluster 2 patientswhen you see the overall picture and has hadnot steep but significant risein prescription by aroundAug or Sep 2018and maintained the same rate throughout.
Creating new feature - Month from TimeInterval Column...¶
cluster_1['Month'] = (cluster_1['TimeInterval'].dt.days / 30.44).astype(int)
cluster_2['Month'] = (cluster_2['TimeInterval'].dt.days / 30.44).astype(int)
cluster_3['Month'] = (cluster_3['TimeInterval'].dt.days / 30.44).astype(int)
cluster_4['Month'] = (cluster_4['TimeInterval'].dt.days / 30.44).astype(int)
# Calculating the avg no. of prescriptions per unique patient for each month in each cluster
cluster_1_prescription = cluster_1.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
cluster_2_prescription = cluster_2.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
cluster_3_prescription = cluster_3.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
cluster_4_prescription = cluster_4.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
Visualizations explaining the pattern in each cluster with respect to average prescriptions every month:¶
std = cluster_1_prescription['Average Prescriptions'].std() # standard deviation calculation for errorbar
error_y = np.full(len(cluster_1_prescription), std)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=cluster_1_prescription['Month'],
y=cluster_1_prescription['Average Prescriptions'],
error_y=dict(type='data', array=error_y, visible=True),
mode='markers+lines',
marker={'size': 16}
))
fig.update_layout(
title='Average Prescriptions for Cluster 1',
xaxis_title='Month',
yaxis_title='Average Prescriptions'
)
fig.show()
Insights gained:¶
Patients in this cluster have
took the target drug for utmost 3 months... As we have observed already, this group had themost number of patientsin it... So, it should be safe to assume that almost close to 60% patient who took target drug in total across clusters has took it for only 3 months...Month 1: On an average presciption is made atleastthricein this month.Month 2: On an average presciption is made atleasttwicethis month.Month 3: On an average presciption is made atleastoncethis month.
So, the overall pattern in this cluster shows that, no. of prescriptions made on an average scale seems to be on a decreasing trend with the patients taking drug for 3 months to the max and then quit... This doesn't seem to be healthy trend... So, we have to make them more engaged in taking prescriptions...
So, this looked like a good cluster of patients when we observed in that overall visualization. But, as I pointed out there, it seemed to look better because of more number of patients in this cluster...
std = cluster_2_prescription['Average Prescriptions'].std()
error_y = np.full(len(cluster_2_prescription), std)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=cluster_2_prescription['Month'],
y=cluster_2_prescription['Average Prescriptions'],
error_y=dict(type='data', array=error_y, visible=True),
mode='markers+lines',
marker={'size': 16}
))
fig.update_layout(
title='Average Prescriptions for Cluster 2',
xaxis_title='Month',
yaxis_title='Average Prescriptions'
)
fig.show()
Insights gained:¶
This cluster looks more of a
consistent and stable cluster.Starting from 9th month, the prescription seems to beconsistent until 27th month (A span of 18 months)where there's been atleast 1 prescription every month which looks pretty good and ahealthy trendrather than having high no. of prescription on a month and falling down steeply in the subsequent months.After 27th month until 32nd monththere seems to beneglible or no prescriptionand after that 32nd month the average was 1 and after that prescription has been taken post 3 months from 32nd month and 5 months from there on. Some patients might have had somecomplicationswhich lead them to take target drug even after a span of 18 months period... This might also be because of thereccurrent symptoms...
So, the overall pattern looks pretty much healthy over here despite having few cuts here and there...
std = cluster_3_prescription['Average Prescriptions'].std()
error_y = np.full(len(cluster_3_prescription), std)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=cluster_3_prescription['Month'],
y=cluster_3_prescription['Average Prescriptions'],
error_y=dict(type='data', array=error_y, visible=True),
mode='markers+lines',
marker={'size': 16}
))
fig.update_layout(
title='Average Prescriptions for Cluster 3',
xaxis_title='Month',
yaxis_title='Average Prescriptions'
)
fig.show()
Insights gained:¶
These are the patient group who were the
earliest to take the target drugas we observed before.Initiallythey seem to have taken around3 prescriptions on an average.After first prescription, there seems to besteep declinein average no. of prescription taken which almost became less than half of the initial month's average.These patients seem to have
stoppedtaking target drugwithin a month or somaking them theworst amongst all that we have seen so far.
Overall, patients in this cluster represents the worst trend of all that we have seen so far despite them being the earliest drug takers and the second most no. of patient falling under this umbrella.
std = cluster_4_prescription['Average Prescriptions'].std()
error_y = np.full(len(cluster_4_prescription), std)
fig = go.Figure()
fig.add_trace(go.Scatter(
x=cluster_4_prescription['Month'],
y=cluster_4_prescription['Average Prescriptions'],
error_y=dict(type='data', array=error_y, visible=True),
mode='markers+lines',
marker={'size': 16}
))
fig.update_layout(
title='Average Prescriptions for Cluster 4',
xaxis_title='Month',
yaxis_title='Average Prescriptions'
)
fig.show()
Insights gained:¶
These group of patients all seem to have taken
around 1 drug on average throughoutfrom the 3rd month till 9th month(A span of 6 months)with slight decline from inital administration.Apart from that this cluster seems to be
a consistent and stable clusterandcan be put next to the best performing cluster - cluster 2.
Overall, this group though with second least no. of people coming under this umbrella, have performed really upto the mark. The span might just be because the parctitioner would have thought that these group just need to take for these many months... There might be other reasons as well.
Overall Summary:¶
At first instance when we have seen that overall visualization, we might have got carried away... But, only when we digged deeper, we came to sense, which is quite contrary to what we might have thought.
Cluster 1 and Cluster 3which have more than95% of total patientswho took target drug,performs poorly. Whilecluster 1which has60% of total patientsin it, seems to have had adeclining trendand the patients in this clustertook drug only for 3 monthswhereas,cluster 3which has35% of total patientsin it seems to be theworst of allthat we have as it has patients who hastaken drug just for a month or soand that too on adeclining trend.
Cluster 2 and Cluster 4which we would have thought as worst at first glance are the best actually. Among whichcluster 2is thebest of allthat we have as theytook drug for a span of 18 monthswhich ispretty much consistentand they tookatleast 1 drug on average throughout, which isstable enough. With respect tocluster 4, these patients follow asimilar trendto that of cluster 2exceptfor the fact that it has aslight declinein no. of prescriptions and thespanhere is around6 monthswhich sound pretty much fine and consistent.
So, this sums it up all. Had some more time, we could have explored it further... Thanks for the opportunity though...