Patient Segmentation Based on Prescription Patterns¶

Importing Necessary Libraries:¶

In [1]:
from warnings import filterwarnings
filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Reading the data, Understanding the data and Addressing basic inconsistencies if any...¶

In [2]:
df = pd.read_parquet('train.parquet')
df.head()
Out[2]:
Patient-Uid Date Incident
0 a0db1e73-1c7c-11ec-ae39-16262ee38c7f 2019-03-09 PRIMARY_DIAGNOSIS
1 a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f 2015-05-16 PRIMARY_DIAGNOSIS
3 a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f 2018-01-30 SYMPTOM_TYPE_0
4 a0dc950b-1c7c-11ec-b6ec-16262ee38c7f 2015-04-22 DRUG_TYPE_0
8 a0dc9543-1c7c-11ec-bb63-16262ee38c7f 2016-06-18 DRUG_TYPE_1
In [3]:
df.shape
Out[3]:
(3220868, 3)
In [4]:
df.duplicated().sum()
Out[4]:
35571
In [5]:
df.index.duplicated().sum()
Out[5]:
166693
In [6]:
df.drop_duplicates(inplace = True)
df.reset_index(drop = True, inplace = True)
df.shape
Out[6]:
(3185297, 3)

We have already understood the data well enough in 001.ipynb... So, let's straightaway jump into the context...

Our matter of importance lies with only the patient who took target drug atleast once... So, let's split that alone from main dataframe and use it to cluster and analyse...

Pulling just the positive set out of original dataset...¶

In [7]:
positive_set = df[df['Incident'] == 'TARGET DRUG']
print(positive_set.shape)
positive_set.head()
(67218, 3)
Out[7]:
Patient-Uid Date Incident
2065342 a0eb742b-1c7c-11ec-8f61-16262ee38c7f 2020-04-09 TARGET DRUG
2065362 a0edaf09-1c7c-11ec-a360-16262ee38c7f 2018-06-12 TARGET DRUG
2065502 a0e9fa0e-1c7c-11ec-8dc7-16262ee38c7f 2019-06-11 TARGET DRUG
2065613 a0ecc615-1c7c-11ec-aa31-16262ee38c7f 2019-11-15 TARGET DRUG
2065618 a0ea612f-1c7c-11ec-8cf0-16262ee38c7f 2020-03-18 TARGET DRUG

So, we have 67k + instances where patients took target drug and as we have already observed in 001.ipynb, the total no. of unique patients who took target drug is around 9300 which is 1/3rd of total no. of unique patients in the entire dataset...

As all that incident column would have is just the Target Drug, it's of no use and so, let's drop it off...

In [8]:
positive_set.drop('Incident', inplace = True, axis = 1)
In [9]:
# Sorting patients by Patient-uid and date to manage them better

positive_set.sort_values(by=['Patient-Uid', 'Date'], inplace=True)

Engineering new feature...¶

Extracting a new feature which shows the time interval between each prescription using the date column that we have would help us train the model and get clusters based on it... So, that's what we are going to do below...

Steps are simple...

  1. Adding a new column to positive set df and assigning values to it by grouping based on Patient-uid and using diff method to compute diff between one date to another...

  2. Dropping null values in TimeInterval column as all we would have for the first date would be NaT values (Not a Time)...

Let's do it...

In [10]:
positive_set['TimeInterval'] = positive_set.groupby('Patient-Uid')['Date'].diff()

prescription_patterns = positive_set.copy()

prescription_patterns.dropna(subset=['TimeInterval'], inplace=True)
In [11]:
print(prescription_patterns['TimeInterval'].min(), prescription_patterns['TimeInterval'].max())
1 days 00:00:00 1219 days 00:00:00
In [12]:
print(df['Date'].min(), df['Date'].max())
2015-04-07 00:00:00 2020-09-03 00:00:00
In [13]:
print(positive_set['Date'].min(), positive_set['Date'].max())
2017-02-22 00:00:00 2020-09-03 00:00:00

The time interval as we can observe spans between 1 day to 1219 days...

Also from the subsequent cells we can notice that while the dataset has records starting from Apr 2015 and ends by Sep 2020, the targe drug administration has got started only by Feb 2017 and ends by the date up till which we have records... We will discuss this later on if needed...

Now let's find ideal no. of clusters using elbow method for k-means clustering... As we have time-delta object as input for the model, DB-scan won't take such input... So, let's stick to k-means clustering...

Elbow Plot to find ideal no. of clusters:¶

In [14]:
X = prescription_patterns['TimeInterval'].values.reshape(-1, 1)

k_values = range(1, 11)
inertia_scores = []
cluster_mapping = {}

for k in k_values:
    kmeans = KMeans(n_clusters=k, n_init='auto', random_state=42)   # init will by default be done using k++ which is the best one to go with...
    kmeans.fit(X)
    
    inertia_score = kmeans.inertia_
    inertia_scores.append(inertia_score)
    
    cluster_mapping[f'{k} Cluster' if k == 1 else f'{k} Clusters'] = inertia_score

print(cluster_mapping)

sns.set_theme()
plt.figure(figsize=(10, 7))
plt.plot(k_values, inertia_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia Score')
plt.title('Elbow Method: Inertia Scores')
plt.show()
{'1 Cluster': 8.596245577653992e+35, '2 Clusters': 4.237133518141459e+35, '3 Clusters': 2.4248190479612254e+35, '4 Clusters': 1.3773206859526056e+35, '5 Clusters': 1.165582461295304e+35, '6 Clusters': 7.009182021866456e+34, '7 Clusters': 5.067819584613085e+34, '8 Clusters': 4.035829757754878e+34, '9 Clusters': 3.079275106145406e+34, '10 Clusters': 2.6283973130400936e+34}
No description has been provided for this image

4 clusters looks good... If had some more time, I could have plotted silhoutte score plot, davis bouldin score plot and calinski score plot to check the quality of clusters and that would have added some strength to our n-clusters pick...So, here let's go with 4 clusters and visualize patterns...

During my previous clustering project, I encountered huge delay in computing silhoutte, davis bouldin and calinski scores for say n-clusters in the range of [1, 10]... That's the main reason why I am skipping it here considering time constraint...

Now, let's do clustering with 4 no. of clusters with k-means...

Clustering:¶

In [15]:
kmeans = KMeans(n_clusters=4, random_state=42)
prescription_patterns['Cluster'] = kmeans.fit_predict(X) + 1

# Cluster by default will start from 0... For our understandability we are making it start from 1
In [16]:
prescription_patterns['Cluster'].value_counts().plot.pie(autopct='%.2f')
Out[16]:
<Axes: ylabel='Cluster'>
No description has been provided for this image

Instant observation:¶

  • 95% patients falling in cluster 1 and 3 remaining 5% with cluster 2 and 4, that too in cluster, only around 0.6% patients are there.

  • As obvious, Cluster 1 has more patients followed by Cluster 3.

Let's move forward and analyse further...

In [17]:
prescription_patterns.head()
Out[17]:
Patient-Uid Date TimeInterval Cluster
2094649 a0e9c384-1c7c-11ec-81a0-16262ee38c7f 2020-08-05 28 days 3
2164002 a0e9c384-1c7c-11ec-81a0-16262ee38c7f 2020-09-02 28 days 3
2637552 a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f 2018-05-17 23 days 3
3171058 a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f 2018-06-13 27 days 3
2375328 a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f 2018-08-07 55 days 1

Some Seggregation and Grouping before we get into visualization part...¶

In [18]:
# Seggregating clusters into different dataframes to visualize wach of them with ease...

cluster_1 = prescription_patterns[prescription_patterns['Cluster'] == 1]
cluster_2 = prescription_patterns[prescription_patterns['Cluster'] == 2]
cluster_3 = prescription_patterns[prescription_patterns['Cluster'] == 3]
cluster_4 = prescription_patterns[prescription_patterns['Cluster'] == 4]
In [19]:
# Grouping the data in each cluster by month and counting the number of prescriptions using Grouper object and groupby method

cluster_1_counts = cluster_1.groupby(pd.Grouper(key='Date', freq='M')).size()
cluster_2_counts = cluster_2.groupby(pd.Grouper(key='Date', freq='M')).size()
cluster_3_counts = cluster_3.groupby(pd.Grouper(key='Date', freq='M')).size()
cluster_4_counts = cluster_4.groupby(pd.Grouper(key='Date', freq='M')).size()

Some first line of visualization before we dig deeper into the clusters...¶

In [20]:
plt.figure(figsize=(10, 8))

plt.plot(cluster_1_counts.index, cluster_1_counts.values, label='Cluster 1')
plt.plot(cluster_2_counts.index, cluster_2_counts.values, label='Cluster 2')
plt.plot(cluster_3_counts.index, cluster_3_counts.values, label='Cluster 3')
plt.plot(cluster_4_counts.index, cluster_4_counts.values, label='Cluster 4')

plt.xlabel('Period')
plt.ylabel('Number of Prescriptions')
plt.title('Number of Prescriptions overall as time progresses')
plt.legend()

plt.show()
No description has been provided for this image

Inferences from above visualization:¶


As we already observed earlier in this same notebook, the target drug administration started only after Feb 2017.


  • Cluster-1 : This group of patients has got target drug prescribed only after the patients from cluster 3 has got prescribed. But eventually, these patients have had more prescriptions as compared to patients from cluster 3 and overall... This might be because of the fact that cluster 1 has more no. of patients as compared to cluster 3. Also this cluster had few ups and downs but has had a stark rise in no. of prescriptions made as time progresses and hasn't had that much stark decrease in no. of prescriptions.

  • Cluster-2 : This is the group which have had least amount of prescriptions throughout without any steep rise or fall... They looked like taking a tad bit more by the mid of 2020 but then again gone back to same no. of prescriptions. Also they seem to be latest group which started taking target drug only by around Feb or Mar 2019. We need additional details to see the trend of this cluster.

  • Cluster-3 : These patients seem to have started with the target drug earlier than any other group and has had a steep rise in no. of prescriptions made during early 2018 and it became more or like similar to a plateau kinda thing and that's been the case for the rest of the period... They have had stark rise and then plateau as we discussed, but has never had steep decline in no. of prescriptions made as time went by.

  • Cluster-4 : These group of patients are bit better than cluster 2 patients when you see the overall picture and has had not steep but significant rise in prescription by around Aug or Sep 2018 and maintained the same rate throughout.

Creating new feature - Month from TimeInterval Column...¶

In [21]:
cluster_1['Month'] = (cluster_1['TimeInterval'].dt.days / 30.44).astype(int)
cluster_2['Month'] = (cluster_2['TimeInterval'].dt.days / 30.44).astype(int)
cluster_3['Month'] = (cluster_3['TimeInterval'].dt.days / 30.44).astype(int)
cluster_4['Month'] = (cluster_4['TimeInterval'].dt.days / 30.44).astype(int)
In [22]:
# Calculating the avg no. of prescriptions per unique patient for each month in each cluster

cluster_1_prescription = cluster_1.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
cluster_2_prescription = cluster_2.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
cluster_3_prescription = cluster_3.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')
cluster_4_prescription = cluster_4.groupby(['Month', 'Patient-Uid']).size().groupby('Month').mean().reset_index(name='Average Prescriptions')

Visualizations explaining the pattern in each cluster with respect to average prescriptions every month:¶

In [23]:
std = cluster_1_prescription['Average Prescriptions'].std()    # standard deviation calculation for errorbar

error_y = np.full(len(cluster_1_prescription), std)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=cluster_1_prescription['Month'],
    y=cluster_1_prescription['Average Prescriptions'],
    error_y=dict(type='data', array=error_y, visible=True),
    mode='markers+lines',
    marker={'size': 16}
))

fig.update_layout(
    title='Average Prescriptions for Cluster 1',
    xaxis_title='Month',
    yaxis_title='Average Prescriptions'
)

fig.show()
11.522.53012345
Average Prescriptions for Cluster 1MonthAverage Prescriptions
plotly-logomark

Insights gained:¶

  • Patients in this cluster have took the target drug for utmost 3 months... As we have observed already, this group had the most number of patients in it... So, it should be safe to assume that almost close to 60% patient who took target drug in total across clusters has took it for only 3 months...

  • Month 1 : On an average presciption is made atleast thrice in this month.

  • Month 2 : On an average presciption is made atleast twice this month.

  • Month 3 : On an average presciption is made atleast once this month.

So, the overall pattern in this cluster shows that, no. of prescriptions made on an average scale seems to be on a decreasing trend with the patients taking drug for 3 months to the max and then quit... This doesn't seem to be healthy trend... So, we have to make them more engaged in taking prescriptions...

So, this looked like a good cluster of patients when we observed in that overall visualization. But, as I pointed out there, it seemed to look better because of more number of patients in this cluster...

In [24]:
std = cluster_2_prescription['Average Prescriptions'].std()

error_y = np.full(len(cluster_2_prescription), std)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=cluster_2_prescription['Month'],
    y=cluster_2_prescription['Average Prescriptions'],
    error_y=dict(type='data', array=error_y, visible=True),
    mode='markers+lines',
    marker={'size': 16}
))

fig.update_layout(
    title='Average Prescriptions for Cluster 2',
    xaxis_title='Month',
    yaxis_title='Average Prescriptions'
)

fig.show()
1015202530354000.511.52
Average Prescriptions for Cluster 2MonthAverage Prescriptions
plotly-logomark

Insights gained:¶

  • This cluster looks more of a consistent and stable cluster.

  • Starting from 9th month, the prescription seems to be consistent until 27th month (A span of 18 months) where there's been atleast 1 prescription every month which looks pretty good and a healthy trend rather than having high no. of prescription on a month and falling down steeply in the subsequent months.

  • After 27th month until 32nd month there seems to be neglible or no prescription and after that 32nd month the average was 1 and after that prescription has been taken post 3 months from 32nd month and 5 months from there on. Some patients might have had some complications which lead them to take target drug even after a span of 18 months period... This might also be because of the reccurrent symptoms...

So, the overall pattern looks pretty much healthy over here despite having few cuts here and there...

In [25]:
std = cluster_3_prescription['Average Prescriptions'].std()

error_y = np.full(len(cluster_3_prescription), std)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=cluster_3_prescription['Month'],
    y=cluster_3_prescription['Average Prescriptions'],
    error_y=dict(type='data', array=error_y, visible=True),
    mode='markers+lines',
    marker={'size': 16}
))

fig.update_layout(
    title='Average Prescriptions for Cluster 3',
    xaxis_title='Month',
    yaxis_title='Average Prescriptions'
)

fig.show()
00.20.40.60.810.511.522.533.54
Average Prescriptions for Cluster 3MonthAverage Prescriptions
plotly-logomark

Insights gained:¶

  • These are the patient group who were the earliest to take the target drug as we observed before.

  • Initially they seem to have taken around 3 prescriptions on an average.

  • After first prescription, there seems to be steep decline in average no. of prescription taken which almost became less than half of the initial month's average.

  • These patients seem to have stopped taking target drug within a month or so making them the worst amongst all that we have seen so far.

Overall, patients in this cluster represents the worst trend of all that we have seen so far despite them being the earliest drug takers and the second most no. of patient falling under this umbrella.

In [26]:
std = cluster_4_prescription['Average Prescriptions'].std()

error_y = np.full(len(cluster_4_prescription), std)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=cluster_4_prescription['Month'],
    y=cluster_4_prescription['Average Prescriptions'],
    error_y=dict(type='data', array=error_y, visible=True),
    mode='markers+lines',
    marker={'size': 16}
))

fig.update_layout(
    title='Average Prescriptions for Cluster 4',
    xaxis_title='Month',
    yaxis_title='Average Prescriptions'
)

fig.show()
34567890.9511.051.11.15
Average Prescriptions for Cluster 4MonthAverage Prescriptions
plotly-logomark

Insights gained:¶

  • These group of patients all seem to have taken around 1 drug on average throughout from the 3rd month till 9th month (A span of 6 months) with slight decline from inital administration.

  • Apart from that this cluster seems to be a consistent and stable cluster and can be put next to the best performing cluster - cluster 2.

Overall, this group though with second least no. of people coming under this umbrella, have performed really upto the mark. The span might just be because the parctitioner would have thought that these group just need to take for these many months... There might be other reasons as well.

Overall Summary:¶


At first instance when we have seen that overall visualization, we might have got carried away... But, only when we digged deeper, we came to sense, which is quite contrary to what we might have thought.


  • Cluster 1 and Cluster 3 which have more than 95% of total patients who took target drug, performs poorly. While cluster 1 which has 60% of total patients in it, seems to have had a declining trend and the patients in this cluster took drug only for 3 months whereas, cluster 3 which has 35% of total patients in it seems to be the worst of all that we have as it has patients who has taken drug just for a month or so and that too on a declining trend.

  • Cluster 2 and Cluster 4 which we would have thought as worst at first glance are the best actually. Among which cluster 2 is the best of all that we have as they took drug for a span of 18 months which is pretty much consistent and they took atleast 1 drug on average throughout, which is stable enough. With respect to cluster 4, these patients follow a similar trend to that of cluster 2 except for the fact that it has aslight decline in no. of prescriptions and the span here is around 6 months which sound pretty much fine and consistent.

So, this sums it up all. Had some more time, we could have explored it further... Thanks for the opportunity though...