with merchandise, we’d face a must introduce some “guidelines”. Let me clarify what I imply by “guidelines” in sensible examples:
- Think about that we’re seeing a large wave of fraud in our product, and we need to prohibit onboarding for a specific section of shoppers to decrease this danger. For instance, we discovered that almost all of fraudsters had particular consumer brokers and IP addresses from sure nations.
- Another choice is to ship coupons to clients to make use of in our on-line store. Nonetheless, we want to deal with solely clients who’re prone to churn since loyal customers will return to the product anyway. We would determine that essentially the most possible group is clients who joined lower than a yr in the past and decreased their spending by 30%+ final month.
- Transactional companies usually have a section of shoppers the place they’re dropping cash. For instance, a financial institution buyer handed the verification and frequently reached out to buyer assist (so generated onboarding and servicing prices) whereas doing virtually no transactions (so not producing any income). The financial institution would possibly introduce a small month-to-month subscription charge for purchasers with lower than 1000$ of their account since they’re seemingly non-profitable.
In fact, in all these instances, we’d have used a fancy Machine Studying mannequin that might take note of all of the elements and predict the likelihood (both of a buyer being a fraudster or churning). Nonetheless, below some circumstances, we’d desire only a set of static guidelines for the next causes:
- The pace and complexity of implementation. Deploying an ML mannequin in manufacturing takes effort and time. In case you are experiencing a fraud wave proper now, it is likely to be extra possible to go reside with a set of static guidelines that may be applied rapidly after which work on a complete resolution.
- Interpretability. ML fashions are black bins. Though we’d be capable to perceive at a excessive degree how they work and what options are crucial ones, it’s difficult to clarify them to clients. Within the instance of subscription charges for non-profitable clients, it’s essential to share a set of clear guidelines with clients in order that they will perceive the pricing.
- Compliance. Some industries, like finance or healthcare, would possibly require auditable and rule-based selections to fulfill compliance necessities.
On this article, I need to present you the way we are able to clear up enterprise issues utilizing such guidelines. We’ll take a sensible instance and go actually deep into this subject:
- we’ll talk about which fashions we are able to use to mine such guidelines from information,
- we’ll construct a Decision Tree Classifier from scratch to study the way it works,
- we’ll match the
sklearn
Determination Tree Classifier mannequin to extract the principles from the information, - we’ll discover ways to parse the Determination Tree construction to get the ensuing segments,
- lastly, we’ll discover totally different choices for class encoding, for the reason that
sklearn
implementation doesn’t assist categorical variables.
We’ve numerous subjects to cowl, so let’s bounce into it.
Case
As common, it’s simpler to study one thing with a sensible instance. So, let’s begin by discussing the duty we can be fixing on this article.
We’ll work with the Bank Marketing dataset (). This dataset incorporates information concerning the direct advertising campaigns of a Portuguese banking establishment. For every buyer, we all know a bunch of options and whether or not they subscribed to a time period deposit (our goal).
Our enterprise aim is to maximise the variety of conversions (subscriptions) with restricted operational sources. So, we are able to’t name the entire consumer base, and we need to attain the most effective final result with the sources we now have.
Step one is to take a look at the information. So, let’s load the information set.
import pandas as pd
pd.set_option('show.max_colwidth', 5000)
pd.set_option('show.float_format', lambda x: '%.2f' % x)
df = pd.read_csv('bank-full.csv', sep = ';')
df = df.drop(['duration', 'campaign'], axis = 1)
# eliminated columns associated to the present advertising marketing campaign,
# since they introduce information leakage
df.head()
We all know quite a bit concerning the clients, together with private information (equivalent to job sort or marital standing) and their earlier behaviour (equivalent to whether or not they have a mortgage or their common yearly stability).
The subsequent step is to pick out a machine-learning mannequin. There are two courses of fashions which are normally used once we want one thing simply interpretable:
- choice timber,
- linear or logistic regression.
Each choices are possible and may give us good fashions that may be simply applied and interpreted. Nonetheless, on this article, I want to keep on with the choice tree mannequin as a result of it produces precise guidelines, whereas logistic regression will give us likelihood as a weighted sum of options.
Knowledge Preprocessing
As we’ve seen within the information, there are many categorical variables (equivalent to training or marital standing). Sadly, the sklearn
choice tree implementation can’t deal with categorical information, so we have to do some preprocessing.
Let’s begin by remodeling sure/no flags into integers.
for p in ['default', 'housing', 'loan', 'y']:
df[p] = df[p].map(lambda x: 1 if x == 'sure' else 0)
The subsequent step is to rework the month
variable. We are able to use one-hot encoding for months, introducing flags like month_jan
, month_feb
, and many others. Nonetheless, there is likely to be seasonal results, and I believe it might be extra cheap to transform months into integers following their order.
month_map = {
'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'could': 5, 'jun': 6,
'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
# I saved 5 minutes by asking ChatGPT to do that mapping
df['month'] = df.month.map(lambda x: month_map[x] if x in month_map else x)
For all different categorical variables, let’s use one-hot encoding. We’ll talk about totally different methods for class encoding later, however for now, let’s keep on with the default method.
The simplest option to do one-hot encoding is to leverage get_dummies
function in pandas.
fin_df = pd.get_dummies(
df, columns=['job', 'marital', 'education', 'poutcome', 'contact'],
dtype = int, # to transform to flags 0/1
drop_first = False # to maintain all potential values
)
This operate transforms every categorical variable right into a separate 1/0 column for every potential. We are able to see the way it works for poutcome
column.
fin_df.merge(df[['id', 'poutcome']])
.groupby(['poutcome', 'poutcome_unknown', 'poutcome_failure',
'poutcome_other', 'poutcome_success'], as_index = False).y.rely()
.rename(columns = {'y': 'instances'})
.sort_values('instances', ascending = False)

Our information is now prepared, and it’s time to debate how choice tree classifiers work.
Determination Tree Classifier: Concept
On this part, we’ll discover the speculation behind the Determination Tree Classifier and construct the algorithm from scratch. When you’re extra fascinated with a sensible instance, be at liberty to skip forward to the following half.
The simplest option to perceive the choice tree mannequin is to take a look at an instance. So, let’s construct a easy mannequin primarily based on our information. We’ll use DecisionTreeClassifier from sklearn
.
feature_names = fin_df.drop(['y'], axis = 1).columns
mannequin = sklearn.tree.DecisionTreeClassifier(
max_depth = 2, min_samples_leaf = 1000)
mannequin.match(fin_df[feature_names], fin_df['y'])
The subsequent step is to visualise the tree.
dot_data = sklearn.tree.export_graphviz(
mannequin, out_file=None, feature_names = feature_names, crammed = True,
proportion = True, precision = 2
# to indicate shares of courses as a substitute of absolute numbers
)
graph = graphviz.Supply(dot_data)
graph

So, we are able to see that the mannequin is easy. It’s a set of binary splits that we are able to use as heuristics.
Let’s determine how the classifier works below the hood. As common, one of the best ways to grasp the mannequin is to construct the logic from scratch.
The cornerstone of any drawback is the optimisation operate. By default, within the choice tree classifier, we’re optimising the Gini coefficient. Think about getting one random merchandise from the pattern after which the opposite. The Gini coefficient would equal the likelihood of the scenario when this stuff are from totally different courses. So, our aim can be minimising the Gini coefficient.
Within the case of simply two courses (like in our instance, the place advertising intervention was both profitable or not), the Gini coefficient is outlined simply by one parameter p
, the place p
is the likelihood of getting an merchandise from one of many courses. Right here’s the method:
[textbf{gini}(textsf{p}) = 1 – textsf{p}^2 – (1 – textsf{p})^2 = 2 * textsf{p} * (1 – textsf{p}) ]
If our classification is good and we’re in a position to separate the courses completely, then the Gini coefficient can be equal to 0. The worst-case situation is when p = 0.5
, then the Gini coefficient can be equal to 0.5.
With the method above, we are able to calculate the Gini coefficient for every leaf of the tree. To calculate the Gini coefficient for the entire tree, we have to mix the Gini coefficients of binary splits. For that, we are able to simply get a weighted sum:
[textbf{gini}_{textsf{total}} = textbf{gini}_{textsf{left}} * frac{textbf{n}_{textsf{left}}}{textbf{n}_{textsf{left}} + textbf{n}_{textsf{right}}} + textbf{gini}_{textsf{right}} * frac{textbf{n}_{textsf{right}}}{textbf{n}_{textsf{left}} + textbf{n}_{textsf{right}}}]
Now that we all know what worth we’re optimising, we solely must outline all potential binary splits, iterate by way of them and select the best choice.
Defining all potential binary splits can be fairly simple. We are able to do it one after the other for every parameter, kind potential values, and decide up thresholds between them. For instance, for months (integer from 1 to 12).

Let’s attempt to code it and see whether or not we’ll come to the identical outcome. First, we’ll outline capabilities that calculate the Gini coefficient for one dataset and the mixture.
def get_gini(df):
p = df.y.imply()
return 2*p*(1-p)
print(get_gini(fin_df))
# 0.2065
# near what we see on the root node of Determination Tree
def get_gini_comb(df1, df2):
n1 = df1.form[0]
n2 = df2.form[0]
gini1 = get_gini(df1)
gini2 = get_gini(df2)
return (gini1*n1 + gini2*n2)/(n1 + n2)
The subsequent step is to get all potential thresholds for one parameter and calculate their Gini coefficients.
import tqdm
def optimise_one_parameter(df, param):
tmp = []
possible_values = checklist(sorted(df[param].distinctive()))
print(param)
for i in tqdm.tqdm(vary(1, len(possible_values))):
threshold = (possible_values[i-1] + possible_values[i])/2
gini = get_gini_comb(df[df[param] threshold])
tmp.append(
{'param': param,
'threshold': threshold,
'gini': gini,
'sizes': (df[df[param] threshold].form[0]))
}
)
return pd.DataFrame(tmp)
The ultimate step is to iterate by way of all options and calculate all potential splits.
tmp_dfs = []
for function in feature_names:
tmp_dfs.append(optimise_one_parameter(fin_df, function))
opt_df = pd.concat(tmp_dfs)
opt_df.sort_values('gini', asceding = True).head(5)

Great, we’ve bought the identical outcome as in our DecisionTreeClassifier
mannequin. The optimum cut up is whether or not poutcome = success
or not. We’ve decreased the Gini coefficient from 0.2065 to 0.1872.
To proceed constructing the tree, we have to repeat the method recursively. For instance, taking place for the poutcome_success department:
tmp_dfs = []
for function in feature_names:
tmp_dfs.append(optimise_one_parameter(
fin_df[fin_df.poutcome_success

The only question we still need to discuss is the stopping criteria. In our initial example, we’ve used two conditions:
max_depth = 2
— it just limits the maximum depth of the tree,min_samples_leaf = 1000
prevents us from getting leaf nodes with less than 1K samples. Because of this condition, we’ve chosen a binary split bycontact_unknown
even thoughage
led to a lower Gini coefficient.
Also, I usually limit the min_impurity_decrease
that prevent us from going further if the gains are too small. By gains, we mean the decrease of the Gini coefficient.
So, we’ve understood how the Decision Tree Classifier works, and now it’s time to use it in practice.
If you’re interested to see how Decision Tree Regressor works in all detail, you can look it up in my previous article.
Decision Trees: practice
We’ve already built a simple tree model with two layers, but it’s definitely not enough since it’s too simple to get all the insights from the data. Let’s train another Decision Tree by limiting the number of samples in leaves and decreasing impurity (reduction of Gini coefficient).
model = sklearn.tree.DecisionTreeClassifier(
min_samples_leaf = 1000, min_impurity_decrease=0.001)
model.fit(fin_df[features], fin_df['y'])
dot_data = sklearn.tree.export_graphviz(
mannequin, out_file=None, feature_names = options, crammed = True,
proportion = True, precision=2, impurity = True)
graph = graphviz.Supply(dot_data)
# saving graph to png file
png_bytes = graph.pipe(format='png')
with open('decision_tree.png','wb') as f:
f.write(png_bytes)

That’s it. We’ve bought our guidelines to separate clients into teams (leaves). Now, we are able to iterate by way of teams and see which teams of shoppers we need to contact. Though our mannequin is comparatively small, it’s daunting to repeat all circumstances from the picture. Fortunately, we are able to parse the tree structure and get all of the teams from the mannequin.
The Determination Tree classifier has an attribute tree_
that may permit us to get entry to low-level attributes of the tree, equivalent to node_count
.
n_nodes = mannequin.tree_.node_count
print(n_nodes)
# 13
The tree_
variable additionally shops the whole tree construction as parallel arrays, the place the i
th aspect of every array shops the details about the node i
. For the basis i
equals to 0.
Listed below are the arrays we now have to signify the tree construction:
children_left
andchildren_right
— IDs of left and proper nodes, respectively; if the node is a leaf, then -1.function
— function used to separate the nodei
.threshold
— threshold worth used for the binary cut up of the nodei
.n_node_samples
— variety of coaching samples that reached the nodei
.values
— shares of samples from every class.
Let’s save all these arrays.
children_left = mannequin.tree_.children_left
# [ 1, 2, 3, 4, 5, 6, -1, -1, -1, -1, -1, -1, -1]
children_right = mannequin.tree_.children_right
# [12, 11, 10, 9, 8, 7, -1, -1, -1, -1, -1, -1, -1]
options = mannequin.tree_.function
# [30, 34, 0, 3, 6, 6, -2, -2, -2, -2, -2, -2, -2]
thresholds = mannequin.tree_.threshold
# [ 0.5, 0.5, 59.5, 0.5, 6.5, 2.5, -2. , -2. , -2. , -2. , -2. , -2. , -2. ]
num_nodes = mannequin.tree_.n_node_samples
# [45211, 43700, 30692, 29328, 14165, 4165, 2053, 2112, 10000,
# 15163, 1364, 13008, 1511]
values = mannequin.tree_.worth
# [[[0.8830152 , 0.1169848 ]],
# [[0.90135011, 0.09864989]],
# [[0.87671054, 0.12328946]],
# [[0.88550191, 0.11449809]],
# [[0.8530886 , 0.1469114 ]],
# [[0.76686675, 0.23313325]],
# [[0.87043351, 0.12956649]],
# [[0.66619318, 0.33380682]],
# [[0.889 , 0.111 ]],
# [[0.91578184, 0.08421816]],
# [[0.68768328, 0.31231672]],
# [[0.95948647, 0.04051353]],
# [[0.35274653, 0.64725347]]]
Will probably be extra handy for us to work with a hierarchical view of the tree construction, so let’s iterate by way of all nodes and, for every node, save the guardian node ID and whether or not it was a proper or left department.
hierarchy = {}
for node_id in vary(n_nodes):
if children_left[node_id] != -1:
hierarchy[children_left[node_id]] = {
'guardian': node_id,
'situation': 'left'
}
if children_right[node_id] != -1:
hierarchy[children_right[node_id]] = {
'guardian': node_id,
'situation': 'proper'
}
print(hierarchy)
# {1: {'guardian': 0, 'situation': 'left'},
# 12: {'guardian': 0, 'situation': 'proper'},
# 2: {'guardian': 1, 'situation': 'left'},
# 11: {'guardian': 1, 'situation': 'proper'},
# 3: {'guardian': 2, 'situation': 'left'},
# 10: {'guardian': 2, 'situation': 'proper'},
# 4: {'guardian': 3, 'situation': 'left'},
# 9: {'guardian': 3, 'situation': 'proper'},
# 5: {'guardian': 4, 'situation': 'left'},
# 8: {'guardian': 4, 'situation': 'proper'},
# 6: {'guardian': 5, 'situation': 'left'},
# 7: {'guardian': 5, 'situation': 'proper'}}
The subsequent step is to filter out the leaf nodes since they’re terminal and essentially the most attention-grabbing for us as they outline the shopper segments.
leaves = []
for node_id in vary(n_nodes):
if (children_left[node_id] == -1) and (children_right[node_id] == -1):
leaves.append(node_id)
print(leaves)
# [6, 7, 8, 9, 10, 11, 12]
leaves_df = pd.DataFrame({'node_id': leaves})
The subsequent step is to find out all of the circumstances utilized to every group since they may outline our buyer segments. The primary operate get_condition
will give us the tuple of function, situation sort and threshold for a node.
def get_condition(node_id, situation, options, thresholds, feature_names):
# print(node_id, situation)
function = feature_names[features[node_id]]
threshold = thresholds[node_id]
cond = '>' if situation == 'proper' else '', 0.5)
The subsequent operate will permit us to recursively go from the leaf node to the basis and get all of the binary splits.
def get_decision_path_rec(node_id, decision_path, hierarchy):
if node_id == 0:
yield decision_path
else:
parent_id = hierarchy[node_id]['parent']
situation = hierarchy[node_id]['condition']
for res in get_decision_path_rec(parent_id, decision_path + [(parent_id, condition)], hierarchy):
yield res
decision_path = checklist(get_decision_path_rec(12, [], hierarchy))[0]
print(decision_path)
# [(0, 'right')]
fmt_decision_path = checklist(map(
lambda x: get_condition(x[0], x[1], options, thresholds, feature_names),
decision_path))
print(fmt_decision_path)
# [('poutcome_success', '>', 0.5)]
Let’s save the logic of executing the recursion and formatting right into a wrapper operate.
def get_decision_path(node_id, options, thresholds, hierarchy, feature_names):
decision_path = checklist(get_decision_path_rec(node_id, [], hierarchy))[0]
return checklist(map(lambda x: get_condition(x[0], x[1], options, thresholds,
feature_names), decision_path))
We’ve discovered the right way to get every node’s binary cut up circumstances. The one remaining logic is to mix the circumstances.
def get_decision_path_string(node_id, options, thresholds, hierarchy,
feature_names):
conditions_df = pd.DataFrame(get_decision_path(node_id, options, thresholds, hierarchy, feature_names))
conditions_df.columns = ['feature', 'condition', 'threshold']
left_conditions_df = conditions_df[conditions_df.condition == '']
# deduplication
left_conditions_df = left_conditions_df.groupby(['feature', 'condition'], as_index = False).min()
right_conditions_df = right_conditions_df.groupby(['feature', 'condition'], as_index = False).max()
# concatination
fin_conditions_df = pd.concat([left_conditions_df, right_conditions_df])
.sort_values(['feature', 'condition'], ascending = False)
# formatting
fin_conditions_df['cond_string'] = checklist(map(
lambda x, y, z: '(%s %s %.2f)' % (x, y, z),
fin_conditions_df.function,
fin_conditions_df.situation,
fin_conditions_df.threshold
))
return ' and '.be part of(fin_conditions_df.cond_string.values)
print(get_decision_path_string(12, options, thresholds, hierarchy,
feature_names))
# (poutcome_success > 0.50)
Now, we are able to calculate the circumstances for every group.
leaves_df['condition'] = leaves_df['node_id'].map(
lambda x: get_decision_path_string(x, options, thresholds, hierarchy,
feature_names)
)
The final step is so as to add their measurement and conversion to the teams.
leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.complete)
.map(lambda x: int(spherical(x/100)))
leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()
Now, we are able to use these guidelines to make selections. We are able to kind teams by conversion (likelihood of profitable contact) and decide the shoppers with the very best likelihood.
leaves_df.sort_values('conversion', ascending = False)
.drop('node_id', axis = 1).set_index('situation')

Think about we now have sources to contact solely round 10% of our consumer base, we are able to concentrate on the primary three teams. Even with such a restricted capability, we’d count on to get virtually 40% conversion — it’s a extremely good outcome, and we’ve achieved it with only a bunch of simple heuristics.
In actual life, it’s additionally price testing the mannequin (or heuristics) earlier than deploying it in manufacturing. I’d cut up the coaching dataset into coaching and validation components (by time to keep away from leakage) and see the heuristics efficiency on the validation set to have a greater view of the particular mannequin high quality.
Working with excessive cardinality classes
One other subject that’s price discussing on this context is class encoding, since we now have to encode the explicit variables for sklearn
implementation. We’ve used an easy method with one-hot encoding, however in some instances, it doesn’t work.
Think about we even have a area within the information. I’ve synthetically generated English cities for every row. We’ve 155 distinctive areas, so the variety of options has elevated to 190.
mannequin = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 100, min_impurity_decrease=0.001)
mannequin.match(fin_df[feature_names], fin_df['y'])
So, the essential tree now has numerous circumstances primarily based on areas and it’s not handy to work with them.

In such a case, it won’t be significant to blow up the variety of options, and it’s time to consider encoding. There’s a complete article, “Categorically: Don’t explode — encode!”, that shares a bunch of various choices to deal with excessive cardinality categorical variables. I believe essentially the most possible ones in our case would be the following two choices:
- Rely or Frequency Encoder that reveals good efficiency in benchmarks. This encoding assumes that classes of comparable measurement would have comparable traits.
- Goal Encoder, the place we are able to encode the class by the imply worth of the goal variable. It is going to permit us to prioritise segments with increased conversion and deprioritise segments with decrease. Ideally, it might be good to make use of historic information to get the averages for the encoding, however we’ll use the prevailing dataset.
Nonetheless, it is going to be attention-grabbing to check totally different approaches, so let’s cut up our dataset into practice and take a look at, saving 10% for validation. For simplicity, I’ve used one-hot encoding for all columns apart from area (because it has the very best cardinality).
from sklearn.model_selection import train_test_split
fin_df = pd.get_dummies(df, columns=['job', 'marital', 'education',
'poutcome', 'contact'], dtype = int, drop_first = False)
train_df, test_df = train_test_split(fin_df,test_size=0.1, random_state=42)
print(train_df.form[0], test_df.form[0])
# (40689, 4522)
For comfort, let’s mix all of the logic for parsing the tree into one operate.
def get_model_definition(mannequin, feature_names):
n_nodes = mannequin.tree_.node_count
children_left = mannequin.tree_.children_left
children_right = mannequin.tree_.children_right
options = mannequin.tree_.function
thresholds = mannequin.tree_.threshold
num_nodes = mannequin.tree_.n_node_samples
values = mannequin.tree_.worth
hierarchy = {}
for node_id in vary(n_nodes):
if children_left[node_id] != -1:
hierarchy[children_left[node_id]] = {
'guardian': node_id,
'situation': 'left'
}
if children_right[node_id] != -1:
hierarchy[children_right[node_id]] = {
'guardian': node_id,
'situation': 'proper'
}
leaves = []
for node_id in vary(n_nodes):
if (children_left[node_id] == -1) and (children_right[node_id] == -1):
leaves.append(node_id)
leaves_df = pd.DataFrame({'node_id': leaves})
leaves_df['condition'] = leaves_df['node_id'].map(
lambda x: get_decision_path_string(x, options, thresholds, hierarchy, feature_names)
)
leaves_df['total'] = leaves_df.node_id.map(lambda x: num_nodes[x])
leaves_df['conversion'] = leaves_df['node_id'].map(lambda x: values[x][0][1])*100
leaves_df['converted_users'] = (leaves_df.conversion * leaves_df.complete).map(lambda x: int(spherical(x/100)))
leaves_df['share_of_converted'] = 100*leaves_df['converted_users']/leaves_df['converted_users'].sum()
leaves_df['share_of_total'] = 100*leaves_df['total']/leaves_df['total'].sum()
leaves_df = leaves_df.sort_values('conversion', ascending = False)
.drop('node_id', axis = 1).set_index('situation')
leaves_df['cum_share_of_total'] = leaves_df['share_of_total'].cumsum()
leaves_df['cum_share_of_converted'] = leaves_df['share_of_converted'].cumsum()
return leaves_df
Let’s create an encodings information body, calculating frequencies and conversions.
region_encoding_df = train_df.groupby('area', as_index = False)
.combination({'id': 'rely', 'y': 'imply'}).rename(columns =
{'id': 'region_count', 'y': 'region_target'})
Then, merge it into our coaching and validation units. For the validation set, we will even fill NAs as averages.
train_df = train_df.merge(region_encoding_df, on = 'area')
test_df = test_df.merge(region_encoding_df, on = 'area', how = 'left')
test_df['region_target'] = test_df['region_target']
.fillna(region_encoding_df.region_target.imply())
test_df['region_count'] = test_df['region_count']
.fillna(region_encoding_df.region_count.imply())
Now, we are able to match the fashions and get their buildings.
count_feature_names = train_df.drop(
['y', 'id', 'region_target', 'region'], axis = 1).columns
target_feature_names = train_df.drop(
['y', 'id', 'region_count', 'region'], axis = 1).columns
print(len(count_feature_names), len(target_feature_names))
# (36, 36)
count_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500,
min_impurity_decrease=0.001)
count_model.match(train_df[count_feature_names], train_df['y'])
target_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500,
min_impurity_decrease=0.001)
target_model.match(train_df[target_feature_names], train_df['y'])
count_model_def_df = get_model_definition(count_model, count_feature_names)
target_model_def_df = get_model_definition(target_model, target_feature_names)
Let’s have a look at the buildings and choose the highest classes as much as 10–15% of our audience. We are able to additionally apply these circumstances to our validation units to check our method in follow.
Let’s begin with Rely Encoder.

count_selected_df = test_df[
(test_df.poutcome_success > 0.50) |
((test_df.poutcome_success 60.50)) |
((test_df.region_count > 3645.50) & (test_df.region_count 0.50) & (test_df.age
We can also see what regions have been selected, and it’s only Manchester.

Let’s continue with the Target encoding.

target_selected_df = test_df[
((test_df.region_target > 0.21) & (test_df.poutcome_success > 0.50)) |
((test_df.region_target > 0.21) & (test_df.poutcome_success 0.21) & (test_df.poutcome_success 8.50) & (test_df.housing 0.50)) |
((test_df.region_target > 0.21) & (test_df.poutcome_success 6.50) & (test_df.month
We see a slightly lower number of selected users for communication but a significantly higher number of conversions: 248 vs. 227 (+9.3%).
Let’s also look at the selected categories. We see that the model picked up all the cities with high conversions (Manchester, Liverpool, Bristol, Leicester, and New Castle), but there are also many small regions with high conversions solely due to chance.
region_encoding_df[region_encoding_df.region_target > 0.21]
.sort_values('region_count', ascending = False)

In our case, it doesn’t influence a lot for the reason that share of such small cities is low. Nonetheless, you probably have far more small classes, you would possibly see vital drawbacks of overfitting. Goal Encoding is likely to be tough at this level, so it’s price keeping track of the output of your mannequin.
Fortunately, there’s an method that may enable you overcome this difficulty. Following the article “Encoding Categorical Variables: A Deep Dive into Target Encoding”, we are able to add smoothing. The thought is to mix the group’s conversion charge with the general common: the bigger the group, the extra weight its information carries, whereas smaller segments will lean extra in the direction of the worldwide common.
First, I’ve chosen the parameters that make sense for our distribution, taking a look at a bunch of choices. I selected to make use of the worldwide common for the teams below 100 individuals. This half is a bit subjective, so use widespread sense and your data concerning the enterprise area.
import numpy as np
import matplotlib.pyplot as plt
global_mean = train_df.y.imply()
okay = 100
f = 10
smooth_df = pd.DataFrame({'region_count':np.arange(1, 100001, 1) })
smooth_df['smoothing'] = (1 / (1 + np.exp(-(smooth_df.region_count - okay) / f)))
ax = plt.scatter(smooth_df.region_count, smooth_df.smoothing)
plt.xscale('log')
plt.ylim([-.1, 1.1])
plt.title('Smoothing')

Then, we are able to calculate, primarily based on the chosen parameters, the smoothing coefficients and blended averages.
region_encoding_df['smoothing'] = (1 / (1 + np.exp(-(region_encoding_df.region_count - okay) / f)))
region_encoding_df['region_target'] = region_encoding_df.smoothing * region_encoding_df.raw_region_target
+ (1 - region_encoding_df.smoothing) * global_mean
Then, we are able to match one other mannequin with smoothed goal class encoding.
train_df = train_df.merge(region_encoding_df[['region', 'region_target']],
on = 'area')
test_df = test_df.merge(region_encoding_df[['region', 'region_target']],
on = 'area', how = 'left')
test_df['region_target'] = test_df['region_target']
.fillna(region_encoding_df.region_target.imply())
target_v2_feature_names = train_df.drop(['y', 'id', 'region'], axis = 1)
.columns
target_v2_model = sklearn.tree.DecisionTreeClassifier(min_samples_leaf = 500,
min_impurity_decrease=0.001)
target_v2_model.match(train_df[target_v2_feature_names], train_df['y'])
target_v2_model_def_df = get_model_definition(target_v2_model,
target_v2_feature_names)

target_v2_selected_df = test_df[
((test_df.region_target > 0.12) & (test_df.poutcome_success > 0.50)) |
((test_df.region_target > 0.12) & (test_df.poutcome_success 0.12) & (test_df.poutcome_success 8.50) & (test_df.housing 0.50) ) |
((test_df.region_target > 0.12) & (test_df.poutcome_success 6.50) & (test_df.month
We can see that we’ve eliminated the small cities and prevented overfitting in our model while keeping roughly the same performance, capturing 247 conversions.
region_encoding_df[region_encoding_df.region_target > 0.12]

It’s also possible to use TargetEncoder from sklearn
, which smoothes and mixes the class and world means relying on the section measurement. Nonetheless, it additionally provides random noise, which isn’t best for our case of heuristics.
Yow will discover the complete code on GitHub.
Abstract
On this article, we explored the right way to extract easy “guidelines” from information and use them to tell enterprise selections. We generated heuristics utilizing a Determination Tree Classifier and touched on the essential subject of categorical encoding since choice tree algorithms require categorical variables to be transformed.
We noticed that this rule-based method will be surprisingly efficient, serving to you attain enterprise selections rapidly. Nonetheless, it’s price noting that this simplistic method has its drawbacks:
- We’re buying and selling off the mannequin’s energy and accuracy for its simplicity and interpretability, so in the event you’re optimising for accuracy, select one other method.
- Though we’re utilizing a set of static heuristics, your information nonetheless can change, they usually would possibly change into outdated, so it is advisable to recheck your mannequin once in a while.
Thank you numerous for studying this text. I hope it was insightful to you. When you’ve got any follow-up questions or feedback, please go away them within the feedback part.
Reference
Dataset: Moro, S., Rita, P., & Cortez, P. (2014). Financial institution Advertising and marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306