Creating an AI algorithm for ICP scoring in the B2B market
Many people have heard about the big skepticism in the field of artificial intelligence from final consumers: “We have been on the market for 50 years, so we know better what to do”.
You are talking about your AI, but how does it work? Can you write an algorithm for our sales department? How can we interpret the results?
From these desires, a popular task emerged: “ICP-segmentation for B2B”. ICP — ideal customer profile, target segment in sales. There are parameters of a company at the input: country, amount, income; at the output, we predict some other parameters:
- conversion of potential client
- what income can give you the client
- is this person the one who makes decisions in his company, etc.
Next, we calculate the scoring and, consequently, the probability of getting into one or another group.
The task looks clear enough: we take the input parameters, transform them into the required form, substitute them into the model, and train the model. However, as mentioned earlier, the end-users of our solution will want not only to see the result but also to understand why a certain resulting scheme solves their problem better than a team of professional employees with good intuition. Therefore, in addition to being effective, our decision must be interpretable.
Trying to find a compromise between the efficiency and interpretability of the problem, we came to the following math model:
Or, more simply, a truncated quadratic model with a constraint on the influence of a parameter. The final score is from -100 to 100, and everything is interpreted quite well. You can try to create a model based on intuition (we work with America, which means we will put the most weight on it), or you can find a model based on actual conversion data or profit.
This is where the brute-force problem came from, which was supposed to be solved using a genetic algorithm because the methods of ordinary linear programming (with simplifications) stumble over the amount of data. Each set(x,y)will be briefly referred to as a gene.
At each stage, we calculate the scoring using the formula for each row in the dataset and calculate, using one of several loss functions, how well this gene predicts the result or not.
There are several loss functions that have been investigated:
1 — Matrix: there is a matrix of penalties where we calculate a penalty or a bonus for each row, for example, if we try to split the sample into 2 groups, then in case of errors of the first or second kind, we give a penalty -1, and in case of falling into the correct group, we give a bonus 1
2 — By Average: Calculate the average target parameter for each group and try to find the maximum distance between the groups.
Despite the fact that the results of using the first loss function are easier to interpret, the resulting groups are too heterogeneous. As a result, the effectiveness is grouped when the number of groups is more than two. As a result, the groups are too heterogeneous due to the randomness of the data for groupings, so let’s choose the second option. Sample distribution after PCA (chaotic reference):
Algorithm Optimization Methods
A genetic algorithm is a set of transformations of the original sample and, at each stage, discarding the worst results. Initially, the algorithm is naturally multithreaded for large problems, and the growth is not linear. For example, running on 16 cores gives an increase of almost 10 times, in contrast to the algorithm without multithreading. It is also worth looking at optimizing the loss function for sampling. But there are also less obvious ways.
Reducing the Number of Map-reduces in Multithreaded Computation
The first thing I want to discuss is reducing the number of map-reduces during parallelization. The genetic algorithm is divided into several phases, namely: reproduction, mutation, calculation of the next generation. All this can be combined into one task with a trigger inside the parameters. That is, instead of:
CURRENT_POPULATION = mutation.Mutation(CURRENT_POPULATION,triggers)
CURRENT_POPULATION = reproduction.Reproduction_multythread(CURRENT_POPULATION)
Can be combined into one function:
CURRENT_POPULATION = mut_rep_iteration.Itr(CURRENT_POPULATION,triggers)
Results of comparing the speed of one iteration:
The combined function (new fn) is almost 10 percent better than the previous (old fn). The algorithm itself can count almost a full day, which means that the calculations can take 2–3 hours less.
Selection of the Initial Approximation
To begin with, let us recall the mathematical model that we are researching:
Obviously, if we take not random values in the genetic algorithm, but, for example, some valid gene, then the algorithm will need fewer iterations.
Let’s apply several controversial approximations.
1 — Approximate option only yⱼ:
- make a random set x, substitute it into the formula and get the usual linear equation in each row
- calculate each yⱼ as a system of equations
- to fit the conditions, substitute the extreme y values again into the formula, again solve the usual SLAE system already with fewer variables and repeat several times
round each yⱼ
- we obtain some initial approximation for each parameter yⱼ
This assumption makes sense since many parameters y are often repeated in different schemes. If some parameter is good (strongly affects the result), then it will have a maximum or minimum value in any scheme.
2 — General approximation variant (gradient descent):
- do option 1
- we solve the inverse problem: substitute the parameters y and calculate x
- repeat N times
Here, in addition to local parameters, we also want to see which group most strongly affects the final result. The speed of the algorithm is low. Therefore, for each variant, five runs were carried out.
Current parameter set:
Generation survivors: 50,000
Number of children: 25,000
Number of mutants: 12,500
Now let’s look at the results:
Variant 1 is the same as a clean launch, which means that the approximation did not lead to positive consequences, but the second one (variant 2) looks promising, but let’s look at the execution time graph.
The time taken to execute is significantly increased due to the time spent on the initial approximation. This is a bit unfair. So, let’s try to compare the usual algorithm with variant 2, giving more resources to the regular algorithm and approximating the time by expanding the sample at each iteration.
Despite the resources being far superior to option number 2 (fp2), achieving a consistently better result is impossible. In theory, this can be corrected by searching for hyperparameters and filtering the next generation to protect against dominance. Research in this direction continues, so the second version of the initial approximation is chosen.
An initial approximation in this problem stabilizes the answer and optimizes resources. The initial approximation was chosen as several iterations of gradient descent:
gen = ret_random_gen(0) #Full random gen
for i in range(0,iters):
gen = CalculateByGenMG(gen) # calculate only X
gen = CalculateByGenML(gen) # calculate only Y
In the end, I would like to provide data from the graphics of model validation in production.
Training is based on the data on the size of the check. Each lead falls into a group from Low to Very High, and as you can see, the higher the rank, the higher the probability of a high check. The incoming data structure has slightly changed, but the differentiation of groups remains stable.
A genetic algorithm is a resource-intensive process, and optimizing this process can become the main task in its creation. The algorithm is easy to use, helps find the optimal solution in brute force problems and proves its effectiveness.