1 Introduction

In the 1980s, Breiman et al. (1984) developed Classification and Regression Trees (CART) as alternative, non-parametric approaches to classification and regression. The tree-structured predictors are obtained by recursively partitioning the data space into a set of (usually) axis-parallel hyper-rectangles, with similar response values grouped together, so that, in the end, the resulting groups are as homogeneous as possible with respect to the considered response. The resulting decision trees suffer from high variance, i.e. the decision trees learned from different data samples drawn from the same distribution may be quite different. In other words, the learned models are unstable, i.e. small perturbations in the training set could cause large changes in the resulting predictors.

Breiman (1996) suggested to improve the stability of decision trees by learning an ensemble of diverse trees whose predictions are combined. In particular, by repeatedly perturbing the training set, learning multiple decision trees from these perturbed sets, and finally then combining their individual predictions into an overall prediction, it is possible to improve the quality of the predictions over those obtained from individual single trees. The general class of such predictors is called perturb and combine (P&C; see Breiman (1996) for other examples of P&C procedures), and they in turn form a special case of what is generally known as ensemble techniques in machine learning (Dietterich, 2000).

One of the best known P&C methods, bagging (Boostrap AGGregatING; Breiman 1996), perturbs the training set several times using bootstrapping, a general statistical method in which several (non-disjoint) training sets are obtained by drawing examples randomly, with replacement, from a single base dataset (Efron, 1982). The multiple predictors, which are learned from such bootstrap samples, are then combined by simple voting, in the case of classification, or by averaging for regression. In each iteration, the bootstrap samples are drawn with the same probability, which guarantees independent bootstrap training sets. Bagging without replacement was proposed as “subagging = subsample aggregating” by Bühlmann and Yu (2000, 2002) as a computational cheaper version of bagging (Bühlmann, 2003). Buja and Stuetzle (2006) analytically proved the equivalence of resampling with and without replacement in Bagging. A comparison between bagging with or without replacement can be found also in Friedman and Hall (2007).

As an alternative to bagging, Freund and Schapire (1996) and Freund and Schapire (1998) proposed boosting, another P&C method which aims at a faster reduction of training set errors. The key idea is to learn the ensemble of predictors sequentially, so that each predictor can focus on correcting the mistakes of the previous ones. This is obtained by increasing the probabilities for sampling examples that have been misclassified in previous iterations, so that they are more prevalent in subsequent iterations. Thus, unlike in bagging, the samples used in the boosting iterations are clearly not independent. Breiman (1998) considers boosting to be a special case of the class of arcing (adaptive resampling and combining) classifiers, where units that have frequently been classified as incorrect have an increased probability to be drawn in subsequent iterations.

Decision trees and ensemble methods have been widely applied to quantitative or qualitative response variables, in sociological, medical or psychological domains. For example, Austin (2012) compare inferences on the effect of in-hospital smoking cessation counselling on subsequent mortality in patients hospitalized with an acute myocardial infarction using ensemble-based methods (bagged regression trees, random forests, and boosted regression trees) to directly estimate average treatment effects by imputing potential outcomes. Stegmann et al. (2018) introduce nonlinear Longitudinal Recursive Partitioning (nLRP) and illustrate its use with empirical data from the kindergarten cohort of the Early Childhood Longitudinal Study. Or again, recently, Grimm and Jacobucci (2020) improve the reliability of splitting functions in the case of small data samples, and illustrate their performance using data on depression and suicidal ideation from the National Longitudinal Survey of Youth. In all these works decision trees and ensemble methods are applied in situations where the outcome is a categorical or a continuous variable but, to our knowledge, there is not much work in that area that allows to cope with ranking or preference data.

Preference data may be viewed as samples in which some subjects (voters, judges), characterized by a set of features, indicate their preferences over a set of alternatives (items). In many real-world cases, items may also receive the same preference by a judge, resulting in rankings with ties. The goal is to learn a function that allows to rank these items according to the predicted preferences of a new subject, given his or her characteristics. In preference learning (Fürnkranz & Hüllermeier, 2011), this setting is referred to as label ranking, to distinguish it from object ranking, a related but different scenario which does not involve judges, but tries to predict a universal preference degree for items based on their characteristics. A typical example is to learn to rank Web pages for a search result. Freund et al. (2003) discuss a boosting approach for preference data in object ranking problems, but very few works can be found in literature on label ranking problems. Aledo et al. (2017), taking as a basis the Label Ranking Tree (LRT)-based algorithm proposed by Hüllermeier et al. (2008) and described in (Cheng et al., 2009), design weaker tree-based models which can be learnt more efficiently and show that bagging these weak learners improves not only the LRT algorithm, but also the state-of-the-art competitors. Werbin-Ofir et al. (2019) focus mainly on the aggregation problem and again used the LRT label ranking base algorithm, which is based on the estimation of a Mallows model (Mallows, 1957) in each node of the tree. They propose to apply voting rules, typically used in the field of social choice, as the aggregation technique for label ranking ensembles. As they found that there is no single rule that consistently outperforms all other voting rules and that under different settings, different voting rules perform the best, they propose a novel aggregation method for label ranking ensembles, the voting rule selector (VRS), which learns the best voting rule to be used in a given setting. Both de Sá et al. (2017) and Zhou and Qiu (2018) use random forests (Breiman, 2001) for ensemble formation, the former considering label ranking trees (de Sá et al., 2015) as base classifiers, the latter using the top level in the ranking as the class in decision tree construction. A key, non-trivial step in ranking trees is to devise a procedure for aggregating the various rankings associated with all examples in a node into a consensus ranking. To this end, de Sá et al. (2017) use the average ranking, whereas Zhou and Qiu (2018) rely on Borda’s method. Dery and Shmueli (2020) propose a boosting algorithm, BoostLR, quite similar to the one proposed in our paper. They use LRT (Hüllermeier et al., 2008) as the base algorithm and weighted Borda’s count as aggregation method. We will show in Section 4.2 how and why our proposal differs from Dery and Shmueli (2020).

The purpose of this paper is to define, analytically and empirically, boosting and bagging algorithms for rankings, with or without ties, and to evaluate these algorithms on real and simulated data. We start with a brief description of ranking data and distances between rankings in Section 2. Section 3 introduces decision trees and how they can be applied to ranking data, describing the rank aggregation process. Section 4 shows how these ranking trees can be used in bagging and boosting algorithms. Finally, in Section 5, we discuss a real application and a simulation study, and, in the end, conclude with a brief conclusion.

2 Ranking and Preference Data

Ranking and classification are frequently used in order to grade various experiences in our lives. Grouping and ordering a set of elements is quite easy and meaningful. Examples include rankings of football teams, universities, and countries.

A special case of ranking data constitutes preference data, where some individuals (called judges) indicate their preferences over a set of alternatives (which, in the literature, are also referred to as options, stimuli, or items) and are asked to order them from most to least preferred.

2.1 Motivation

Preference data can be found as pairwise comparisons, when respondents are asked to select the more preferred alternative from each pair of alternatives. Note that paired comparison and ranking methods, especially when differences between choice alternatives are small, impose lower constraints on the response behaviour than rating methods. Rounds et al. (1978) say that “The method of paired comparisons … has been frequently applied in conjunction with the law of comparative judgment to scaling psychological variables.”, and they cite the seriousness of crimes, patients’ mental health, visual illusions, life goals, food preferences etc. as examples of stimuli that have been scaled with these methods. Presenting the objects in pair to judges, and then producing paired comparison rankings, could be the natural experimental procedure when the objects to be ranked are really similar and the introduction of other items may be confusing (David, 1969). Of course, as the number of stimuli increases, paired comparisons become exceedingly time-consuming and laborious for the subjects, while directly ranking the stimuli involves less labour. A great deal of data, in psychological research for example, can be considered the result of a choice process (Maydeu-Olivares and Bockenholt, 2009), and if item and person characteristics are available, appropriate statistical tools can be used to identify how respondents differ in their perception and preferences for a set of choice options. Maydeu-Olivares and Bockenholt (2005) stress that the methods of ranking and paired comparison play an essential role in the measurement of preferences, attitudes, and values. They report two applications: in the first one, a Spanish university wished to investigate career preferences among its undergraduate psychology students. A pilot study was performed in which a sample of 57 psychology sophomores was asked to express their preferences for four broad psychology career areas (Academic, Clinical, Educational, and Industrial) using a ranking task. The second one has the goal to model purchasing preferences for seven compact cars among Spanish college students. In both examples, covariates may be included to explain individual differences in the evaluation of choice alternatives.

Preference data are typical of marketing analyses, but also of political and social choices science surveys and, more generally, of behavioural studies (staff selections, policy alternatives assessment, etc.). The primary aim of market segmentation, for example, is to identify relevant groups of consumers that can be addressed efficiently by marketing or advertising campaigns. But, as Müllensiefen et al. (2018) have observed, the issue whether consumer groups can be identified from background variables, that are not brand-related, and particularly how much personality vs socio-demographic variables contribute to the identification of consumer clusters is very important to address. Indeed, “the term ‘psychographics’ describes the collection of data on variables that reflect consumer personality, attitudes, personal values, lifestyle and other psychological constructs in order to identify and describe subpopulations of consumers and ultimately to inform marketing processes” (Müllensiefen et al., 2018). Moreover, the authors underline how “personality information can be gathered indirectly from online data such as Facebook likes, Twitter profiles, musical preferences or spending behaviour. Thus, unlike other psychographics measures, personality has become a layer of information that can be obtained from various sources, especially online information, and that has proven to be useful for predicting a broad variety of outcome measures, such as substance use, political attitudes, or purchase satisfaction”.

In the last two decades, reasoning with and learning of preferences has received increased attention in the machine learning (Fürnkranz and Hüllermeier, 2011) and artificial intelligence (Rossi et al., 2011) literature.

Consequently, many models have been proposed able to cope with ranking data, when respondent-specific variables are available, and several of them focused specifically on decision tree models. Some of the best known include the distance-based tree models proposed by Lee and Yu (2010), decision trees with ad-hoc impurity functions suggested by Yu et al. (2010) and multivariate trees for rankings based on distance measures of D’Ambrosio (2008).

2.2 Rankings

Formally, a ranking of m items, labeled \((1,\dots , m)\), is a mapping a from the set of items \(\{1,\dots , m\}\) to the set of ranks \(\{1,\dots , m\}\). When all items are ranked in m distinct ranks, a complete ranking or linear ordering (Cook et al., 1986) is observed. A ranking a is, therefore, one of the m! possible permutations of m elements, representing the preference ordering of one particular judge for the items involved. But if a judge fails to distinguish between two or more objects and he/she assigns equal importance to them, a tied ranking or weak ordering is obtained. Moreover, rankings may not be complete: partial rankings occur when only a specific subset of q < m objects are ranked by judges, while incomplete rankings occur when judges freely rank different subsets of m objects (Cook et al., 1986). It is worth noting that different types of orderings will generate different sample spaces of rankings. With m items there are m! possible complete rankings, but this number increases when ties are allowed (for the universe cardinality in the presence of ties, we refer to Good (1980) and Marcus (2013). The size of the universe of all possible rankings with ties, Sm, is equal to the following quantity:

$$ S^{m}=\sum\limits_{r=0}^{m}r!{m\brace r}, $$

where {mr} states the Stirling number of the second kind, indicating the number of all possible ways to partition a set of m objects into r non-empty subsets, i.e. \({m\brace r} = \frac {1}{r!}{\sum }_{i=0}^{r}(-1)^{i} \binom {r}{i} (r-i)^{m} \) (D’Ambrosio et al., 2017).

Zhou et al. (2014) propose a taxonomy of label ranking algorithms, where label ranking studies a mapping from instances to rankings over a finite number of predefined labels, in other words ranking data. They mention four subtypes of label ranking: (a) Multiclass classification, (b) Multilabel classification, (c) Label ranking and (d) Multilabel ranking. Case a is when each instance is associated with a single label: this is equivalent to say that in a ranking we are interested only to the first (top-1) position. Case b is when we are interested only in a subset of the labels, that share the same importance: that is to say that we are interested only to the top-k positions, equally ranked. Case c corresponds to a linear ranking — all items are ranked and without ties. Case d, finally, is a typical top-k or what we have called partial ranking (Hall and Schimek, 2012; Sampath & Verducci, 2013; Svendova & Schimek, 2017). All these types can be handled in the framework of this paper. Further details on Label Ranking can be found in Vembu and Gärtner (2010).

2.3 Distances Between Rankings

A natural desiderata is to group subjects with similar preferences together. To this end, it is necessary to measure the spread between rankings through dissimilarity or distance measures, where a distance d between two rankings is a non-negative value, ranging from 0 to \(D_{\max \limits }\), where 0 is the distance between a ranking and itself. Many different distance functions between rankings have been proposed in the literature (Marcus, 2013), often defined as functions of the m × m score matrix (aij) (defined for the generic ranking a), whose elements are defined as:

$$ a_{ij}=\left\{ \begin{array}{rl} 1 & \text{if } i \text{ is preferred to } j,\\ 0 & \text{if } i \text{ is tied to } j \text{ or } i=j,\\ -1 & \text{if } j \text{ is preferred to } i. \end{array} \right. $$

Kemeny & Snell (1962) introduced a metric defined on linear and weak orders, known as Kemeny distance (or metric), which satisfies the constraints of a distance measure suitable for rankings. Cook et al. (1986) later generalized Kemeny distances to the framework of partial orders.

In this work, we consider the possibility of ties; therefore, we assume that the geometrical space of preference rankings is the generalized permutation polytope (Heiser and D’Ambrosio, 2013; D’Ambrosio et al., 2017), for which the natural distance measure is the Kemeny distance. It is defined as follows:

$$ d_{K}(a,b)=\frac{1}{2}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{m}|a_{ij}-b_{ij}|, $$
(1)

where aij and bij are the generic elements of the m × m score matrices associated with a and b

Each distance between orderings is in a one to one correspondence to a proper correlation coefficient, through the linear transformation \(c = 1 - \frac {2 d}{D_{\max \limits }}\)

(Emond & Mason, 2002). The rank correlation coefficient τx corresponding to the Kemeny-Snell distance is defined as (Emond & Mason, 2002)

$$ \tau_{x}(a,b)=\frac{{\sum}_{i=1}^{m}{\sum}_{j=1}^{m}a^{\prime}_{ij}b^{\prime}_{ij}}{m(m-1)}, $$
(2)

where \(a^{\prime }_{ij}\) and \(b^{\prime }_{ij}\) are the generic elements of the m × m score matrices associated with a and b respectively, assuming now that the value is equal to 1 if the item i is preferred to or tied with the item j, − 1 if item j is preferred over i, and 0 if i = j.

3 Decision Trees for Rankings

As discussed above, preference rankings can be considered as indicators of individual behaviours of a set of individuals, the judges. In the case of preference data, we also have characteristics of the individuals available, so that an important task is the identification of profiles of respondents (or judges) which exhibit a similar behaviour, i.e. which show similar preference rankings.

3.1 Decision Trees

Decision tree learning is a very popular data analysis technique, which has been independently discovered in statistics and machine learning (Murthy, 1998). The CART algorithm (Breiman et al., 1984), which we already briefly introduced in Section 1, starts from the root node that contains the whole learning sample, and uses a recursive partitioning algorithm to create a tree where each node t represents one set of the partition.

At each node, data are divided by choosing a split, among all possible splits, so that the resulting child nodes are the “purest”, where pureness is meant in terms of homogeneity (with respect to the response) of observations in the same node. The homogeneity of child nodes is measured by a so-called impurity functioni(t), a generic function satisfying the following three properties: (a) it is minimum when the node is pure, i.e. when all data points associated with this node have the same response; (b) it is maximum when the node is the most impure; (c) its value does not change if items are renamed. Specifically, a splitting criterion is based on the reduction in impurity resulting from the split s of node t, with the best split chosen as the one maximizing the impurity reduction

$$ {{\varDelta}} i(s,t)=i(t)-p_{L}i(t_{L})-p_{R}i(t_{R}), $$

where pL and pR are given by the proportions of units in node t assigned to the left child node tL and to the right child node tR, respectively, at the s th split.

3.2 Impurity Functions for Ranking Data

The key issue in adapting conventional univariate classification trees to the multivariate case is the generalization of the definition of the impurity function. In order to avoid the problems caused by the multivariate role of the ranking vector, following Sciandra et al. (2015) and Plaia and Sciandra (2019), the set of preferences are hereby valued as a unique multivariate structure. More precisely, the ranking process of m items can be seen as a permutation function from \(\{1,\dots ,m\}\) into \(\{1,\dots ,m\}\). Assigning a label to each permutation (with or without ties) obtained by a set of distinct items, the ranking vector can be uniquely identified by its label that can play the role of a response variable. This strategy enables us to work with the classical univariate CART methodology.

For a better understanding of our strategy, consider the following illustrative example: for ranking 4 items without ties, there are overall 4! permutations. The first permutation could represent the ranking 〈1,2,3,4〉, the second another ranking such as 〈2,1,3,4〉, the third one 〈1,2,4,3〉 and so forth. If we assign to each ranking the corresponding label “first”, “second”, “third” etc., the multivariate problem of rankings reduces to a univariate one, where the labels “first”, “second”, “third”, \(\dots \) represent the response variable, assuming that each one is in a one to one correspondence with a permutation.

However, the impurity function should represent the fact that some rankings are more similar to each other than others. For example, the ranking 〈1,2,3,4〉 is quite similar to the ranking 〈1,2,4,3〉 (only the order of the last two items in the ranking is reversed), but quite different from 〈4,3,2,1〉 (where all preference pairs are inverted). Thus, when the impurity is evaluated on ranking data, the impurity function i(t) can be suitably modified as:

$$ i(t)=\underset{a, b \in t: a\ne b}{\sum} d (a,b), $$
(3)

where d is a distance measure between orderings a and b, the sum being extended over all the pairs of orderings in node t. A decrease in node impurity at each step will be evaluated according to all covariates and respective split points, and the best among them is chosen. Piccarreta (2010) suggests an impurity function based on dissimilarities, without proposing specific distances for preference data. In order for the impurity function to take into account the ordinal nature of rankings, following Sciandra et al. (2015), we propose to use the Kemeny distance dK(a,b) as defined in (1). Therefore, the base classifier Cf(.) we will use as the main ingredient of our bagging and boosting procedure is the tree-based method for ranking data introduced by Sciandra et al. (2015) and Plaia and Sciandra (2019).

3.3 Rank Aggregation

Once the tree has been grown, we need to assign a class label or a class ranking to each leaf node. To this end, we need to compute a so-called consensus ranking, i.e. the ranking that mostly agrees with all the rankings in the node (D’Ambrosio et al., 2015a). Among the pool of several consensus ranking measures proposed in the literature, the median ranking will be used in this paper. The median ranking in a node is defined as the ranking that has the minimum sum of the distances to all rankings in the node. Equivalently, it is the ranking that maximizes the average rank correlation coefficient τx (2) between itself and the other rankings in the node (D’Ambrosio & Heiser, 2016). In other words, we look for the candidate, within the universe of the permutations with replacement of m elements Sm, that maximizes the rank correlation coefficient (2) or, conversely, minimizes the average Kemeny distance (1). It has been shown, as stated by Zhou and Qiu (2018), that the Kemeny optimal aggregation is the best compromise ranking.

A key problem with this approach is that the naïve computation of the median ranking is computationally prohibitive, as it essentially involves the evaluation of (in the worst case) m! different candidate rankings. Thus, for an efficient computation of the median ranking, we use a branch and bound algorithm proposed by Amodio et al. (2016), implemented in the R package ConsRank (D’Ambrosio et al., 2015b). Emond and Mason (2002) suggested the initial branch and bound algorithm to find the solution in a consensus ranking problem. Concerning the Kemeny approach, D’Ambrosio et al. (2015a), D’Ambrosio et al. (2015b) and Amodio et al. (2016) proposed two accurate algorithms (QUICK and FAST) for identifying the median ranking when treating with linear, weak and partial rankings. The authors showed that the QUICK and FAST algorithms are equivalent to Emond and Mason’s branch and bound, but ensure an impressive time-saving from a computational point of view, by starting with a “good” initial solution. The solution they come up with is optimal but might not find all the solutions in case of multiple solutions.

4 Ensemble Methods for Ranking Data

“The idea of ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models”. (Hastie et al., 2009, ch. 16). It is well known that, by aggregating many decision trees for a categorical or continuous outcome, using methods like bagging, random forests, and boosting, the predictive performance of trees can be substantially improved (James et al., 2013). This is true with ranking data as well, as demonstrated by Dery and Shmueli (2020) who report (Figs. 3 and 4) the average improvements of ensemble methods over a single decision tree. Comparable enhancements are shown in our paper and will be illustrated more deeply in Section 5.

In this paper, we form ensembles using ranking trees as building blocks, in order to construct more powerful prediction models than a single tree. Ensemble learning can be broken down into two tasks: developing a population of base learners (in our case decision trees) from the training data, and combining them to form the composite predictor. We will first discuss the first point, before we show in Section 4.3 the way we aggregated the trees. We briefly outline the main difference between bagging and boosting, the ensemble methods we are going to work with. Bagging (Section 4.1) learns decision trees for many datasets of the same size, randomly drawn with replacement from the training set. Thereafter, a proper predicted ranking is assigned to each unit. Boosting on the other hand (Section 4.2) combines classifiers, iteratively created from weighted versions of the learning sample, with weights adaptively modified, iteration by iteration, so that previously misclassified rankings have a higher probability of being selected in subsequent iterations. The final predicted rankings are computed by weighted combination of the intermediate rankings of the iterative process.

4.1 Bagging for Ranking Data

The simplest implementation of the idea of generating quasi-replicate training sets is bagging (Breiman, 1996), which is briefly summarized in Algorithm 1.

figure a

In a few words, several training datasets Tb of the same size are sampled at random, with replacement, from the training set. Each Tb is used to train a decision tree Cb(.), leading to multiple predicted rankings Cb(xi) for each data point xi. A final ranking prediction \(\hat {y}_{i} = C_{f}(x_{i})\) is assigned to each unit xi through the consensus ranking process briefly described in Section 3.3 and detailed in Section 4.3. Combining multiple classifiers decreases the expected error by reducing the variance. More classifiers in the ensemble typically lead to a greater reduction of the error (at least on the training set). The procedure allows to determine which covariates are important and discriminant: the covariates’ importance can be estimated by averaging over their importance in each of the b trees.

4.2 Boosting for Rankings

Boosting also repeatedly uses a base learning algorithm on differently weighted versions of the training data, leading to a sequence of classifiers that are finally combined. However, unlike bagging, which always gives identical weights to all data points, boosting creates weighted versions of the learning sample. The example weights are adaptively adjusted at each iteration so that previously misclassified items are sampled with a higher probability. The final predictions are obtained by weighting the results of the iteratively produced predictors based on their observed error rate. There are many versions of boosting algorithms, but arguably the most frequently used is AdaBoost.M1 (Freund & Schapire, 1996), which we will adapt to ranking data in the following.

figure b

The AdaBoost.R algorithm (Algorithm 2) is described as follows. A weight w1(i) = 1/n is initially assigned to each observation in the training set T of size n (step1). The weights wb can be interpreted as probabilities for the corresponding example being included into the training set Tb for the b th iteration of the algorithm, in which a base classifier Cb(.) is trained on Tb. The classifiers Cb(.) are subsequently applied to each example in the training sample leading to a predicted ranking for each item \(\tilde {y}_{i}^{b} = C_{b}(x_{i})\) (steps 3 and 4). The ranking error eb of the ranking tree Cb(.) is estimated based on the distance between each predicted ranking \(\tilde {y}_{i}^{b}\) and its real value yi (step 5)

$$ e_{b}=\sum\limits_{i=1}^{n}w_{b}(i)\left[1-\frac{\tau_{x}(\tilde{y}_{i}^{b},y_{i})+1}{2}\right]. $$
(4)

An observation with a high value of \( \left (1-\frac {\tau _{x}(\tilde {y}_{i}^{b},y_{i})+1}{2}\right )\), i.e. a low value of \(\tau _{x}(\tilde {y}_{i}^{b},y_{i}) \), is poorly predicted.

Based on the errors eb, a factor αb is computed for updating the weights wb(i) (step 6). Following Freund and Schapire (1998),

$$ \alpha_{b}=\ln((1-e_{b})/e_{b}). $$
(5)

The weights wb(i) are updated after each iteration so that, at step b + 1, the observations xi that were misclassified by the classifier Cb(xi) have their weights increased, whereas the weights are decreased for those that were classified correctly. The new weight for the (b + 1)th iteration (step 7) is

$$ w_{b+1}(i)=w_{b}(i)\exp \alpha_{b}\tau_{x}(i), $$
(6)

normalized to sum to one.

This procedure ensures that the bigger the distance between the ranking associated with an observation and the original ranking, the higher is the probability that this observation is resampled in the new iteration (Alfaro et al., 2013). Thus, the sequence of trees tries to focus more on examples that are hard to predict. The α value can be interpreted as a local iteration-specific learning rate, calculated as a function of the error made in each iteration. Moreover, this value is also used in the final decision rule, giving more importance to the trees that made a lower error.

The iterative procedure continues until the stopping criterion (i.e. αb > 0.5) fires, or the maximum number of trees is reached. It is important to consider that fitting the training data too well can lead to overfitting, which increases the risk on future predictions (Hastie et al., 2009). We will return on this point in Section 5.

The importance of individual covariates can be estimated as a weighted average of their importance in each of the B trees, with weights αb given by (5).

4.3 Rank Aggregation and Test Error Measurement

For making a final prediction, both, bagging and boosting, use rank aggregation to combine the predictions of the individual ranking trees. Recall that in each iteration b, we have obtained a tree Cb(.), which predicts a ranking \(\tilde {y}_{i}^{b}\) for a given example xi.

In our experiments, we will evaluate our ensembles after each iteration, i.e. we have to compute an aggregated ranking \(\hat {y}_{ib}\) and an aggregated error err(b) after each iteration b. The aggregated ranking for a generic i th observation at the b th iteration, \(\hat {y}_{ib}\), will then be obtained as

$$ \hat{y}_{ib}=\arg\underset{y_{i} \in S^{m}} \max \sum\limits_{k=1}^{b}\alpha_{k}\tau_{x}(\tilde{y}_{i}^{k},y_{i}), \quad \quad \text{with }b=1,2,\dots,B, $$
(7)

where αk, the weight related to the k th tree, is computed as in (5) for boosting, and αk = 1 in the case of bagging. The process is illustrated in Table 1, where the i th column shows the predictions, weights, and errors of the i th tree Ci(.). The values in the last column correspond to the final lines of Algorithms 1 and 2.

Table 1 Predictor matrix

Once each unit has been assigned a final ranking tree by tree, the error assigned to each sequence of trees \(\{1,\dots , b\}\) is computed as

$$ \text{err}(b)=1-\frac{\tau_{x}(b)+1}{2},\quad \quad \text{with }b=1, \dots, B, $$
(8)

where \(\tau _{x}(b)=\frac 1n{\sum }_{i=1}^{n} \tau _{x}(\hat {y}_{ib},y_{i})\) is the average of τx of the b th tree over all the units in an example set T.

Before looking at the experimental results, we want to underline how and why our AdaBoost.R differs from Dery & Shmueli’s (2020) BoostLR. First, the base classifier we use is the one introduced by Sciandra et al. (2015) and Plaia and Sciandra (2019), whereas Dery and Shmueli (2020) use LRT proposed by Hüllermeier et al. (2008). BoostLR uses a transformation of Kendall’s τb coefficient to compute the “loss” (i.e. the factor multiplied by wb(i) in our (4)) for each training example: actually, the true transformation from a “correlation” to a distance is the one we use in our (4). Moreover, Emond and Mason (2002) have criticized τb because of its problematic way of handling ties — the all ties ranking has an undefined distance to any other ranking — and instead propose a refined version, τx (2), which fixes the problem assuming a different score matrix, with \(a^{\prime }_{ij}\) equal to 1 (and not to 0) if the item i is tied with the item j.

Finally, Dery and Shmueli (2020) aggregate the weak models’ outputs using weighted Borda, while, coherently with the previous steps in our algorithm, we use again a function of τx, now weighted with αb, as shown in (7). Critiques to Borda’s method can be found in Amodio et al. (2016).

5 Experimental Results

In order to evaluate the performance of the ensemble methods described in Section 4 we performed a simulation study (Section 5.1) and applied the method to three real datasets (Section 5.2).

5.1 Simulation Experiments

Following D’Ambrosio and Heiser (2016), we considered a predictor space (X1, X2), were X1 and X2 were generated according to continuous uniform distributions, with \(X_{1}\sim U(0,10)\) and \(X_{2}\sim U(0,6)\), which was partitioned into five regions as shown in Fig. 1. The number of datapoints (of 4 items) within each sub-partition was determined by: i) randomly drawing from a normal distribution N(10,2), ii) dividing them by their summation, iii) multiplying by the true sample size (n = 200,n = 500).

Fig. 1
figure 1

Theoretical partition of the predictor space (X1, X2) with 4 items

The rankings (datapoints) within each sub-partition were generated from a Mallows Model (Mallows, 1957), which is one of the first probability models proposed for rankings and frequently used in both theoretical and applied studies. It is an exponential model defined by a central permutation σ0 and a dispersion parameter 𝜃. When 𝜃 > 0, σ0 represents the mode of the distribution, i.e. the permutation that has the highest probability. The σ0 values for our simulation studies are shown in Fig. 1. The probability of any other ranking decays exponentially with increasing distance to the central permutation. The dispersion parameter controls the steepness of this decline. Assuming that σ is a generic ranking, the probability for this ranking is given by

$$ \Pr(\sigma)=\frac{\exp(-\theta d(\sigma,\sigma_{0}^{-1}))}{\psi(\theta)}, $$

where d is a ranking distance measure and ψ(𝜃) is a normalization constant. In our simulation, we generated rankings assuming the Kemeny distance and varying the dispersion parameter 𝜃, according to three different level of noise (low with 𝜃 = 2, medium with 𝜃 = 0.7 and high with 𝜃 = 0.4). Considering two levels for the sample size (n = 200,n = 500) the experimental design, hence, counts 3 × 2 = 6 different scenarios. Figure 2 shows one of the six datasets considered in our experiment (corresponding to 𝜃 = 2 and n = 500).

Fig. 2
figure 2

Empirical partition of the predictor space, generating high homogeneous groups of rankings (𝜃 = 2), with n = 500. A specific colour is assigned to the generated rankings. For instance, yellow is the colour associated with the modal ranking ADCB of the bottom-left sub-partition but, because of the heterogeneity among rankings, other colours, hence other rankings, are present, too

We applied the two ensemble methods defined in Section 4 to all six scenarios, fixing the number of trees B to 100, the depth of each tree to 4 (see Fig. 4 for different depths) and considering a training sample T of 2/3 of the observations. R code (R Core Team, 2020) was used for both the simulations and the application to real data, by opportunely modifying available functions in the R packages Consrank (D’Ambrosio et al., 2015b) and adabag (Alfaro et al., 2013).

Figure 3 compares boosting and bagging performances in all the simulated scenarios, plotting the error err(b) (8) vs. the number of trees b, both in the training and in the test set. The starting point of each graph is the error corresponding to the first tree (x-axis), therefore the plot shows the improvement by the bagging/boosting algorithms computed both on the training (2/3 of the observations) and on the test (1/3 of the observations) samples.

Fig. 3
figure 3

Bagging and boosting for all the simulated scenarios with 100 trees: different levels of homogeneity among the rankings, 𝜃 = (0.4,0.7,2), and two sample sizes, n = (200,500)

Looking at the errors produced by boosting, the methodology is able to perform very well when there is a high level of heterogeneity among the rankings (𝜃 = 0.4). Expectedly the performances for n = 200 are worse than for n = 500. The training error and the test error for n = 500 seem to converge after approximately 20 trees. In particular, it was enough stopping the procedure at B = 100 trees, since the errors showed to be quite stable. Stopping at B = 100 trees we also avoid overfitting, that, as known, can affect boosting: looking at the top-right plot in Fig. 3, for example, we can observe that the boosting training error (green line) keeps going down, while the corresponding test error (blue line) starts to increase.

Finally, we also see that bagging always performs worse than boosting. It is important to highlight that, even if a depth-4 tree could seem too small for a Bagging procedure, due to the cardinality of the universe of all possible rankings with ties, even a small tree (depth 4, i.e. 5 leaves) causes a non-negligible variability in the predicted rankings \(\tilde {y}_{i}^{b} \) along with the trees. Just as an example, considering 𝜃 = 0.7 and n = 200 (i.e. left-middle plot in figure 3, which considers only depth 4 trees), the τx computed on the B = 100 predicted values for each of the n = 200 units

$$ \tau_{x}(i)=\frac1B\sum\limits_{b=1}^{B} \tau_{x}(\tilde{y}_{ib},y_{i}) $$

ranges from 0.55 to 0.86. This means that even using small tree produces high variability.

In order to verify if the procedure is sensitive to the number of splits in each tree, i.e. to its depth, we consider boosting with two different tree depths (2 and 4 respectively). The results are shown in Fig. 4. It becomes clearer, going from 𝜃 = 0.4 to 𝜃 = 2, that the mean error computed using depth 4 (both in the training and in test sets) tends to be lower than that with depth 2. These results confirm the ones by Zhou and Mentch (2021), who found that depending on the signal-to-noise ratio of the underlying data, having smaller trees can be more beneficial to the ensemble predictions.

Fig. 4
figure 4

Boosting built up for each simulated dataset, using two levels of splitting for the trees (2 and 4)

5.2 Real Data Applications

The first dataset is a multi-class data set (Fig. 5) from the UCI repositoryFootnote 1. The dataset “vehicle” originally contains 946 units, 18 explanatory variables and four items useful for classification tasks. As the original dataset is only for classification, we followed the procedure proposed by Hüllermeier et al. (2008) to turn it into a ranking dataset: first a Naïve Bayes classifier is trained to estimate the probabilities for each class label, and then this predictd ranking is used as the target ranking for the ranking prediction task. We randomly split the data in 2/3 training and 1/3 test instances and learned trees with a maximum depth of 4.

Figure 5 compares the outcomes obtained applying boosting and bagging to the data for up to 350 trees. Again, boosting shows a better performance, while the error becomes stable after about 100 trees.

Fig. 5
figure 5

Boosting and bagging applied to Vehicle dataset

We also consider two more real datasets, which have also been used in previous works (de Sá et al., 2018; Werbin-Ofir et al., 2019; Dery and Shmueli, 2020). The datasets GermanElections2005 and GermanElections2009 contain socio-economic information from regions of Germany and their respective electoral results. The 413 records correspond to the administrative districts of Germany, which are described by 39 covariates, such as age and education of the population, economic indicators (e.g. GDP growth, percentage of unemployment), indicators of the labour workforce in different sectors such as production, public service, etc. In terms of the outcome, de Sá et al. (2018) transformed the election results of the five major political parties for the federal elections in 2005 and 2009 into rankings of five items: CDU (conservative), SPD (centre-left), FDP (liberal), Green (centre-left) and Left (left-wing). Because of the better performance of boosting over bagging, observed both in the simulation study and in the previous example, we consider only boosting, limiting the number of trees B to 100 with a maximum depth of 4.

Figure 6 shows how the error varies — starting from the first tree — in both the datasets: it is interesting to observe that in 1 case, after about 15 iterations, the test error increases, i.e. on the 2005 data we seem to have a strong problem with overfitting.

Fig. 6
figure 6

Boosting applied to German Elections datasets: err(b)

We also measured \(\tau _{x}(b)=\frac 1n{\sum }_{i=1}^{n} \tau _{x}(\hat {y}_{ib},y_{i})\) for all the trees: Fig. 7 shows the average of τx of the b th tree over all the units in an example set T (both training or test): as expected, τx(b) performance is specular to err(b)’s one.

Fig. 7
figure 7

Boosting applied to German Elections datasets: average τx

As explained in Section 4.2, the procedure allows to determine which covariates are important and discriminant: covariates’ importance can be computed averaging over their importance resulting in each of the B trees, with weights α given by (5). Figure 8 shows the variable relative importance at some steps in the procedure (i.e. after 1, 10, 50, 75 and 100 trees). It is interesting to notice how the relative importance of each variable can change along the iterations: for example the importance of the variable Unemploy, the most important covariate at the beginning, has substantially decreased at the end of the process (i.e. at 100 trees).

Fig. 8
figure 8

Variable Importance at intermediate and final step for 2009 German Elections dataset

6 Conclusion

In this paper, we adapted two ensemble methodologies, boosting and bagging, to ranking data to improve the prediction performance of single decision trees. Both algorithms combine the predictions of multiple decision trees, in particular the distance-based trees for ranking data proposed by Plaia and Sciandra (2019). These trees consider the Kemeny distance (Kemeny & Snell, 1962) as a measure of impurity in the splitting process, and its related rank correlation coefficient τx (proposed by Emond & Mason (2002)) for identifying the median ranking in the final nodes. As stated by Zhou and Qiu (2018), rank aggregation can be obtained by optimizing different rank distance measures, but it has been shown that the Kemeny optimal aggregation is the best compromise ranking. However, finding a Kemeny optimal aggregation is NP-hard even with four items, therefore, we use two algorithms QUICK and FAST (D’Ambrosio et al., 2015a; Amodio et al., 2016) for quickly approximating the median ranking. The same rank correlation coefficient τx is used to compute the final aggregated ranking over the predictions of every single tree. τx also represents the main ingredient both in computing the error err(b) associated with each tree and the factor αb (only in the boosting algorithm) to update the weights of each item in the bootstrap procedure.

We implemented both algorithms in R, and applied them to simulated data and real cases. The simulation study shows that boosting can perform very well when there is a high level of heterogeneity (𝜃 = 0.4) among the rankings, and its performance improves with the sample size. Moreover, depth-4 boosting outperforms depth-2 boosting with a high level of heterogeneity (𝜃 = 0.4), while they behave similarly with a low level of heterogeneity (𝜃 = 2). Bagging performs worse than boosting in all the simulated scenarios, and its error decreases when the heterogeneity decreases, independently on the sample size n. The results on real datasets were in line with the results of the simulation study. Both algorithms also allow to determine which covariates are significant and discriminant, by averaging over their importance in each of the b trees, with weights αb in case of boosting.

An interesting research direction would be to improve the scalability of the algorithms with respect to the number of items. And in this direction, the introduction of weights, both to give different importance to some positions—say the top-positions—or to items—some items could be more relevant than others—could help (García-Lapresta & Pérez-Román, 2010; Kumar & Vassilvitskii, 2010; Can, 2014; Plaia et al., 2021): both these types of weights could help reduce the size of the universe of rankings to explore to find the consensus, the most computer intensive step of the proposed procedures. Indeed, the distance-based trees for ranking data proposed by Plaia and Sciandra (2019) already implement position weights, even if not considered in the algorithms presented in this paper, and we are now studying how item weights can be considered. Both solutions could be then considered as base decision trees within the bagging and boosting algorithms proposed in this paper.