Skip to main content

programming - How to implement Kemeny-Young method? (rank aggregation problem)


Prehistory


I am trying to make some statistical analysis of some experimental data, arises from measurements made ​​on an ordinal scale.


I faced with the problem of rank aggregation: to get from many "individuals" orderings (on the same set of objects) one "collective" ordering.


The most "natural" approach to this problem is Kemeny-Young method (better look primary source).


Surprisingly I found out that there is no program application for this method!!! (There is one C++ code, but it does not allow weak orderings, i.e., does not allow cases when several objects share same position at ordering).


Previously I asked some points (one, two) for constructing needed code, but now I have decide to tell all at once — because the problem is more complicated than I thought at first, and I can miss some points since I am null in Mathematica and programming.


Description


Let $R_1, R_2, R_3, ..., R_N$ denote "individuals" weak orderings of $N$ given objects $a, b, c, ...$.


Let consider $R$ as set of ordered pairs of objects. Then $(a, b)\in R$ has the usual interpretation: "$a$ is ordered at least as good as b at $R$". In case $(b, a)$ is also in $R$ we say that "both are ordered equally good". Where in case $(b,a)$ is not in $R$ we say that "$a$ is ordered above $b$" or "$b$ is ordered below $a$".



(After introducing metric we will see that it is possible to neglect — since they don't make further difference — such pairs as $(a, a), (b, b), (c, c)...$.)



To demonstrate orderings notation, here are all possible orderings for $N=3$:


$\ R_1=\{(a, b), (a, c), (b, c)\}$


$\ R_2=\{(a, c), (a, b), (c, b)\}$


$\ R_3=\{(a, b), (a, c), (b, c), (c, b)\}$


$\ R_4=\{(b, a), (b, c), (a, c)\}$


$\ R_5=\{(b, c), (b, a), (c, a)\}$


$\ R_6=\{(b, a), (b, c), (a, c), (c, a)\}$


$\ R_7=\{(c, a), (c, b), (a, b)\}$



$\ R_8=\{(c, b), (c, a), (b, a)\}$


$\ R_9=\{(c, a), (c, b), (a, b), (b, a)\}$


$R_{10}=\{(a, b), (b, a), (a, c), (b, c)\}$


$R_{11}=\{(a, c), (c, a), (a, b), (c, b)\}$


$R_{12}=\{(b, c), (c, b), (b, a), (c, a)\}$


$R_{13}=\{(a, b), (b, a), (a, c), (c, a), (b, c), (c, b)\}$



For such notation we may introduce metric $\delta$ (so-called Kemeny distance) between any two orderings $R_1$ and $R_2$ by next way: $$\delta(R_1,R_2)=|R_1\setminus R_2|+|R_2\setminus R_1|$$ where $\setminus$ means set-theoretic difference; and $||$ means cardinality of set (i.e., number of elements, because sets are finite).


The required "collective" ordering $Ω$ (so-called Kemeny mean) is such ordering of given objects, that minimizes the sum of the squares of Kemeny distances to all "individuals" orderings, i.e.: $$Ω=\min \sum_{i=1}^{N } \delta(R_i,Ω)^2$$


In fact, $Ω$ may be not unique, there may be several $Ω_1, Ω_2, ...$ for which the appropriate sums of the squares of Kemeny distances are equal and minimal. So in general case $Ω$ is set of such $Ω_i$.



How to attack


So we have several of $R_i$ as input and the goal is to output $Ω$.


It looks, like there should be brute-force with one-by-one generating of possible ordering, calculating the sum of the squares of Kemeny distances to all input orderings and verifiaction for minimum.


Even for small $N$ number of possible orderings is huge (for my data case $N=10$, there are over $10^8$ possible orderings), so we should keep less data.



Answer



[Warning: There could be issues with the averaging procedure. At least one should pay attention to allPossibleOrderings; its definition probably needs to be improved. See other two *edit*s in text below.]


I'm probably doing something nasty here but if RAM is really not a problem for you, then… here you go, a blunt brute-force approach:


KemenyDistance =
Plus @@
Composition[Length, Complement] @@@

Permutations[{##}, {2}] &;

FindKemenyMean[setOfOrderings:{__List}, allObjects:_List:{}] :=
Block[{ weakAndNonWeakOrderings, allPossibleOrderings
, kemenySumOfSquares
, minFound, data, \[CapitalOmega]candidate },

weakAndNonWeakOrderings@objects_List :=
With[{orderingFromOrderedList = Partition[#, 2, 1]&},
Block[{nonWeakOrderingsOnly, addWeakOrderings},

nonWeakOrderingsOnly =
orderingFromOrderedList /@ Permutations[#, {Length@#}]&;

addWeakOrderings@nonWeakOrdering_ :=
Join[#, nonWeakOrdering]& /@
(Permutations[Reverse/@#, Length@#]&@nonWeakOrdering);

Flatten[addWeakOrderings /@ nonWeakOrderingsOnly@objects, 1]]];

allPossibleOrderings =

weakAndNonWeakOrderings @
If[allObjects === {}
, Composition[DeleteDuplicates, Flatten]@setOfOrderings
, allObjects];

kemenySumOfSquares[\[CapitalOmega]_, orderingsToAverage_List] :=
Plus @@ (KemenyDistance[#, \[CapitalOmega]]^2 & /@ orderingsToAverage);

Fold[
With[

{ currentMin = First@Cases[#1, minFound@n_ :> n, {1}, 1]
, currentOrdering = #2 }
, With[
{ currentSum =
kemenySumOfSquares[currentOrdering, setOfOrderings] }
, If[currentSum < currentMin
, data[minFound@currentSum, \[CapitalOmega]candidate@currentOrdering]
, #1]]] &
, data[minFound[\[Infinity]], \[CapitalOmega]candidate[]]
, allPossibleOrderings]]


These two definitions give you KemenyDistance and FindKemenyMean precisely in the form you described. FindKemenyMean fetches initial objects from the orderings list which will always work if your orderings are linear (= complete).


In[1]:= FindKemenyMean[
{ {{a, b}, {b, c}, {c, b}, {c, d}},
{{b, d}, {d, c}, {c, a}},
{{c, a}, {a, d}, {d, b}} }]
Out[1]= data[minFound[45], \[CapitalOmega]candidate[{{b, c}, {c, a}, {a, d}}]]

You can provide list of all objects independently as its second argument but I don't guarantee anything for incomplete orderings (just don't have enough mathematical intuition for that).


Notes:



1) My calculation of all orderings, with weak ones included, for $N = 3$, evaluates list that is somewhat longer than the one you provided. I probably generate equivalent (hence redundant) weak orderings! [edit: Or maybe I should have converted them to some canonical form.]


2) The $R_i$ data structure for orderings is obviously redundant, hence definitely inefficient. [edit: not so if you don't presume that the relation is transitive, or if one must get to some kind of “complete form” of all $R_i$'s in order to calculate Kemeny mean w.r.t. them] I don't understand a thing in statistics but I suggest describing orderings in a more concise manner; Mma has gives lots of possibilites to do that. For example, you could use weighted sums, like $a + 2b + 3c$, to describe that $a < b < c$ (or maybe $a + (1+\varepsilon)b + (1+\varepsilon)^2c$ would be a better idea). Calculation of average ordering then would become an immensely more simple and efficient procedure (you could just use arithmetic mean, or something close to it), although I can't say much about its sanity. At least orderings that treat $a$ and $b$ as equal would always get averaged to orderings with the same property in case you pick simple arithmetic mean.


3) I use a lot of nested scoping constructs. Normally, one would write a small package instead. I hope this won't make it significantly more dificult for you to find out what's going on here in case you're interested in Mathematica's core functions. Hopefully, explicit names of symbols (I could say it's a [really good] Mma tradition) will be of help.


Comments

Popular posts from this blog

plotting - Filling between two spheres in SphericalPlot3D

Manipulate[ SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, Mesh -> None, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], {n, 0, 1}] I cant' seem to be able to make a filling between two spheres. I've already tried the obvious Filling -> {1 -> {2}} but Mathematica doesn't seem to like that option. Is there any easy way around this or ... Answer There is no built-in filling in SphericalPlot3D . One option is to use ParametricPlot3D to draw the surfaces between the two shells: Manipulate[ Show[SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], ParametricPlot3D[{ r {Sin[t] Cos[1.5 Pi], Sin[t] Sin[1.5 Pi], Cos[t]}, r {Sin[t] Cos[0 Pi], Sin[t] Sin[0 Pi], Cos[t]}}, {r, 1, 2 - n}, {t, 0, Pi}, PlotStyle -> Yellow, Mesh -> {2, 15}]], {n, 0, 1}]

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Adding a thick curve to a regionplot

Suppose we have the following simple RegionPlot: f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}] Now I'm trying to change the curve defined by $y=g[x]$ into a thick black curve, while leaving all other boundaries in the plot unchanged. I've tried adding the region $y=g[x]$ and playing with the plotstyle, which didn't work, and I've tried BoundaryStyle, which changed all the boundaries in the plot. Now I'm kinda out of ideas... Any help would be appreciated! Answer With f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 You can use Epilog to add the thick line: RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}, PlotPoints -> 50, Epilog -> (Plot[g[x], {x, 0, 2}, PlotStyle -> {Black, Thick}][[1]]), PlotStyle -> {Directive[Yellow, Opacity[0.4]], Directive[Pink, Opacity[0.4]],