Prehistory
I am trying to make some statistical analysis of some experimental data, arises from measurements made on an ordinal scale.
I faced with the problem of rank aggregation: to get from many "individuals" orderings (on the same set of objects) one "collective" ordering.
The most "natural" approach to this problem is Kemeny-Young method (better look primary source).
Surprisingly I found out that there is no program application for this method!!! (There is one C++ code, but it does not allow weak orderings, i.e., does not allow cases when several objects share same position at ordering).
Previously I asked some points (one, two) for constructing needed code, but now I have decide to tell all at once — because the problem is more complicated than I thought at first, and I can miss some points since I am null in Mathematica and programming.
Description
Let $R_1, R_2, R_3, ..., R_N$ denote "individuals" weak orderings of $N$ given objects $a, b, c, ...$.
Let consider $R$ as set of ordered pairs of objects. Then $(a, b)\in R$ has the usual interpretation: "$a$ is ordered at least as good as b at $R$". In case $(b, a)$ is also in $R$ we say that "both are ordered equally good". Where in case $(b,a)$ is not in $R$ we say that "$a$ is ordered above $b$" or "$b$ is ordered below $a$".
(After introducing metric we will see that it is possible to neglect — since they don't make further difference — such pairs as $(a, a), (b, b), (c, c)...$.)
To demonstrate orderings notation, here are all possible orderings for $N=3$:
$\ R_1=\{(a, b), (a, c), (b, c)\}$
$\ R_2=\{(a, c), (a, b), (c, b)\}$
$\ R_3=\{(a, b), (a, c), (b, c), (c, b)\}$
$\ R_4=\{(b, a), (b, c), (a, c)\}$
$\ R_5=\{(b, c), (b, a), (c, a)\}$
$\ R_6=\{(b, a), (b, c), (a, c), (c, a)\}$
$\ R_7=\{(c, a), (c, b), (a, b)\}$
$\ R_8=\{(c, b), (c, a), (b, a)\}$
$\ R_9=\{(c, a), (c, b), (a, b), (b, a)\}$
$R_{10}=\{(a, b), (b, a), (a, c), (b, c)\}$
$R_{11}=\{(a, c), (c, a), (a, b), (c, b)\}$
$R_{12}=\{(b, c), (c, b), (b, a), (c, a)\}$
$R_{13}=\{(a, b), (b, a), (a, c), (c, a), (b, c), (c, b)\}$
For such notation we may introduce metric $\delta$ (so-called Kemeny distance) between any two orderings $R_1$ and $R_2$ by next way: $$\delta(R_1,R_2)=|R_1\setminus R_2|+|R_2\setminus R_1|$$ where $\setminus$ means set-theoretic difference; and $||$ means cardinality of set (i.e., number of elements, because sets are finite).
The required "collective" ordering $Ω$ (so-called Kemeny mean) is such ordering of given objects, that minimizes the sum of the squares of Kemeny distances to all "individuals" orderings, i.e.: $$Ω=\min \sum_{i=1}^{N } \delta(R_i,Ω)^2$$
In fact, $Ω$ may be not unique, there may be several $Ω_1, Ω_2, ...$ for which the appropriate sums of the squares of Kemeny distances are equal and minimal. So in general case $Ω$ is set of such $Ω_i$.
How to attack
So we have several of $R_i$ as input and the goal is to output $Ω$.
It looks, like there should be brute-force with one-by-one generating of possible ordering, calculating the sum of the squares of Kemeny distances to all input orderings and verifiaction for minimum.
Even for small $N$ number of possible orderings is huge (for my data case $N=10$, there are over $10^8$ possible orderings), so we should keep less data.
Answer
[Warning: There could be issues with the averaging procedure. At least one should pay attention to allPossibleOrderings
; its definition probably needs to be improved. See other two *edit*s in text below.]
I'm probably doing something nasty here but if RAM is really not a problem for you, then… here you go, a blunt brute-force approach:
KemenyDistance =
Plus @@
Composition[Length, Complement] @@@
Permutations[{##}, {2}] &;
FindKemenyMean[setOfOrderings:{__List}, allObjects:_List:{}] :=
Block[{ weakAndNonWeakOrderings, allPossibleOrderings
, kemenySumOfSquares
, minFound, data, \[CapitalOmega]candidate },
weakAndNonWeakOrderings@objects_List :=
With[{orderingFromOrderedList = Partition[#, 2, 1]&},
Block[{nonWeakOrderingsOnly, addWeakOrderings},
nonWeakOrderingsOnly =
orderingFromOrderedList /@ Permutations[#, {Length@#}]&;
addWeakOrderings@nonWeakOrdering_ :=
Join[#, nonWeakOrdering]& /@
(Permutations[Reverse/@#, Length@#]&@nonWeakOrdering);
Flatten[addWeakOrderings /@ nonWeakOrderingsOnly@objects, 1]]];
allPossibleOrderings =
weakAndNonWeakOrderings @
If[allObjects === {}
, Composition[DeleteDuplicates, Flatten]@setOfOrderings
, allObjects];
kemenySumOfSquares[\[CapitalOmega]_, orderingsToAverage_List] :=
Plus @@ (KemenyDistance[#, \[CapitalOmega]]^2 & /@ orderingsToAverage);
Fold[
With[
{ currentMin = First@Cases[#1, minFound@n_ :> n, {1}, 1]
, currentOrdering = #2 }
, With[
{ currentSum =
kemenySumOfSquares[currentOrdering, setOfOrderings] }
, If[currentSum < currentMin
, data[minFound@currentSum, \[CapitalOmega]candidate@currentOrdering]
, #1]]] &
, data[minFound[\[Infinity]], \[CapitalOmega]candidate[]]
, allPossibleOrderings]]
These two definitions give you KemenyDistance
and FindKemenyMean
precisely in the form you described. FindKemenyMean
fetches initial objects from the orderings list which will always work if your orderings are linear (= complete).
In[1]:= FindKemenyMean[
{ {{a, b}, {b, c}, {c, b}, {c, d}},
{{b, d}, {d, c}, {c, a}},
{{c, a}, {a, d}, {d, b}} }]
Out[1]= data[minFound[45], \[CapitalOmega]candidate[{{b, c}, {c, a}, {a, d}}]]
You can provide list of all objects independently as its second argument but I don't guarantee anything for incomplete orderings (just don't have enough mathematical intuition for that).
Notes:
1) My calculation of all orderings, with weak ones included, for $N = 3$, evaluates list that is somewhat longer than the one you provided. I probably generate equivalent (hence redundant) weak orderings! [edit: Or maybe I should have converted them to some canonical form.]
2) The $R_i$ data structure for orderings is obviously redundant, hence definitely inefficient. [edit: not so if you don't presume that the relation is transitive, or if one must get to some kind of “complete form” of all $R_i$'s in order to calculate Kemeny mean w.r.t. them] I don't understand a thing in statistics but I suggest describing orderings in a more concise manner; Mma has gives lots of possibilites to do that. For example, you could use weighted sums, like $a + 2b + 3c$, to describe that $a < b < c$ (or maybe $a + (1+\varepsilon)b + (1+\varepsilon)^2c$ would be a better idea). Calculation of average ordering then would become an immensely more simple and efficient procedure (you could just use arithmetic mean, or something close to it), although I can't say much about its sanity. At least orderings that treat $a$ and $b$ as equal would always get averaged to orderings with the same property in case you pick simple arithmetic mean.
3) I use a lot of nested scoping constructs. Normally, one would write a small package instead. I hope this won't make it significantly more dificult for you to find out what's going on here in case you're interested in Mathematica's core functions. Hopefully, explicit names of symbols (I could say it's a [really good] Mma tradition) will be of help.
Comments
Post a Comment