I have a cost that varies by category and date. I also have the counts of each item type within each category and date. I want to distribute the cost to each of the item types. I'm trying to use GroupBy
, and Normalize
to get the percentage distribution by item type. My plan is to then multiply these percentages by the cost for each category and date.
x1 = {
<|"cat" -> "a", "type" -> "abc1", "date" -> {2013, 1}, "count" -> 10|>,
<|"cat" -> "a", "type" -> "abc2", "date" -> {2013, 1}, "count" -> 20|>,
<|"cat" -> "a", "type" -> "abc3", "date" -> {2013, 1}, "count" -> 30|>,
<|"cat" -> "a", "type" -> "abc1", "date" -> {2014, 1}, "count" -> 40|>,
<|"cat" -> "a", "type" -> "abc2", "date" -> {2014, 1}, "count" -> 50|>,
<|"cat" -> "a", "type" -> "abc1", "date" -> {2015, 1}, "count" -> 13|>,
<|"cat" -> "b", "type" -> "abc1", "date" -> {2013, 1}, "count" -> 60|>,
<|"cat" -> "b", "type" -> "abc1", "date" -> {2014, 1}, "count" -> 70|>,
<|"cat" -> "b", "type" -> "abc2", "date" -> {2013, 1}, "count" -> 75|>
};
x2 = GroupBy[x1, {#cat, #date} &, Normalize[#, Total] &]
This results in the following:
<|{"a", {2013, 1}} ->
{<|"cat" -> 1/3, "type" -> ("abc1")/("abc1" + "abc2" + "abc3"), "date" -> {1/3, 1/3}, "count" -> 1/6|>,
<|"cat" -> 1/3, "type" -> ("abc2")/("abc1" + "abc2" + "abc3"), "date" -> {1/3, 1/3}, "count" -> 1/3|>,
<|"cat" -> 1/3, "type" -> ("abc3")/("abc1" + "abc2" + "abc3"), "date" -> {1/3, 1/3}, "count" -> 1/2|>},
{"a", {2014, 1}} ->
{<|"cat" -> 1/2, "type" -> ("abc1")/("abc1" + "abc2"), "date" -> {1/2, 1/2}, "count" -> 4/9|>,
<|"cat" -> 1/2, "type" -> ("abc2")/("abc1" + "abc2"), "date" -> {1/2, 1/2}, "count" -> 5/9|>},
{"a", {2015, 1}} ->
{<|"cat" -> 1, "type" -> 1, "date" -> {1, 1}, "count" -> 1|>},
{"b", {2013, 1}} ->
{<|"cat" -> 1/2, "type" -> ("abc1")/("abc1" + "abc2"), "date" -> {1/2,1/2}, "count" -> 4/9|>,
<|"cat" -> 1/2, "type" -> ("abc2")/("abc1" + "abc2"), "date" -> {1/2,1/2}, "count" -> 5/9|>},
{"b", {2014, 1}} -> {<|"cat" -> 1, "type" -> 1,"date" -> {1, 1}, "count" -> 1|>}|>
Which is a mess. It does calculate the correct share for each type (in the count key), but now I'd have to somehow pull this apart and put it back together properly.
Originally I didn't use associations, and did a lot of joining. Using associations seems like a better idea but this level of complication is daunting. I'm afraid that when I pull it apart and put it back together the pieces may not be in the same order.
Is there a better way to approach this? Or, is there a reasonable way to simplify the result?
Answer
Let's start with a function cost
that will tell us the total cost for each category/date pair.
cost[{cat_, date_}] := 100.
For illustration purposes, this function has been defined trivially -- the real cost function will have to look up its values somewhere.
Using Query
Let's define total
to be a map from a category/date pair to the total count for that pair:
total = x1 // Query[GroupBy[{#cat, #date}&], Total, "count"]
(* <| {a,{2013,1}} -> 60
, {a,{2014,1}} -> 90
, {a,{2015,1}} -> 13
, {b,{2013,1}} -> 135
, {b,{2014,1}} -> 70
|> *)
Now we can generate adjusted source data so that each entry has a new cost
property containing the prorated cost of that entry:
x1 // Query[All, <| #, "cost" -> cost[{#cat, #date}] #count / total[{#cat, #date}] |>&]
Without Using Query
The previous section uses Query
, but it does not depend upon any particularly advanced functionality. We can perform the same operations without using Query
at all:
total = GroupBy[x1, {#cat, #date}& -> (#count&), Total];
<| #, "cost" -> cost[{#cat, #date}] #count / total[{#cat, #date}] |>& /@ x1
(* same result as before *)
Sidebar: How does the Query
version work?
Let's look at how Query
is used above. First:
x1 // Query[GroupBy[{#cat, #date}&], Total, "count"]
Queries specify an operator to apply to each element of each level of an expression. The first operator is applied to level zero (the whole expression), the next to level one, etc. Each level is acted upon twice: once while descending into the expression and a second time when ascending back out to top-level. It is beyond the scope of this response to describe these phases in detail. However, here is the sequence of events for this particular query:
- Level 0, descending The
GroupBy
operator performs the necessary grouping and splits the original data into groups keyed by the"cat"
and"date"
keys.GroupBy
is documented as being a special operator in that it inserts a new level into the data that did not exist before. While we started with a list of associations on level 0, after the grouping each level 1 element is a list of associations for a particular group. - Level 1, descending The
Total
operator is an ascending operator, so in a descending context it is treated asAll
, i.e.Identity
. This means that each group's list of associations will be left untouched as we descend. - Level 2, descending The
"count"
operator will extract the count property from each of the associations. Thus, each association will be transformed into a count. This is the last operator, so now we switch to the ascending phase. - Level 2, ascending The
"count"
operator is a descending operator, so it is treated asAll
in the ascending phase. Each count undergoes no further transformation. - Level 1, ascending The
Total
operator now gets its chance to act since it is an ascending operator. It is presented with a list of the counts generated by step 4, returning a single number representing the sum of those counts. - Level 0, ascending The
GroupBy
operator is a descending operator so it is treated asAll
in the ascending phase. This means that association containing counts keyed by groups will be returned unchanged. This association is the final result of the first query.
The second query is essentially of the form:
Query[All, f]
Short-cutting the description a bit, this applies the All
operator to level 0 and the f
operator to level 1. Since All
essentially means Identity
, the query is equivalent to saying Map[f]
. In fact, we can see this by applying Normal
to the query:
Query[All, f] // Normal
(* Map[f] *)
This Normal
trick works for any query.
Tools described in Visualizing Type System Operations can be helpful in developing an intuition for dataset and query operations.
Comments
Post a Comment