Skip to main content

list manipulation - Allocate amount A according to the distribution of amount B


I have a cost that varies by category and date. I also have the counts of each item type within each category and date. I want to distribute the cost to each of the item types. I'm trying to use GroupBy, and Normalize to get the percentage distribution by item type. My plan is to then multiply these percentages by the cost for each category and date.


x1 = {
<|"cat" -> "a", "type" -> "abc1", "date" -> {2013, 1}, "count" -> 10|>,
<|"cat" -> "a", "type" -> "abc2", "date" -> {2013, 1}, "count" -> 20|>,
<|"cat" -> "a", "type" -> "abc3", "date" -> {2013, 1}, "count" -> 30|>,
<|"cat" -> "a", "type" -> "abc1", "date" -> {2014, 1}, "count" -> 40|>,
<|"cat" -> "a", "type" -> "abc2", "date" -> {2014, 1}, "count" -> 50|>,

<|"cat" -> "a", "type" -> "abc1", "date" -> {2015, 1}, "count" -> 13|>,
<|"cat" -> "b", "type" -> "abc1", "date" -> {2013, 1}, "count" -> 60|>,
<|"cat" -> "b", "type" -> "abc1", "date" -> {2014, 1}, "count" -> 70|>,
<|"cat" -> "b", "type" -> "abc2", "date" -> {2013, 1}, "count" -> 75|>
};
x2 = GroupBy[x1, {#cat, #date} &, Normalize[#, Total] &]

This results in the following:


<|{"a", {2013, 1}} -> 
{<|"cat" -> 1/3, "type" -> ("abc1")/("abc1" + "abc2" + "abc3"), "date" -> {1/3, 1/3}, "count" -> 1/6|>,

<|"cat" -> 1/3, "type" -> ("abc2")/("abc1" + "abc2" + "abc3"), "date" -> {1/3, 1/3}, "count" -> 1/3|>,
<|"cat" -> 1/3, "type" -> ("abc3")/("abc1" + "abc2" + "abc3"), "date" -> {1/3, 1/3}, "count" -> 1/2|>},
{"a", {2014, 1}} ->
{<|"cat" -> 1/2, "type" -> ("abc1")/("abc1" + "abc2"), "date" -> {1/2, 1/2}, "count" -> 4/9|>,
<|"cat" -> 1/2, "type" -> ("abc2")/("abc1" + "abc2"), "date" -> {1/2, 1/2}, "count" -> 5/9|>},
{"a", {2015, 1}} ->
{<|"cat" -> 1, "type" -> 1, "date" -> {1, 1}, "count" -> 1|>},
{"b", {2013, 1}} ->
{<|"cat" -> 1/2, "type" -> ("abc1")/("abc1" + "abc2"), "date" -> {1/2,1/2}, "count" -> 4/9|>,
<|"cat" -> 1/2, "type" -> ("abc2")/("abc1" + "abc2"), "date" -> {1/2,1/2}, "count" -> 5/9|>},

{"b", {2014, 1}} -> {<|"cat" -> 1, "type" -> 1,"date" -> {1, 1}, "count" -> 1|>}|>

Which is a mess. It does calculate the correct share for each type (in the count key), but now I'd have to somehow pull this apart and put it back together properly.


Originally I didn't use associations, and did a lot of joining. Using associations seems like a better idea but this level of complication is daunting. I'm afraid that when I pull it apart and put it back together the pieces may not be in the same order.


Is there a better way to approach this? Or, is there a reasonable way to simplify the result?



Answer



Let's start with a function cost that will tell us the total cost for each category/date pair.


cost[{cat_, date_}] := 100.

For illustration purposes, this function has been defined trivially -- the real cost function will have to look up its values somewhere.



Using Query


Let's define total to be a map from a category/date pair to the total count for that pair:


total = x1 // Query[GroupBy[{#cat, #date}&], Total, "count"]

(* <| {a,{2013,1}} -> 60
, {a,{2014,1}} -> 90
, {a,{2015,1}} -> 13
, {b,{2013,1}} -> 135
, {b,{2014,1}} -> 70
|> *)


Now we can generate adjusted source data so that each entry has a new cost property containing the prorated cost of that entry:


x1 // Query[All, <| #, "cost" -> cost[{#cat, #date}] #count / total[{#cat, #date}] |>&]

result screenshot


Without Using Query


The previous section uses Query, but it does not depend upon any particularly advanced functionality. We can perform the same operations without using Query at all:


total = GroupBy[x1, {#cat, #date}& -> (#count&), Total];

<| #, "cost" -> cost[{#cat, #date}] #count / total[{#cat, #date}] |>& /@ x1

(* same result as before *)




Sidebar: How does the Query version work?


Let's look at how Query is used above. First:


x1 // Query[GroupBy[{#cat, #date}&], Total, "count"]

Queries specify an operator to apply to each element of each level of an expression. The first operator is applied to level zero (the whole expression), the next to level one, etc. Each level is acted upon twice: once while descending into the expression and a second time when ascending back out to top-level. It is beyond the scope of this response to describe these phases in detail. However, here is the sequence of events for this particular query:




  1. Level 0, descending The GroupBy operator performs the necessary grouping and splits the original data into groups keyed by the "cat" and "date" keys. GroupBy is documented as being a special operator in that it inserts a new level into the data that did not exist before. While we started with a list of associations on level 0, after the grouping each level 1 element is a list of associations for a particular group.

  2. Level 1, descending The Total operator is an ascending operator, so in a descending context it is treated as All, i.e. Identity. This means that each group's list of associations will be left untouched as we descend.

  3. Level 2, descending The "count" operator will extract the count property from each of the associations. Thus, each association will be transformed into a count. This is the last operator, so now we switch to the ascending phase.

  4. Level 2, ascending The "count" operator is a descending operator, so it is treated as All in the ascending phase. Each count undergoes no further transformation.

  5. Level 1, ascending The Total operator now gets its chance to act since it is an ascending operator. It is presented with a list of the counts generated by step 4, returning a single number representing the sum of those counts.

  6. Level 0, ascending The GroupBy operator is a descending operator so it is treated as All in the ascending phase. This means that association containing counts keyed by groups will be returned unchanged. This association is the final result of the first query.


The second query is essentially of the form:


Query[All, f]


Short-cutting the description a bit, this applies the All operator to level 0 and the f operator to level 1. Since All essentially means Identity, the query is equivalent to saying Map[f]. In fact, we can see this by applying Normal to the query:


Query[All, f] // Normal
(* Map[f] *)

This Normal trick works for any query.


Tools described in Visualizing Type System Operations can be helpful in developing an intuition for dataset and query operations.


Comments

Popular posts from this blog

plotting - Filling between two spheres in SphericalPlot3D

Manipulate[ SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, Mesh -> None, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], {n, 0, 1}] I cant' seem to be able to make a filling between two spheres. I've already tried the obvious Filling -> {1 -> {2}} but Mathematica doesn't seem to like that option. Is there any easy way around this or ... Answer There is no built-in filling in SphericalPlot3D . One option is to use ParametricPlot3D to draw the surfaces between the two shells: Manipulate[ Show[SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], ParametricPlot3D[{ r {Sin[t] Cos[1.5 Pi], Sin[t] Sin[1.5 Pi], Cos[t]}, r {Sin[t] Cos[0 Pi], Sin[t] Sin[0 Pi], Cos[t]}}, {r, 1, 2 - n}, {t, 0, Pi}, PlotStyle -> Yellow, Mesh -> {2, 15}]], {n, 0, 1}]

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Adding a thick curve to a regionplot

Suppose we have the following simple RegionPlot: f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}] Now I'm trying to change the curve defined by $y=g[x]$ into a thick black curve, while leaving all other boundaries in the plot unchanged. I've tried adding the region $y=g[x]$ and playing with the plotstyle, which didn't work, and I've tried BoundaryStyle, which changed all the boundaries in the plot. Now I'm kinda out of ideas... Any help would be appreciated! Answer With f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 You can use Epilog to add the thick line: RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}, PlotPoints -> 50, Epilog -> (Plot[g[x], {x, 0, 2}, PlotStyle -> {Black, Thick}][[1]]), PlotStyle -> {Directive[Yellow, Opacity[0.4]], Directive[Pink, Opacity[0.4]],