Skip to main content

string manipulation - Dataset breaks multi-character StringSplit


Given


ds = Dataset[{"a b", "c-d"} ]


multi-character StringSplit is broken with Dataset (10.1 regression?)


ds[All, StringSplit[#, {" ", "-"}] &]

enter image description here


though single split charaters works:


ds[All, StringSplit[#, " "] &] // Normal    


{{"a", "b"}, {"c-d"}}




As does plain non-Dataset version of multi-char of course (same output as above)


ds // Normal // Map[StringSplit[#, {" ", "-"}] &]

Answer



This issue is due to the same type-inferencing problem described here.


Using printSignatures from the referenced answer, we can see that the type inferencer will only accept a single string as the second argument, not a list:


printSignatures[StringSplit]
(*
{Vector[Atom[String], n_]}
{Atom[String]}

{Atom[String], Atom[String]}
{Vector[Atom[String], n_], Atom[String]}
*)

This list of valid signatures will only accept a single string as the second argument.


The referenced answer shows how to dodge the type-inferencer. We can use similar work-arounds here: either by using Query directly on the raw data...


ds // Normal // Query[Dataset, StringSplit[#, {" ", "-"}] &]

dataset screenshot


... or by disguising the StringSplit operator:



ds[All, StringSplit&[][#, {" ", "-"}] &]

dataset screenshot


Notice how the second work-around loses useful type information in this case, causing the dataset visualization to fall back to a cruder form. We can restore the missing type information by inserting a terminal Dataset ascending operator into the query:


ds[Dataset, StringSplit&[][#, {" ", "-"}] &]

dataset screenshot


This last operation causes the proper type information to be deduced from the final output data (using TypeSystem`DeduceType), restoring the proper visualization.


Comments