Skip to main content

string manipulation - Dataset breaks multi-character StringSplit


ds = Dataset[{"a b", "c-d"} ]

multi-character StringSplit is broken with Dataset (10.1 regression?)

ds[All, StringSplit[#, {" ", "-"}] &]

enter image description here

though single split charaters works:

ds[All, StringSplit[#, " "] &] // Normal    

{{"a", "b"}, {"c-d"}}

As does plain non-Dataset version of multi-char of course (same output as above)

ds // Normal // Map[StringSplit[#, {" ", "-"}] &]


This issue is due to the same type-inferencing problem described here.

Using printSignatures from the referenced answer, we can see that the type inferencer will only accept a single string as the second argument, not a list:

{Vector[Atom[String], n_]}

{Atom[String], Atom[String]}
{Vector[Atom[String], n_], Atom[String]}

This list of valid signatures will only accept a single string as the second argument.

The referenced answer shows how to dodge the type-inferencer. We can use similar work-arounds here: either by using Query directly on the raw data...

ds // Normal // Query[Dataset, StringSplit[#, {" ", "-"}] &]

dataset screenshot

... or by disguising the StringSplit operator:

ds[All, StringSplit&[][#, {" ", "-"}] &]

dataset screenshot

Notice how the second work-around loses useful type information in this case, causing the dataset visualization to fall back to a cruder form. We can restore the missing type information by inserting a terminal Dataset ascending operator into the query:

ds[Dataset, StringSplit&[][#, {" ", "-"}] &]

dataset screenshot

This last operation causes the proper type information to be deduced from the final output data (using TypeSystem`DeduceType), restoring the proper visualization.
