string manipulation - Dataset breaks multi-character StringSplit

Given

ds = Dataset[{"a b", "c-d"} ]

multi-character StringSplit is broken with Dataset (10.1 regression?)

ds[All, StringSplit[#, {" ", "-"}] &]

enter image description here

though single split charaters works:

ds[All, StringSplit[#, " "] &] // Normal

{{"a", "b"}, {"c-d"}}

As does plain non-Dataset version of multi-char of course (same output as above)

ds // Normal // Map[StringSplit[#, {" ", "-"}] &]

Answer

This issue is due to the same type-inferencing problem described here.

Using printSignatures from the referenced answer, we can see that the type inferencer will only accept a single string as the second argument, not a list:

printSignatures[StringSplit]
  (*
    {Vector[Atom[String], n_]}
    {Atom[String]}

    {Atom[String], Atom[String]}
    {Vector[Atom[String], n_], Atom[String]}
  *)

This list of valid signatures will only accept a single string as the second argument.

The referenced answer shows how to dodge the type-inferencer. We can use similar work-arounds here: either by using Query directly on the raw data...

ds // Normal // Query[Dataset, StringSplit[#, {" ", "-"}] &]

dataset screenshot

... or by disguising the StringSplit operator:

ds[All, StringSplit&[][#, {" ", "-"}] &]

dataset screenshot

Notice how the second work-around loses useful type information in this case, causing the dataset visualization to fall back to a cruder form. We can restore the missing type information by inserting a terminal Dataset ascending operator into the query:

ds[Dataset, StringSplit&[][#, {" ", "-"}] &]

dataset screenshot

This last operation causes the proper type information to be deduced from the final output data (using TypeSystem`DeduceType), restoring the proper visualization.

Blog

Search This Blog

string manipulation - Dataset breaks multi-character StringSplit

Comments

Post a Comment

Popular posts from this blog

front end - keyboard shortcut to invoke Insert new matrix

How to thread a list

functions - Get leading series expansion term?