Skip to main content

graphics - How to create word clouds?



Word clouds are rather useless fancy and visually appealing plots, where words are plotted with different sizes according to their frequency in a corpus. Many applications exist out there (Wordle, Tagxedo, etc.) that can give an example. I am interested in the algorithm that achieves the closest possible packing of words or other irregular shapes.


There is a method for defining the convex hull of an object (in the Computational geometry package), but I think one needs here the boundary that closes the least area. If this is calculated, perhaps the packing method of graph layout can be exploited by assuming that points on the hull of a word correspond to graph vertices... but this is just speculation. So far I could only list and style the words (that was the easy part):


tally = Tally@
Cases[StringSplit[ExampleData[{"Text", "AliceInWonderland"}],
Except@LetterCharacter], _?(StringLength@# > 4 &)];
tally = Cases[tally, _?(Last@# > 10 &)];
range = {Min@(Last /@ tally), Max@(Last /@ tally)};

words = Style[First@#, FontFamily -> "Cardinal", FontWeight -> Bold,
FontColor ->

Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, .8}]],
FontSize -> Last@Rescale[#, range, {12, 70}]] & /@ tally;

Framed[Grid@Partition[words, 10, 10, {1, 1}, {}],
FrameStyle -> {Gray, Thick}, RoundingRadius -> 10, ImageMargins -> 5]

Mathematica graphics


Some possible specifications of the algorithm:



  • According to this link (shared by cormullion) identifying the closest boundary of each word is not enough as words can appear inside other glyphs with holes, like P, A, etc. Thus indeed intersection of words must be tested.


  • According to Szabolcs, the code might be able to resize words to fit them better

  • Many applications are able to arrange the cloud to fill up a user-specified shape (e.g. ellipse, apple, Che Guevara, etc.) instead of being casually positioned along the ever-increasing spiral.

  • It would be nice to allow individual words to have different rotations.

  • As usually, a fully vectorized version is preferred over image-processing methods (if the former is faster).

  • Also it would be nice to have post-rendering effects, like clickable words, mouseover effects, etc.


One way to convert strings to vector graphics is:


First@ImportString[
ExportString[
Style["SomeText", Italic, FontFamily -> "Times", FontSize -> 36],

"PDF"], "PDF", "TextMode" -> "Outlines"]

Some related questions for those who want to do further research:




Answer



Here's what I came up with


Mathematica graphics


How I did it


First we need a list of words. Here, I've taken the original list ordered by size.


tally = Tally@

Cases[StringSplit[ExampleData[{"Text", "AliceInWonderland"}],
Except@LetterCharacter], _?(StringLength@# > 4 \[And] # =!=
"Alice" &)];
tally = Cases[tally, _?(Last@# > 10 &)];
tally = Reverse@SortBy[tally, Last];
range = {Min@(Last /@ tally), Max@(Last /@ tally)};

words = Style[First@#, FontFamily -> "Cracked", FontWeight -> Bold,
FontColor ->
Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]],

FontSize -> Last@Rescale[#, range, {12, 70}]] & /@ tally;

The words are rasterised and cropped to make sure the bounding box is as tight as possible.


wordsimg = ImageCrop[Image[Graphics[Text[#]]]] & /@ words;

To produce the image the words are added one by one using a Fold loop where the next word is placed as close to the centre of the existing image as possible. This is done by applying a max filter to the binarized version of the original image thus turning forbidden pixels white and looking for the black point that is closest to the centre of the image.


iteration[img1_, w_, fun_: (Norm[#1 - #2] &)] := 
Module[{imdil, centre, diff, dimw, padding, padded1, minpos},
dimw = ImageDimensions[w];
padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];


imdil = MaxFilter[Binarize[ColorNegate[padded1], 0.01],
Reverse@Floor[dimw/2 + 2]];
centre = ImageDimensions[padded1]/2;

minpos = Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0],
Reverse[centre], DistanceFunction -> fun][[1]];
diff = ImageDimensions[imdil] - dimw;
padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];


ImagePad[#, (-Min[#] {1, 1 }) & /@ BorderDimensions[#]] &@
ImageMultiply[padded1, ImagePad[w, padding[minpos], 1]]]

Fold[iteration, wordsimg[[1]], Rest[wordsimg]]

You can play around with the distance function. For example for a distance function


fun = Norm[{1, 1/2} (#2 - #1)] &

you get an ellipsoidal shape:


Fold[iteration[##, fun]&, wordsimg[[1]], Rest[wordsimg]]


Mathematica graphics




Updated version


The previous code places new words in the image by approximating them with rectangles. This works fine for horizontally or vertically oriented words, but not so well for rotated words or more general shapes. Luckily, the code can be easily modified to deal with this by replacing the MaxFilter with a ImageCorrelate:


iteration2[img1_, w_, fun_: ( Norm[#1 - #2] &)] := 
Module[{imdil, centre, diff, dimw, padding, padded1, minpos},
dimw = ImageDimensions[w];
padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];
imdil = Binarize[ImageCorrelate[Binarize[ColorNegate[padded1], 0.05],

Dilation[Binarize[ColorNegate[w], .05], 1]]];
centre = ImageDimensions[padded1]/2;
minpos =
Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0],
Reverse[centre], DistanceFunction -> fun][[1]];
Sow[minpos - centre]; (* for creating vector plot *)
diff = ImageDimensions[imdil] - dimw;
padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];
ImagePad[#, (-Min[#] {1, 1}) & /@ BorderDimensions[#]] &@
ImageMultiply[padded1, ImagePad[w, padding[minpos], 1]]]


To test this code we use a list of rotated words. Note that I'm using ImagePad instead of ImageCrop to crop the images. This is because ImageCrop seems to clip the words sometimes.


words = Style[First@#, FontFamily -> "Times", 
FontColor ->
Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]],
FontSize -> (Last@Rescale[#, range, {12, 150}])] & /@ tally;

wordsimg = ImagePad[#, -3 -
BorderDimensions[#]] & /@ (Image[
Graphics[Text[Framed[#, FrameMargins -> 2]]]] & /@ words);


wordsimgRot = ImageRotate[#, RandomReal[2 Pi],
Background -> White] & /@ wordsimg;

The iteration loop is as before:


Fold[iteration2, wordsimgRot[[1]], Rest[wordsimgRot]]

which produces


Mathematica graphics


Second update



To create a vector graphics of the previous result, we need to save the positions of the words in the image, for example by adding Sow[minpos - centre] to the definition of iteration2 somewhere towards the end of the code and using Reap to reap the results. We also need to keep the rotation angles of the words, so we'll replace wordsimgRot with


angles = RandomReal[2 Pi, Length[wordsimg]];

wordsimgRot = ImageRotate[##, Background -> White] & @@@
Transpose[{wordsimg, angles}];

As mentioned before, we use Reap to create the position list


poslist = Reap[img = Fold[iteration2, wordsimgRot[[1]], 
Rest[wordsimgRot]];][[2, 1]]


The vector graphics can then be created with


Graphics[MapThread[Text[#1, Offset[#2, {0, 0}], {0, 0}, {Cos[#3], Sin[#3]}] &,
{words, Prepend[poslist, {0, 0}], angles}]]

Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...