Skip to main content

programming - How to perform document classification (i.e. extracting topics from text)?


I need to build a multi-label text classifier. There are a bunch of techniques for doing this, and it's called automatic document classification. I'm trying to use the EM technique (expectation maximization algorithm) to write a simple machine learning program that determines the core topics (each topic is a set of feature words) in a set of related documents (each document is a string).


For example if given the Sarah Palin's emails, the output might be:


(*topics =>*) 
{{"Mccain", "election", "2008", "obama"}, {"family", "Trig", "son", "downs syndrome"},
{"Alaska", "oil", "governor", "scandal", "book burning"}};

The program has two steps:


Step 1: Find a subset important words using some heuristics, length, type (noun|adjective|etc...), frequency, etc... Here's my feature extracting code:


Words[s_String] := ToLowerCase /@ StringCases[s,RegularExpression["\\w(?
features[s_String, o___] := features[Words@s,o];
features[listOfWords_List, minLength_:6, maxFeatures_:10] := Module[{words},
words = Union @ Select[listOfWords, And[
StringLength[#] >= minLength,
With[{w = WordData[#,"PartsOfSpeech"]},
Or[
Head@w === WordData,
!FreeQ[WordData[w,"PartsOfSpeech"],"Noun"|"Adjective"]]]
]&];
words = Quiet@Commonest[words,maxFeatures];

Return @ words;
];

Step 2: Cluster the feature words into random topics sets and then iteratively permute them to maximize the expectation.


This is where I need a bit of help from Mathemaica v10 ML features...


Notes



  • Mathematica 10 has a natural-language classifier "FacebookTopic" for classification of Facebook posts built in but it doesn't help because the code is obfuscated and I need to retrain on my own data set.

  • Standford has a working topic classifier version for download.

  • Here's a nicely explained example of document classification using EM.




Answer



What you're describing is Topic modeling. Your link describes Latent Dirichlet Allocation (LDA), which is a popular model. You mention that you would like to use an expectation-maximization algorithm to implement the LDA model (through variational inference). That is, however, only one of many approaches to the necessary parameter estimation. Another is collapsed Gibbs sampling, which is easier to implement. In this technical report a pseudo-algorithm for LDA with Gibbs sampling is presented.


The download link to Sarah Palin's e-mails is broken. Instead, I will use Hillary Clinton's e-mails.


The Code:


(* Basic LDA with collapsed Gibbs sampling. *)
LatentDirichletAllocation[documents_, topics_, maxIterations_] :=
Module[
{words, alpha, beta, nd, nw, nt, theta, phi, wordsLength,
phiDenominator, maxFeatures, currentWord, probabilities, newTopic},

(* Parameters. *)
alpha = 50/topics;
beta = 0.1;
maxFeatures = 2000;
(* Data pre-processing.*)
words = DataPreprocessing[documents, maxFeatures];
wordsLength = Length@words;
(* Append z-values. *)
words = Append[#, 1 + RandomInteger[topics - 1]] & /@ words;
(* Initialize counts. *)

Print["Starting initialization."];
nd = ConstantArray[0, {Length@documents, topics}];
(nd[[#[[2]], #[[3]]]] += 1) & /@ words;
nw = <||>;
If[KeyExistsQ[nw, #[[1]]],
nw[#[[1]]] =
nw[#[[1]]] + Normal@SparseArray[{#[[3]] -> 1}, topics],
AssociateTo[
nw, #[[1]] -> Normal@SparseArray[{#[[3]] -> 1}, topics]]
] & /@ words;

nt = Count[words[[All, 3]], #] & /@ Range[topics];
(* Repeat iteration over all words. *)
Print["Starting sampling."];
Do[
Do[
currentWord = words[[i]];
(* Update counts downwards. *)
nd[[Sequence @@ currentWord[[{2, 3}]]]] -= 1;
nw[currentWord[[1]]] =
nw[currentWord[[1]]] -

Normal@SparseArray[{currentWord[[3]] -> 1}, topics];
nt[[currentWord[[3]]]] -= 1;
(* Create probability distribution. *)
probabilities = (nd[[currentWord[[2]], #]] +
alpha) (nw[currentWord[[1]]][[#]] + beta)/(nt[[#]] +
beta wordsLength) & /@ Range[topics];
(* Choose new topic. *)
newTopic = RandomChoice[probabilities -> Range[topics]];
words[[i, 3]] = newTopic;
(* Update counts upwards. *)

nd[[currentWord[[2]], newTopic]] += 1;
nw[currentWord[[1]]] =
nw[currentWord[[1]]] +
Normal@SparseArray[{newTopic -> 1}, topics];
nt[[newTopic]] += 1;
(* Done. Repeat for all elements. *)
, {i, wordsLength}],
(* Update all elements maxIterations times. *)
{iter, maxIterations}
];


(* Calculate theta and phi. *)
theta =
N@Table[(nd[[d, #]] + alpha)/(Total@nd[[d]] + alpha topics) & /@
Range[topics], {d, Length@documents}];
phiDenominator = nt + Length@Keys[nw] beta;
phi = KeyValueMap[{#1, (#2 + beta)/phiDenominator} &, nw];
(* Return estimated parameters. *)
Print["LDA done. Returning theta and phi."];
{theta, phi}

];

DataPreprocessing[texts_, maxFeature_] := Module[{newTexts, features},
features = Flatten@TransformDocument[#] & /@ texts;
features = DimensionalityReduction[features, maxFeature];
newTexts = Join @@ MapIndexed[Thread@{#1, First@#2} &, features];
newTexts
];

(* Function for features selection. *)

TransformDocument[text_] := Module[{newText, wordWithNumber, target},
newText = StringReplace[ToLowerCase@text, "\\n" -> " "];
(* Remove digits. *)
wordWithNumber[1] =
WordBoundary ~~ LetterCharacter ... ~~ DigitCharacter .. ~~
WordBoundary;

wordWithNumber[2] =
WordBoundary ~~ DigitCharacter .. ~~ LetterCharacter ... ~~
WordBoundary;


wordWithNumber[3] =
WordBoundary ~~ LetterCharacter ... ~~ DigitCharacter .. ~~
LetterCharacter ... ~~ WordBoundary;

target = (wordWithNumber[1] | wordWithNumber[2] |
wordWithNumber[3]);
newText = StringDelete[newText, target];
newText = DeleteStopwords@newText;
newText = Union@TextWords[newText];

newText
];

DimensionalityReduction[words_, wordMax_] :=
Module[{wordCount, wordSelected, wordNew},
wordCount =
Sort[Tally[Flatten@words], #1[[2]] > #2[[2]] &][[All, 1]];
wordSelected = Take[wordCount, Min[Length@wordCount, wordMax]];
wordNew = Select[#, MemberQ[wordSelected, #] &] & /@ words;
wordNew

];

The LatentDirichletAllocation[] function takes three arguments. The documents, the number of topics, and the number of sampling iterations. The model is also dependant on the parameters alpha, beta, and maxFeatures. Choosing the number of topics is more or less guesswork, I used 50 but I would consider trying with more than that (and less). The maxFeatures limits the number of words considered when the documents are pre-processed, it's mainly to save time.


Running LDA:


(* Load Hillary Clinton's e-mails. *)
Needs["DatabaseLink`"];
JDBCDrivers["SQLite"];
data = OpenSQLConnection[
JDBC["SQLite", NotebookDirectory[] <> "database.sqlite"]];
emails = SQLSelect[data, "Emails", {"ExtractedBodyText"}];


(* Parameters. *)
topics = 50;
maxIter = 100;

(* Start progress indicator. *)
iter = 0;
Dynamic@ProgressIndicator[iter/maxIter]

(* Run LDA. *)

results = LatentDirichletAllocation[emails, topics, maxIter];

Theta and Phi will be returned from the function and stored in "results". Theta reflects how each document is made up of words from different topics (it contains the proportions). Phi measures how words are associated with different topics.


Data Analysis:


There are some typical things one can do with the results once retrieved. For example, show topics and the words that constitute them. For example, show the top 10 words for random topics:


(* Data Analysis. *)
topWords = 10;
nrDocuments = 20;

(* Extract the top words from each category. *)

topWordsTables =
Transpose@(Thread[{#[[1]], #[[2]]}] & /@ results[[2]]);
topWordsTables =
Take[Sort[#, #1[[2]] > #2[[2]] &], topWords] & /@ topWordsTables;
topWordsTablesShow =
Table[TableForm[topWordsTables[[i]],
TableHeadings -> {None, {"Category " <> ToString@i,
"Weight"}}], {i, 1, topics}];

Show random sample:



(* Show a sample of categories. *)
RandomSample[topWordsTablesShow, 3]

enter image description here


The words themselves do not always provide enough context to understand what the topic is about, it can be helpful to browse through topics and documents associated with topics:


(* Print documents with color encoding. *)
selectedDocuments =
Reverse@Ordering[#, -nrDocuments] & /@ Transpose@results[[1]];
Manipulate[
(* Selected category and document. *)

selectedText =
ToLowerCase@TextWords@emails[[selectedDocuments[[C, D]]]];
selectedWords =
Rule[#, Style[#, Blue]] & /@ topWordsTables[[C, All, 1]];
selectedText = First@(selectedText /. selectedWords);
selectedText = Flatten@Thread[{selectedText, " "}];
(* Display. *)
Column[{
topWordsTablesShow[[C]],
Row[selectedText]

}, Spacings -> 2, Dividers -> Center]
, Style["Documents from selected category:", Bold,
Medium], {{C, 1, "Category"}, 1, topics, 1}, {{D, 1, "Document"}, 1,
nrDocuments, 1}, Delimiter,
Dynamic["Category: " <> ToString@C <> " Document: " <> ToString@D <>
" (ID: " <> ToString@selectedDocuments[[C, D]] <> ")"],
TrackedSymbols :> {C, D}
]

enter image description here



The documents are sorted by how much they belong to a certain topic. The chosen document is the ninth document from topic 44. The top 10 words from topic 44 have been colored blue in the text. Note that "calls" is not blue despite "call" being the top word: I did not use stemming because I wanted the output to be easily readable (but stemming is recommended). To achieve the effect I had to ruin the formatting, the ID can be used to show the original e-mail:


(* Show email in its original form using ID. *)
emails[[6116]]

(* Out *)
{"MS \[LongDash] I've heard back from Armitage and Wexler offices. \
Here are all of your pending calls in one e-mail:
-Tony Blair: asking to speak with you btw 11AM-12 Noon tomorrow \
(Friday), or sometime Saturday morning.
- Fmr. Dep. Sec Armitage: Available at 3:15 PM today or 12 noon \

tomorrow (Friday).
- Menenedez: Available between 5-5:30 PM today, or after 7:00 PM \
today.
- Robert Wexler: Available anytime after 4:30 PM today, or anytime \
tomorrow.
Just let me know if you would like for me to lock anything in.
###"}

So what are the topics about? Topic 44 is about calling people, different sort of schedule-related issues. Other topics found were about Benghazi, Obama, and several topics focused on a particular word such as "ok", "fyi", etc. another was about Sri Lanka. Far from all topics were interesting however.


A fundamental assumption in LDA is that a document combines many topics. Plotting theta for a document is a way to visualize what topics a document belongs to and to which degree:



(* Display document as a ListPlot using ID. *)
ListPlot[results[[1, 6116]], Filling -> Axis]

enter image description here


As we already knew the document 6116 belongs mostly to topic 44, but it is also associated with topic 14. What is topic 14? It contains words such as "am", "pm", "meeting", "office". Topic 44 is about scheduling who to call, topic 14 is about scheduling where to be. That the document 6116 belongs partly to both is therefore somewhat expected!


Remember that the topics change each time the LDA is done, and that the results depend on the choice of parameters.


Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...