Skip to main content

interpreter - Improving semantic interpretation of dates


SemanticInterpretation and Interpreter are promising but imperfect, eg extracting dates and calendar expressions from semi-structured documents.


Consider this sample of 92 items extracted from legal documents by a team at USDOJ which develops regex: dates are a major challenge.


doj92[0] = {"February 2006","July 2006 - December 2006","December, 2006","2/1/07 - 2/28/07","Jan. 1-31, 2007","February 1-28, 2007","1/1/07 through 1/31/07","01-31-07","DECEMBER 1 TO DECEMBER 31, 2007","DECEMBER 30, 2006 TO JANUARY 31, 2007","Jan/Feb 2007","Mar. 21 through 31, 2007","Jan 3, 2007","First Quarter 2007","Nov & Dec 2006","DECEMBER ----- 2006","JANUARY -- 2007","March-2007","11/2007","7/07","12-06","JULY, AUGUST, SEPTEMBER, OCTOBER-2006","3/1/07 to 3/31/07","July 1, 2006 - Sept. 30, 2006","Oct. 1,2006 - Dec. 31,2006","Jan 1, 2003 - Mar 31, 2003","May 1 2007 To May 31 2007","APRIL 16TH, THROUGH APRIL 30TH, 2007","May 9th, through May 31st, 2007","May 31,07","Month of July","(Month Ending)August","July 2006, August 2006 and April 2007","Aug. 1 Through Aug. 31, 2007","8/1/ 07 - 8/31/ 07","May 30 - June 2007","6/1/30/2007 to 6/30/2007","080107 - 083107","Januanry 1, 2007 - Janurary 31, 2007","June 1, 2007 - June 31, 2007","September 31, 2007","September2007","Sep 07","July 20--31, 2007","August (2007) ","09 01 07 to 09 30 07","04-16-07/09-30-07","October, 2006 and February, 2007","5/25 - 5/31 2007 May","6/1 -6/30 2007","Sept.30,2007","Sept. 19 - Sept. 30","May 30 - June 2007","October 1st to October 9th","10/31/'07","8/9/078/31/07","10/1-07-10/31/07","11//14/07","10/1/7 to 10/31/7","Octobeer 2007","Seotember 2007","Septmber 2007","September 1, 2007 to Setember 30, 2007","July through September 2007 ","3Q07","0/30/2007","Augusut 2007","September, 30 2006","3rdQuater,2007","Beginning 08012007 and Ending 08312007","4th Quarterly Report 2006","Beginning 110107 and Ending 113007","10/01/07 to 10/310/07","JULY, AUGUST, SEPTEMBER, OCTUBER 2007","11.07","Nov.2007","Novermber 2007","November 15, 2007 to Decmeber 15, 2007","093007 ","Novemer 2007","December 1, to 31, 2007","December 1,2007 through Deceber 31, 2007","Decmber 2007","DCEMBER 2008","12/1-31/07","Beginning 10/1/2007 and Ending 10/31/2007","Third Qtr 2007","November 2007, December 2007 & January 2008","12/01/207 to 12/31/2007","October 2007 - December 2007","January 16th through 31st, 2008","010108 to 013108"}//MapIndexed[<|"num"-> First@#2,"text"->  #1|>&]//Dataset;


Comparison table shown as compact screenshots:


doj92[1] = 
doj92[0][All,
Append[#, <| "SemanticInterp" -> SemanticInterpretation[#text],
"Interp/Date" -> Interpreter["Date"][#text],
"Interp/Computed" -> Interpreter["ComputedDate"][#text]|>] &];

Although SemanticInterpretation works better than Interpreter and even Interpreter notably on calendar expressions that yield date intervals, it fails or yields incorrect ambiguous interpretations for 42 (46%) items:


doj92[1][{11, 15, 16, 17, 19, 20, 21, 22, 28, 29, 31, 32, 33, 36, 37, 

38, 41, 43, 44, 47, 48, 49, 56, 57, 58, 59, 65, 66, 69, 70, 71, 72,
73, 74, 75, 79, 81, 85, 86, 88, 91, 92}]

enter image description here enter image description hereenter image description here


This small scale test based on real world data shows that significant pre-processing/normalization of the text is needed. Interpreter does accept pattern matching options, but its fail rate is much higher (with a few exceptions like #19-21).


Are there some basic pre-processing or options for these method that could improve performance and generalize? Or are large-scale tests needed to validate these tools?




Comments

Popular posts from this blog

front end - keyboard shortcut to invoke Insert new matrix

I frequently need to type in some matrices, and the menu command Insert > Table/Matrix > New... allows matrices with lines drawn between columns and rows, which is very helpful. I would like to make a keyboard shortcut for it, but cannot find the relevant frontend token command (4209405) for it. Since the FullForm[] and InputForm[] of matrices with lines drawn between rows and columns is the same as those without lines, it's hard to do this via 3rd party system-wide text expanders (e.g. autohotkey or atext on mac). How does one assign a keyboard shortcut for the menu item Insert > Table/Matrix > New... , preferably using only mathematica? Thanks! Answer In the MenuSetup.tr (for linux located in the $InstallationDirectory/SystemFiles/FrontEnd/TextResources/X/ directory), I changed the line MenuItem["&New...", "CreateGridBoxDialog"] to read MenuItem["&New...", "CreateGridBoxDialog", MenuKey["m", Modifiers-...

How to thread a list

I have data in format data = {{a1, a2}, {b1, b2}, {c1, c2}, {d1, d2}} Tableform: I want to thread it to : tdata = {{{a1, b1}, {a2, b2}}, {{a1, c1}, {a2, c2}}, {{a1, d1}, {a2, d2}}} Tableform: And I would like to do better then pseudofunction[n_] := Transpose[{data2[[1]], data2[[n]]}]; SetAttributes[pseudofunction, Listable]; Range[2, 4] // pseudofunction Here is my benchmark data, where data3 is normal sample of real data. data3 = Drop[ExcelWorkBook[[Column1 ;; Column4]], None, 1]; data2 = {a #, b #, c #, d #} & /@ Range[1, 10^5]; data = RandomReal[{0, 1}, {10^6, 4}]; Here is my benchmark code kptnw[list_] := Transpose[{Table[First@#, {Length@# - 1}], Rest@#}, {3, 1, 2}] &@list kptnw2[list_] := Transpose[{ConstantArray[First@#, Length@# - 1], Rest@#}, {3, 1, 2}] &@list OleksandrR[list_] := Flatten[Outer[List, List@First[list], Rest[list], 1], {{2}, {1, 4}}] paradox2[list_] := Partition[Riffle[list[[1]], #], 2] & /@ Drop[list, 1] RM[list_] := FoldList[Transpose[{First@li...

plotting - How to draw lines between specified dots on ListPlot?

I would like to create a plot where I have unconnected dots and some connected. So far, I have figured out how to draw the dots. My code is the following: ListPlot[{{1, 1}, {2, 2}, {3, 3}, {4, 4}, {1, 4}, {2, 5}, {3, 6}, {4, 7}, {1, 7}, {2, 8}, {3, 9}, {4, 10}, {1, 10}, {2, 11}, {3, 12}, {4,13}, {2.5, 7}}, Ticks -> {{1, 2, 3, 4}, None}, AxesStyle -> Thin, TicksStyle -> Directive[Black, Bold, 12], Mesh -> Full] I have thought using ListLinePlot command, but I don't know how to specify to the command to draw only selected lines between the dots. Do have any suggestions/hints on how to do that? Thank you. Answer One possibility would be to use Epilog with Line : ListPlot[ {{1, 1}, {2, 2}, {3, 3}, {4, 4}, {1, 4}, {2, 5}, {3, 6}, {4, 7}, {1, 7}, {2, 8}, {3, 9}, {4, 10}, {1, 10}, {2, 11}, {3, 12}, {4, 13}, {2.5, 7}}, Ticks -> {{1, 2, 3, 4}, None}, AxesStyle -> Thin, TicksStyle -> Directive[Black, Bold, 12], Mesh -> Full, Epilog -> { Line[ ...