Bug introduced in 8 or earlier and persisting through 11.1
When I try to download some files from the web, I sometimes get an error like this:
links = {"http://dip21.bundestag.de/dip21/btd/18/076/1807620.pdf",
"http://epaper.das-parlament.de/epaper/ausgabe.pdf",
"http://gemeindebund.at/images/uploads/downloads/2014/Publikationen/RFGs/2010/RFG_4-2010_-_E-Government_in_Gemeinden_(PDF__2_4MB).pdf",
"http://smart-government.eu/wp-content/uploads/2013/02/PB_IRML-SmartSchool.pdf",
"http://subs.emis.de/LNI/Proceedings/Proceedings261/113.pdf",
"http://subs.emis.de/LNI/Proceedings/Proceedings261/163.pdf",
"http://subs.emis.de/LNI/Proceedings/Proceedings261/51.pdf",
"http://subs.emis.de/LNI/Proceedings/Proceedings261/P-261.pdf",
"https://www.bundes-sgk.de/system/files/documents/impulse2_nov2015_final.pdf",
"https://www.demo-online.de/system/files/demo_05_06_ansicht_pdf_klein.pdf",
"https://www.gruene-bw.de/app/uploads/2016/05/GrueneBW-Koalitionsvertrag-2016-Entwurf.pdf",
"https://www.itdz-berlin.de/dokumente/splitter/splitter_2005_1.pdf",
"https://www.normenkontrollrat.bund.de/Webs/NKR/Content/DE/Publikationen/Jahresberichte/2014-10-01_nkr_jahresbericht_2014.pdf?__blob=publicationFile&v=1",
"http://www.beamten-informationen.de/media/pdf/Beamten_Magazin_2008_04.pdf",
"http://www.dgb.de/themen/++co++541b3e54-263c-11e5-9b43-52540023ef1a",
"http://www.staedtetag-nrw.de/imperia/md/content/stnrw/internet/2_fachinformationen/2008/ag_4_geoportale_formatiert_korr_kes.pdf",
"http://www.vitako.de/Themen%20Dokumente/Vitako_aktuell_01-16.pdf"};
res = Import[#, "Plaintext"] & /@ links
LinkObject::linkd: Unable to communicate with closed link LinkObject["C:\Program Files\Wolfram Research\Mathematica\11.0\SystemFiles\Converters\Binaries\Windows-x86-64\PDF.exe",2011,5].
What does this error mean and how can I cleanly handle it (without keeping a PDF.exe in memory)?
Answer
Just for reference, this is my workaround:
importPatched[url_, elements_String] := importPatched[url, {elements}];
importPatched[url_List, elements_List] := Dataset[importPatched[#, elements] & /@ url];
importPatched[url_String, elements_List] := Module[{fn, pdftohtml, dlResult, pdfQ, s, elems, result, evData, i},
pdftohtml = "pdftohtml.exe"; (* edit path if necessary *)
fn = CreateFile[];
result = Prepend[Association[Rule[#, Missing[]] & /@ elements], Rule["Link", url]];
(* download *)
evData = EvaluationData[URLDownload[url, fn, {"StatusCode", "ContentType"}]];
If[Not[evData["Success"]], Return[Association[Prepend["Status" -> "URLDownload Error " <> StringRiffle[ToString /@ evData["Messages"], ", "], result]]]];
dlResult = evData["Result"];
If[dlResult["StatusCode"] =!= 200,
Quiet[DeleteFile[fn]];
Return[Prepend[result, "Status" -> "HTTP Error " <> dlResult["StatusCode"]]];
];
(* convert PDF to html *)
pdfQ = AnyTrue[{"application/acrobat", "application/pdf", "application/vnd.pdf", "application/x-pdf", "text/pdf", "text/x-pdf"},
StringContainsQ[dlResult["ContentType"], Verbatim[#], IgnoreCase -> True]& ];
s = If[pdfQ,
RenameFile[fn, fn <> ".pdf"];
fn = fn <> ".pdf";
evData = EvaluationData[StringJoin@ReadList["!" <> pdftohtml <> " -stdout " <> fn, Character]];
If[Not[evData["Success"]], Return[Prepend[result, "Status" -> "pdftohtml Error"]]];
evData["Result"]
, (* else *)
ReadString[fn]
];
Quiet[DeleteFile[fn]];
(* read requested elements *)
elems = Intersection[elements, ImportString[s, "Elements"]];
i = 1;
While[i <= Length[elems],
evData = EvaluationData[ImportString[s, elems[[i]]]];
If[Not[evData["Success"]], Return[Prepend[result, "Status" -> "Import Error " <> StringRiffle[ToString /@ evData["Messages"], ", "]]]];
result[elems[[i]]] = evData["Result"];
i++
];
Return[Prepend[result, "Status" -> "Success"]];
];
ans = importPatched[links, {"Title", "Hyperlinks"}]
It produces more or less reasonable results on all links above. It relies on the great pdftohtml tool. The function basically downloads the link, looks if it is a pdf (in which case the file is transformed to html), and then calls Import
.
For calling pdftohtml I asked a separate question and got help quickly.
Any hints on bugs or possible improvements on this code very welcome! In particular, it would be interesting to know how to include the pdftohtml.exe in a subdirectory of a Mathematica package.
Comments
Post a Comment