Short version: How can I convert an XMLElement
that represents part of a HTML document to plain text?
Long version: The more general problem is extracting information from webpages. We can import the webpage as an XMLObject
and extract the relevant part. But this may still be a complex expression of many nested XMLElement
s (several paragraphs, links, emphasis, etc.), while I'm typically only interested in the text.
Let's take a random example: extracting the text from this article. Using the developer tools of any modern browser it's easy to find out that the relevant part is in a div
with id="article-body-blocks"
. So we do
page = Import[
"http://www.guardian.co.uk/science/blog/2012/nov/13/science-enforced-humility",
"XMLObject"];
body = Cases[page,
XMLElement["div", {"id" -> "article-body-blocks"}, ___], Infinity];
The body is still a compound expression. Is there a built-in, direct way to extract the text?
My workaround is
ImportString[ExportString[First@body, "XML"], "HTML"]
but this is a hack (i.e. fragile, likely to break in future versions or with an input I didn't anticipate). Is there anything specifically meant for this purpose?
Comments
Post a Comment