TL;DR Is there any way that we can parse HTML using CSS selectors in Mathematica, the way it is done in for example jQuery?
Extracting information from websites, i.e. web-scraping, in Mathematica can be time-consuming. The traditional techniques described in
are simply not enough for most serious web-scraping tasks. Since the most common technique is to import HTML as symbolic XML and then parse the XML with Cases
another user had the idea to abstract this method into a package that would turn CSS rules into patterns that can parse symbolic XML:
Although the effort is praiseworthy there are a couple of drawbacks with his solution, primarily because it is only a proof-of-concept. Unfortunately it would take an unreasonable amount of time to build a solution based on this start that is even close to as good as what is already out there for other langauges, such as jQuery or PHP Simple HTML DOM parser.
Is there any way we can get comparable functionality in Mathematica?
The questions on this site alone show that there is a demand for a solution to this problem. A solution would make it possible to provide elegant answers to at least the following questions:
There is also this question which demonstrates Leonid Shifrin's HTML parser. It could also have been avoided by starting from a jQuery-like HTML parser.
Answer
This answer pertains to the original release of jsoupLink. The interface changed completely in a later version. Please see the Github page for the current interface.
=================================
As much as I would like to see a solution to this problem written in Mathematica, this is very unlikely given the scope of the problem. I would like to share a way to solve this using JLink, in the hope that it may help someone.
JLink, for those who don't know, is a package that comes with Mathematica. It allows you to execute Java code from within Mathematica. This means you can use any Java library out there to solve your problems without leaving the notebook interface. For this particular problem I will use jSoup, which is a parser just like the ones mentioned in the question.
You can download the latest version as a zip file from here.
It is important that the files are kept in the correct folder, otherwise Mathematica will not be able to locate the Java files. Therefore, to install the package start by evaluating
FileNameJoin[{$UserBaseDirectory, "Applications"}]
in Mathematica and unzip the zip file you downloaded into this folder. Then use Needs["`jSoupLink`"]
to load the package.
The package contains three functions: ParseHTML
, ParseHTMLString
and ParseHTMLFragment
. Some information about these is contained in their usage messages, which, if you have loaded the package, you can view using for example
?jSoupLink`ParseHTML
Typically you will use ParseHTML
to download HTML source code from a website and then select a few elements. From these elements you will then extract some data. The general syntax is like this:
jSoupLink`ParseHTML[
website address,
CSS selector,
data elements to extract
]
some textwebsite address
is any URL, for example http://mathematica.stackexchange.com
. CSS selector
is basically any valid CSS3 selector. There is a list of CSS3 selector in jSoup's documentation. Data elements to extract
can be almost anything contained by the elements that you've selected. Most commonly you'll want to extract attributes such as src
if you've selected img
elements or href
if you've selected links (a
elements). There are a few keywords that aren't attributes such as text
to select the text contained by a selected element (some text in
) or html
to select the HTML contained by a selected element. You can glean the complete list from the package source code, and look them up in jSoup's documentation if you're not sure what they are.
Selecting images from Wikipedia
urls = jSoupLink`ParseHTML[
"http://en.wikipedia.org/wiki/Sweden", (* URL *)
"table.infobox img", (* CSS selector *)
"src" (* Attribute to retrieve *)
];
Partition[Import /@ urls, 2] // Grid
Select headlines (both text and URL) from NYT
headlines = Rest@jSoupLink`ParseHTML[
"http://www.nytimes.com/pages/politics/index.html",
"h2 a, h3 a",
{"text", "href"}
];
Take[headlines, 5] // TableForm
Build a database with information about Swedish municipalities, using data on Wikipedia
headers = jSoupLink`ParseHTML[
"http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
"table.wikitable.sortable th",
"text"
];
headers = StringReplace[#, "(" ~~ __ ~~ ")" -> ""] & /@ headers; (* Remove units *)
headers = StringReplace[#, WordBoundary ~~ x_ :> ToUpperCase[x]] & /@ headers; (* Capitalize *)
headers = StringReplace[#, " " -> ""] & /@ headers;(* Remove spaces *)
municipalities = jSoupLink`ParseHTML[
"http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
"table.wikitable.sortable td",
"text"
];
municipalities = Partition[municipalities, 9];
ds = Dataset@Composition[
Map[AssociationThread],
Map[(headers -> #) &]
][municipalities];
Now if you want to select all municipalities that belong to the county Västra Götaland you just have to type
ds[Select[#County == "Västra Götaland County" &], "Municipality"] // Normal
{"Ale Municipality", "Alingsås Municipality", "Bengtsfors \ Municipality", "Bollebygd Municipality", ...
Comments
Post a Comment