Extract information from HTML using CSS selectors?

TL;DR Is there any way that we can parse HTML using CSS selectors in Mathematica, the way it is done in for example jQuery?

Extracting information from websites, i.e. web-scraping, in Mathematica can be time-consuming. The traditional techniques described in

are simply not enough for most serious web-scraping tasks. Since the most common technique is to import HTML as symbolic XML and then parse the XML with Cases another user had the idea to abstract this method into a package that would turn CSS rules into patterns that can parse symbolic XML:

Although the effort is praiseworthy there are a couple of drawbacks with his solution, primarily because it is only a proof-of-concept. Unfortunately it would take an unreasonable amount of time to build a solution based on this start that is even close to as good as what is already out there for other langauges, such as jQuery or PHP Simple HTML DOM parser.

Is there any way we can get comparable functionality in Mathematica?

The questions on this site alone show that there is a demand for a solution to this problem. A solution would make it possible to provide elegant answers to at least the following questions:

There is also this question which demonstrates Leonid Shifrin's HTML parser. It could also have been avoided by starting from a jQuery-like HTML parser.

Answer

This answer pertains to the original release of jsoupLink. The interface changed completely in a later version. Please see the Github page for the current interface.

=================================

As much as I would like to see a solution to this problem written in Mathematica, this is very unlikely given the scope of the problem. I would like to share a way to solve this using JLink, in the hope that it may help someone.

JLink, for those who don't know, is a package that comes with Mathematica. It allows you to execute Java code from within Mathematica. This means you can use any Java library out there to solve your problems without leaving the notebook interface. For this particular problem I will use jSoup, which is a parser just like the ones mentioned in the question.

You can download the latest version as a zip file from here.

It is important that the files are kept in the correct folder, otherwise Mathematica will not be able to locate the Java files. Therefore, to install the package start by evaluating

FileNameJoin[{$UserBaseDirectory, "Applications"}]

in Mathematica and unzip the zip file you downloaded into this folder. Then use Needs["`jSoupLink`"] to load the package.

The package contains three functions: ParseHTML, ParseHTMLString and ParseHTMLFragment. Some information about these is contained in their usage messages, which, if you have loaded the package, you can view using for example

?jSoupLink`ParseHTML

Typically you will use ParseHTML to download HTML source code from a website and then select a few elements. From these elements you will then extract some data. The general syntax is like this:

jSoupLink`ParseHTML[
website address,
CSS selector,

data elements to extract
]

website address is any URL, for example http://mathematica.stackexchange.com. CSS selector is basically any valid CSS3 selector. There is a list of CSS3 selector in jSoup's documentation. Data elements to extract can be almost anything contained by the elements that you've selected. Most commonly you'll want to extract attributes such as src if you've selected img elements or href if you've selected links (a elements). There are a few keywords that aren't attributes such as text to select the text contained by a selected element (some text in

some text

) or html to select the HTML contained by a selected element. You can glean the complete list from the package source code, and look them up in jSoup's documentation if you're not sure what they are.

Selecting images from Wikipedia

urls = jSoupLink`ParseHTML[
   "http://en.wikipedia.org/wiki/Sweden", (* URL *)
   "table.infobox img", (* CSS selector *)
   "src" (* Attribute to retrieve *)

   ];
Partition[Import /@ urls, 2] // Grid

Example images from Wikipedia

Select headlines (both text and URL) from NYT

headlines = Rest@jSoupLink`ParseHTML[
    "http://www.nytimes.com/pages/politics/index.html",
    "h2 a, h3 a",
    {"text", "href"}
    ];

Take[headlines, 5] // TableForm

NYT headlines

Build a database with information about Swedish municipalities, using data on Wikipedia

headers = jSoupLink`ParseHTML[
   "http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
   "table.wikitable.sortable th",
   "text"
   ];
headers = StringReplace[#, "(" ~~ __ ~~ ")" -> ""] & /@ headers; (* Remove units *)

headers = StringReplace[#, WordBoundary ~~ x_ :> ToUpperCase[x]] & /@ headers; (* Capitalize *)
headers = StringReplace[#, " " -> ""] & /@ headers;(* Remove spaces *)

municipalities = jSoupLink`ParseHTML[
   "http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
   "table.wikitable.sortable td",
   "text"
   ];
municipalities = Partition[municipalities, 9];


ds = Dataset@Composition[
     Map[AssociationThread],
     Map[(headers -> #) &]
     ][municipalities];

Now if you want to select all municipalities that belong to the county Västra Götaland you just have to type

ds[Select[#County == "Västra Götaland County" &], "Municipality"] // Normal

{"Ale Municipality", "Alingsås Municipality", "Bengtsfors \ Municipality", "Bollebygd Municipality", ...

Blog

Search This Blog

Extract information from HTML using CSS selectors?

Comments

Post a Comment

Popular posts from this blog

front end - keyboard shortcut to invoke Insert new matrix

How to thread a list

functions - Get leading series expansion term?