Skip to main content

Extract information from HTML using CSS selectors?


TL;DR Is there any way that we can parse HTML using CSS selectors in Mathematica, the way it is done in for example jQuery?





Extracting information from websites, i.e. web-scraping, in Mathematica can be time-consuming. The traditional techniques described in

are simply not enough for most serious web-scraping tasks. Since the most common technique is to import HTML as symbolic XML and then parse the XML with Cases another user had the idea to abstract this method into a package that would turn CSS rules into patterns that can parse symbolic XML:



Although the effort is praiseworthy there are a couple of drawbacks with his solution, primarily because it is only a proof-of-concept. Unfortunately it would take an unreasonable amount of time to build a solution based on this start that is even close to as good as what is already out there for other langauges, such as jQuery or PHP Simple HTML DOM parser.


Is there any way we can get comparable functionality in Mathematica?


The questions on this site alone show that there is a demand for a solution to this problem. A solution would make it possible to provide elegant answers to at least the following questions:



There is also this question which demonstrates Leonid Shifrin's HTML parser. It could also have been avoided by starting from a jQuery-like HTML parser.




Answer




This answer pertains to the original release of jsoupLink. The interface changed completely in a later version. Please see the Github page for the current interface.


=================================


As much as I would like to see a solution to this problem written in Mathematica, this is very unlikely given the scope of the problem. I would like to share a way to solve this using JLink, in the hope that it may help someone.


JLink, for those who don't know, is a package that comes with Mathematica. It allows you to execute Java code from within Mathematica. This means you can use any Java library out there to solve your problems without leaving the notebook interface. For this particular problem I will use jSoup, which is a parser just like the ones mentioned in the question.



You can download the latest version as a zip file from here.


It is important that the files are kept in the correct folder, otherwise Mathematica will not be able to locate the Java files. Therefore, to install the package start by evaluating


FileNameJoin[{$UserBaseDirectory, "Applications"}]


in Mathematica and unzip the zip file you downloaded into this folder. Then use Needs["`jSoupLink`"] to load the package.



The package contains three functions: ParseHTML, ParseHTMLString and ParseHTMLFragment. Some information about these is contained in their usage messages, which, if you have loaded the package, you can view using for example


?jSoupLink`ParseHTML

Typically you will use ParseHTML to download HTML source code from a website and then select a few elements. From these elements you will then extract some data. The general syntax is like this:


jSoupLink`ParseHTML[
website address,
CSS selector,

data elements to extract
]

website address is any URL, for example http://mathematica.stackexchange.com. CSS selector is basically any valid CSS3 selector. There is a list of CSS3 selector in jSoup's documentation. Data elements to extract can be almost anything contained by the elements that you've selected. Most commonly you'll want to extract attributes such as src if you've selected img elements or href if you've selected links (a elements). There are a few keywords that aren't attributes such as text to select the text contained by a selected element (some text in

some text

) or html to select the HTML contained by a selected element. You can glean the complete list from the package source code, and look them up in jSoup's documentation if you're not sure what they are.



Selecting images from Wikipedia


urls = jSoupLink`ParseHTML[
"http://en.wikipedia.org/wiki/Sweden", (* URL *)
"table.infobox img", (* CSS selector *)
"src" (* Attribute to retrieve *)

];
Partition[Import /@ urls, 2] // Grid

Example images from Wikipedia


Select headlines (both text and URL) from NYT


headlines = Rest@jSoupLink`ParseHTML[
"http://www.nytimes.com/pages/politics/index.html",
"h2 a, h3 a",
{"text", "href"}
];

Take[headlines, 5] // TableForm

NYT headlines


Build a database with information about Swedish municipalities, using data on Wikipedia


headers = jSoupLink`ParseHTML[
"http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
"table.wikitable.sortable th",
"text"
];
headers = StringReplace[#, "(" ~~ __ ~~ ")" -> ""] & /@ headers; (* Remove units *)

headers = StringReplace[#, WordBoundary ~~ x_ :> ToUpperCase[x]] & /@ headers; (* Capitalize *)
headers = StringReplace[#, " " -> ""] & /@ headers;(* Remove spaces *)

municipalities = jSoupLink`ParseHTML[
"http://en.wikipedia.org/wiki/List_of_municipalities_of_Sweden",
"table.wikitable.sortable td",
"text"
];
municipalities = Partition[municipalities, 9];


ds = Dataset@Composition[
Map[AssociationThread],
Map[(headers -> #) &]
][municipalities];

Now if you want to select all municipalities that belong to the county Västra Götaland you just have to type


ds[Select[#County == "Västra Götaland County" &], "Municipality"] // Normal


{"Ale Municipality", "Alingsås Municipality", "Bengtsfors \ Municipality", "Bollebygd Municipality", ...




Comments

Popular posts from this blog

plotting - Filling between two spheres in SphericalPlot3D

Manipulate[ SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, Mesh -> None, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], {n, 0, 1}] I cant' seem to be able to make a filling between two spheres. I've already tried the obvious Filling -> {1 -> {2}} but Mathematica doesn't seem to like that option. Is there any easy way around this or ... Answer There is no built-in filling in SphericalPlot3D . One option is to use ParametricPlot3D to draw the surfaces between the two shells: Manipulate[ Show[SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], ParametricPlot3D[{ r {Sin[t] Cos[1.5 Pi], Sin[t] Sin[1.5 Pi], Cos[t]}, r {Sin[t] Cos[0 Pi], Sin[t] Sin[0 Pi], Cos[t]}}, {r, 1, 2 - n}, {t, 0, Pi}, PlotStyle -> Yellow, Mesh -> {2, 15}]], {n, 0, 1}]

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1.