Skip to main content

xml - Problem of importing some special characters in HTML


content of the html file is as follows:





被尊称为“高贵的”



method1:


Import["C:\\Users\\HyperGroups\\Desktop\\test.html", "XML"]

XMLParserXMLGet::nfprserr: Entity 'ldquo' was not found at Line: 2 Character: 12 in C:\Users\HyperGroups\Desktop\test.html. >>


XMLParserXMLGet::nfprserr: Entity 'rdquo' was not found at Line: 2 Character: 22 in C:\Users\HyperGroups\Desktop\test.html. >>



(*
XMLObject[Document][{},XMLElement[p,{},{被尊称为\[EntityStart]ldquo\[EntityEnd]高贵的\[EntityStar
t]rdquo\[EntityEnd]}],{}]
*)

method2:


Import["C:\\Users\\HyperGroups\\Desktop\\test.html", "XMLObject"]

(*
XMLObject[Document][{XMLObject[Declaration][Version->1.0,Standalone->yes]},XMLElement[html

,{version->-//W3C//DTD HTML 4.01 Transitional//EN,{http://www.w3.org/2000/xmlns/,xmlns}->h
ttp://www.w3.org/1999/xhtml},{XMLElement[body,{},{XMLElement[p,{},{被尊称为"高贵的"}]}]}],{}]
*)

Why method1 generates those information, and how to avoid that and get the result just like method2 ?




Comments