Can you provide some examples of why it is hard to parse XML and HTML with a regex? [closed]

Here’s some fun valid XML for you:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>
    <a b="&y;>" />
    <![CDATA[[a>b <a>b <a]]>
    <?x <a> <!-- <b> ?> c --> d

And this little bundle of joy is valid HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "" [
    <!ENTITY % e "href=""">
    <!ENTITY e "<a %e;>">
    <p id  =  a:b center>
    <span / hello </span>
    &amp<br left>
    <!---- >t<!---> < -->
    &e link </a>

Not to mention all the browser-specific parsing for invalid constructs.

Good luck pitting regex against that!

EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:


Leave a Comment