[Catalyst] Alien::Dojo uses regexes to parse HTML, so what?
Matt S Trout
dbix-class at trout.me.uk
Mon May 29 20:13:23 CEST 2006
Dominique Quatravaux wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> phaylon wrote:
>
>> Dominique Quatravaux said:
>>
>>> I rest my case, unless someone can provide compelling reasons for
>>> avoiding regexes *in general* for this task.
>>
>> mst gave only one to demonstrate the whole problem. It's like a
>> big, lightsucking black hole.
>
> No it's not. We are not trying to address the problem of parsing HTML
> in general, we are trying to address the problem of parsing *one
> single page*. Since I apparently have to be that explicit to make my
> point, consider
>
> my ($url) = qr{<a ^>+
> href="(http://download.dojotoolkit.org/release[^"]+)"}sx
>
> or even
>
> my ($url) = qr{href="http://download.dojotoolkit.org/release[^"]+)"}sx
<a href="http://download.dojotoolkit.org/release-notes.txt">
Congratulations, you're toast.
> and pray tell me what's wrong with those. HTML is a *text* language,
> for chrissake, it was designed *purposefully* so that I am able to do
> that sort of thing.
Regexps are a hammer. This problem is not a nail.
Get a canonical address from the dojo maintainers, or at the very least
consider a lightweight SGMLish parsing job. Regexps are only sane for
hacky one-off scripts, and do not belong on something you're promoting
as a viable CPAN module - or at least certainly not for production use.
More information about the Catalyst
mailing list