[Catalyst] Alien::Dojo uses regexes to parse HTML, so what?

Mon May 29 20:13:23 CEST 2006

Dominique Quatravaux wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> phaylon wrote:
> 
>> Dominique Quatravaux said:
>>
>>> I rest my case, unless someone can provide compelling reasons for
>>> avoiding regexes *in general* for this task.
>>
>> mst gave only one to demonstrate the whole problem. It's like a
>> big, lightsucking black hole.
> 
> No it's not. We are not trying to address the problem of parsing HTML
> in general, we are trying to address the problem of parsing *one
> single page*. Since I apparently have to be that explicit to make my
> point, consider
> 
>   my ($url) = qr{<a ^>+
> href="(http://download.dojotoolkit.org/release[^"]+)"}sx
> 
> or even
> 
>   my ($url) = qr{href="http://download.dojotoolkit.org/release[^"]+)"}sx

<a href="http://download.dojotoolkit.org/release-notes.txt">

Congratulations, you're toast.

> and pray tell me what's wrong with those. HTML is a *text* language,
> for chrissake, it was designed *purposefully* so that I am able to do
> that sort of thing.

Regexps are a hammer. This problem is not a nail.

Get a canonical address from the dojo maintainers, or at the very least 
consider a lightweight SGMLish parsing job. Regexps are only sane for 
hacky one-off scripts, and do not belong on something you're promoting 
as a viable CPAN module - or at least certainly not for production use.