Regex i html

Jel zna neko kako da sa regexom parsira ugnjezdene divove. Ovakav je html:

<div class="content">
<div class="profile-info">
bla bla
</div>
<div class="post-content">
blabla
</div>
</div>

Ako probam ovakav .*?<div class="content">(.*?)</div>.*? regex, on ce uhvatiti sve do prvog zatvorenog div-a, sto nije zatvorenje tog diva, vec nekog unutrasnje, znaci omasi skroz.

Ovo bi trebalo da je nemoguce jer regularni jezici nisu ista kategorija kao html, ali me zanima kako biste ovo resili bez bs4 i vec gotovih html/xml parsera.

Dobio sam ovakav zadatak za razgovor za posao, nisam uspeo da resim jer traze samo regex da se koristi, ali me svejedno zanima resenje kako bi bilo

Istrazi lookbehind i lookahead. Na tom primeru je okay, ali za bilo koji obniji html od toga, bas ne znam zasto bi iko ikada to radio.

As far as I understand, you are already aware of teorethical inability of complete parsing of HTML with regex =)

So looks like this task was about “hacky parsing” and the simpliest anwswer is smth like that:

/.*"profile-info">\s*(.*?)\s*<\/div>.*"post-content">\s*(.*?)\s*<\/div>.*/sgm

Here’s a link to prepared playground, so you could play with this regexp:

Also you could omit these divs and get shorter but less readable version:

/.*"profile-info">\s*(.*?)\s*<\/.*"post-content">\s*(.*?)\s*<\/.*/s

If you’re looking for more general answer and do not want to rely on classnames, then you can use this re:

/<div.*>\s*(.*)\s*<\/div>/g
1 Like

Wow, I was stubbornly starting with the div tag, not trying to start from classname.

What does this \s stands for ?

According to the regex101 documentation \s matches any whitespace character.

You could check the link above =) Sidebar contains full explanation for the groups and special characters

1 Like

tyy

1 Like