Regex i html

venmo · April 25, 2024, 11:03am

Jel zna neko kako da sa regexom parsira ugnjezdene divove. Ovakav je html:

<div class="content">
<div class="profile-info">
bla bla
</div>
<div class="post-content">
blabla
</div>
</div>

Ako probam ovakav .*?<div class="content">(.*?)</div>.*? regex, on ce uhvatiti sve do prvog zatvorenog div-a, sto nije zatvorenje tog diva, vec nekog unutrasnje, znaci omasi skroz.

Ovo bi trebalo da je nemoguce jer regularni jezici nisu ista kategorija kao html, ali me zanima kako biste ovo resili bez bs4 i vec gotovih html/xml parsera.

Dobio sam ovakav zadatak za razgovor za posao, nisam uspeo da resim jer traze samo regex da se koristi, ali me svejedno zanima resenje kako bi bilo

texhno · April 25, 2024, 5:10pm

Istrazi lookbehind i lookahead. Na tom primeru je okay, ali za bilo koji obniji html od toga, bas ne znam zasto bi iko ikada to radio.

He4eT · April 26, 2024, 11:32am

As far as I understand, you are already aware of teorethical inability of complete parsing of HTML with regex =)

So looks like this task was about “hacky parsing” and the simpliest anwswer is smth like that:

/.*"profile-info">\s*(.*?)\s*<\/div>.*"post-content">\s*(.*?)\s*<\/div>.*/sgm

Here’s a link to prepared playground, so you could play with this regexp:

He4eT · April 26, 2024, 11:41am

Also you could omit these divs and get shorter but less readable version:

/.*"profile-info">\s*(.*?)\s*<\/.*"post-content">\s*(.*?)\s*<\/.*/s

He4eT · April 26, 2024, 2:49pm

If you’re looking for more general answer and do not want to rely on classnames, then you can use this re:

/<div.*>\s*(.*)\s*<\/div>/g

venmo · April 27, 2024, 1:25pm

Wow, I was stubbornly starting with the div tag, not trying to start from classname.

What does this \s stands for ?

He4eT · April 27, 2024, 1:48pm

According to the regex101 documentation \s matches any whitespace character.

You could check the link above =) Sidebar contains full explanation for the groups and special characters

venmo · April 27, 2024, 9:15pm

tyy