python - Beautiful Soup Not able to get_text after using extract() -


i working on web scrapping , want text website using beautiful soup. found get_text() method returning javascript code avoid come across should use extract() method have weird problem after extraction of script , style tag beautiful soup doesn't recognize body present in new `html.

let me clear first doing this

soup = beautifulsoup(htmlrawdata, 'html.parser') print(soup.body) 

here print statement printing html data when

soup = beautifulsoup(rawdata, 'html.parser')     script in soup(["script", "style"]):         script.extract()    # rip out     print(soup.body) 

now printing none element not present debugging after did soup.prettify() print whole html including body tag , there no script , style tag :( confused why happening , if body present why saying none please thanks

and using python 3 , bs4 , rawdata html extracted website .

problem: using html example:

<html> <style>just style</style> <span>main text.</span> </html> 

after extracting style tag , calling get_text() returns text supposed remove. due double newline in html after using extract(). call soup.contents before , after .extract() , see issue.

before extract():

[<html>\n<style>just style</style>\n<span>main text.</span>\n</html>] 

after extract():

[<html>\n\n<span>main text.</span>\n</html>] 

you can see double newline between html , span. issue brakes get_text() unknown reason. validate point remove newlines in example , work properly.

solutions:

1.- parse soup again after extract() call.

beautifulsoup(str(soup), 'html.parser') 

2.- use different parser.

beautifulsoup(raw, 'html5lib') 

note: solution #2 doesn't work if extract 2 or more contiguous tags because end double newline again.

note: have install parser. do:

pip install html5lib 

Comments

Popular posts from this blog

c# - Better 64-bit byte array hash -

webrtc - Which ICE candidate am I using and why? -

php - Zend Framework / Skeleton-Application / Composer install issue -