python - Beautiful Soup Not able to get_text after using extract() -

- July 15, 2014

i working on web scrapping , want text website using beautiful soup. found get_text() method returning javascript code avoid come across should use extract() method have weird problem after extraction of script , style tag beautiful soup doesn't recognize body present in new `html.

let me clear first doing this

soup = beautifulsoup(htmlrawdata, 'html.parser') print(soup.body)

here print statement printing html data when

soup = beautifulsoup(rawdata, 'html.parser')     script in soup(["script", "style"]):         script.extract()    # rip out     print(soup.body)

now printing none element not present debugging after did soup.prettify() print whole html including body tag , there no script , style tag :( confused why happening , if body present why saying none please thanks

and using python 3 , bs4 , rawdata html extracted website .

problem: using html example:

<html> <style>just style</style> <span>main text.</span> </html>

after extracting style tag , calling get_text() returns text supposed remove. due double newline in html after using extract(). call soup.contents before , after .extract() , see issue.

before extract():

[<html>\n<style>just style</style>\n<span>main text.</span>\n</html>]

after extract():

[<html>\n\n<span>main text.</span>\n</html>]

you can see double newline between html , span. issue brakes get_text() unknown reason. validate point remove newlines in example , work properly.

solutions:

1.- parse soup again after extract() call.

beautifulsoup(str(soup), 'html.parser')

2.- use different parser.

beautifulsoup(raw, 'html5lib')

note: solution #2 doesn't work if extract 2 or more contiguous tags because end double newline again.

note: have install parser. do:

pip install html5lib

Search This Blog

Current CAD

python - Beautiful Soup Not able to get_text after using extract() -

Comments

Post a Comment

Popular posts from this blog

python - argument must be rect style object - Pygame -

c++ - Qt setGeometry: Unable to set geometry -

php - Zend Framework / Skeleton-Application / Composer install issue -