python - Beautiful Soup Not able to get_text after using extract() -
i working on web scrapping , want text website using beautiful soup
. found get_text()
method returning javascript
code avoid come across should use extract()
method have weird problem after extraction of script
, style
tag beautiful soup
doesn't recognize body present in new `html.
let me clear first doing this
soup = beautifulsoup(htmlrawdata, 'html.parser') print(soup.body)
here print
statement printing html
data when
soup = beautifulsoup(rawdata, 'html.parser') script in soup(["script", "style"]): script.extract() # rip out print(soup.body)
now printing none
element not present debugging after did soup.prettify()
print whole html
including body
tag , there no script
, style
tag :( confused why happening , if body
present why saying none
please thanks
and using python 3 , bs4 , rawdata
html extracted website .
problem: using html example:
<html> <style>just style</style> <span>main text.</span> </html>
after extracting style tag , calling get_text() returns text supposed remove. due double newline in html after using extract(). call soup.contents before , after .extract() , see issue.
before extract():
[<html>\n<style>just style</style>\n<span>main text.</span>\n</html>]
after extract():
[<html>\n\n<span>main text.</span>\n</html>]
you can see double newline between html , span. issue brakes get_text() unknown reason. validate point remove newlines in example , work properly.
solutions:
1.- parse soup again after extract() call.
beautifulsoup(str(soup), 'html.parser')
2.- use different parser.
beautifulsoup(raw, 'html5lib')
note: solution #2 doesn't work if extract 2 or more contiguous tags because end double newline again.
note: have install parser. do:
pip install html5lib
Comments
Post a Comment