mysql - Fetching websites name contains the HTML code in python 27 -
i have facing problem when running python script download companies business directories company name, address, location address , web address.
but when script fetching websites name of company www.example.com
fetch websites name html code instead of fetching websites name , store html code mysql server of current websites.
i have using following library of python beautifulsoup, lxml, html, hashlib, urllib2 , store websites name html code mysql server like
<input><tr><td>www.example.com</td></tr></input>
i want remove html tag , store companies web url www.example.com
in mysql server
my code here:
for hit in soup2.findall(attrs={'id' : 'website_0'}): web = str(hit).replace('<input type="hidden" value="', '') web = web.replace('" id="website_0" />', '') if web == "": flog.write("\nwebsite extraction... failed") print "none" else: flog.write("\nwebsite extraction... ok") print web companyobj.setweb(web)
any solution or suggestion how fix this.
you have (at least) 2 options: using re
or beautifulsoup
.
using re
import re cleanse_url = re.compile(r'<[^>]*>') hit in soup2.findall(attrs={'id' : 'website_0'}): web = str(hit).replace('<input type="hidden" value="', '') web = web.replace('" id="website_0" />', '') if web == "": flog.write("\nwebsite extraction... failed") print "none" else: web = cleanse_url.sub('', web) # escape html flog.write("\nwebsite extraction... ok") print web companyobj.setweb(web)
using beautifulsoup.tag.text
i think option better tag.text
strips attributes tags.
for hit in soup2.findall(attrs={'id' : 'website_0'}): web = hit.text # use beautifulsoup if web == "": flog.write("\nwebsite extraction... failed") print "none" else: flog.write("\nwebsite extraction... ok") print web companyobj.setweb(web)
Comments
Post a Comment