python - Regular Expression HTML Tag Exclusion -
python - Regular Expression HTML Tag Exclusion -
yes, yes, i've weighed using xml parser instead of regular expressions, seems simplistic plenty case it's suitable:
from beautifulsoup import beautifulsoup urllib import urlopen tempsite = 'http://www.sumkindawebsiterighthur.com' thetempsite = urlopen(tempsite).read() currenttempsite = beautifulsoup(thetempsite) email = currenttempsite.findall('tr', valign="top") print email[0]
currently results with:
<tr valign="top"> <td><p>phone number:</p></td> <td> </td> <td><p>706-878-8888</p></td> </tr>
i'm trying remove markup (tr, td, p, nice too) , result:
phone number: 706-878-8888
my problem over-exclusion , multiple lines beingness regex'd, looking reply outputs on single line.
if results simple, next regex set 'phone number:' in capture grouping 1 , number in capture grouping 2 long re.dotall
flag set:
.*(phone number:).*?([-\d]+).*
you can phone call re.sub()
on string replacement \1 \2
.
here finish illustration returns want:
>>> s = """<tr valign="top"> ... <td><p>phone number:</p></td> ... <td> </td> ... <td><p>706-878-8888</p></td> ... </tr>""" >>> regex = re.compile(r'.*(phone number:).*?([-\d]+).*', re.dotall) >>> regex.sub(r'\1 \2', s) 'phone number: 706-878-8888'
python html regex beautifulsoup
Comments
Post a Comment