python - Regular Expression HTML Tag Exclusion -



python - Regular Expression HTML Tag Exclusion -

yes, yes, i've weighed using xml parser instead of regular expressions, seems simplistic plenty case it's suitable:

from beautifulsoup import beautifulsoup urllib import urlopen tempsite = 'http://www.sumkindawebsiterighthur.com' thetempsite = urlopen(tempsite).read() currenttempsite = beautifulsoup(thetempsite) email = currenttempsite.findall('tr', valign="top") print email[0]

currently results with:

<tr valign="top"> <td><p>phone number:</p></td> <td>&nbsp;</td> <td><p>706-878-8888</p></td> </tr>

i'm trying remove markup (tr, td, p,   nice too) , result:

phone number: 706-878-8888

my problem over-exclusion , multiple lines beingness regex'd, looking reply outputs on single line.

if results simple, next regex set 'phone number:' in capture grouping 1 , number in capture grouping 2 long re.dotall flag set:

.*(phone number:).*?([-\d]+).*

you can phone call re.sub() on string replacement \1 \2.

here finish illustration returns want:

>>> s = """<tr valign="top"> ... <td><p>phone number:</p></td> ... <td>&nbsp;</td> ... <td><p>706-878-8888</p></td> ... </tr>""" >>> regex = re.compile(r'.*(phone number:).*?([-\d]+).*', re.dotall) >>> regex.sub(r'\1 \2', s) 'phone number: 706-878-8888'

python html regex beautifulsoup

Comments

Popular posts from this blog

How do I check if an insert was successful with MySQLdb in Python? -

delphi - blogger via idHTTP : error 400 bad request -

postgresql - ERROR: operator is not unique: unknown + unknown -