uweziegenhagen.de » Blog Archive » Webseiten parsen mit Python

Webseiten parsen mit Python

2012-10-13, 20:14

Hier ein kurzes Beispiel, wie man mittels Python und BeautifulSoup Texte aus Webseiten extrahieren kann.

import urllib2
from BeautifulSoup import BeautifulSoup
 
# http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents
def textOf(soup):
    return u''.join(soup.findAll(text=True))
 
soup = BeautifulSoup(urllib2.urlopen('http://www.fmylife.com/').read())
 
for item in soup.findAll('div', attrs={'class': 'post article'}):
    item = textOf(item)
    print item[:item.find("FML#")]

Uwe

Uwe Ziegenhagen likes LaTeX and Python, sometimes even combined. Do you like my content and would like to thank me for it? Consider making a small donation to my local fablab, the Dingfabrik Köln. Details on how to donate can be found here Spenden für die Dingfabrik.

More Posts - Website

Schlagwörter: B
Category: Python / SciPy / pandas | Trackback

Entries (RSS) and Comments (RSS). Valid XHTML and CSS.
Powered by WordPress and Fluid Blue theme.

Durch die weitere Nutzung der Seite stimmst du der Verwendung von Cookies zu. Weitere Informationen