Webseiten parsen mit Python

Hier ein kurzes Beispiel, wie man mittels Python und BeautifulSoup Texte aus Webseiten extrahieren kann.

import urllib2
from BeautifulSoup import BeautifulSoup
 
# http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents
def textOf(soup):
    return u''.join(soup.findAll(text=True))
 
soup = BeautifulSoup(urllib2.urlopen('http://www.fmylife.com/').read())
 
for item in soup.findAll('div', attrs={'class': 'post article'}):
    item = textOf(item)
    print item[:item.find("FML#")]

Uwe

Uwe Ziegenhagen likes LaTeX and Python, sometimes even combined. Do you like my content and would like to thank me for it? Consider making a small donation to my local fablab, the Dingfabrik Köln. Details on how to donate can be found here Spenden für die Dingfabrik.

More Posts - Website