Screen scraping with Python

Does Python have screen scraping libraries that offer JavaScript support?

I've been using pycurl for simple HTML requests, and Java's HtmlUnit for more complicated requests requiring JavaScript support.

Ideally I would like to be able to do everything from Python, but I haven't come across any libraries that would allow me to do it. Do they exist?

标签： python screen-scraping htmlunit pycurl

7条回答

爷的心禁止访问

2楼-- · 2020-02-19 08:58

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Here you go: http://scrapy.org/

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2020-02-19 09:06

you can try spidermonkey ?

This Python module allows for the implementation of Javascript? classes, objects and functions in Python, as well as the evaluation and calling of Javascript scripts and functions. It borrows heavily from Claes Jacobssen's Javascript Perl module, which in turn is based on Mozilla's PerlConnect Perl binding.

0人赞添加讨论(0) 举报

小情绪 Triste *

4楼-- · 2020-02-19 09:09

I have not found anything for this. I use a combination of beautifulsoup and custom routines...

0人赞添加讨论(0) 举报

你好瞎i

5楼-- · 2020-02-19 09:14

Selenium maybe? It allows you to automate an actual browser (Firefox, IE, Safari) using python (amongst other languages). It is meant for testing websites, but seems it should be usable for scraping as well. (disclaimer: never used it myself)

0人赞添加讨论(0) 举报

Deceive 欺骗

6楼-- · 2020-02-19 09:14

The Webscraping library wraps the PyQt4 WebView into a simple and easy-to-use API.

Here is a simple example to download a web page rendered by WebKit and extract the title element using XPath (taken from the URL above):

from webscraping import download, xpath
D = download.Download()
# download and cache the Google Code webpage
html = D.get('http://code.google.com/p/webscraping')
# use xpath to extract the project title
print xpath.get(html, '//div[@id="pname"]/a/span')

0人赞添加讨论(0) 举报

祖国的老花朵

7楼-- · 2020-02-19 09:16

There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html

0人赞添加讨论(0) 举报

1 2 下一页

Screen scraping with Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间