优秀的数据爬虫框架平台

收集一些常用的爬虫框架、类库、平台等项目。

Python

基于 Python 语言

Scrapy
https://github.com/scrapy/scrapy
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
pyspider （已停更）
https://github.com/binux/pyspider
A Powerful Spider(Web Crawler) System in Python.
Playwright （多语言）
https://github.com/microsoft/playwright-python
Playwright is a Python library to automate Chromium, Firefox and WebKit browsers with a single API. Playwright delivers automation that is ever-green, capable, reliable and fast. See how Playwright is better.
Selenium （多语言：Java,Python,CSharp,Ruby,JavaScript,Kotlin）
https://github.com/seleniumhq/selenium
https://www.selenium.dev/
Selenium is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provides an infrastructure for the W3C WebDriver specification — a platform and language-neutral coding interface compatible with all major web browsers.
feapder
https://github.com/Boris-code/feapder
feapder是一款上手简单，功能强大的Python爬虫框架，内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。
支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。
更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度

基于 Go 语言

rod
https://github.com/go-rod/rod
Rod is a high-level driver directly based on DevTools Protocol. It's designed for web automation and scraping for both high-level and low-level use, senior developers can use the low-level packages and functions to easily customize or build up their own version of Rod, the high-level functions are just examples to build a default version of Rod.
goquery
https://github.com/PuerkitoBio/goquery
goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go's net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery's stateful manipulation functions (like height(), css(), detach()) have been left off.
colly
https://github.com/gocolly/colly
https://go-colly.org/
Lightning Fast and Elegant Scraping Framework for Gophers
crawlab
https://github.com/crawlab-team/crawlab
基于Golang的分布式爬虫管理平台，支持Python、NodeJS、Go、Java、PHP等多种编程语言以及多种爬虫框架。

基于 NodeJS & JavaScript

playwright （多语言：Python,C#,Java）
https://github.com/microsoft/playwright
https://playwright.dev/
Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API. Playwright is built to enable cross-browser web automation that is ever-green, capable, reliable and fast.
puppeteer
https://github.com/puppeteer/puppeteer
https://pptr.dev/
Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol. Puppeteer runs in headless mode by default, but can be configured to run in full ("headful") Chrome/Chromium.
Crawlee
https://github.com/apify/crawlee
https://crawlee.dev/
Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.
-
Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.