WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.
WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.
其他Java Web爬虫工具 开源项目资源: