PHPCrawl webcrawler library/framework

PHPCrawl FAQ

  1. Sometimes it happens that (almost) no information about a document is passed to the user-function handleDocumentInfo(), most properties of the corresponding PHPCrawlerDocumentInfo-object are emtpy.

    Mostly the reason for this is an error that occurred during the request of the document. In this case, the PHPCrawlerDocumentInfo-property "error_occured" will be true and "error_string" contains the error-report as human readable string. For timeout-errors (like "Socket-stream timed out"), try to increase the connection-timeout and/or the stream-timeout.

    $crawler->setStreamTimeout(5); // defaults to 2 seconds $crawler->setConnectionTimeout(10); // defaults to 5 seconds

  2. When trying to start the crawler in multi-process-mode, a lot of warnings like "sem_get() [function.sem-get]: failed for key 0x5202e59f: No space left on device" are thrown.

    PHPCrawl is using semaphores for process-communication. When crawling-processes get aborted, the used sempahores don't get removed. If this happens too often, there will be no more space for new semaphores and the above error(s) occur. To remove "dead" semaphores, use the following unix command:

    for i in `ipcs -s | awk '/phpcrawl_user/ {print $2}'`; do (ipcrm -s $i); done

    ... whereas "phpcrawl_user" is the user who is running the crawler.