Expert Crawl Start

You can define URLs as start points for Web page crawling and start crawling here. "Crawling" implies that YaCy will download the given website, extract all links in it and then download the content behind these links. This is repeated as long as specified under "Crawling Depth". A crawl can also be started using wget and the post arguments for this web page.


Crawl Job

A Crawl Job consist of one or more start points, crawl limitations and document freshness rules.


Start Point


Crawler Filter

These are limitations on the crawl stacker. The filters will be applied before a web page is loaded.


Also all linked non-parsable documents
Use Page Count:
Accept URLs with query-part ('?'):
Obey html-robots-noindex:
Obey html-robots-nofollow:
must-match
Restrict to start domain(s)
Restrict to sub-path(s)
Use Filter
must-not-match
must-match must-not-match
Must-Match List for Country Codes
No country code restriction
Use filter

Document Filter

These are limitations on index feeder. The filters will be applied after a web page was loaded.

must match must-not-match
must match must-not-match

Clean-Up before Crawl Start
Do not delete any document before the crawl is started.
For each host in the start url list, delete all documents (in the given subpath) from that host.
Treat documents that were loaded before as stale and delete them before the crawl is started.

Double-Check Rules
Never load any page that is already known. Only the start-url may be loaded again
Treat documents that were loaded before as stale and load them again. If they are younger, they are ignored.

Document Cache
no cache
if fresh
if exist
cache only

Snapshot Creation
Replace old snapshots with new ones
Add new versions for each crawl

Index Attributes
index text
index media