search-engine-stract/configs
2024-12-09 15:52:52 +01:00
..
crawler group urls by domain centrality in planner and process groups by highest centrality first 2024-10-01 10:37:23 +02:00
indexer internet archive warc files does not seem to store the payload type. let's just assume it's html (records that can't be parsed are skipped anyway) 2024-02-08 16:14:24 +01:00
shortest_paths optional max distance in shortest paths 2024-12-09 15:52:52 +01:00
webgraph Webgraph inverted index (#232) 2024-10-23 11:59:52 +02:00
api.toml better keyphrases 2024-08-07 09:44:41 +02:00
canonical_index.toml canonical index to lookup canonical version of urls (if they are the same host) for a higher quality web graph 2024-04-20 16:38:34 +02:00
entity_search_server.toml Move entity index out of normal search index and have dedicated search server for it 2024-01-23 14:53:33 +01:00
search_server.toml update config files to match new dual_encoder path 2024-06-28 15:21:56 +02:00
site_stats.toml Compute some site level statistics (number of pages, number of news articles and number of blog posts). This will be used to determine which sites to crawl in the live index (#220) 2024-09-07 15:10:15 +02:00
web_spell.toml New spell corrector training. 2023-11-30 15:47:56 +01:00