0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	ba14aaab68	remove optional and multivalued columns from tantivy as we only use full columnar indices	2024-07-04 14:12:14 +02:00
Mikkel Denker	8af2144898	mark the query parser in tantivy with #[cfg(test)] to ensure we don't accidentally use it instead of stracts	2024-07-02 09:08:21 +02:00
Mikkel Denker	f46abd0511	move tantivy dependencies up to workspace for consistent versioning between crates	2024-07-01 20:40:15 +02:00
Mikkel Denker	297f79d46b	store number of position bytes as u64 instead of u32 so we can have more than 4gb of positions in a segment	2024-07-01 15:11:52 +02:00
Mikkel Denker	28d2eff2c2	remove unused feature flags from tantivy	2024-07-01 14:32:23 +02:00
Mikkel Denker	3e5875839b	use workspace fst in tantivy	2024-07-01 14:22:34 +02:00
Mikkel Denker	b15261b003	remove some unused dependencies	2024-07-01 14:12:58 +02:00
Mikkel Denker	4fb6af6fed	remove aggregations for simplicity	2024-07-01 13:25:32 +02:00
Mikkel Denker	4306586763	fix clippy warnings	2024-07-01 13:09:38 +02:00
Mikkel Denker	a3292143d3	fix clippy warnings	2024-07-01 12:42:02 +02:00
Mikkel Denker	454774dfa7	fork tantivy our segments are starting to grow too big, so the assumption that number of position bytes can be in a u32 is no longer the case. storing it in u64 might not be what regular users of tantivy want, as our use of the library most likely doesn't resemble the average user. forking tantivy allows us to customize it directly for our usecase	2024-07-01 11:26:41 +02:00
Mikkel Denker	3ab9c301e4	update config files to match new dual_encoder path	2024-06-28 15:21:56 +02:00
Mikkel Denker	e5cc6f8442	add total num hosts and domains to crawlplan stats	2024-06-28 14:05:55 +02:00
Mikkel Denker	f6995402a1	bump crawler version	2024-06-28 13:50:58 +02:00
Mikkel Denker	071e41d167	ensure reasonable limit for robotstxt files currently same as html body (32mb)	2024-06-28 13:47:09 +02:00
Mikkel Denker	e1ccaf9251	crawler robustness • delay between robots.txt requests • retry robots.txt requests 3 times • if /robots.txt request returns anything except 404 or 200 (times out etc.), don’t crawl site • respect crawldelay (up to max limit) • respect 429 Retry-After header (up to max limit) • increase timeout in reqwest client • only visit sites on port 80 and 443	2024-06-28 13:35:37 +02:00
Mikkel Denker	303a2cf2da	accept unicode-3.0 license	2024-06-27 17:12:28 +02:00
Mikkel Denker	126eedc0d0	way more robust robotstxt parser	2024-06-27 16:10:41 +02:00
Mikkel Denker	95f0703602	approximate centrality naming consistency with exact harmonic	2024-06-24 11:51:57 +02:00
Mikkel Denker	385e8375c6	deduplicate urls during indexing	2024-06-24 10:43:12 +02:00
Mikkel Denker	3844e29bd1	make bm25 constants configurable for each field	2024-06-21 15:16:07 +02:00
Mikkel Denker	e4ae26470e	skip links to/from same domain when calculating harmonic centralities makes it more expensive for linkfarms	2024-06-21 14:21:31 +02:00
Mikkel Denker	de5c946b58	move last worker folder into correct output location after indexing	2024-06-21 12:16:18 +02:00
Mikkel Denker	1bcf74ec11	remove bm25+ again while it scales low term frequencies, it also adds a positive score for pages even though they don't match the query at all. that seems like a really bad idea...	2024-06-20 15:58:06 +02:00
Mikkel Denker	5beae3b9a9	simplified bm25f that uses same IDF weight across all fields e.g. the term 'the' might not be very common in titles but should still be scaled as a less important term than other terms in the query. instead of duplicating all text in the index we approximate the bm25f IDF weight as the highest IDF across the fields	2024-06-20 15:01:41 +02:00
Mikkel Denker	2fd6db3cfa	bm25+ to not over penalize long documents https://dl.acm.org/doi/10.1145/2063576.2063584	2024-06-20 10:07:08 +02:00
Mikkel Denker	ca9b249992	[crawler] don't re-crawl url when it redirects to another url	2024-06-18 15:05:16 +02:00
Mikkel Denker	5968185136	[crawler] normalize url before fetch	2024-06-18 14:46:50 +02:00
Mikkel Denker	ad617a8151	[crawler] another check to avoid re-crawling urls	2024-06-18 14:19:15 +02:00
Mikkel Denker	2376f33e19	[crawler] some sites seem to only host robots.txt files on the 'www.' subdomain without redirects	2024-06-18 14:06:56 +02:00
Mikkel Denker	07369e1005	[crawler] continue wandering while budget > 0 and we still know about uncrawled urls	2024-06-18 13:37:29 +02:00
Mikkel Denker	919441850b	parse html mime metadata	2024-06-17 14:26:45 +02:00
Mikkel Denker	2bc0d69e00	support both 'linkto:' and 'linksto:'	2024-06-17 12:47:18 +02:00
Mikkel Denker	6aec31525a	add a 'linksto' query operator	2024-06-17 12:33:10 +02:00
Mikkel Denker	a1c9721b6c	remove some telemetry hosts from list of ad servers so pages aren't mistakenly tagged as containing ads (#209 )	2024-06-12 17:29:21 +02:00
Mikkel Denker	feafa7507a	fix webgraph merge bug of missing edges the edges were compared by their sort_key, but some nodes don't have a centrality value (sort_key) so they got erronously mistaken for other nodes and skipped. it's better to compare directly on the node id	2024-06-12 12:58:24 +02:00
Mikkel Denker	a7b61559db	optimize trivial segment merges	2024-06-11 15:57:47 +02:00
Mikkel Denker	a9eb8acd80	store 'rel' attribute for each edge in the webgraph this allows us to skip links to tag pages etc. when calculating harmonic centrality which should greatly improve the centrality values for the page graph	2024-06-11 15:34:54 +02:00
Mikkel Denker	efa6f9fab0	ability to convert some of the wander budget to scheduled urls we might now about more urls for each domain than the ones that got a page centrality > 0.0. these urls won't be scheduled unless we convert some of the wander budget back to scheduled urls	2024-06-10 14:11:36 +02:00
Mikkel Denker	817bda9738	optionally merge all webgraph segments into a single segment for improved read performance	2024-06-09 14:48:11 +02:00
Mikkel Denker	295444de4e	store sort-key in webgraph edges to properly apply limits with multiple segments this should also allow us to merge segments which should improve read performance	2024-06-07 11:13:17 +02:00
Mikkel Denker	07663c1687	no need to consider nodes where we have found a better path	2024-06-06 17:31:27 +02:00
Mikkel Denker	90980a1aa4	no need to consider nodes where we have found a better path	2024-06-06 17:05:32 +02:00
Mikkel Denker	6bc7a5caf0	simplify dijkstra to reduce allocations and btree searches	2024-06-06 15:20:10 +02:00
Mikkel Denker	78c4676165	approx harmonic parallel graph executor	2024-06-06 14:11:46 +02:00
Mikkel Denker	4fbffe63b4	optionally store pages with 0 centrality from approx harmonic	2024-06-06 11:54:04 +02:00
Mikkel Denker	4b459b579b	reduce indirections in dht (and hopefully memory usage) by removing some arc's most dht usecases store very small keys/values but store a ton of them, so the arcs don't really help a lot	2024-06-06 11:23:31 +02:00
Mikkel Denker	f7a354572e	reduce memory usage and serialization/deserialization in dht by storing keys and values as enums instead of vec<u8>	2024-06-06 10:39:00 +02:00
Mikkel Denker	3c4a0c480e	use remote webgraph in crawl planner	2024-06-05 09:41:03 +02:00
Mikkel Denker	265b1b7871	Ranking diff tool (#207 ) * ranking diff tool structure * fix missing icon types * add admin for queries and experiments * minor cleanup * show experiment progress * upgrade node adapter for svelte * hopefully fix ci * display common queries between experiments * display serp diffs with top signals for each result * like experiments and show overview in queries * settings to toggle experiment shuffle and show/hide signals * keyboard shortcuts * visualise improvements by query category * document how to use tool	2024-06-03 15:00:16 +02:00

... 4 5 6 7 8 ...

1308 commits