0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	ec8b9a0786	Use remote webgraph instead of local (#196 ) * [WIP] remote webgraph client * [WIP] use remote webgraph for backlinks during indexing. still need to properly batch the requests * support batch requests in sonic * [WIP] use remote webgraph in explore and make sure ranking pipeline always sets updated score * use remote webgraph for inbound similarity * return correct type from explore api	2024-05-01 09:04:04 +02:00
Mikkel Denker	8f65817cd0	align nlnet images in readme	2024-04-26 09:24:54 +02:00
Mikkel Denker	a6598b8169	add nlnet funding	2024-04-26 09:17:03 +02:00
Mikkel Denker	d9848b19b2	use snippet in nsfw script instead of body	2024-04-24 09:32:58 +02:00
Mikkel Denker	fe56664e4d	cargo check all features in ci	2024-04-23 20:28:02 +02:00
Mikkel Denker	29c9502ce8	field might not be present in index if index was built before the field was added	2024-04-23 20:24:38 +02:00
Mikkel Denker	96875cc702	make sure only or-children are pruned in distributive optimisation	2024-04-23 17:18:08 +02:00
Mikkel Denker	1f07294d4f	higher batches for approx_harmonic inserts	2024-04-23 16:23:29 +02:00
Mikkel Denker	3a8da50a4a	make sure distributive optimisation works with multiple common 'or' terms	2024-04-23 14:40:34 +02:00
Mikkel Denker	8d315dbc95	use distributive law to re-write '(A \| B) & (A \| C)' queries to 'A \| (B & C)' for speedup	2024-04-23 10:59:42 +02:00
Mikkel Denker	da4f930b03	cratify file-store	2024-04-22 21:30:59 +02:00
Oliver Bøving	18d9d279fb	Cratify `bloom` and `speedy-kv` (#193 ) * Move bloom into separate crate * Move speedy_kv into a separate crate * add licenses --------- Co-authored-by: Mikkel Denker <mikkel@stract.com>	2024-04-22 21:18:44 +02:00
Mikkel Denker	32fd5d35e0	proptest query parsing to guard against panics	2024-04-22 20:53:46 +02:00
Mikkel Denker	acf9d11d1d	hide optics selector when js is disabled	2024-04-22 19:44:16 +02:00
Mikkel Denker	5f6bd0cb3e	introduce query plan as an internal representation. this allows us to potentially optimise the query before it is executed to avoid having to lookup the same term in the same field multiple times. the optimisations should ofcourse ensure that the results stay the same.	2024-04-22 15:37:31 +02:00
Mikkel Denker	9ba4fb4531	merge id2node_db faster than batchputs	2024-04-22 09:50:49 +02:00
Mikkel Denker	149d625daf	graphs are automatically removed after merge	2024-04-22 09:32:50 +02:00
Mikkel Denker	b99429aa5e	optimize webgraph for read after merges	2024-04-22 09:26:24 +02:00
Mikkel Denker	8a302697d1	optimize webgraph for read after merges	2024-04-22 09:20:47 +02:00
Mikkel Denker	2a17ca1d5b	canonical index to lookup canonical version of urls (if they are the same host) for a higher quality web graph	2024-04-20 16:38:34 +02:00
Mikkel Denker	d29b75bc7a	dedup webgraph edges on write	2024-04-20 10:36:47 +02:00
Mikkel Denker	b346705c84	load u64 fastfields into memory for faster access	2024-04-19 12:25:33 +02:00
Mikkel Denker	43c3744de2	continously dump inserted edges into sorted files in edge store writer to avoid oom	2024-04-19 12:08:44 +02:00
Mikkel Denker	1447be7bfa	revert tantivy upgrade a bit due to deprecation warning	2024-04-18 17:06:39 +02:00
Mikkel Denker	2c88387943	webgraph writer directly in memory. reduces number of disk writes, and it turns out it is actually feasible to keep it in memory	2024-04-18 16:11:55 +02:00
Mikkel Denker	4e12528aa2	upgrade tantivy to 0.22	2024-04-18 15:59:53 +02:00
Mikkel Denker	4951b72417	add truncated body in api	2024-04-18 15:21:28 +02:00
Mikkel Denker	05b9f43f9d	use xxhash instead of md5 for webgraph nodeids for better performance	2024-04-18 13:14:11 +02:00
Mikkel Denker	c43283cc5d	use redb in live-index downloaded db. to truncate the database, we would have to implement deletes and possibly also some kind of auto merging strategy in speedy_kv. to keep things simple, we use redb for this db instead.	2024-04-18 12:45:02 +02:00
Mikkel Denker	57c2affa50	new speedy-kv designed for very read heavy workloads without many small writes. this basically describes most of our workloads. as an example, in the webgraph we know that we only ever get inserts when constructing the graph, after which all the reads will happen. the key-value database consists of the following components: * an fst index (key -> blob_id) * a memory mapped blob index (blob_id -> blob_ptr) * a memory mapped blob store (blob_ptr -> blob) this allows us to move everything over from rocksdb to speedy_kv, and thereby removing the rocksdb dependency.	2024-04-17 14:14:39 +02:00
Mikkel Denker	3ab4f944e0	MapReduce -> AMPC (#189 ) * [WIP] structure for mapreduce -> ampc and introduce tables in dht * temporarily disable failing lints in ampc/mod.rs * establish dht connection in ampc * support batch get/set in dht * ampc implementation (not tested yet) * dht upsert * no more todo's in ampc harmonic centrality impl * return 'UpsertAction' instead of bool from upserts this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive * add ability to have multiple dht tables for each ampc algorithm gives better type-safety as each table can then have their own key-value type pair * some bundled bug/correctness fixes. * await currently scheduled jobs after there are no more jobs to schedule. * execute each mapper fully at a time before scheduling next mapper. * compute centrality scores from set cardinalities. * refactor into smaller functions * happy path ampc dht test and split ampc into multiple files * correct harmonic centrality calculation in ampc * run distributed harmonic centrality worker and coordinator from cli * stream key/values from dht using range queries in batches * benchmark distributed centrality calculation * faster hash in shard selection and drop table in background thread * Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost * dht store copy-on-write for keys and values to make table clone faster * fix flaky dht test and improve .set performance using entries * dynamic batch size based on number of shards in dht cluster	2024-04-15 10:29:33 +02:00
Mikkel Denker	8cdcc63371	[docs] change absolute link to mkdocs relative	2024-04-06 11:43:44 +02:00
Mikkel Denker	05e20434a6	add 'add_to_browser.md' to mkdocs navbar	2024-04-06 11:11:29 +02:00
Wesley Appler	cc612c5b8d	[WIP] Implemented `keybind` module to handle keyboard shortcuts (#186 ) * Implemented `keybind` module to handle keyboard shortcuts * Removal of direct DOM querying and the addition of searchbar keybindings * Remove generics from 'keybind' It's always used with 'Refs' as context, so there is no need to have it generic * Revert 'Searchbar' to use simple keydown match instead of 'Keybind' The functionality didn't work (for instance enter didn't trigger a search). It would require a lot of aditional complexity in 'Keybind' to also support the use case from searchbar. It's okay to have some code duplication if this results in a simpler solution that will therefore be more readable and mantainable long term * Remove need to know about keyboard event in keybind callbacks This forces us to not rely on direct manipulation of the event, but instead implement the necesarry functionality in helper methods in the different components * forgot to remove a console.log... --------- Co-authored-by: Mikkel Denker <mikkel@stract.com>	2024-04-02 13:27:13 +02:00
Mikkel Denker	be5bb09fcf	option to limit number of concurrent search requests gives better rate limit control to slow down requests instead of crashing if we exceed the limits of the servers	2024-03-25 15:20:06 +01:00
Mikkel Denker	a51678e37c	change id2nodedb to be memory mapped file of values with in-memory btreemap<id, range> index a very significant amount of time was spent looking up id's when constructing the crawlplan. this surely speeds it up as it reduces disk seeks significantly. if it uses too much memory, might introduce some segments in id2nodedb so that each segment has an on-disk index that we can binary search in to find the store ranges.	2024-03-25 14:11:32 +01:00
Mikkel Denker	1e931c01b0	split webgraph and inverted_index into smaller files	2024-03-25 11:46:22 +01:00
Mikkel Denker	bba251d7e9	refactor long functions	2024-03-25 10:41:49 +01:00
Mikkel Denker	110f4cdffd	disallow bots from /search in robots.txt	2024-03-22 11:05:44 +01:00
Mikkel Denker	2258243bc2	run clippy in CI	2024-03-21 17:10:03 +01:00
Mikkel Denker	bdf93a6316	split signal.rs into more manageable parts	2024-03-21 17:05:35 +01:00
Mikkel Denker	c4a1c53d78	rename SignalAggregator -> SignalComputer more descriptive name. it's used to compute the signals, not necesarrily aggregate them. still not entirely satisfied with the naming, but it's at least better than aggregator	2024-03-21 14:57:06 +01:00
Mikkel Denker	bf23701d16	refactor signal into trait with enum dispatch	2024-03-21 14:32:57 +01:00
Mikkel Denker	c03b249047	rustup update and fix clippy	2024-03-21 09:59:04 +01:00
Mikkel Denker	2dadbf70d6	Schema fields as traits (#185 ) * refactor data that is re-used across fields for a particular page during indexing into an 'FnCache' * automatically generate ALL_FIELDS and ALL_SIGNALS arrays with strum macro. ensures the arrays are always fully up to date * split up schema fields into submodules * add textfield trait with enum-dispatch * add fastfield trait with enum-dispatch * move field names into trait * move some trivial functions from 'FastFieldEnum' and 'TextFieldEnum' into their respective traits * move methods from Field into TextField and FastField traits * extract html .as_tantivy into textfield trait * extract html .as_tantivy into fastfield trait * extract webpage .as_tantivy into field traits * fix indexer example cleanup	2024-03-20 21:36:44 +01:00
Mikkel Denker	72ac622028	split live index into more manageable files	2024-03-19 14:05:08 +01:00
Mikkel Denker	90653c51cc	remove fragments from page graph nodes	2024-03-19 13:27:18 +01:00
Mikkel Denker	065063861d	only perform server side search when ssr parameter is set, otherwise search client side we now redirect clients with js disabled using a <noscript><meta ...</noscript> tag. this setup allows us to show an empty serp before the search is finished if one has enabled js. it also makes it posible to add more customization options later like applying optics directly from the search string. otherwise, the server would have no way of knowing what the optic rules for a particular optic name is.	2024-03-19 12:34:44 +01:00
Mikkel Denker	e47b49a012	Nom query parser (#184 ) * model that inbody:... intitle:... etc can have either simple term or phrase query as subterm * re-write query parser using nom * all whitespace queries should return empty terms vec	2024-03-19 09:47:48 +01:00
Mikkel Denker	37b6c7d86c	Distributed in-memory key/value store for mapreduce (#181 ) * [WIP] raft consensus using openraft on sonic networking * handle rpc's on nodes * handle get/set application requests * dht get/set stubs that handles leader changes and retries also improve sonic error handling. there is no need for handle to return a sonic::Result, it's better that the specific message has a Result<...> as their response as this can then be properly handled on the caller side * join existing raft cluster * make sure node state is consisten in case of crash -> rejoin * ResilientConnection in sonic didn't retry requests, only connections, and was therefore a bit misleading. remove it and add a send_with_timeout_retry method to normal connection with sane defaults in .send method * add Response::Empty to raft in order to avoid having to send back hacky Response::Set(Ok(())) for internal raft entries * change key/value in dht to be arbitrary bytes * dht chaos proptest * make dht tests more reliable in raft, writes are written to a majority quorom. if we have a cluster of 3 nodes, this means that we can only be sure that 2 of the nodes get's the data. the test might therefore fail if we are unlucky and check the node that didn't get the data yet. by having a cluster of 2 nodes instead, we can be sure that both nodes always receives all writes. * sharded dht client	2024-03-17 16:04:07 +01:00

... 6 7 8 9 10 ...

1308 commits