* [WIP] remote webgraph client
* [WIP] use remote webgraph for backlinks during indexing. still need to properly batch the requests
* support batch requests in sonic
* [WIP] use remote webgraph in explore and make sure ranking pipeline always sets updated score
* use remote webgraph for inbound similarity
* return correct type from explore api
this allows us to potentially optimise the query before it is executed to avoid having to lookup the same term in the same field multiple times. the optimisations should ofcourse ensure that the results stay the same.
to truncate the database, we would have to implement deletes and possibly also some kind of auto merging strategy in speedy_kv. to keep things simple, we use redb for this db instead.
this basically describes most of our workloads. as an example, in the webgraph we know that we only ever get inserts when constructing the graph, after which all the reads will happen.
the key-value database consists of the following components:
* an fst index (key -> blob_id)
* a memory mapped blob index (blob_id -> blob_ptr)
* a memory mapped blob store (blob_ptr -> blob)
this allows us to move everything over from rocksdb to speedy_kv, and thereby removing the rocksdb dependency.
* [WIP] structure for mapreduce -> ampc and introduce tables in dht
* temporarily disable failing lints in ampc/mod.rs
* establish dht connection in ampc
* support batch get/set in dht
* ampc implementation (not tested yet)
* dht upsert
* no more todo's in ampc harmonic centrality impl
* return 'UpsertAction' instead of bool from upserts
this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive
* add ability to have multiple dht tables for each ampc algorithm
gives better type-safety as each table can then have their own key-value type pair
* some bundled bug/correctness fixes.
* await currently scheduled jobs after there are no more jobs to schedule.
* execute each mapper fully at a time before scheduling next mapper.
* compute centrality scores from set cardinalities.
* refactor into smaller functions
* happy path ampc dht test and split ampc into multiple files
* correct harmonic centrality calculation in ampc
* run distributed harmonic centrality worker and coordinator from cli
* stream key/values from dht using range queries in batches
* benchmark distributed centrality calculation
* faster hash in shard selection and drop table in background thread
* Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost
* dht store copy-on-write for keys and values to make table clone faster
* fix flaky dht test and improve .set performance using entries
* dynamic batch size based on number of shards in dht cluster
* Implemented `keybind` module to handle keyboard shortcuts
* Removal of direct DOM querying and the addition of searchbar keybindings
* Remove generics from 'keybind'
It's always used with 'Refs' as context, so there is no need to have it generic
* Revert 'Searchbar' to use simple keydown match instead of 'Keybind'
The functionality didn't work (for instance enter didn't trigger a search). It would require a lot of aditional complexity in 'Keybind' to also support the use case from searchbar. It's okay to have some code duplication if this results in a simpler solution that will therefore be more readable and mantainable long term
* Remove need to know about keyboard event in keybind callbacks
This forces us to not rely on direct manipulation of the event, but instead implement the necesarry functionality in helper methods in the different components
* forgot to remove a console.log...
---------
Co-authored-by: Mikkel Denker <mikkel@stract.com>
a very significant amount of time was spent looking up id's when constructing the crawlplan. this surely speeds it up as it reduces disk seeks significantly. if it uses too much memory, might introduce some segments in id2nodedb so that each segment has an on-disk index that we can binary search in to find the store ranges.
more descriptive name. it's used to compute the signals, not necesarrily aggregate them. still not entirely satisfied with the naming, but it's at least better than aggregator
* refactor data that is re-used across fields for a particular page during indexing into an 'FnCache'
* automatically generate ALL_FIELDS and ALL_SIGNALS arrays with strum macro. ensures the arrays are always fully up to date
* split up schema fields into submodules
* add textfield trait with enum-dispatch
* add fastfield trait with enum-dispatch
* move field names into trait
* move some trivial functions from 'FastFieldEnum' and 'TextFieldEnum' into their respective traits
* move methods from Field into TextField and FastField traits
* extract html .as_tantivy into textfield trait
* extract html .as_tantivy into fastfield trait
* extract webpage .as_tantivy into field traits
* fix indexer example cleanup
we now redirect clients with js disabled using a <noscript><meta ...</noscript> tag. this setup allows us to show an empty serp before the search is finished if one has enabled js. it also makes it posible to add more customization options later like applying optics directly from the search string. otherwise, the server would have no way of knowing what the optic rules for a particular optic name is.
* model that inbody:... intitle:... etc can have either simple term or phrase query as subterm
* re-write query parser using nom
* all whitespace queries should return empty terms vec
* [WIP] raft consensus using openraft on sonic networking
* handle rpc's on nodes
* handle get/set application requests
* dht get/set stubs that handles leader changes and retries
also improve sonic error handling. there is no need for handle to return a sonic::Result, it's better that the specific message has a Result<...> as their response as this can then be properly handled on the caller side
* join existing raft cluster
* make sure node state is consisten in case of crash -> rejoin
* ResilientConnection in sonic didn't retry requests, only connections, and was therefore a bit misleading. remove it and add a send_with_timeout_retry method to normal connection with sane defaults in .send method
* add Response::Empty to raft in order to avoid having to send back hacky Response::Set(Ok(())) for internal raft entries
* change key/value in dht to be arbitrary bytes
* dht chaos proptest
* make dht tests more reliable
in raft, writes are written to a majority quorom. if we have a cluster of 3 nodes, this means that we can only be sure that 2 of the nodes get's the data. the test might therefore fail if we are unlucky and check the node that didn't get the data yet. by having a cluster of 2 nodes instead, we can be sure that both nodes always receives all writes.
* sharded dht client