Commit graph

48 commits

Author SHA1 Message Date
Mikkel Denker
de7291daa1 just update 2024-12-03 15:00:08 +01:00
Mikkel Denker
9e8dc92a41
Improve architecture documentation (#243)
* cleanup assets

* update crawler docs

* update search index docs

* update webgraph docs
2024-12-03 14:57:54 +01:00
Mikkel Denker
01de7a107b
Improve zimba+web-spell docs and release the modules under MIT (#242)
* improve web-spell docs

* improve zimba docs

* release zimba + web-spell under MIT
2024-12-02 13:21:26 +01:00
Mikkel Denker
040d04413e
Web spell as dedicated module (#240)
* separate web-spell into a dedicated module

* web-spell readme
2024-11-29 15:15:18 +01:00
Mikkel Denker
05d3cf9de5
urlencode user queries when forwarding to bang location to prevent open redirect vulnerabilities (#239) 2024-11-29 12:17:01 +01:00
Mikkel Denker
1d821ef4db rustup update and fix clippy warnings 2024-11-27 17:15:24 +01:00
Mikkel Denker
12e9502e80
Improve API documentation (#235)
* add docusaurus scalar api documentation structure

* bump openapi 3.0 to 3.1 so we can mark internal endpoints

* improve search api docs

* webgraph api docs

* point docs to prod
2024-11-19 13:43:42 +01:00
Mikkel Denker
4e8426888b just update 2024-11-01 15:28:40 +01:00
Mikkel Denker
31bfebf2c9 just update 2024-10-25 09:37:45 +02:00
Mikkel Denker
5ebdb24a07 just update 2024-10-01 09:51:11 +02:00
Mikkel Denker
365ed02813
Very simple WAL built on top of file-store primitives (#219)
Doesn't handle concurrent writes and flushes after each write. This will cause a lot of fsync's which will impact performance, but as this will be used for the live index where each item (a full webpage) is quite large, this will hopefully not be too detrimental.
2024-09-05 14:35:52 +02:00
Mikkel Denker
48789d75a2 cargo update 2024-09-04 16:18:11 +02:00
Mikkel Denker
21d7342ecd optionally construct only host graph or page graph 2024-08-13 15:25:39 +02:00
Mikkel Denker
ebbfffdb8e simpler keyphrase extraction
turns out this works even better than the more complex approach when the dataset is large enough
2024-08-09 17:28:46 +02:00
Mikkel Denker
b79233302b api admin interface 2024-08-06 18:26:14 +02:00
Mikkel Denker
fa5282a800 move some of the 'stream.next()' functionality into traits in a lending-iter crate so we can implement and re-use adapters 2024-07-25 13:58:07 +02:00
Mikkel Denker
8b2cbc98e0 cargo update 2024-07-24 17:45:17 +02:00
Mikkel Denker
cef67d6aaf automatically build autosuggest from search index keywords 2024-07-24 16:48:57 +02:00
Mikkel Denker
3e5875839b use workspace fst in tantivy 2024-07-01 14:22:34 +02:00
Mikkel Denker
b15261b003 remove some unused dependencies 2024-07-01 14:12:58 +02:00
Mikkel Denker
a3292143d3 fix clippy warnings 2024-07-01 12:42:02 +02:00
Mikkel Denker
303a2cf2da accept unicode-3.0 license 2024-06-27 17:12:28 +02:00
Mikkel Denker
126eedc0d0 way more robust robotstxt parser 2024-06-27 16:10:41 +02:00
Mikkel Denker
a9eb8acd80 store 'rel' attribute for each edge in the webgraph
this allows us to skip links to tag pages etc. when calculating harmonic centrality which should greatly improve the centrality values for the page graph
2024-06-11 15:34:54 +02:00
Mikkel Denker
e74978c5bc update image dependency 2024-05-23 12:44:01 +02:00
Mikkel Denker
d023552391 use binary heap for less cmp when merging spell correction dictionaries 2024-05-18 13:32:50 +02:00
Mikkel Denker
17fed5a75c
Show ranking signals (#201) 2024-05-17 16:39:33 +02:00
Mikkel Denker
e026fe5548
Sonic connection pool (#200)
* allow connection reuse by not taking ownership in send methods

* [sonic] continously handle requests from each connection in the server as long as the connection is not closed

* add connection pool to sonic based on deadpool

* use connection pool in remote webgraph and distributed searcher

* hopefully fix flaky test

* hopefully fix flaky test
2024-05-09 15:24:43 +02:00
Mikkel Denker
3c94cb7f81 approximate number of hits by assuming that each term is independent
this allows us to short-cirquit the query by default which significantly improves performance as we therefore don't have to iterate the non-scored results simply to count them
2024-05-06 15:21:17 +02:00
Mikkel Denker
73e5445018
update fend to 1.4.8 (#198) 2024-05-05 17:05:44 +02:00
Mikkel Denker
7e9da2e37c chore: upgrade dependencies for kuchiki 2024-05-03 12:23:10 +02:00
Mikkel Denker
9c983e5f96
Top k webgraph edges (#197)
* implement random access index in file_store where keys are u64 and values are serialised to a constant size

* cleanup: move all webgraph store writes into store_writer

* add a 'ConstIterableStore' that can store items on disk without needing to interleave headers in the case that all items can be serialized to a constant number of bytes known up front

* change edges file format to make edges for a given node iterable.
this allows us to only load a subset of the edges for a node in the future

* compress webgraph labels in blocks of 128

* ability to limit number of edges returned by webgraph

* sort edges in webgraph store by the host rank of the opposite node
2024-05-03 09:33:57 +02:00
Mikkel Denker
a6598b8169 add nlnet funding 2024-04-26 09:17:03 +02:00
Mikkel Denker
da4f930b03 cratify file-store 2024-04-22 21:30:59 +02:00
Oliver Bøving
18d9d279fb
Cratify bloom and speedy-kv (#193)
* Move bloom into separate crate

* Move speedy_kv into a separate crate

* add licenses

---------

Co-authored-by: Mikkel Denker <mikkel@stract.com>
2024-04-22 21:18:44 +02:00
Mikkel Denker
1447be7bfa revert tantivy upgrade a bit due to deprecation warning 2024-04-18 17:06:39 +02:00
Mikkel Denker
c43283cc5d use redb in live-index downloaded db.
to truncate the database, we would have to implement deletes and possibly also some kind of auto merging strategy in speedy_kv. to keep things simple, we use redb for this db instead.
2024-04-18 12:45:02 +02:00
Mikkel Denker
57c2affa50 new speedy-kv designed for very read heavy workloads without many small writes.
this basically describes most of our workloads. as an example, in the webgraph we know that we only ever get inserts when constructing the graph, after which all the reads will happen.
the key-value database consists of the following components:
* an fst index (key -> blob_id)
* a memory mapped blob index (blob_id -> blob_ptr)
* a memory mapped blob store (blob_ptr -> blob)

this allows us to move everything over from rocksdb to speedy_kv, and thereby removing the rocksdb dependency.
2024-04-17 14:14:39 +02:00
Mikkel Denker
3ab4f944e0
MapReduce -> AMPC (#189)
* [WIP] structure for mapreduce -> ampc and introduce tables in dht

* temporarily disable failing lints in ampc/mod.rs

* establish dht connection in ampc

* support batch get/set in dht

* ampc implementation (not tested yet)

* dht upsert

* no more todo's in ampc harmonic centrality impl

* return 'UpsertAction' instead of bool from upserts
this makes it easier to see what action was taken from the callers perspective. a bool is not particularly descriptive

* add ability to have multiple dht tables for each ampc algorithm
gives better type-safety as each table can then have their own key-value type pair

* some bundled bug/correctness fixes.
* await currently scheduled jobs after there are no more jobs to schedule.
* execute each mapper fully at a time before scheduling next mapper.
* compute centrality scores from set cardinalities.

* refactor into smaller functions

* happy path ampc dht test and split ampc into multiple files

* correct harmonic centrality calculation in ampc

* run distributed harmonic centrality worker and coordinator from cli

* stream key/values from dht using range queries in batches

* benchmark distributed centrality calculation

* faster hash in shard selection and drop table in background thread

* Move all rpc communication to bincode2. This should give a significant serilization/deserilization performance boost

* dht store copy-on-write for keys and values to make table clone faster

* fix flaky dht test and improve .set performance using entries

* dynamic batch size based on number of shards in dht cluster
2024-04-15 10:29:33 +02:00
Mikkel Denker
2dadbf70d6
Schema fields as traits (#185)
* refactor data that is re-used across fields for a particular page during indexing into an 'FnCache'

* automatically generate ALL_FIELDS and ALL_SIGNALS arrays with strum macro. ensures the arrays are always fully up to date

* split up schema fields into submodules

* add textfield trait with enum-dispatch

* add fastfield trait with enum-dispatch

* move field names into trait

* move some trivial functions from 'FastFieldEnum' and 'TextFieldEnum' into their respective traits

* move methods from Field into TextField and FastField traits

* extract html .as_tantivy into textfield trait

* extract html .as_tantivy into fastfield trait

* extract webpage .as_tantivy into field traits

* fix indexer example cleanup
2024-03-20 21:36:44 +01:00
Mikkel Denker
37b6c7d86c
Distributed in-memory key/value store for mapreduce (#181)
* [WIP] raft consensus using openraft on sonic networking

* handle rpc's on nodes

* handle get/set application requests

* dht get/set stubs that handles leader changes and retries
also improve sonic error handling. there is no need for handle to return a sonic::Result, it's better that the specific message has a Result<...> as their response as this can then be properly handled on the caller side

* join existing raft cluster

* make sure node state is consisten in case of crash -> rejoin

* ResilientConnection in sonic didn't retry requests, only connections, and was therefore a bit misleading. remove it and add a send_with_timeout_retry method to normal connection with sane defaults in .send method

* add Response::Empty to raft in order to avoid having to send back hacky Response::Set(Ok(())) for internal raft entries

* change key/value in dht to be arbitrary bytes

* dht chaos proptest

* make dht tests more reliable
in raft, writes are written to a majority quorom. if we have a cluster of 3 nodes, this means that we can only be sure that 2 of the nodes get's the data. the test might therefore fail if we are unlucky and check the node that didn't get the data yet. by having a cluster of 2 nodes instead, we can be sure that both nodes always receives all writes.

* sharded dht client
2024-03-17 16:04:07 +01:00
Mikkel Denker
32bbbc63ab
Semantic embeddings (#179)
* change indexer to prepare webpages in batches

* some clippy lints

* split 'IndexingWorker::prepare_webpages' into more readable sub functions and fix more clippy pedantic lints

* use dual encoder to embed title and keywords of page during indexing

* make sure we don't open harmonic centrality rocksdb in core/src during test...

* add indexer example used for benchmark

* add option to only compute embeddings of top ranking sites.
this is not really ideal, but it turns out to be way too slow to compute
the embeddings for all the sites in the index. this way, we at least get embeddings
for the sites that are most likely to appear in the search results while it is
still tractable to compute.

* store embeddings in index as bytes

* refactor ranking pipeline to statically ensure we score the different stages as expected

* use similarity between title and query embeddings during ranking

* use keyword embeddings during ranking

* handle missing fastfields in index gracefully

* remove unneeded Arc clone when constructing 'RecallRankingWebpage'
2024-03-11 14:28:17 +01:00
Mikkel Denker
519ecb52b7 chore: cargo update 2024-03-05 13:58:47 +01:00
Mikkel Denker
f8c58b3c03
use more sophisticated encoding detection when utf8 decoding fails. (#172)
some websites, especially older ones, sometimes use a different encoding scheme than utf8 or latin1. before, we simply tried different encoding schemes until one successfully decoded the bytes but this approach can fail unexpectedly as some encodings can erroneously get decoded by other encodings without errors being reported.
we now use the encoding detection crate 'chardetng' which is also [used in firefox](https://github.com/hsivonen/chardetng?tab=readme-ov-file#purpose).
2024-03-05 10:55:05 +01:00
Mikkel Denker
b678e678a6 add links to '/webmasters' information for crawler 2024-02-17 13:41:33 +01:00
Mikkel Denker
0b69853fa9 chore: 'cargo update' and remove some unused trait method.
also accept gplv3 licenses in libraries as this is permitted under section 13 of gplv3.
2024-02-12 13:49:20 +01:00
Mikkel Denker
e4e3044e47 finally ditch that pesky libtorch dependency! 2024-02-02 13:11:06 +01:00
Mikkel Denker
1a9f381d15
GGML Rust bindings (#122)
* move crates into a 'crates' folder

* added cargo-about to check dependency licenses

* create ggml-sys bindings and build as a static library.
simple addition sanity test passes

* update licenses

* yeet alice

* yeet qa model

* yeet fact model

* [wip] idiomatic rust bindings for ggml

* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
2024-01-27 12:27:27 +01:00