* add docusaurus scalar api documentation structure
* bump openapi 3.0 to 3.1 so we can mark internal endpoints
* improve search api docs
* webgraph api docs
* point docs to prod
* add steps for chrome & firefox
* add steps for mainstream browsers
* add images to steps
* fix typo in filename
* fix typo
* remove word for unneeded word for brevity
* move crates into a 'crates' folder
* added cargo-about to check dependency licenses
* create ggml-sys bindings and build as a static library.
simple addition sanity test passes
* update licenses
* yeet alice
* yeet qa model
* yeet fact model
* [wip] idiomatic rust bindings for ggml
* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
Each domain now starts with a score of 1.0 and is added with the score of all the incoming links for that domain.
A domains score is distributed amongst all the outgoing links for that domain when it is sampled.
The intuition is that if a domain has many outgoing links, each link has relatively little value whereas if a domain has few outgoing links, each link
is more important.
This score is of course not stable and depends on the order we discover and crawl urls+domains. However, I think it will work quite well
as a crawl prioritization mechanism in practice.
* Begin overview documentation in mdbook format
* Overview of the different docs
* Move overview documentation to mkdocs
* Reduce webgraph segment merges by introducing a webgraph commit mode that commits the live segment directly to the stored segment
* Parallel harmonic centrality calculations
* Even more parallelism in harmonic centrality calculations
* Way faster hyperloglog but also less accurate
* Dynamic exact counting threshold proportional to size of graph
* improve inbound similarity speed and fix hyperloglog out-of-bounds bug
* no need to load all nodes into memory for harmonic centrality
* Use rayon directly in indexer.
Hopefully this fixes the bug where the indexer takes a new job before it has finished the first one. I think what happened was that the indexer thread took a new job when hitting the webgraph executor.
* single threaded webgraph when indexing
* No need for node2id anymore
* Use single thread in tantviy by default.
We introduce a method to optmize the index for search, which currently just sets the tantivy executor to be multithreaded. This should improve the indexing performance.
* Reduce memory arena in tantivy
* try jemalloc
* Revert tantivy memory arena reduction. Caused too many files to be created when indexing warc files