Commit graph

17 commits

Author SHA1 Message Date
Mikkel Denker
daff4d06d6
document supported search operators (#245) 2024-12-04 10:45:03 +01:00
Mikkel Denker
9e8dc92a41
Improve architecture documentation (#243)
* cleanup assets

* update crawler docs

* update search index docs

* update webgraph docs
2024-12-03 14:57:54 +01:00
Mikkel Denker
12e9502e80
Improve API documentation (#235)
* add docusaurus scalar api documentation structure

* bump openapi 3.0 to 3.1 so we can mark internal endpoints

* improve search api docs

* webgraph api docs

* point docs to prod
2024-11-19 13:43:42 +01:00
Mikkel Denker
8cdcc63371 [docs] change absolute link to mkdocs relative 2024-04-06 11:43:44 +02:00
Mikkel Denker
05e20434a6 add 'add_to_browser.md' to mkdocs navbar 2024-04-06 11:11:29 +02:00
Mikkel Denker
b678e678a6 add links to '/webmasters' information for crawler 2024-02-17 13:41:33 +01:00
jmillerv
ccd16df514
Add Stract to Web Browser Search Documentation (#135)
* add steps for chrome & firefox

* add steps for mainstream browsers

* add images to steps

* fix typo in filename

* fix typo

* remove word for unneeded word for brevity
2024-02-10 12:12:38 +01:00
Mikkel Denker
1a9f381d15
GGML Rust bindings (#122)
* move crates into a 'crates' folder

* added cargo-about to check dependency licenses

* create ggml-sys bindings and build as a static library.
simple addition sanity test passes

* update licenses

* yeet alice

* yeet qa model

* yeet fact model

* [wip] idiomatic rust bindings for ggml

* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
2024-01-27 12:27:27 +01:00
Mikkel Denker
54fe19ddf6 trystract.com -> stract.com 2023-12-16 14:43:00 +01:00
Mikkel Denker
b096e7cd5b Deprecate old crawler docs.
The crawler architecture has changed tremendously with the planner etc. The docs needs to be updated, but for now we will just hide them.
2023-11-23 10:13:01 +01:00
Mikkel Denker
ceb4c83c7f Better prioritization for which domains and urls to crawl.
Each domain now starts with a score of 1.0 and is added with the score of all the incoming links for that domain.
A domains score is distributed amongst all the outgoing links for that domain when it is sampled.

The intuition is that if a domain has many outgoing links, each link has relatively little value whereas if a domain has few outgoing links, each link
is more important.

This score is of course not stable and depends on the order we discover and crawl urls+domains. However, I think it will work quite well
as a crawl prioritization mechanism in practice.
2023-10-02 12:23:48 +02:00
Mikkel Denker
22a8e7d4df preliminary api docs 2023-08-16 14:57:25 +02:00
Mikkel Denker
62264700fa rkyv serialization in crawl-db to increase performance quite a bit 2023-08-16 08:55:08 +02:00
Mikkel Denker
5a562b66d3 tune rocksdb options to reduce write amplification in crawl coordinator 2023-08-14 20:49:12 +02:00
Mikkel Denker
feec143db8 reduce crawler memory usage 2023-08-14 11:27:58 +02:00
Mikkel Denker
4c0b5e4d88 Fix sonic broken pipe due to low timeouts 2023-08-10 09:59:24 +02:00
Mikkel Denker
36f22e801e
Overview docs (#73)
* Begin overview documentation in mdbook format

* Overview of the different docs

* Move overview documentation to mkdocs

* Reduce webgraph segment merges by introducing a webgraph commit mode that commits the live segment directly to the stored segment

* Parallel harmonic centrality calculations

* Even more parallelism in harmonic centrality calculations

* Way faster hyperloglog but also less accurate

* Dynamic exact counting threshold proportional to size of graph

* improve inbound similarity speed and fix hyperloglog out-of-bounds bug

* no need to load all nodes into memory for harmonic centrality

* Use rayon directly in indexer.
Hopefully this fixes the bug where the indexer takes a new job before it has finished the first one. I think what happened was that the indexer thread took a new job when hitting the webgraph executor.

* single threaded webgraph when indexing

* No need for node2id anymore

* Use single thread in tantviy by default.
We introduce a method to optmize the index for search, which currently just sets the tantivy executor to be multithreaded. This should improve the indexing performance.

* Reduce memory arena in tantivy

* try jemalloc

* Revert tantivy memory arena reduction. Caused too many files to be created when indexing warc files
2023-08-08 06:32:44 +00:00