0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	daff4d06d6	document supported search operators (#245 )	2024-12-04 10:45:03 +01:00
Mikkel Denker	9e8dc92a41	Improve architecture documentation (#243 ) * cleanup assets * update crawler docs * update search index docs * update webgraph docs	2024-12-03 14:57:54 +01:00
Mikkel Denker	12e9502e80	Improve API documentation (#235 ) * add docusaurus scalar api documentation structure * bump openapi 3.0 to 3.1 so we can mark internal endpoints * improve search api docs * webgraph api docs * point docs to prod	2024-11-19 13:43:42 +01:00
Mikkel Denker	8cdcc63371	[docs] change absolute link to mkdocs relative	2024-04-06 11:43:44 +02:00
Mikkel Denker	05e20434a6	add 'add_to_browser.md' to mkdocs navbar	2024-04-06 11:11:29 +02:00
Mikkel Denker	b678e678a6	add links to '/webmasters' information for crawler	2024-02-17 13:41:33 +01:00
jmillerv	ccd16df514	Add Stract to Web Browser Search Documentation (#135 ) * add steps for chrome & firefox * add steps for mainstream browsers * add images to steps * fix typo in filename * fix typo * remove word for unneeded word for brevity	2024-02-10 12:12:38 +01:00
Mikkel Denker	1a9f381d15	GGML Rust bindings (#122 ) * move crates into a 'crates' folder * added cargo-about to check dependency licenses * create ggml-sys bindings and build as a static library. simple addition sanity test passes * update licenses * yeet alice * yeet qa model * yeet fact model * [wip] idiomatic rust bindings for ggml * [ggml] mul, add and sub ops implemented for tensors. i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?	2024-01-27 12:27:27 +01:00
Mikkel Denker	54fe19ddf6	trystract.com -> stract.com	2023-12-16 14:43:00 +01:00
Mikkel Denker	b096e7cd5b	Deprecate old crawler docs. The crawler architecture has changed tremendously with the planner etc. The docs needs to be updated, but for now we will just hide them.	2023-11-23 10:13:01 +01:00
Mikkel Denker	ceb4c83c7f	Better prioritization for which domains and urls to crawl. Each domain now starts with a score of 1.0 and is added with the score of all the incoming links for that domain. A domains score is distributed amongst all the outgoing links for that domain when it is sampled. The intuition is that if a domain has many outgoing links, each link has relatively little value whereas if a domain has few outgoing links, each link is more important. This score is of course not stable and depends on the order we discover and crawl urls+domains. However, I think it will work quite well as a crawl prioritization mechanism in practice.	2023-10-02 12:23:48 +02:00
Mikkel Denker	22a8e7d4df	preliminary api docs	2023-08-16 14:57:25 +02:00
Mikkel Denker	62264700fa	rkyv serialization in crawl-db to increase performance quite a bit	2023-08-16 08:55:08 +02:00
Mikkel Denker	5a562b66d3	tune rocksdb options to reduce write amplification in crawl coordinator	2023-08-14 20:49:12 +02:00
Mikkel Denker	feec143db8	reduce crawler memory usage	2023-08-14 11:27:58 +02:00
Mikkel Denker	4c0b5e4d88	Fix sonic broken pipe due to low timeouts	2023-08-10 09:59:24 +02:00
Mikkel Denker	36f22e801e	Overview docs (#73 ) * Begin overview documentation in mdbook format * Overview of the different docs * Move overview documentation to mkdocs * Reduce webgraph segment merges by introducing a webgraph commit mode that commits the live segment directly to the stored segment * Parallel harmonic centrality calculations * Even more parallelism in harmonic centrality calculations * Way faster hyperloglog but also less accurate * Dynamic exact counting threshold proportional to size of graph * improve inbound similarity speed and fix hyperloglog out-of-bounds bug * no need to load all nodes into memory for harmonic centrality * Use rayon directly in indexer. Hopefully this fixes the bug where the indexer takes a new job before it has finished the first one. I think what happened was that the indexer thread took a new job when hitting the webgraph executor. * single threaded webgraph when indexing * No need for node2id anymore * Use single thread in tantviy by default. We introduce a method to optmize the index for search, which currently just sets the tantivy executor to be multithreaded. This should improve the indexing performance. * Reduce memory arena in tantivy * try jemalloc * Revert tantivy memory arena reduction. Caused too many files to be created when indexing warc files	2023-08-08 06:32:44 +00:00

17 commits