Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
26b1f08ba6 new spell correction is more accurate, so can have lower threshold for suggestions 2024-06-03 13:12:25 +02:00
Mikkel Denker
b2dd6c8731 make sure active connections aren't reused from connection pool.
I don't know how this could even happen as we get a mutable reference from the connection pool, but this seems to solve the issue...
2024-05-28 11:32:46 +02:00
Mikkel Denker
a1381d667b fixed bug that caused error model in spell correction to always be empty 2024-05-27 11:45:07 +02:00
Mikkel Denker
e39987a2f7 remove summarizer
the probabilistic nature of llms means they have an inherent risc of hallucinating. even if they tend to cite correctly most of the time, the probability of hallucinations is still too large to be able to trust the output, thus defeating the purpose of the summary entirely. until these hallucinations are fixed (or the probability is extremely low) i don't see how it makes sense to include llms in search
2024-05-27 10:04:48 +02:00
Mikkel Denker
5c94ded567 use page centrality directly in crawl planner to prioritise pages.
this simplifies the crawl planner quite a bit and has made it easier to use the remote webgraph instead
2024-05-26 16:30:49 +02:00
Mikkel Denker
af73d33b39 forgot to push new accepted licenses 2024-05-23 13:01:34 +02:00
Mikkel Denker
e74978c5bc update image dependency 2024-05-23 12:44:01 +02:00
Mikkel Denker
bc3fa6974c no need to run workers independently anymore as they don't need to communicate 2024-05-23 09:53:14 +02:00
Mikkel Denker
d7aee00f72 reduce communication between workers in approx centrality
this makes it even more approximated, but it improves performance so we can actually run it
2024-05-23 09:30:37 +02:00
Mikkel Denker
c17824321b increase randomness of sampled nodes by using reservoir sampling to sample random nodes from graph
as the node ids are not based on md5 (or another cryptographic hash) anymore they cannot be assumed to be randomly distributed
2024-05-23 08:37:38 +02:00
Mikkel Denker
215696ebeb avoid inf values in approximate centrality by pushing norm inside the sum so the values don't get too big 2024-05-22 20:11:47 +02:00
Mikkel Denker
9209ad2048 close connections in connection pool after a ttl of 60 seconds 2024-05-22 12:49:07 +02:00
Mikkel Denker
54b7eb1e17 sort thesaurus entries by their in-degree to prioritize most used definitions 2024-05-22 12:36:44 +02:00
Mikkel Denker
7f4e1d8d04 annotation tool name change 2024-05-22 12:01:09 +02:00
Mikkel Denker
798ef2fe9e for some reason, poll fn seems to block sometimes when checking if tcpstream is closed. let's instead check by trying to read 0-bytes with a timeout and see if it returns an err 2024-05-22 11:25:33 +02:00
Mikkel Denker
e94f02d401 if queue is empty and batch.len() < BATCH_SIZE, approximate harmonic didn't make progress 2024-05-22 08:26:25 +02:00
Mikkel Denker
df7ac44061 chunk id2node requests into smaller parts to avoid timeout 2024-05-21 16:48:55 +02:00
Mikkel Denker
127e1211b0 respect max distance in distributed approx harmonic 2024-05-21 16:22:33 +02:00
Mikkel Denker
4985ec8117 add progressbar to distributed approx harmonic 2024-05-21 15:12:00 +02:00
Mikkel Denker
823efa6716 implement distributed version of approximated harmonic centrality
the page graph still seems to be too big to calculate the exact centrality even when distributed across multiple workers (need more workers)
2024-05-21 14:53:12 +02:00
Mikkel Denker
c97541fa1f implement upsert for more hyperloglog sizes 2024-05-19 17:48:20 +02:00
Mikkel Denker
38416c6070 web-spell cleanup old dicts after merge 2024-05-19 14:01:15 +02:00
Mikkel Denker
d023552391 use binary heap for less cmp when merging spell correction dictionaries 2024-05-18 13:32:50 +02:00
Mikkel Denker
17fed5a75c
Show ranking signals (#201) 2024-05-17 16:39:33 +02:00
Mikkel Denker
38e92c1813 add upsert for HyperLogLog<128> 2024-05-17 09:51:11 +02:00
Mikkel Denker
c9750a823a each worker in distributed harmonic should only use 1 thread for webgraph as they only need to traverse their edges which is not multithreaded anyway 2024-05-17 09:50:39 +02:00
Mikkel Denker
1531c7a02a fix logic bug in 'sorted_k' function that caused it to prune wrong elements 2024-05-17 09:48:38 +02:00
Mikkel Denker
0b57dbfe61 remove retry strategy from ampc and add the missing features to the one from sonic so it can be re-used 2024-05-16 13:23:30 +02:00
Mikkel Denker
cfdf66473e enable tcp linger and shutdown stream on timeout
this should prevent old headers/bodies arriving at the stream out of order
2024-05-16 11:06:20 +02:00
Mikkel Denker
1a6d8ff6be fixed bug in stupid_backoff model that caused last n_gram count to always be 0 2024-05-14 16:49:30 +02:00
Mikkel Denker
8d80ce285c reuse language detection calls by passing whatlang::Lang into tokenizer function for fields 2024-05-14 14:33:57 +02:00
Mikkel Denker
ea1c517da4 remove ranking signal from optics and add to api instead
this will simplify optic merging and make it easier to allow more than 1 optic to be applied to a search
2024-05-14 11:46:40 +02:00
Mikkel Denker
7e8781fe5b use binary heap for less cmp when merging speedy-kv segments 2024-05-13 10:44:17 +02:00
Mikkel Denker
19dab37daa update segment paths to new folder during move 2024-05-11 11:29:26 +02:00
Mikkel Denker
b302a8d5c7 coordinate changed nodes between workers in distrivuted harmonic to make sure a node that has been update in worker A is also considered for updates on worker B 2024-05-10 15:09:44 +02:00
Mikkel Denker
e026fe5548
Sonic connection pool (#200)
* allow connection reuse by not taking ownership in send methods

* [sonic] continously handle requests from each connection in the server as long as the connection is not closed

* add connection pool to sonic based on deadpool

* use connection pool in remote webgraph and distributed searcher

* hopefully fix flaky test

* hopefully fix flaky test
2024-05-09 15:24:43 +02:00
Mikkel Denker
76cd7e8f63 fixed bug that caused queries with special characters to crash ('c++' etc)
'c++' gets tokenized as ['c', '+', '+'] which we use in a phrase query to enforce that the result must have 'c++' in sequence instead of simply having 'c' somewhere on the page and '+' another place. however, some fields don't have the necesarry position data stored which caused these queries to crash when trying to perform the phrase query on these fields
2024-05-07 12:51:32 +02:00
Mikkel Denker
d24dce8831 distinguish between itemtypes and regular keys in schema flattened json to ensure schema matchings in optics always start their match against an itemtype 2024-05-07 10:37:52 +02:00
Mikkel Denker
e5e2126e54 align snippet text and date horizontally 2024-05-07 09:33:30 +02:00
Mikkel Denker
905fec80ae make sure site query terms are treated as phrase for correct matching 2024-05-06 16:15:10 +02:00
Mikkel Denker
6c365f369a fix clippy warnings in tests 2024-05-06 15:56:23 +02:00
Mikkel Denker
3c94cb7f81 approximate number of hits by assuming that each term is independent
this allows us to short-cirquit the query by default which significantly improves performance as we therefore don't have to iterate the non-scored results simply to count them
2024-05-06 15:21:17 +02:00
Mikkel Denker
bfa4ce9043 make sure long titles are truncated in serp 2024-05-06 12:14:48 +02:00
Mikkel Denker
d687270b78 small serp accessibility improvements
group results and make sure heading is read first
2024-05-06 11:52:05 +02:00
Mikkel Denker
651ff30aa7 don't augment long queries with ngram lookups for performance 2024-05-06 10:51:49 +02:00
Mikkel Denker
73e5445018
update fend to 1.4.8 (#198) 2024-05-05 17:05:44 +02:00
Mikkel Denker
ccc28d7ade make sure keyed each blocks have unique id to prevent infinite render retries from svelte when search results contain duplicate results 2024-05-04 17:54:30 +02:00
Mikkel Denker
84f56053a1 rename all 'type' to '_type' in api as 'type' might be reserved in some languages
also optionally return structured data from api
2024-05-03 17:03:56 +02:00
Mikkel Denker
7e9da2e37c chore: upgrade dependencies for kuchiki 2024-05-03 12:23:10 +02:00
Mikkel Denker
9c983e5f96
Top k webgraph edges (#197)
* implement random access index in file_store where keys are u64 and values are serialised to a constant size

* cleanup: move all webgraph store writes into store_writer

* add a 'ConstIterableStore' that can store items on disk without needing to interleave headers in the case that all items can be serialized to a constant number of bytes known up front

* change edges file format to make edges for a given node iterable.
this allows us to only load a subset of the edges for a node in the future

* compress webgraph labels in blocks of 128

* ability to limit number of edges returned by webgraph

* sort edges in webgraph store by the host rank of the opposite node
2024-05-03 09:33:57 +02:00