Mikkel Denker
feafa7507a
fix webgraph merge bug of missing edges
...
the edges were compared by their sort_key, but some nodes don't have a centrality value (sort_key) so they got erronously mistaken for other nodes and skipped. it's better to compare directly on the node id
2024-06-12 12:58:24 +02:00
Mikkel Denker
a7b61559db
optimize trivial segment merges
2024-06-11 15:57:47 +02:00
Mikkel Denker
a9eb8acd80
store 'rel' attribute for each edge in the webgraph
...
this allows us to skip links to tag pages etc. when calculating harmonic centrality which should greatly improve the centrality values for the page graph
2024-06-11 15:34:54 +02:00
Mikkel Denker
efa6f9fab0
ability to convert some of the wander budget to scheduled urls
...
we might now about more urls for each domain than the ones that got a page centrality > 0.0. these urls won't be scheduled unless we convert some of the wander budget back to scheduled urls
2024-06-10 14:11:36 +02:00
Mikkel Denker
817bda9738
optionally merge all webgraph segments into a single segment for improved read performance
2024-06-09 14:48:11 +02:00
Mikkel Denker
295444de4e
store sort-key in webgraph edges to properly apply limits with multiple segments
...
this should also allow us to merge segments which should improve read performance
2024-06-07 11:13:17 +02:00
Mikkel Denker
07663c1687
no need to consider nodes where we have found a better path
2024-06-06 17:31:27 +02:00
Mikkel Denker
90980a1aa4
no need to consider nodes where we have found a better path
2024-06-06 17:05:32 +02:00
Mikkel Denker
6bc7a5caf0
simplify dijkstra to reduce allocations and btree searches
2024-06-06 15:20:10 +02:00
Mikkel Denker
78c4676165
approx harmonic parallel graph executor
2024-06-06 14:11:46 +02:00
Mikkel Denker
4fbffe63b4
optionally store pages with 0 centrality from approx harmonic
2024-06-06 11:54:04 +02:00
Mikkel Denker
4b459b579b
reduce indirections in dht (and hopefully memory usage) by removing some arc's
...
most dht usecases store very small keys/values but store a ton of them, so the arcs don't really help a lot
2024-06-06 11:23:31 +02:00
Mikkel Denker
f7a354572e
reduce memory usage and serialization/deserialization in dht by storing keys and values as enums instead of vec<u8>
2024-06-06 10:39:00 +02:00
Mikkel Denker
3c4a0c480e
use remote webgraph in crawl planner
2024-06-05 09:41:03 +02:00
Mikkel Denker
26b1f08ba6
new spell correction is more accurate, so can have lower threshold for suggestions
2024-06-03 13:12:25 +02:00
Mikkel Denker
b2dd6c8731
make sure active connections aren't reused from connection pool.
...
I don't know how this could even happen as we get a mutable reference from the connection pool, but this seems to solve the issue...
2024-05-28 11:32:46 +02:00
Mikkel Denker
a1381d667b
fixed bug that caused error model in spell correction to always be empty
2024-05-27 11:45:07 +02:00
Mikkel Denker
e39987a2f7
remove summarizer
...
the probabilistic nature of llms means they have an inherent risc of hallucinating. even if they tend to cite correctly most of the time, the probability of hallucinations is still too large to be able to trust the output, thus defeating the purpose of the summary entirely. until these hallucinations are fixed (or the probability is extremely low) i don't see how it makes sense to include llms in search
2024-05-27 10:04:48 +02:00
Mikkel Denker
5c94ded567
use page centrality directly in crawl planner to prioritise pages.
...
this simplifies the crawl planner quite a bit and has made it easier to use the remote webgraph instead
2024-05-26 16:30:49 +02:00
Mikkel Denker
e74978c5bc
update image dependency
2024-05-23 12:44:01 +02:00
Mikkel Denker
bc3fa6974c
no need to run workers independently anymore as they don't need to communicate
2024-05-23 09:53:14 +02:00
Mikkel Denker
d7aee00f72
reduce communication between workers in approx centrality
...
this makes it even more approximated, but it improves performance so we can actually run it
2024-05-23 09:30:37 +02:00
Mikkel Denker
c17824321b
increase randomness of sampled nodes by using reservoir sampling to sample random nodes from graph
...
as the node ids are not based on md5 (or another cryptographic hash) anymore they cannot be assumed to be randomly distributed
2024-05-23 08:37:38 +02:00
Mikkel Denker
215696ebeb
avoid inf values in approximate centrality by pushing norm inside the sum so the values don't get too big
2024-05-22 20:11:47 +02:00
Mikkel Denker
9209ad2048
close connections in connection pool after a ttl of 60 seconds
2024-05-22 12:49:07 +02:00
Mikkel Denker
54b7eb1e17
sort thesaurus entries by their in-degree to prioritize most used definitions
2024-05-22 12:36:44 +02:00
Mikkel Denker
798ef2fe9e
for some reason, poll fn seems to block sometimes when checking if tcpstream is closed. let's instead check by trying to read 0-bytes with a timeout and see if it returns an err
2024-05-22 11:25:33 +02:00
Mikkel Denker
e94f02d401
if queue is empty and batch.len() < BATCH_SIZE, approximate harmonic didn't make progress
2024-05-22 08:26:25 +02:00
Mikkel Denker
df7ac44061
chunk id2node requests into smaller parts to avoid timeout
2024-05-21 16:48:55 +02:00
Mikkel Denker
127e1211b0
respect max distance in distributed approx harmonic
2024-05-21 16:22:33 +02:00
Mikkel Denker
4985ec8117
add progressbar to distributed approx harmonic
2024-05-21 15:12:00 +02:00
Mikkel Denker
823efa6716
implement distributed version of approximated harmonic centrality
...
the page graph still seems to be too big to calculate the exact centrality even when distributed across multiple workers (need more workers)
2024-05-21 14:53:12 +02:00
Mikkel Denker
c97541fa1f
implement upsert for more hyperloglog sizes
2024-05-19 17:48:20 +02:00
Mikkel Denker
38416c6070
web-spell cleanup old dicts after merge
2024-05-19 14:01:15 +02:00
Mikkel Denker
d023552391
use binary heap for less cmp when merging spell correction dictionaries
2024-05-18 13:32:50 +02:00
Mikkel Denker
17fed5a75c
Show ranking signals ( #201 )
2024-05-17 16:39:33 +02:00
Mikkel Denker
38e92c1813
add upsert for HyperLogLog<128>
2024-05-17 09:51:11 +02:00
Mikkel Denker
c9750a823a
each worker in distributed harmonic should only use 1 thread for webgraph as they only need to traverse their edges which is not multithreaded anyway
2024-05-17 09:50:39 +02:00
Mikkel Denker
1531c7a02a
fix logic bug in 'sorted_k' function that caused it to prune wrong elements
2024-05-17 09:48:38 +02:00
Mikkel Denker
0b57dbfe61
remove retry strategy from ampc and add the missing features to the one from sonic so it can be re-used
2024-05-16 13:23:30 +02:00
Mikkel Denker
cfdf66473e
enable tcp linger and shutdown stream on timeout
...
this should prevent old headers/bodies arriving at the stream out of order
2024-05-16 11:06:20 +02:00
Mikkel Denker
1a6d8ff6be
fixed bug in stupid_backoff model that caused last n_gram count to always be 0
2024-05-14 16:49:30 +02:00
Mikkel Denker
8d80ce285c
reuse language detection calls by passing whatlang::Lang into tokenizer function for fields
2024-05-14 14:33:57 +02:00
Mikkel Denker
ea1c517da4
remove ranking signal from optics and add to api instead
...
this will simplify optic merging and make it easier to allow more than 1 optic to be applied to a search
2024-05-14 11:46:40 +02:00
Mikkel Denker
7e8781fe5b
use binary heap for less cmp when merging speedy-kv segments
2024-05-13 10:44:17 +02:00
Mikkel Denker
19dab37daa
update segment paths to new folder during move
2024-05-11 11:29:26 +02:00
Mikkel Denker
b302a8d5c7
coordinate changed nodes between workers in distrivuted harmonic to make sure a node that has been update in worker A is also considered for updates on worker B
2024-05-10 15:09:44 +02:00
Mikkel Denker
e026fe5548
Sonic connection pool ( #200 )
...
* allow connection reuse by not taking ownership in send methods
* [sonic] continously handle requests from each connection in the server as long as the connection is not closed
* add connection pool to sonic based on deadpool
* use connection pool in remote webgraph and distributed searcher
* hopefully fix flaky test
* hopefully fix flaky test
2024-05-09 15:24:43 +02:00
Mikkel Denker
76cd7e8f63
fixed bug that caused queries with special characters to crash ('c++' etc)
...
'c++' gets tokenized as ['c', '+', '+'] which we use in a phrase query to enforce that the result must have 'c++' in sequence instead of simply having 'c' somewhere on the page and '+' another place. however, some fields don't have the necesarry position data stored which caused these queries to crash when trying to perform the phrase query on these fields
2024-05-07 12:51:32 +02:00
Mikkel Denker
d24dce8831
distinguish between itemtypes and regular keys in schema flattened json to ensure schema matchings in optics always start their match against an itemtype
2024-05-07 10:37:52 +02:00