Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
3b8bdc22b6 Disable webgraph compression.
An insane amount of time was spent decompressing (about 80%) when fetching inbound links.
This made it impractical to index backlink texts, which of course is very usefull information to have in the index.
2023-08-17 10:50:55 +02:00
Mikkel Denker
00eda36984 webgraph api endpoints in docs 2023-08-16 20:53:05 +02:00
Mikkel Denker
19bc47ad2a added api docs badge to readme 2023-08-16 15:08:22 +02:00
Mikkel Denker
22a8e7d4df preliminary api docs 2023-08-16 14:57:25 +02:00
Mikkel Denker
d2719b333e remove alice improvements 2023-08-16 10:09:21 +02:00
Mikkel Denker
16253e187e centered badge 2023-08-16 10:01:20 +02:00
Mikkel Denker
a7ebc4d1a2 space between badge and image 2023-08-16 09:57:16 +02:00
Mikkel Denker
8e9c0c66e9 space between badge and image 2023-08-16 09:57:02 +02:00
Mikkel Denker
8011719210 space between badge and image 2023-08-16 09:56:40 +02:00
Mikkel Denker
a19bb9fcfb move overview docs badge 2023-08-16 09:54:00 +02:00
Mikkel Denker
5012e03a56 overview docs badge 2023-08-16 09:53:16 +02:00
Mikkel Denker
3a8dab019e failed to move webgraph when done: directory not empty 2023-08-16 09:02:41 +02:00
Mikkel Denker
62264700fa rkyv serialization in crawl-db to increase performance quite a bit 2023-08-16 08:55:08 +02:00
Mikkel Denker
5868119048 simplify crawldb by removing all ids and storing urls/domains directly 2023-08-15 13:49:54 +02:00
Mikkel Denker
cb9a04508b less locking in crawldb 2023-08-15 12:26:26 +02:00
Mikkel Denker
5a562b66d3 tune rocksdb options to reduce write amplification in crawl coordinator 2023-08-14 20:49:12 +02:00
Mikkel Denker
7819c968ef crawldb inserts should be sync 2023-08-14 15:01:07 +02:00
Mikkel Denker
6d358c567a crawl coordinator memory stress test 2023-08-14 14:54:23 +02:00
Mikkel Denker
feec143db8 reduce crawler memory usage 2023-08-14 11:27:58 +02:00
Mikkel Denker
d9bcc5f925 simple test for crawldb politeness 2023-08-14 09:11:05 +02:00
Mikkel Denker
49db3704f7 entity sidebar went off-screen on mobile 2023-08-14 09:02:00 +02:00
Mikkel Denker
f365ee91c2 Remove mutable reference in sonic service handle fn.
Each service should handle their required mutability with atomic/locking mechanisms.
Otherwise, there would be no way for the sonic server to accept multiple incoming requests even if the server had available capacity to handle the request.

Now the server spawns a tokio task for each incoming request.
2023-08-13 16:06:20 +02:00
Mikkel Denker
e9301e3e84 Slight change of sonic protocol so close happens from client side.
This should allow us to disable lingering for the socket since the client will only close the connection once it has received adequate data to satisfy the user request. No stray tcp packets should therefore hit the socket.
2023-08-13 15:44:28 +02:00
Mikkel Denker
5c0487ad89 Enable spell dictionary by default but reduce worst case compute 2023-08-13 11:30:54 +02:00
Mikkel Denker
0b11b20c6d Refactor sonic service to use primitive sonic connections 2023-08-13 10:40:39 +02:00
Mikkel Denker
28c8da906a Make spell dictionary optional 2023-08-12 15:10:07 +02:00
Mikkel Denker
4bba8feb0a Various different optimizations.
This commit should have been structured much better.
2023-08-12 14:22:45 +02:00
Mikkel Denker
9d4c01efc9 Url states to disk.
The idea is to reduce memory usage significantly. Don't know if it will be fast enough for big crawls.
2023-08-10 17:49:22 +02:00
Mikkel Denker
7fef2b7b0d Collector didn't correctly stop when max_docs was reached.
Also faster similar-sites as we now use the scorer to estimate the importance for the backlink nodes as well.
2023-08-10 13:16:03 +02:00
Mikkel Denker
4c0b5e4d88 Fix sonic broken pipe due to low timeouts 2023-08-10 09:59:24 +02:00
Mikkel Denker
ce1e15a11b Fix url root domain parsing and faster memory mapped crawldb. 2023-08-09 14:44:28 +02:00
Mikkel Denker
36f22e801e
Overview docs (#73)
* Begin overview documentation in mdbook format

* Overview of the different docs

* Move overview documentation to mkdocs

* Reduce webgraph segment merges by introducing a webgraph commit mode that commits the live segment directly to the stored segment

* Parallel harmonic centrality calculations

* Even more parallelism in harmonic centrality calculations

* Way faster hyperloglog but also less accurate

* Dynamic exact counting threshold proportional to size of graph

* improve inbound similarity speed and fix hyperloglog out-of-bounds bug

* no need to load all nodes into memory for harmonic centrality

* Use rayon directly in indexer.
Hopefully this fixes the bug where the indexer takes a new job before it has finished the first one. I think what happened was that the indexer thread took a new job when hitting the webgraph executor.

* single threaded webgraph when indexing

* No need for node2id anymore

* Use single thread in tantviy by default.
We introduce a method to optmize the index for search, which currently just sets the tantivy executor to be multithreaded. This should improve the indexing performance.

* Reduce memory arena in tantivy

* try jemalloc

* Revert tantivy memory arena reduction. Caused too many files to be created when indexing warc files
2023-08-08 06:32:44 +00:00
Oliver Bøving
345bd9c1b4
Deunsafication and testing of sonic protocol and introducing sonic::service (#72)
* deunsafication and testing of sonic protocol

Originally sonic used unsafe casts to read and write the Header for
packets. This was "safe" as it was, but relied on the code reading and
writing to make use that it used the correct types on each end of the
protocol.

This introduces bytemuck which does the same thing, but safely. Safety
comes from bytemuck asserting at compile-time that the Header is safe
to cast to bytes, and that there are enough bytes when interpreting
the received bytes.

In addition to this, this commit also introduces proptest testing of the
sonic communication, by generating arbitray messages and sending them to
a server, and asserting that the request and response is always as
expected.

* introduce type safe sonic and sonic service

The core of sonic communication now establishes a protocol that has
the request and response types determined upon creation, instead of
defered to sending and accepting. This upfront declaration makes it
harder to misuse the protocol, and more obvious what messages are
expected at each part of the protocol.

A blocker for implementing this, was in tailored request/response cycles
where a specific response type was expected based on the given request
type. Formerly, this was handeled ad-hoc, where the expected response
type was declared on the connection side to match what type the server
would send in response. This implicit coupling was brittle, and depended
opon the two call sites agreeing on what to send and when.

Based on the prior downsides, sonic::service is introduced to formalize
the request/response cycle. This is done by defining messages, which
implement a trait with an associated response type and a handler. It is
when a service connection then sends a message to a server of the same
service, the connection will statically know what response to expect,
by use of the associated type of the sent message. This provides
connections with a RPC-like interface to the service, while the service
can be expanded with new messages without encuring any additional
overhead to the call-sites of existing connections.

A new service is declared using a macro, that defined appropriate
request and response enums, used during communication. It additionally
generates the code for an ergonomic interface that uses the assiciated
response type to cover over implementation details of tagging messages
while providing a type-safe interface.

Most servers are converted to use the new sonic::service, but some still
remain at the more lowlevel sonic, which is now type-safe, since they
are used in ways incompatible with strict request/response +
predetermined types model; notably MapReduce is generic over the job,
and has non-service-like-control-flow.

* `: Clone` bye bye!

By adding a new `RequestRef<'a>` for the sending part of sonic::service
we get to take references to the send messages, and thus remove the
need for copying each message sent!

This relies on `Request` and `RequestRef` being the same shape when serialized and deserialized, since we serialize the ref version but
deserialzie the non-ref version. Since both are generated by the same
macro, we can have quite high confidence in that they are the same, but
we should perhaps add some tests to ensure that this are the case. I
don't think it would be possible to exhastiavly check, but a proptest
cycling through serialize/deserialize loop could give us some confidence
that it is correct.

* Added a proptest to cycle 'Request' and 'RequestRef'
This commit adds a proptest to the CounterService in order to increase our confidence, that a 'Request' struct can be deserialized from a 'RequestRef' struct. It's important to note while this proptest does not guarantee such the property to hold, it does increase our confidence.
2023-08-04 09:11:34 +00:00
Mikkel Denker
38fcd5ceb8 should have removed rocksdb cache when disabled 2023-08-04 10:01:22 +02:00
Mikkel Denker
6eeb528085 hopefully less rocksdb memory usage 2023-08-04 09:56:57 +02:00
Mikkel Denker
2e885e9d11 rocksdb bloom filters caused OOM errors. Updated filter sizes to more sane defaults 2023-08-03 15:01:09 +02:00
Mikkel Denker
b323fa9e4b limit cache of robots.txt files 2023-08-02 14:14:01 +02:00
Mikkel Denker
e28bbd75bf Spec compliant url parser.
This is a giant re-write to use the spec compliant 'url' crate. The parser we had before did not follow relative urls correctly, which caused many 404's when crawling.
There was probably also a bunch of other issues with the simple parser, so this should make webgraph and everything related to it more robust as well.
2023-08-02 10:26:52 +02:00
Mikkel Denker
0eec601520 less clones when inserting results into crawl coordinator 2023-08-01 12:06:36 +02:00
Mikkel Denker
8ee456c272 entity sidebar threshold 2023-08-01 11:11:01 +02:00
Mikkel Denker
69e06bb574 Some domain states got set to "Pending" even though a worker was already crawling them.
This primarily happened for popular sites that has many incoming links from many different sites.
2023-08-01 08:48:26 +02:00
Mikkel Denker
93dcb50691 reduce number of connections from crawl workers to coordinator 2023-07-31 16:30:22 +02:00
Mikkel Denker
03f26fc6b0 don't remove '#' from path part of urls 2023-07-31 12:11:03 +02:00
Mikkel Denker
f4cc94f7e5 if '#' is before '?' in an url, the query part is simply from '?' to url.len() 2023-07-31 12:00:28 +02:00
Mikkel Denker
df39af8c46 Remove utm queries during indexing 2023-07-31 10:36:12 +02:00
Mikkel Denker
e755068dad remove 'utm_*' parameters from urls when normalizing 2023-07-31 10:30:25 +02:00
Mikkel Denker
4a2ca1472d Moved some hardcoded constants into configuration files.
Also a bunch of other stuff; I should really have split this commit into multiple parts...
2023-07-30 20:08:47 +02:00
Mikkel Denker
0d51eff321 better snippets for phrase queries 2023-07-24 13:24:06 +02:00
Mikkel Denker
f81722c4c8 update bigprime 2023-07-21 09:39:52 +02:00
Mikkel Denker
0c22064414 id2node cache 2023-07-21 09:37:00 +02:00