Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
05b87c95dd http protocol in explore 2023-07-20 15:48:15 +02:00
Mikkel Denker
3c2c0bf758 Use md5 digest as id for webgraph nodes.
There is a small collision risk where multiple nodes will get the same id in the graph.
This risk is extremely small since the digest is 128 bits and the benefits we get from the significantly reduced complexity with everything surrounding the webgraph hugely outweighs this risk.
It's now much faster to get edges for a node in the graph, since we don't need to map segment_node_ids to node_ids. Segment merges is also way less complex.
2023-07-20 15:41:08 +02:00
Mikkel Denker
7c95231011 update crawler description text 2023-07-19 17:00:22 +02:00
Mikkel Denker
a545770b6e Change user agent of crawler.
Reddit seems to look for "bot" in the user agent. If they cannot find the substring, they return a page that updates the title with javascript. This causes us to have a bunch of reddit pages with the title "Reddit - Dive into anything" in the search results.
2023-07-19 16:40:30 +02:00
Mikkel Denker
ceaf443884 calibrate discussions widget 2023-07-19 10:18:13 +02:00
Mikkel Denker
405e59335e better caching for node ids in webgraph 2023-07-18 19:09:58 +02:00
Mikkel Denker
d9b0fd7450 Improve read performance for webgraph.
We have split adjacency into a small_adjacency and full_adjacency (same for reverse) rocksdb databases. This allows us to read edge labels when we need them, which will increase the read performance for the webgraph edges at the expense of storage usage.
2023-07-18 16:17:45 +02:00
Mikkel Denker
802bddedae respect robots meta tag and use canonical url 2023-07-18 13:28:34 +02:00
Mikkel Denker
99438846ac forgot to add shadows to screenshot 2023-07-18 11:10:07 +02:00
Mikkel Denker
330c02a96f rust update and new git screenshot 2023-07-18 11:04:58 +02:00
Mikkel Denker
3f69e3f062 make type of centrality store explicit in config variable name 2023-07-17 11:09:32 +02:00
Mikkel Denker
361e0bbc2f More logging if some things looks off during indexing.
Might be due to wrong path for centrality stores, webgraph etc.
2023-07-17 10:10:38 +02:00
Mikkel Denker
c68c7ca96c less approximations in similar sites finder 2023-07-16 18:28:45 +02:00
Mikkel Denker
c392f689f9 Remove www prefix when inserting into web graph 2023-07-15 21:58:54 +02:00
Mikkel Denker
43bd806fef Retry urls that return 429 during crawl 2023-07-14 08:25:09 +02:00
Mikkel Denker
1af94339bb increase max politeness, faster setup and some documentation 2023-07-13 17:53:46 +02:00
Mikkel Denker
a31755b874 Language makes more sense than region.
Currently region is only detected by language, so from a user perspective it makes more sense to call the dropdown "language".
2023-07-13 13:54:42 +02:00
Mikkel Denker
aa357307d7 forgot to update primary key name in scylla chats table 2023-07-13 11:45:23 +02:00
Mikkel Denker
62d8a579dd test hardening 2023-07-12 14:11:05 +02:00
Mikkel Denker
d9bf123572 url digits and slashes are computable before search 2023-07-12 12:03:28 +02:00
Mikkel Denker
1e2ee6b8ea number of slashes and digits in url as ranking signal 2023-07-12 11:16:33 +02:00
Mikkel Denker
4f1a7079e4 formating and raw http for s3 endpoint 2023-07-11 17:47:42 +02:00
Mikkel Denker
5f737943cc normalize graph nodes whenver a node struct i constructed 2023-07-06 17:20:58 +02:00
Mikkel Denker
af1e43206b Normalize redirects in web graph 2023-07-06 16:39:52 +02:00
Mikkel Denker
2191b4734c a bunch of changes primarily to centrality store to make it store more stuff on disk 2023-07-06 14:28:02 +02:00
Mikkel Denker
ec76b7fb63 rename webgraph url level to page level 2023-07-06 12:12:38 +02:00
Mikkel Denker
3a87834547 Re-write webgraph to be based on rocksdb.
This allows us to create really big graphs (needed for page-level webgraphs).
2023-07-06 12:02:23 +02:00
Mikkel Denker
e858c004f1 Lower domain weight after the urls have been sampled 2023-07-06 08:05:44 +02:00
Mikkel Denker
0ea90aefdd harmonic centrality exact counting threshold 2023-07-05 19:59:53 +02:00
Mikkel Denker
6aa3766d79 configurable webgraph level 2023-07-05 18:45:55 +02:00
Mikkel Denker
bf68865f89 only search single replica with retries 2023-07-05 17:15:00 +02:00
Mikkel Denker
293527008f saving content length before replacing some characters causes an offset in the written warcfile 2023-07-05 12:43:40 +02:00
Mikkel Denker
f95271bfe0 If the docset is empty, there is for sure no pattern match 2023-07-05 12:41:01 +02:00
Mikkel Denker
e1c16e3f20 insert crawler job responses in batches 2023-07-04 13:37:14 +02:00
Mikkel Denker
def23ac59a simplify ranking pipeline and fix offsets 2023-07-04 10:31:10 +02:00
Mikkel Denker
cd3a65fe79 use http instead of https as default protocol 2023-07-04 08:25:58 +02:00
Mikkel Denker
74af6d486a Make sure that warc writes are atomic.
If part of a warc record write fails, we should not write other parts of the record. This could lead to the warc reader being out of sync, which would cause some html bodies to be saved under the wrong url
2023-07-03 16:57:55 +02:00
Mikkel Denker
cb0e8593bd fixed crawler sampling 2023-07-03 15:26:08 +02:00
Mikkel Denker
625e02590d Don't use self node for scoring in query centrality.
We don't want to have bias for the top search results when calculating query centrality. For any node in the search results, they should be scored by the similarity with all the other nodes (not themselves).
2023-07-03 11:35:10 +02:00
Mikkel Denker
498e78b430 Remove online/personal centrality.
It simply wasn't good enough and was too computationally expensive. It is better to have a faster search experience and only rely on inbound similarity for the personalization part, as this tends to work quite well while also being way simpler. Kill your darlings and all that.
2023-07-03 10:48:43 +02:00
Mikkel Denker
db7abf3d5d added log messages to indicate where we are in centrality calculations 2023-06-30 17:00:10 +02:00
Mikkel Denker
fb3700dfaf fix s3 warc download 2023-06-30 15:59:06 +02:00
Mikkel Denker
ae79c7af90 s3 warc source 2023-06-30 14:23:30 +02:00
Mikkel Denker
aaf5bd064f Revert store url states in crawldb on disk.
It introduced a significant performance penalty and for some reason kept trying to allocate an extreme amount of memory, causing oom
2023-06-30 11:07:47 +02:00
Mikkel Denker
bdd9d415d7 try rkyv serialization in crawl coordinator to hopefully fix oom error 2023-06-30 10:53:43 +02:00
Mikkel Denker
142ddebeb5 there seems to be a bug in bincode that causes an oom crash with an extreme amount of bytes tried allocated 2023-06-30 10:14:09 +02:00
Mikkel Denker
565e7e1e70 Each worker now fetches 2*num_workers jobs.
This will decrease the pressure on the coordinator
2023-06-30 10:01:34 +02:00
Mikkel Denker
864fe3a7c0 Significantly decrease memory usage by storing url states on disk.
Hopefully this will not hurt performance too much. It would allow us to make much larger crawls
2023-06-30 09:59:35 +02:00
Mikkel Denker
b4af8bf24a forgot to open rocksdb with correct options... 2023-06-30 09:19:08 +02:00
Mikkel Denker
a6e7d85cab Disable WAL in rocksdb for crawl coordinator.
We don't care about data loss in this part. If a crawl coordinator fails, we will have to restart the crawl anyway. Full speeeeeeed
2023-06-30 09:15:36 +02:00