Mikkel Denker
05b87c95dd
http protocol in explore
2023-07-20 15:48:15 +02:00
Mikkel Denker
3c2c0bf758
Use md5 digest as id for webgraph nodes.
...
There is a small collision risk where multiple nodes will get the same id in the graph.
This risk is extremely small since the digest is 128 bits and the benefits we get from the significantly reduced complexity with everything surrounding the webgraph hugely outweighs this risk.
It's now much faster to get edges for a node in the graph, since we don't need to map segment_node_ids to node_ids. Segment merges is also way less complex.
2023-07-20 15:41:08 +02:00
Mikkel Denker
7c95231011
update crawler description text
2023-07-19 17:00:22 +02:00
Mikkel Denker
a545770b6e
Change user agent of crawler.
...
Reddit seems to look for "bot" in the user agent. If they cannot find the substring, they return a page that updates the title with javascript. This causes us to have a bunch of reddit pages with the title "Reddit - Dive into anything" in the search results.
2023-07-19 16:40:30 +02:00
Mikkel Denker
ceaf443884
calibrate discussions widget
2023-07-19 10:18:13 +02:00
Mikkel Denker
405e59335e
better caching for node ids in webgraph
2023-07-18 19:09:58 +02:00
Mikkel Denker
d9b0fd7450
Improve read performance for webgraph.
...
We have split adjacency into a small_adjacency and full_adjacency (same for reverse) rocksdb databases. This allows us to read edge labels when we need them, which will increase the read performance for the webgraph edges at the expense of storage usage.
2023-07-18 16:17:45 +02:00
Mikkel Denker
802bddedae
respect robots meta tag and use canonical url
2023-07-18 13:28:34 +02:00
Mikkel Denker
99438846ac
forgot to add shadows to screenshot
2023-07-18 11:10:07 +02:00
Mikkel Denker
330c02a96f
rust update and new git screenshot
2023-07-18 11:04:58 +02:00
Mikkel Denker
3f69e3f062
make type of centrality store explicit in config variable name
2023-07-17 11:09:32 +02:00
Mikkel Denker
361e0bbc2f
More logging if some things looks off during indexing.
...
Might be due to wrong path for centrality stores, webgraph etc.
2023-07-17 10:10:38 +02:00
Mikkel Denker
c68c7ca96c
less approximations in similar sites finder
2023-07-16 18:28:45 +02:00
Mikkel Denker
c392f689f9
Remove www prefix when inserting into web graph
2023-07-15 21:58:54 +02:00
Mikkel Denker
43bd806fef
Retry urls that return 429 during crawl
2023-07-14 08:25:09 +02:00
Mikkel Denker
1af94339bb
increase max politeness, faster setup and some documentation
2023-07-13 17:53:46 +02:00
Mikkel Denker
a31755b874
Language makes more sense than region.
...
Currently region is only detected by language, so from a user perspective it makes more sense to call the dropdown "language".
2023-07-13 13:54:42 +02:00
Mikkel Denker
aa357307d7
forgot to update primary key name in scylla chats table
2023-07-13 11:45:23 +02:00
Mikkel Denker
62d8a579dd
test hardening
2023-07-12 14:11:05 +02:00
Mikkel Denker
d9bf123572
url digits and slashes are computable before search
2023-07-12 12:03:28 +02:00
Mikkel Denker
1e2ee6b8ea
number of slashes and digits in url as ranking signal
2023-07-12 11:16:33 +02:00
Mikkel Denker
4f1a7079e4
formating and raw http for s3 endpoint
2023-07-11 17:47:42 +02:00
Mikkel Denker
5f737943cc
normalize graph nodes whenver a node struct i constructed
2023-07-06 17:20:58 +02:00
Mikkel Denker
af1e43206b
Normalize redirects in web graph
2023-07-06 16:39:52 +02:00
Mikkel Denker
2191b4734c
a bunch of changes primarily to centrality store to make it store more stuff on disk
2023-07-06 14:28:02 +02:00
Mikkel Denker
ec76b7fb63
rename webgraph url level to page level
2023-07-06 12:12:38 +02:00
Mikkel Denker
3a87834547
Re-write webgraph to be based on rocksdb.
...
This allows us to create really big graphs (needed for page-level webgraphs).
2023-07-06 12:02:23 +02:00
Mikkel Denker
e858c004f1
Lower domain weight after the urls have been sampled
2023-07-06 08:05:44 +02:00
Mikkel Denker
0ea90aefdd
harmonic centrality exact counting threshold
2023-07-05 19:59:53 +02:00
Mikkel Denker
6aa3766d79
configurable webgraph level
2023-07-05 18:45:55 +02:00
Mikkel Denker
bf68865f89
only search single replica with retries
2023-07-05 17:15:00 +02:00
Mikkel Denker
293527008f
saving content length before replacing some characters causes an offset in the written warcfile
2023-07-05 12:43:40 +02:00
Mikkel Denker
f95271bfe0
If the docset is empty, there is for sure no pattern match
2023-07-05 12:41:01 +02:00
Mikkel Denker
e1c16e3f20
insert crawler job responses in batches
2023-07-04 13:37:14 +02:00
Mikkel Denker
def23ac59a
simplify ranking pipeline and fix offsets
2023-07-04 10:31:10 +02:00
Mikkel Denker
cd3a65fe79
use http instead of https as default protocol
2023-07-04 08:25:58 +02:00
Mikkel Denker
74af6d486a
Make sure that warc writes are atomic.
...
If part of a warc record write fails, we should not write other parts of the record. This could lead to the warc reader being out of sync, which would cause some html bodies to be saved under the wrong url
2023-07-03 16:57:55 +02:00
Mikkel Denker
cb0e8593bd
fixed crawler sampling
2023-07-03 15:26:08 +02:00
Mikkel Denker
625e02590d
Don't use self node for scoring in query centrality.
...
We don't want to have bias for the top search results when calculating query centrality. For any node in the search results, they should be scored by the similarity with all the other nodes (not themselves).
2023-07-03 11:35:10 +02:00
Mikkel Denker
498e78b430
Remove online/personal centrality.
...
It simply wasn't good enough and was too computationally expensive. It is better to have a faster search experience and only rely on inbound similarity for the personalization part, as this tends to work quite well while also being way simpler. Kill your darlings and all that.
2023-07-03 10:48:43 +02:00
Mikkel Denker
db7abf3d5d
added log messages to indicate where we are in centrality calculations
2023-06-30 17:00:10 +02:00
Mikkel Denker
fb3700dfaf
fix s3 warc download
2023-06-30 15:59:06 +02:00
Mikkel Denker
ae79c7af90
s3 warc source
2023-06-30 14:23:30 +02:00
Mikkel Denker
aaf5bd064f
Revert store url states in crawldb on disk.
...
It introduced a significant performance penalty and for some reason kept trying to allocate an extreme amount of memory, causing oom
2023-06-30 11:07:47 +02:00
Mikkel Denker
bdd9d415d7
try rkyv serialization in crawl coordinator to hopefully fix oom error
2023-06-30 10:53:43 +02:00
Mikkel Denker
142ddebeb5
there seems to be a bug in bincode that causes an oom crash with an extreme amount of bytes tried allocated
2023-06-30 10:14:09 +02:00
Mikkel Denker
565e7e1e70
Each worker now fetches 2*num_workers jobs.
...
This will decrease the pressure on the coordinator
2023-06-30 10:01:34 +02:00
Mikkel Denker
864fe3a7c0
Significantly decrease memory usage by storing url states on disk.
...
Hopefully this will not hurt performance too much. It would allow us to make much larger crawls
2023-06-30 09:59:35 +02:00
Mikkel Denker
b4af8bf24a
forgot to open rocksdb with correct options...
2023-06-30 09:19:08 +02:00
Mikkel Denker
a6e7d85cab
Disable WAL in rocksdb for crawl coordinator.
...
We don't care about data loss in this part. If a crawl coordinator fails, we will have to restart the crawl anyway. Full speeeeeeed
2023-06-30 09:15:36 +02:00