Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
cb46c3417a Way better CLI 2022-06-24 18:37:16 +02:00
Mikkel Denker
085d6cae55 Very simple searching working. Still a lot of bugs to figure out, but it actually works 2022-06-24 00:15:40 +02:00
Mikkel Denker
da6c750375 Indexing using mapreduce, centrality store, mapreduce worker state 2022-06-23 21:44:46 +02:00
Mikkel Denker
ec0385b668 Merge search indexes and make indexes serializable by freezing 2022-06-23 15:30:04 +02:00
Mikkel Denker
6e959f75f0 ability to reduce into different type in mapreduce 2022-06-23 14:01:57 +02:00
Mikkel Denker
4164cd3aeb limit warc files during webgraph building and other minor changes 2022-06-23 12:38:43 +02:00
Mikkel Denker
43d7ac4d93 only build host webgraph for now 2022-06-23 10:58:20 +02:00
Mikkel Denker
9fa16c6d9f Retry if warc download fails 2022-06-23 09:48:59 +02:00
Mikkel Denker
8ac2654f06 cleanup warnings (unused functions and imports) 2022-06-23 09:31:04 +02:00
Mikkel Denker
da66ffa2ca mapreduce types doesn't have to be sync 2022-06-18 21:09:06 +02:00
Mikkel Denker
7db1d92dba concurrent reduce and map. Mapreduce doesn't use async anymore 2022-06-18 19:14:33 +02:00
Mikkel Denker
5b98e498fc various webgraph insert + merge optimizations 2022-06-18 14:49:36 +02:00
Mikkel Denker
919c3c583f mapreduce simple wire protocol to handle cases where job or result is greater than buf size 2022-06-17 14:48:05 +02:00
Mikkel Denker
002981da1c parsing speedup by using tl crate if possible 2022-06-17 12:04:16 +02:00
Mikkel Denker
2db70f2354 less locking+cloning in webgraph. Default trait for webgraph store 2022-06-16 09:03:41 +02:00
Mikkel Denker
abcada0245 significant speedup in webgraph by inserting into memory and only flushing to disk when needed 2022-06-14 21:22:13 +02:00
Mikkel Denker
40a2c2c0f8 serialize/deserialize webgraph 2022-06-11 12:00:10 +02:00
Mikkel Denker
6f66d04f35 speedup webgraph with caching 2022-06-08 19:57:30 +02:00
Mikkel Denker
1c067a2443 mapreduce success async 2022-06-06 20:13:57 +02:00
Mikkel Denker
22f74e2d9d fixed case where mapreduce worker panic did not register as failure at manager if 0 is a valid answer to the job 2022-06-03 22:47:18 +02:00
Mikkel Denker
e1a8ff33ec mapreduce is now way more robust to failures 2022-06-03 13:35:01 +02:00
Mikkel Denker
2040134118 forgot a todo item for mapreduce 2022-06-02 15:56:54 +02:00
Mikkel Denker
344dc3fd19 simple mapreduce without much error handling 2022-06-02 15:44:56 +02:00
Mikkel Denker
718034fb9d new mode: webgraph builder 2022-06-01 14:03:35 +02:00
Mikkel Denker
4433cc8965 moved warc download into warcfile 2022-06-01 13:56:11 +02:00
Mikkel Denker
58d2ee41ef node iterator refactoring 2022-06-01 12:37:00 +02:00
Mikkel Denker
6ce0b10c94 merge graphs 2022-06-01 12:32:47 +02:00
Mikkel Denker
c8b51db9ef Custom tokenizer that stems multiple languages and fixed snippet generation as a result of the changed tokenizer 2022-06-01 10:39:58 +02:00
Mikkel Denker
11953e8bdb Tokenizer can produce multiple tokens for each term. Should search for stemmed and non-stemmed version. 2022-06-01 10:05:08 +02:00
Mikkel Denker
d90888d52f host-level graph 2022-05-31 15:51:36 +02:00
Mikkel Denker
86b07e3f87 tokenizer module 2022-05-31 14:55:12 +02:00
Mikkel Denker
19916da491 navigational ranking only applies to homepage 2022-05-31 11:14:04 +02:00
Mikkel Denker
06fde74b37 less confusion between searcher and tantivy::Searcher 2022-05-31 10:15:07 +02:00
Mikkel Denker
43fb159742 refactoring 2022-05-31 10:10:37 +02:00
Mikkel Denker
8f98ebd0de navigational query 2022-05-30 17:36:50 +02:00
Mikkel Denker
2870c5f5ad simple snippet generation 2022-05-30 12:13:03 +02:00
Mikkel Denker
0d97184026 Added license to cargo.toml 2022-05-30 11:04:40 +02:00
Mikkel Denker
7eb4ea84dc Added license to source code files 2022-05-30 10:56:56 +02:00
Mikkel Denker
494c6f1727 Added LICENSE.md 2022-05-30 10:30:36 +02:00
Mikkel Denker
86f5447437 clippy fix 2022-05-29 19:22:19 +02:00
Mikkel Denker
82f8222999 use harmonic centrality during ranking 2022-05-29 19:20:57 +02:00
Mikkel Denker
74723d64ba searchable backlinks and store centrality in search engine 2022-05-29 17:51:12 +02:00
Mikkel Denker
05a438a64d english stemming 2022-05-29 16:38:17 +02:00
Mikkel Denker
c1c1f9b215 very simple search 2022-05-27 15:38:42 +02:00
Mikkel Denker
0dc108772c more structure 2022-05-26 16:59:54 +02:00
Mikkel Denker
d6d10c6efb added some tests that represents some of the features we want 2022-05-26 16:44:12 +02:00
Mikkel Denker
c55af35f01 process warc files concurrently 2022-05-24 15:09:22 +02:00
Mikkel Denker
c6c200564c download from http 2022-05-24 14:52:22 +02:00
Mikkel Denker
6f22aad455 clippy 2022-05-24 14:19:25 +02:00
Mikkel Denker
e638d062b2 sled graph store seems to work 2022-05-24 14:09:11 +02:00