Mikkel Denker
|
cb46c3417a
|
Way better CLI
|
2022-06-24 18:37:16 +02:00 |
|
Mikkel Denker
|
085d6cae55
|
Very simple searching working. Still a lot of bugs to figure out, but it actually works
|
2022-06-24 00:15:40 +02:00 |
|
Mikkel Denker
|
da6c750375
|
Indexing using mapreduce, centrality store, mapreduce worker state
|
2022-06-23 21:44:46 +02:00 |
|
Mikkel Denker
|
ec0385b668
|
Merge search indexes and make indexes serializable by freezing
|
2022-06-23 15:30:04 +02:00 |
|
Mikkel Denker
|
6e959f75f0
|
ability to reduce into different type in mapreduce
|
2022-06-23 14:01:57 +02:00 |
|
Mikkel Denker
|
4164cd3aeb
|
limit warc files during webgraph building and other minor changes
|
2022-06-23 12:38:43 +02:00 |
|
Mikkel Denker
|
43d7ac4d93
|
only build host webgraph for now
|
2022-06-23 10:58:20 +02:00 |
|
Mikkel Denker
|
9fa16c6d9f
|
Retry if warc download fails
|
2022-06-23 09:48:59 +02:00 |
|
Mikkel Denker
|
8ac2654f06
|
cleanup warnings (unused functions and imports)
|
2022-06-23 09:31:04 +02:00 |
|
Mikkel Denker
|
da66ffa2ca
|
mapreduce types doesn't have to be sync
|
2022-06-18 21:09:06 +02:00 |
|
Mikkel Denker
|
7db1d92dba
|
concurrent reduce and map. Mapreduce doesn't use async anymore
|
2022-06-18 19:14:33 +02:00 |
|
Mikkel Denker
|
5b98e498fc
|
various webgraph insert + merge optimizations
|
2022-06-18 14:49:36 +02:00 |
|
Mikkel Denker
|
919c3c583f
|
mapreduce simple wire protocol to handle cases where job or result is greater than buf size
|
2022-06-17 14:48:05 +02:00 |
|
Mikkel Denker
|
002981da1c
|
parsing speedup by using tl crate if possible
|
2022-06-17 12:04:16 +02:00 |
|
Mikkel Denker
|
2db70f2354
|
less locking+cloning in webgraph. Default trait for webgraph store
|
2022-06-16 09:03:41 +02:00 |
|
Mikkel Denker
|
abcada0245
|
significant speedup in webgraph by inserting into memory and only flushing to disk when needed
|
2022-06-14 21:22:13 +02:00 |
|
Mikkel Denker
|
40a2c2c0f8
|
serialize/deserialize webgraph
|
2022-06-11 12:00:10 +02:00 |
|
Mikkel Denker
|
6f66d04f35
|
speedup webgraph with caching
|
2022-06-08 19:57:30 +02:00 |
|
Mikkel Denker
|
1c067a2443
|
mapreduce success async
|
2022-06-06 20:13:57 +02:00 |
|
Mikkel Denker
|
22f74e2d9d
|
fixed case where mapreduce worker panic did not register as failure at manager if 0 is a valid answer to the job
|
2022-06-03 22:47:18 +02:00 |
|
Mikkel Denker
|
e1a8ff33ec
|
mapreduce is now way more robust to failures
|
2022-06-03 13:35:01 +02:00 |
|
Mikkel Denker
|
2040134118
|
forgot a todo item for mapreduce
|
2022-06-02 15:56:54 +02:00 |
|
Mikkel Denker
|
344dc3fd19
|
simple mapreduce without much error handling
|
2022-06-02 15:44:56 +02:00 |
|
Mikkel Denker
|
718034fb9d
|
new mode: webgraph builder
|
2022-06-01 14:03:35 +02:00 |
|
Mikkel Denker
|
4433cc8965
|
moved warc download into warcfile
|
2022-06-01 13:56:11 +02:00 |
|
Mikkel Denker
|
58d2ee41ef
|
node iterator refactoring
|
2022-06-01 12:37:00 +02:00 |
|
Mikkel Denker
|
6ce0b10c94
|
merge graphs
|
2022-06-01 12:32:47 +02:00 |
|
Mikkel Denker
|
c8b51db9ef
|
Custom tokenizer that stems multiple languages and fixed snippet generation as a result of the changed tokenizer
|
2022-06-01 10:39:58 +02:00 |
|
Mikkel Denker
|
11953e8bdb
|
Tokenizer can produce multiple tokens for each term. Should search for stemmed and non-stemmed version.
|
2022-06-01 10:05:08 +02:00 |
|
Mikkel Denker
|
d90888d52f
|
host-level graph
|
2022-05-31 15:51:36 +02:00 |
|
Mikkel Denker
|
86b07e3f87
|
tokenizer module
|
2022-05-31 14:55:12 +02:00 |
|
Mikkel Denker
|
19916da491
|
navigational ranking only applies to homepage
|
2022-05-31 11:14:04 +02:00 |
|
Mikkel Denker
|
06fde74b37
|
less confusion between searcher and tantivy::Searcher
|
2022-05-31 10:15:07 +02:00 |
|
Mikkel Denker
|
43fb159742
|
refactoring
|
2022-05-31 10:10:37 +02:00 |
|
Mikkel Denker
|
8f98ebd0de
|
navigational query
|
2022-05-30 17:36:50 +02:00 |
|
Mikkel Denker
|
2870c5f5ad
|
simple snippet generation
|
2022-05-30 12:13:03 +02:00 |
|
Mikkel Denker
|
0d97184026
|
Added license to cargo.toml
|
2022-05-30 11:04:40 +02:00 |
|
Mikkel Denker
|
7eb4ea84dc
|
Added license to source code files
|
2022-05-30 10:56:56 +02:00 |
|
Mikkel Denker
|
494c6f1727
|
Added LICENSE.md
|
2022-05-30 10:30:36 +02:00 |
|
Mikkel Denker
|
86f5447437
|
clippy fix
|
2022-05-29 19:22:19 +02:00 |
|
Mikkel Denker
|
82f8222999
|
use harmonic centrality during ranking
|
2022-05-29 19:20:57 +02:00 |
|
Mikkel Denker
|
74723d64ba
|
searchable backlinks and store centrality in search engine
|
2022-05-29 17:51:12 +02:00 |
|
Mikkel Denker
|
05a438a64d
|
english stemming
|
2022-05-29 16:38:17 +02:00 |
|
Mikkel Denker
|
c1c1f9b215
|
very simple search
|
2022-05-27 15:38:42 +02:00 |
|
Mikkel Denker
|
0dc108772c
|
more structure
|
2022-05-26 16:59:54 +02:00 |
|
Mikkel Denker
|
d6d10c6efb
|
added some tests that represents some of the features we want
|
2022-05-26 16:44:12 +02:00 |
|
Mikkel Denker
|
c55af35f01
|
process warc files concurrently
|
2022-05-24 15:09:22 +02:00 |
|
Mikkel Denker
|
c6c200564c
|
download from http
|
2022-05-24 14:52:22 +02:00 |
|
Mikkel Denker
|
6f22aad455
|
clippy
|
2022-05-24 14:19:25 +02:00 |
|
Mikkel Denker
|
e638d062b2
|
sled graph store seems to work
|
2022-05-24 14:09:11 +02:00 |
|