Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
21465a706a add rocksdb indexes to lfs (maybe) 2022-08-17 09:43:26 +02:00
Mikkel Denker
c3e51d23e8 remove some files from git lfs 2022-08-17 09:36:14 +02:00
Mikkel Denker
d8a0de3fa0 remove logfiles from data dir 2022-08-16 18:16:48 +02:00
Mikkel Denker
d2580902aa add data folder to git lfs 2022-08-16 17:30:55 +02:00
Mikkel Denker
2e11771b62 maybe fix lfs 2022-08-16 16:34:44 +02:00
Mikkel Denker
91f501ac1e data folder in lfs 2022-08-16 16:33:46 +02:00
Mikkel Denker
f7e88fab9d data folder in lfs 2022-08-16 16:31:38 +02:00
Mikkel Denker
c87f21e018 include Cargo.lock 2022-08-16 13:55:59 +02:00
Mikkel Denker
d1ece2af46 Ftr/entity sidebar (#11)
* show basic (and ugly) entity information during search

* refactor image downloads into an image downloader

* download entity images

* show entity image and link to wiki article

* show related entities

* show links in entity text and infobox

* don't match stopwords in entity titles during search

* remove stopwords from entity query

* fixed bug where some wikipedia images werent downloaded

* prettify the primary image a bit
2022-08-16 13:52:31 +02:00
Mikkel Denker
8d5d375f0b clippy fix 2022-08-12 14:39:15 +02:00
Mikkel Denker
082957f732 download sample warc file if not present on local run 2022-08-12 14:04:11 +02:00
Mikkel Denker
ab10ac851e empty data folder in git 2022-08-12 13:56:32 +02:00
Mikkel Denker
3861386da6 take freshness into account during ranking 2022-08-10 09:52:06 +02:00
Mikkel Denker
b07c67559a Refactor inverted index to it's own struct.
Extra information such as favicon and other images are now separated from the inverted index and everything is stored in the index. This also allows us to easier keep a list of all pending image download jobs and execute these asynchronously (which we know do).
2022-08-10 08:51:51 +02:00
Mikkel Denker
7f42601f46 Index (almost) all text on page. Only use clean text for snippets unless absolutely necesarry. 2022-08-09 15:27:24 +02:00
Mikkel Denker
7e31e98bb7 Fixed bug that caused some queries to panic.
Since we didn't check if the positions[i] vector had elements, some searches caused the weights and positions vector to have a different number of elements, thereby crashing on .zip_eq
2022-08-09 14:54:53 +02:00
Mikkel Denker
499488d454 if no snippet found, use webapge description if present 2022-08-09 14:47:17 +02:00
Mikkel Denker
4f55c72dd2 show update time in search result 2022-08-09 14:28:53 +02:00
Mikkel Denker
549d3fb074 show primary image in search results 2022-08-09 13:00:27 +02:00
Mikkel Denker
2fe0c1b5dc specialized metadata functions 2022-08-09 09:51:43 +02:00
Mikkel Denker
82e7be6d20 specialized updated_time function 2022-08-08 16:51:43 +02:00
Mikkel Denker
975f229431 parse schema org json-ld data 2022-08-08 16:03:28 +02:00
Mikkel Denker
70163f970c show favicons in search result 2022-08-08 14:48:06 +02:00
Mikkel Denker
003b54b820 image store with filters 2022-08-08 10:24:17 +02:00
Mikkel Denker
d19597beba refactor kv trait out from webgraph 2022-08-08 10:24:08 +02:00
Mikkel Denker
0ed6bc236d Stemmed terms now show up in the generated snippets. We use the language of the webpage to determine which stemmer to use. 2022-08-05 12:52:47 +02:00
Mikkel Denker
87f5695913 fixed broken README link to license file 2022-08-04 13:32:48 +02:00
Mikkel Denker
c3c1cb7f00 use fetch time during ranking 2022-08-04 12:51:47 +02:00
Mikkel Denker
faddb613aa minor refactoring of url functions into a shared struct 2022-08-04 12:30:40 +02:00
Mikkel Denker
ddff7aebe6 show protocol in search url 2022-08-03 18:59:59 +02:00
Mikkel Denker
822213c10e term proximity ranking 2022-08-03 18:49:11 +02:00
Mikkel Denker
976cc70d34 spans for term proximity ranking 2022-08-03 17:01:32 +02:00
Mikkel Denker
e6c88d64e1 Match webage terms across fields.
If a webpage matches TermA in url and TermB in body it should still be a correct match even though no single field has both TermB and TermA.
2022-07-22 11:08:37 +02:00
Mikkel Denker
d5efc66c31 Custom tantivy query implementation.
Preparations for expanded spans ranking which allows us to rank documents based on term proximity. Overall this change gives more control over the search+ranking procedure.
2022-07-22 10:41:39 +02:00
Mikkel Denker
c37db38eea Bold suffix for suggestions 2022-07-01 18:30:48 +02:00
Mikkel Denker
d64847d874 Better UX for searchbar suggestions 2022-07-01 14:54:31 +02:00
Mikkel Denker
553d231da2 Move webgraph to use RocksDB as store 2022-06-30 15:41:11 +02:00
Mikkel Denker
96b2d3e6ea Refactor webgraph for easier KV-store additions.
GraphStore is now a struct backed by something that implements a new Kv trait.
This makes it far easier to implement another backing key-value store, as all the caching
etc. from the graph database is abstracted away.
This change is introduced since we want to try RocksDB instead of Sled as Sled caused huge memory consumption
when indexing 10_000 warc files. It brought a 256GB ram server to its knees.
2022-06-30 12:55:09 +02:00
Mikkel Denker
ea20d43df8 Highlight stemmed matches in snippet generation 2022-06-30 08:48:45 +02:00
Mikkel Denker
b526a782a6 Handle HTML comments in lexer 2022-06-29 19:48:23 +02:00
Mikkel Denker
2ebc020da2 Significant performance improvements to text extraction 2022-06-29 16:37:28 +02:00
Mikkel Denker
fda6c1698c Initial JustText algorithm.
Generate better snippets by guessing which parts of the website viable content.
This also improves search accuracy as the BM25 score becomes way better (it is given cleaner data).
Performance could probably be immensly improved, but it seems to work.
2022-06-29 15:23:37 +02:00
Mikkel Denker
6bc6ee352d [WIP] Query suggestions 2022-06-29 08:55:30 +02:00
Mikkel Denker
c500c0e71c Custom HTML lexer based on logos.
There is no need to parse the entire html file into a tree since a stream of tags will be sufficient for extracting links+text.
This should be faster than parsing. It wil also make it easier to implement the 'just_text' algorithm.
2022-06-28 13:49:50 +02:00
Mikkel Denker
97583078a6 Better navigational ranking 2022-06-27 11:10:41 +02:00
Mikkel Denker
0632110a8d Separate stemmed and non-stemmed fields.
This allows us to prioritize non-stemmed matches over stemmed matches.
2022-06-27 10:37:28 +02:00
Mikkel Denker
4bdfe59b36 show query in searchbar 2022-06-27 09:25:17 +02:00
Mikkel Denker
21149d8ced Merge all segments at the end of map, don't merge automatically.
This hugely increases the indexing performance as we don't merge as often.
2022-06-27 09:12:29 +02:00
Mikkel Denker
383d84266f Optimizations to index schema and indexing.
There seems to be a problem with large fast-fields, both in terms of crashes and performance.
We therefore now check whether a query-document match is navigational by indexing the host string
and boosting the field during search. In the future, we may want to add a boolean fast-field to
indicate if the given webpage is the homepage and only do navigational searches on those.
The commit also increases indexing ressources in the hopes that this reduces the number of merges
during indexing.
2022-06-25 19:06:04 +02:00
Mikkel Denker
3c907b60b7 initial frontend 2022-06-25 17:14:07 +02:00