Commit graph

413 commits

Author SHA1 Message Date
Mikkel Denker
c7625a042e stash ggml
just found the 'candle' library from huggingface written in pure rust. this would be a much better library for us to use as they already implement all the quanitization features, f16 etc. and we can be sure it's safe. unfortunately, it renders all the work i did over the past couple of days on ggml bindings useless... stashing it here before i delete it in case we need it in the future.
2024-02-01 14:13:31 +01:00
Mikkel Denker
ea3b7a4099 implement some layers in ggml
linear, embedding and multihead attention
2024-01-31 17:51:02 +01:00
Mikkel Denker
ca62a0c5ee entity index skip portal pages 2024-01-29 19:48:26 +01:00
Mikkel Denker
c34849cae9 move spellcheck into separate api endpoint and only correct non-special terms.
we don't ant to correct "site:...", "inurl:..." etc. parts of the query.
2024-01-29 14:07:34 +01:00
Mikkel Denker
f4e7d1972c actually skip disambiguation pages for entity index.
turns out that none of the usual disambiguation elements from the online wiki are present in the .zim dump. instead, disambiguation pages seem to have a "<meta property='mw:PageProp/disambiguation'>" element which we can use.

the commit also includes a useful script to dump the html for a specific article from a zim file which is very usefull when debugging this stuff
2024-01-29 09:59:39 +01:00
Mikkel Denker
6bb7311838 hopefully more robust way to detect disambiguation pages.
this is difficult to test as there is no disambiguation pages in the dev .zim file. let's just try on prod
2024-01-28 13:04:59 +01:00
Mikkel Denker
e008e70033 if there are no simple terms in the query, we should use prefix of description (or body if there is no description) as snippet 2024-01-27 21:23:09 +01:00
Mikkel Denker
2205849cfe properly strip ‘https://' and 'http://' prefix from explore 2024-01-27 17:14:20 +01:00
Mikkel Denker
88c8cbce41 get_webpage only looked at the response from the shard with the quickest response.
this obviously caused webpages to not be found seemingly at random, which caused summarizer to fail
2024-01-27 15:06:51 +01:00
Mikkel Denker
b4159d57ed deduplicate related entities by their images.
looks better
2024-01-27 14:28:30 +01:00
Mikkel Denker
6d3af18bac skip disambiguation pages when constructing entity index 2024-01-27 13:36:52 +01:00
Mikkel Denker
045d5d4b05 boost title field matches in entity index 2024-01-27 13:21:04 +01:00
Mikkel Denker
1a9f381d15
GGML Rust bindings (#122)
* move crates into a 'crates' folder

* added cargo-about to check dependency licenses

* create ggml-sys bindings and build as a static library.
simple addition sanity test passes

* update licenses

* yeet alice

* yeet qa model

* yeet fact model

* [wip] idiomatic rust bindings for ggml

* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
2024-01-27 12:27:27 +01:00