Commit graph

1308 commits

Author SHA1 Message Date
SekoiaTree
2a62f9c28a
Make places that use rules use rule OR-ing (#133) 2024-02-08 18:51:12 +01:00
Mikkel Denker
aed64be27e internet archive warc files does not seem to store the payload type. let's just assume it's html (records that can't be parsed are skipped anyway) 2024-02-08 16:14:24 +01:00
Mikkel Denker
91ffe15cfd support internet archive warc files.
the ordering seems to be a bit different than the ones from commoncrawl. these changes should hopefully make the parser overall more robust.
2024-02-07 21:15:08 +01:00
Mikkel Denker
a4ce85a905 bump optics samples submodule 2024-02-07 13:28:57 +01:00
Mikkel Denker
be7bbd02fc Forgot to add serde defaults to the new snippet config fields... 2024-02-06 19:38:45 +01:00
SekoiaTree
359117bfbe
Initial rule or-ing (#127) 2024-02-06 19:34:58 +01:00
Mikkel Denker
aa89813906 move some of the hardcoded snippet choices into the configuration file 2024-02-06 11:19:42 +01:00
Mikkel Denker
d9136d59fd force ttl directly in scylla table 2024-02-06 11:15:22 +01:00
Mikkel Denker
fd33eb4b66 fixed bug where search suggestions could not be closed in safari 2024-02-05 15:27:05 +01:00
Mikkel Denker
f61f1f6b0f fix bug where query suggestions couldn't be selected in safari 2024-02-05 14:57:26 +01:00
Mikkel Denker
bfbb6efeb0 update config file names 2024-02-05 12:04:20 +01:00
Mikkel Denker
e8489f4792 flatten fastfield reader from vec<vec<u64>> to a large vec<u64> 2024-02-04 15:14:49 +01:00
Mikkel Denker
625d6fc6b7 no need for enummap in fastfield reader as we know all fields statically 2024-02-04 15:05:14 +01:00
Mikkel Denker
eb7e96bf50 better cache locality for fast field reader.
instead of going from field -> doc -> value, we can go from doc -> field -> value and thereby reuse the doc -> field part for all fields in the document.
2024-02-04 14:51:50 +01:00
Mikkel Denker
b88e7fd013 search example reduce words considered for snippet generation 2024-02-04 13:30:28 +01:00
Mikkel Denker
5164789a32 move search bench into an example so we can easier profile with perf 2024-02-04 13:28:09 +01:00
Mikkel Denker
099282cefa don't need to send all ranking signals to frontend for dicussions widget. sending final score is enough 2024-02-03 16:48:43 +01:00
Mikkel Denker
7919df0863 'optimize_for_search' actually seemed to make the searches slower as too many threads would fight for io access at once 2024-02-03 14:36:19 +01:00
Mikkel Denker
10043e7db6 reduce fastfield indirections 2024-02-03 14:25:17 +01:00
Mikkel Denker
c46a85a97e load all fastfields into memory.
this is an experiment to see how it affects performance vs memory usage
2024-02-02 22:06:54 +01:00
Mikkel Denker
45d8245374 keep fastfield reader open across searches
52% of time seems to be spend on opening the fastfields
2024-02-02 15:10:41 +01:00
Mikkel Denker
9bcaf054c3 bench based on queries from autosuggest.
this will hopefully show us where the hot paths are when caches aren't hit
2024-02-02 14:50:54 +01:00
Mikkel Denker
e4e3044e47 finally ditch that pesky libtorch dependency! 2024-02-02 13:11:06 +01:00
Mikkel Denker
d7e564d91a move neural network models from torch to candle 2024-02-02 12:36:39 +01:00
Mikkel Denker
c7625a042e stash ggml
just found the 'candle' library from huggingface written in pure rust. this would be a much better library for us to use as they already implement all the quanitization features, f16 etc. and we can be sure it's safe. unfortunately, it renders all the work i did over the past couple of days on ggml bindings useless... stashing it here before i delete it in case we need it in the future.
2024-02-01 14:13:31 +01:00
Mikkel Denker
ea3b7a4099 implement some layers in ggml
linear, embedding and multihead attention
2024-01-31 17:51:02 +01:00
Crispy
ecdc4f89cf
fix outdated link in settings/privacy (#123) 2024-01-30 13:33:28 +01:00
Mikkel Denker
ca62a0c5ee entity index skip portal pages 2024-01-29 19:48:26 +01:00
Mikkel Denker
e8732d7877 annotate summary box with aria-busy to indicate that screen readers might want to wait until summary is done generating 2024-01-29 15:46:11 +01:00
Mikkel Denker
c34849cae9 move spellcheck into separate api endpoint and only correct non-special terms.
we don't ant to correct "site:...", "inurl:..." etc. parts of the query.
2024-01-29 14:07:34 +01:00
Mikkel Denker
f4e7d1972c actually skip disambiguation pages for entity index.
turns out that none of the usual disambiguation elements from the online wiki are present in the .zim dump. instead, disambiguation pages seem to have a "<meta property='mw:PageProp/disambiguation'>" element which we can use.

the commit also includes a useful script to dump the html for a specific article from a zim file which is very usefull when debugging this stuff
2024-01-29 09:59:39 +01:00
Mikkel Denker
b3bcda2dfe simple script to dump article html from a zim file 2024-01-29 09:19:25 +01:00
Mikkel Denker
c91cf3d3d3 Modal with long domains moved “do you like results from ...” too far to the right 2024-01-28 15:01:24 +01:00
Mikkel Denker
a4a3a8d1ed don't show discussions widget if there is a user applied optic 2024-01-28 13:52:55 +01:00
Mikkel Denker
dca9f039df consistent discussions drop down arrow size 2024-01-28 13:42:38 +01:00
Mikkel Denker
cea9884395 make sure each result uses full width of result div 2024-01-28 13:31:49 +01:00
Mikkel Denker
6bb7311838 hopefully more robust way to detect disambiguation pages.
this is difficult to test as there is no disambiguation pages in the dev .zim file. let's just try on prod
2024-01-28 13:04:59 +01:00
Mikkel Denker
e008e70033 if there are no simple terms in the query, we should use prefix of description (or body if there is no description) as snippet 2024-01-27 21:23:09 +01:00
Mikkel Denker
2205849cfe properly strip ‘https://' and 'http://' prefix from explore 2024-01-27 17:14:20 +01:00
Mikkel Denker
88c8cbce41 get_webpage only looked at the response from the shard with the quickest response.
this obviously caused webpages to not be found seemingly at random, which caused summarizer to fail
2024-01-27 15:06:51 +01:00
Mikkel Denker
b4159d57ed deduplicate related entities by their images.
looks better
2024-01-27 14:28:30 +01:00
Mikkel Denker
6d3af18bac skip disambiguation pages when constructing entity index 2024-01-27 13:36:52 +01:00
Mikkel Denker
045d5d4b05 boost title field matches in entity index 2024-01-27 13:21:04 +01:00
Mikkel Denker
0bc8c2081e remove html files from github stats 2024-01-27 12:30:09 +01:00
Mikkel Denker
1a9f381d15
GGML Rust bindings (#122)
* move crates into a 'crates' folder

* added cargo-about to check dependency licenses

* create ggml-sys bindings and build as a static library.
simple addition sanity test passes

* update licenses

* yeet alice

* yeet qa model

* yeet fact model

* [wip] idiomatic rust bindings for ggml

* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
2024-01-27 12:27:27 +01:00
Mikkel Denker
8f648601db Merge branch 'main' of github.com:StractOrg/stract 2024-01-24 11:09:33 +01:00
Mikkel Denker
788f92c8f4 split webgraph server into host and page.
allows us to host each graph on separate sets of servers.
2024-01-24 11:08:45 +01:00
dependabot[bot]
16bcd84a71
Bump shlex from 1.2.0 to 1.3.0 (#121)
Bumps [shlex](https://github.com/comex/rust-shlex) from 1.2.0 to 1.3.0.
- [Changelog](https://github.com/comex/rust-shlex/blob/master/CHANGELOG.md)
- [Commits](https://github.com/comex/rust-shlex/commits)

---
updated-dependencies:
- dependency-name: shlex
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-24 09:42:13 +01:00
Mikkel Denker
cc91935d0a Move entity index out of normal search index and have dedicated search server for it 2024-01-23 14:53:33 +01:00
Mikkel Denker
14e5466fb8 Move widget, discussions etc out from api searcher and into separate api endpoints.
This creates a better separation between the frontend and backend, and avoids e.g. 'fetch_discussions' abstraction leak.
It also allows for these endpoints to be used in different contexts separately without having to perform a full search.
2024-01-23 12:21:55 +01:00