SekoiaTree
2a62f9c28a
Make places that use rules use rule OR-ing ( #133 )
2024-02-08 18:51:12 +01:00
Mikkel Denker
aed64be27e
internet archive warc files does not seem to store the payload type. let's just assume it's html (records that can't be parsed are skipped anyway)
2024-02-08 16:14:24 +01:00
Mikkel Denker
91ffe15cfd
support internet archive warc files.
...
the ordering seems to be a bit different than the ones from commoncrawl. these changes should hopefully make the parser overall more robust.
2024-02-07 21:15:08 +01:00
Mikkel Denker
a4ce85a905
bump optics samples submodule
2024-02-07 13:28:57 +01:00
Mikkel Denker
be7bbd02fc
Forgot to add serde defaults to the new snippet config fields...
2024-02-06 19:38:45 +01:00
SekoiaTree
359117bfbe
Initial rule or-ing ( #127 )
2024-02-06 19:34:58 +01:00
Mikkel Denker
aa89813906
move some of the hardcoded snippet choices into the configuration file
2024-02-06 11:19:42 +01:00
Mikkel Denker
d9136d59fd
force ttl directly in scylla table
2024-02-06 11:15:22 +01:00
Mikkel Denker
fd33eb4b66
fixed bug where search suggestions could not be closed in safari
2024-02-05 15:27:05 +01:00
Mikkel Denker
f61f1f6b0f
fix bug where query suggestions couldn't be selected in safari
2024-02-05 14:57:26 +01:00
Mikkel Denker
bfbb6efeb0
update config file names
2024-02-05 12:04:20 +01:00
Mikkel Denker
e8489f4792
flatten fastfield reader from vec<vec<u64>> to a large vec<u64>
2024-02-04 15:14:49 +01:00
Mikkel Denker
625d6fc6b7
no need for enummap in fastfield reader as we know all fields statically
2024-02-04 15:05:14 +01:00
Mikkel Denker
eb7e96bf50
better cache locality for fast field reader.
...
instead of going from field -> doc -> value, we can go from doc -> field -> value and thereby reuse the doc -> field part for all fields in the document.
2024-02-04 14:51:50 +01:00
Mikkel Denker
b88e7fd013
search example reduce words considered for snippet generation
2024-02-04 13:30:28 +01:00
Mikkel Denker
5164789a32
move search bench into an example so we can easier profile with perf
2024-02-04 13:28:09 +01:00
Mikkel Denker
099282cefa
don't need to send all ranking signals to frontend for dicussions widget. sending final score is enough
2024-02-03 16:48:43 +01:00
Mikkel Denker
7919df0863
'optimize_for_search' actually seemed to make the searches slower as too many threads would fight for io access at once
2024-02-03 14:36:19 +01:00
Mikkel Denker
10043e7db6
reduce fastfield indirections
2024-02-03 14:25:17 +01:00
Mikkel Denker
c46a85a97e
load all fastfields into memory.
...
this is an experiment to see how it affects performance vs memory usage
2024-02-02 22:06:54 +01:00
Mikkel Denker
45d8245374
keep fastfield reader open across searches
...
52% of time seems to be spend on opening the fastfields
2024-02-02 15:10:41 +01:00
Mikkel Denker
9bcaf054c3
bench based on queries from autosuggest.
...
this will hopefully show us where the hot paths are when caches aren't hit
2024-02-02 14:50:54 +01:00
Mikkel Denker
e4e3044e47
finally ditch that pesky libtorch dependency!
2024-02-02 13:11:06 +01:00
Mikkel Denker
d7e564d91a
move neural network models from torch to candle
2024-02-02 12:36:39 +01:00
Mikkel Denker
c7625a042e
stash ggml
...
just found the 'candle' library from huggingface written in pure rust. this would be a much better library for us to use as they already implement all the quanitization features, f16 etc. and we can be sure it's safe. unfortunately, it renders all the work i did over the past couple of days on ggml bindings useless... stashing it here before i delete it in case we need it in the future.
2024-02-01 14:13:31 +01:00
Mikkel Denker
ea3b7a4099
implement some layers in ggml
...
linear, embedding and multihead attention
2024-01-31 17:51:02 +01:00
Crispy
ecdc4f89cf
fix outdated link in settings/privacy ( #123 )
2024-01-30 13:33:28 +01:00
Mikkel Denker
ca62a0c5ee
entity index skip portal pages
2024-01-29 19:48:26 +01:00
Mikkel Denker
e8732d7877
annotate summary box with aria-busy to indicate that screen readers might want to wait until summary is done generating
2024-01-29 15:46:11 +01:00
Mikkel Denker
c34849cae9
move spellcheck into separate api endpoint and only correct non-special terms.
...
we don't ant to correct "site:...", "inurl:..." etc. parts of the query.
2024-01-29 14:07:34 +01:00
Mikkel Denker
f4e7d1972c
actually skip disambiguation pages for entity index.
...
turns out that none of the usual disambiguation elements from the online wiki are present in the .zim dump. instead, disambiguation pages seem to have a "<meta property='mw:PageProp/disambiguation'>" element which we can use.
the commit also includes a useful script to dump the html for a specific article from a zim file which is very usefull when debugging this stuff
2024-01-29 09:59:39 +01:00
Mikkel Denker
b3bcda2dfe
simple script to dump article html from a zim file
2024-01-29 09:19:25 +01:00
Mikkel Denker
c91cf3d3d3
Modal with long domains moved “do you like results from ...” too far to the right
2024-01-28 15:01:24 +01:00
Mikkel Denker
a4a3a8d1ed
don't show discussions widget if there is a user applied optic
2024-01-28 13:52:55 +01:00
Mikkel Denker
dca9f039df
consistent discussions drop down arrow size
2024-01-28 13:42:38 +01:00
Mikkel Denker
cea9884395
make sure each result uses full width of result div
2024-01-28 13:31:49 +01:00
Mikkel Denker
6bb7311838
hopefully more robust way to detect disambiguation pages.
...
this is difficult to test as there is no disambiguation pages in the dev .zim file. let's just try on prod
2024-01-28 13:04:59 +01:00
Mikkel Denker
e008e70033
if there are no simple terms in the query, we should use prefix of description (or body if there is no description) as snippet
2024-01-27 21:23:09 +01:00
Mikkel Denker
2205849cfe
properly strip ‘https://' and 'http://' prefix from explore
2024-01-27 17:14:20 +01:00
Mikkel Denker
88c8cbce41
get_webpage only looked at the response from the shard with the quickest response.
...
this obviously caused webpages to not be found seemingly at random, which caused summarizer to fail
2024-01-27 15:06:51 +01:00
Mikkel Denker
b4159d57ed
deduplicate related entities by their images.
...
looks better
2024-01-27 14:28:30 +01:00
Mikkel Denker
6d3af18bac
skip disambiguation pages when constructing entity index
2024-01-27 13:36:52 +01:00
Mikkel Denker
045d5d4b05
boost title field matches in entity index
2024-01-27 13:21:04 +01:00
Mikkel Denker
0bc8c2081e
remove html files from github stats
2024-01-27 12:30:09 +01:00
Mikkel Denker
1a9f381d15
GGML Rust bindings ( #122 )
...
* move crates into a 'crates' folder
* added cargo-about to check dependency licenses
* create ggml-sys bindings and build as a static library.
simple addition sanity test passes
* update licenses
* yeet alice
* yeet qa model
* yeet fact model
* [wip] idiomatic rust bindings for ggml
* [ggml] mul, add and sub ops implemented for tensors.
i think it would be easier to try and implement a bert model in order to figure out which ops we should include in the binding. for instance, is view and concat needed?
2024-01-27 12:27:27 +01:00
Mikkel Denker
8f648601db
Merge branch 'main' of github.com:StractOrg/stract
2024-01-24 11:09:33 +01:00
Mikkel Denker
788f92c8f4
split webgraph server into host and page.
...
allows us to host each graph on separate sets of servers.
2024-01-24 11:08:45 +01:00
dependabot[bot]
16bcd84a71
Bump shlex from 1.2.0 to 1.3.0 ( #121 )
...
Bumps [shlex](https://github.com/comex/rust-shlex ) from 1.2.0 to 1.3.0.
- [Changelog](https://github.com/comex/rust-shlex/blob/master/CHANGELOG.md )
- [Commits](https://github.com/comex/rust-shlex/commits )
---
updated-dependencies:
- dependency-name: shlex
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-24 09:42:13 +01:00
Mikkel Denker
cc91935d0a
Move entity index out of normal search index and have dedicated search server for it
2024-01-23 14:53:33 +01:00
Mikkel Denker
14e5466fb8
Move widget, discussions etc out from api searcher and into separate api endpoints.
...
This creates a better separation between the frontend and backend, and avoids e.g. 'fetch_discussions' abstraction leak.
It also allows for these endpoints to be used in different contexts separately without having to perform a full search.
2024-01-23 12:21:55 +01:00