0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	46d385e696	empty region file should not crash region count	2022-10-06 10:20:07 +02:00
Mikkel Denker	002eb7f79c	index warc files as they become available	2022-10-05 17:21:59 +02:00
Mikkel Denker	8d1fdcc6fb	Use title hash, site hash and url hash in collector	2022-10-05 15:23:29 +02:00
Mikkel Denker	b8d6afc753	merge subdomain counts	2022-10-05 12:48:48 +02:00
Mikkel Denker	7a65ebbe00	fastfield cache	2022-10-05 12:02:20 +02:00
Mikkel Denker	473c2325e0	host centrality threshold during indexing	2022-10-04 15:32:54 +02:00
Mikkel Denker	a6ea2e7ba0	count number of subdomains during indexing	2022-10-04 15:03:57 +02:00
Mikkel Denker	677f3a40b1	Ftr/site ranking adjustment (#61 ) * gui * forgot to implement bang in search_server * added opensearch.xml file * reduce indexer memory by commiting more often * fix query parser bug with special characters * change autosuggest browser url * remove '#' from url during indexing * pre-hashed domain field * we will also have a subscription model without ads * site query should not match partial domains from tokenization * ttl cache * send site rankings from js to backend during search * apply site rankings during search	2022-10-04 14:16:02 +02:00
Mikkel Denker	2064cc4ad5	refactor searcher to have separate paths for api and html	2022-09-30 15:31:08 +02:00
Mikkel Denker	a4ecf5c307	merge	2022-09-28 16:05:10 +02:00
Mikkel Denker	3cc7c84a32	Ftr/distributed search (#59 ) * refactor network communication into separate module and made mapreduce async again * sonic module is simple enough as is * rename Searcher -> LocalSearcher * [WIP] distributed searcher structure outlined * split index search into initial and retrieval steps * distributed searcher searching shards * make bucket in collector generic * no more todo!s. Waiting for indexing to finish to test implementation * distributed searcher seems to work. Needs an enourmous refactor - the code is really ugly * cleanup search-server on exit in justfile	2022-09-28 15:50:45 +02:00
dependabot[bot]	0c9cda6fab	Bump axum-core from 0.2.6 to 0.2.8 (#58 ) Bumps [axum-core](https://github.com/tokio-rs/axum) from 0.2.6 to 0.2.8. - [Release notes](https://github.com/tokio-rs/axum/releases) - [Changelog](https://github.com/tokio-rs/axum/blob/main/CHANGELOG.md) - [Commits](https://github.com/tokio-rs/axum/compare/axum-core-v0.2.6...axum-core-v0.2.8) --- updated-dependencies: - dependency-name: axum-core dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-09-28 15:50:18 +02:00
Mikkel Denker	e22d4a037c	tweak ranking values in test case	2022-09-27 10:23:21 +02:00
Mikkel Denker	739608bed8	ranking tweaks and fixed dockerfile	2022-09-27 10:21:12 +02:00
Mikkel Denker	f70af22c35	if updated time is in the future, return None	2022-09-27 10:18:52 +02:00
Mikkel Denker	f4446dcd11	add additional fields to the schema that might be used in the future to reduce 'duplicates feeling' in search results	2022-09-27 10:04:22 +02:00
Mikkel Denker	963bd3aef7	forgot to implement is_computable_before_search function	2022-09-22 15:06:16 +02:00
Mikkel Denker	b902a3b7aa	Ftr/sort postings by score (#57 ) * sort postings by a pre-computed field * Limit the number of webpages considered for ranking during search. At the moment we rank top 10 million pages. This can be done since we have sorted the postinglists based on all pre-computable signals. The chance of "the best" webpage being at > 10 mil is extremely low, and this will allow us to speedup search significantly.	2022-09-22 13:11:54 +02:00
Mikkel Denker	5b4a7cc366	added 'try on {different engine}' buttons	2022-09-22 10:57:33 +02:00
Mikkel Denker	e649053260	update setup steps	2022-09-22 09:49:50 +02:00
Mikkel Denker	0cbca3363c	Remove git lfs for the time being. The data folder was starting to get too large, which led to some difficulties with LFS timeouting (I think) when uploading the tar file. We will need a better strategy for this in the future, but for the time being it is disabled	2022-09-21 10:05:01 +02:00
Mikkel Denker	9f12cca77e	tweak ranking	2022-09-21 09:29:53 +02:00
Mikkel Denker	78b238d2bb	Score goggles as a const (while taking boost into account). Also added a goggles scale to prioritize goggles more	2022-09-20 19:28:22 +02:00
Mikkel Denker	02c799c51b	don't spellcheck non-alphabetic words	2022-09-20 18:16:05 +02:00
Mikkel Denker	2f40f7fc9a	reduce indexer memory by closing indexes	2022-09-20 15:32:39 +02:00
Mikkel Denker	164673761e	added url_without_query_hash field so we can reduce similar url search results even further if we want	2022-09-20 14:27:20 +02:00
Mikkel Denker	0da016f678	added domain_name_if_hompage field for better navigation queries	2022-09-20 14:14:19 +02:00
Mikkel Denker	574321f3d8	fixed very nasty bug where collector would not update the segment_id, leading to completely wrong search results	2022-09-20 13:15:45 +02:00
Mikkel Denker	bea5e8e60e	re-enable multi threaded search	2022-09-18 21:13:37 +02:00
Mikkel Denker	15fe347dc5	goggle performance speedup	2022-09-18 20:21:19 +02:00
Mikkel Denker	fd3e99ca50	embed all images directly in search page using base64	2022-09-16 13:28:38 +02:00
Mikkel Denker	b9e1b1352a	Dockerfile	2022-09-15 13:34:47 +02:00
Mikkel Denker	7176744102	tighten num buckets bound	2022-09-14 17:14:51 +02:00
Mikkel Denker	fb8cfe1f30	Ftr/de similar results (#56 ) * pre-hash site and penalize many results from same site * re-adjust deduplication penalty * fix performance by limiting number of buckets * fixed page offsets	2022-09-14 16:16:13 +02:00
Mikkel Denker	8c9ffede30	Ftr/page centrality (#55 ) * move signal from goggles into ranking module * refactor webpage test-constructor * add page_centrality field * use page centrality during ranking * small justfile refactoring * update index in lfs	2022-09-13 11:49:50 +02:00
Mikkel Denker	65cf8f9f53	Control the number of segments after merge. (#54 ) Before, we merged all segments into a single segment. This has the benefit of reducing the disk IO during search, but has the (quite huge) problem that multi-threaded search now becomes non-trivial. This PR implements a way for us to control how many segments the indexer should produce in the end. We would ideally like to target roughly the same number of segments as there are threads on the search server, such that each thread get's one segment each.	2022-09-13 08:50:22 +02:00
Mikkel Denker	d81fe0cc58	The query 'linus torvalds' now produces the correct snippet. (#52 ) The issue came from the fact that we didn't grab the term-frequencies from the correct field, thus it looked like the term "torvalds" had 0 documents since all documents we were looking at had stemmed the term (probably to "torvald" or something).	2022-09-12 18:22:32 +02:00
Mikkel Denker	06e5a73348	simplify term proximity by not using ngrams, only slop	2022-09-12 14:36:21 +02:00
Oliver Bøving	0b0d453e6f	Refactor search page further	2022-09-11 17:08:41 +02:00
Mikkel Denker	8446c38c6d	Ftr/term proximity ranking (#49 ) * term proximity ranking seems to work * refactor term-proximity queries into separate function	2022-09-11 16:13:54 +02:00
Oliver Bøving	4204802b28	Make the main header icon smaller, and improve more responsiveness	2022-09-11 16:01:27 +02:00
Oliver Bøving	30af194bf0	Make search page more responsive	2022-09-11 15:00:20 +02:00
Oliver Bøving	5176802207	Move around some prettier options	2022-09-11 14:20:07 +02:00
Mikkel Denker	7bf0dc844f	Bug/retry image download (#48 ) * retry image-download if fail with exponential backoff * fixed warc-file code header	2022-09-11 13:53:13 +02:00
Mikkel Denker	51834be843	center entity again	2022-09-11 13:27:20 +02:00
Mikkel Denker	1c57e94687	merge main into branch	2022-09-11 13:27:02 +02:00
Oliver Bøving	0009b79ffd	Use faker data in askama templates during dev (#46 ) * Use faker data in askama templates during dev By adding a `$ {{lorem.limes}}` for example to askama`...` expressions, fake data is inserted during development. Additionally all askama specific text (such as {% for x in xs %}) is not produced during development for cleaner pages. * Add astro-icons/heroicons and cleanup some styling (#45) * Add astro-icons/heroicons and cleanup some styling * Make the header a component	2022-09-11 13:17:11 +02:00
Oliver Bøving	9012236ce6	Add mobile menu and more responsive tweaks	2022-09-11 13:07:03 +02:00
Oliver Bøving	f988616b8f	Use faker data in askama templates during dev By adding a `$ {{lorem.limes}}` for example to askama`...` expressions, fake data is inserted during development. Additionally all askama specific text (such as {% for x in xs %}) is not produced during development for cleaner pages.	2022-09-11 13:07:03 +02:00
Oliver Bøving	5223c57021	Add astro-icons/heroicons and cleanup some styling (#45 ) * Add astro-icons/heroicons and cleanup some styling * Make the header a component	2022-09-11 13:00:53 +02:00

... 21 22 23 24 25 ...

1308 commits