0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	f494a11a1a	Accessibility overhaul (#231 ) * high contrast theme * improve '/explore' error messages when site is invalid * change aria-expanded when search suggestions are displayed * wrap search suggestions in <ul> and <li> items to ensure screen reader knows how many suggestions there are * move focus to modal when it opens. trap focus until modal is closed again * add aria-expanded to each result that is true iff. the modal is expanded for that result * entire navigation bar inside <nav> element * add skip link to navbar to jump to main content of page * improve focus indicators for selected /explore sites * more descriptive titles for explore page interactive elements * group settings in fieldsets and use title+description as legend * add language to setting input fields to ensure required fields error is read in correct language on screenreaders * add headings to serp * add title to hamburger menu on mobile * fix firefox accessibility errors * only show button outline on tab focus	2024-10-09 11:12:09 +02:00
Mikkel Denker	8f6b23734a	[live index] decrease compaction interval	2024-10-07 12:03:25 +02:00
Mikkel Denker	3a127af0a7	improve compaction performance in live index by performing initial segment merge on a read lock and only switch to the new segments on a write lock. this ensures that search requests can still be performed while the heavy part of merge is executing	2024-10-07 11:59:36 +02:00
Mikkel Denker	847c14251a	bump edge limit for ingoing nodes to 128 in SimilarHostsFinder now that nofollow links are filtered, we need to fetch more edges from the webgraph to have good enough accuracy on /explore page	2024-10-05 15:12:07 +02:00
Mikkel Denker	db2c6c3eb9	increase compaction interval merging the segments right now takes a write lock for the entirety of the operation. this causes all searches to time out whenever the live index compacts its segments. we should actually be able to split up the merge operation to create the merged segment on a read lock and only take a write lock when switching and cleaning the old segments for the new one. increasing the compaction interval is only a temporary fix	2024-10-04 13:15:20 +02:00
Mikkel Denker	87fbd3d709	index compaction shouldn’t update creation date as this would cause all segments to eventaully have same creation date	2024-10-04 11:25:12 +02:00
Mikkel Denker	2492ebdd41	make sure live index node doesn't try to replicate to itself	2024-10-03 17:07:15 +02:00
Mikkel Denker	e16e353249	increase coefficient for update_timestamp to show more recent articles from live index	2024-10-03 14:04:26 +02:00
Mikkel Denker	c7ea2fd8ff	add smoothing factor to timestamp to give it a half life of 3 days	2024-10-03 13:09:31 +02:00
Mikkel Denker	bc45409887	re-open live index after compaction to reflect segment changes	2024-10-03 11:40:39 +02:00
Mikkel Denker	5a1a5a3225	remove minimum budget of 1 per day in live index	2024-10-03 10:19:44 +02:00
Mikkel Denker	3795dbb64b	sonic probagate err	2024-10-02 16:45:55 +02:00
Mikkel Denker	4915160449	harmonic centrality nearest neighbor calculation that uses the harmonic centrality of the highest neighbors node as a seed node proxy for the centrality of that node (with a discount factor)	2024-10-02 15:57:30 +02:00
Mikkel Denker	c3a1a82b66	prevent divison by zero	2024-10-02 15:14:36 +02:00
Mikkel Denker	240009db93	optionally skip db init	2024-10-02 15:11:49 +02:00
Mikkel Denker	9fc02932d9	refactor GetSiteUrls into a response stream for reusability	2024-10-02 12:59:44 +02:00
Mikkel Denker	69ad95e4e8	add first h1, all h2 and all h3 tags as fields in index might be useful for ranking later	2024-10-02 10:31:48 +02:00
Mikkel Denker	36fab02c9d	implement '<base>' tags for relative urls	2024-10-02 09:56:55 +02:00
Mikkel Denker	089f609e70	add svelte.dev to devdocs	2024-10-02 09:35:35 +02:00
Mikkel Denker	f6bca073db	enforce edgelimit to only be applied in webgraph mod.rs so offset is only applied once	2024-10-02 09:22:09 +02:00
Mikkel Denker	9414c4277a	implement EdgeLimit::LimitAndOffset in webgraph. to make sure the edges are returned in sorted order, we implement a new FlatSortedBy struct that takes a vec of iters and outputs the sorted order of the iter items. it assumes that each iter is already sorted so it doesn't have to traverse the entire result set	2024-10-01 15:28:27 +02:00
Mikkel Denker	d53313388e	make sure no-follow links aren't used in /explore and to score host similarity	2024-10-01 13:53:27 +02:00
Mikkel Denker	b160436297	upper bounds for in degree and out degree in webgraph	2024-10-01 13:27:02 +02:00
Mikkel Denker	c4ea75ad2a	only consider simhash duplicates if we don't have enough non-duplicates	2024-10-01 12:04:27 +02:00
Mikkel Denker	c5a465b7d6	group urls by domain centrality in planner and process groups by highest centrality first	2024-10-01 10:37:23 +02:00
Mikkel Denker	945c358989	increase minimum delay between requests to 10 sec	2024-10-01 10:10:19 +02:00
Mikkel Denker	70e9e8181d	make api types public	2024-10-01 10:02:22 +02:00
Mikkel Denker	4e8c165a1c	cleanup temporary directories automatically in tests (#228 )	2024-10-01 09:42:14 +02:00
Mikkel Denker	950862be9c	Re-open live index after it has been downloaded from replica (#227 ) * re-open index after it has been downloaded from replica * remove writer directory lock * update meta file with segment changes * flatten live index directory structure a bit for better overview * additional live index tests	2024-10-01 09:19:08 +02:00
Mikkel Denker	1b1677dd15	clear wal after inserts	2024-09-27 17:45:34 +02:00
Mikkel Denker	45f6cae5e7	cannot download index from self replica if a node restarted quickly after it shut down, it might see its old ghost in the cluster and try to download the index from itself, which would of course fail	2024-09-27 17:45:11 +02:00
Mikkel Denker	0114e8d9eb	normalize urls before they are crawled for better duplicates detection	2024-09-26 15:33:07 +02:00
Mikkel Denker	d94b46892e	include live index results when returning search count	2024-09-26 15:14:26 +02:00
Mikkel Denker	d61986c255	normalize urls	2024-09-26 15:14:03 +02:00
Mikkel Denker	8ebdd37653	derive default for some live index config fields	2024-09-26 15:13:40 +02:00
Mikkel Denker	7814bf7f1a	skip awaiting webgraph nodes in dev environments as they may not exist	2024-09-26 09:02:56 +02:00
Mikkel Denker	f6f39b77ca	explicitly wait for graph nodes to come online	2024-09-25 16:43:13 +02:00
Mikkel Denker	5cd174e415	Live index crawler (#226 ) * find links to feeds from pages and add them to site statistics (statistics is probably an incorrect name now) * robust url parse * shard download_db by the first two characters of the md5 hash of the host * count number of occurrences for each feed in site stats so we can filter '/comment' etc * filter feeds that only has a count of 1 from site stats * webgraph granularity as trait to ensure each struct gets a connection with the granularity they expect * remove cluster_id as it wasn't used for anything in practice * assign budgets to each site based on their category, host centrality and total daily budgets * entrypoint for live index crawler * implement feeds checker * sitemap checker * frontpage checker * remove unused imports + fix clippy warnings * remove redundant closure * fix deadlock when starting crawl of a site. the drip thread already had the lock so creating the guard would halt indefinitely * also seed crawled_db with urls from live index	2024-09-25 15:07:16 +02:00
Mikkel Denker	de3239716f	explicitly mark as unreachable	2024-09-18 12:26:18 +02:00
Mikkel Denker	55b39555aa	npm update	2024-09-18 12:19:14 +02:00
Mikkel Denker	ee0fc39eaa	rustup update and fix clippy	2024-09-18 11:18:55 +02:00
Mikkel Denker	bcc0819871	Live index replication (#224 ) * [WIP] join live-index node in setup mode before searches can be performed the node needs to check if there are other replicas already online and download the index from one of them. inserts should also still be inserted into setup node's wal * once a server receives an indexing request with a consistency threshold, replicate that request to the other replicas (servers with same shard as the server that received the request). only if the consistency threshold is met should the original request succeed * write to temp wal during setup * remote copy of files/directories to download the index from an existing replica * download index from existing replicas during setup * clippy fix * cleanup large test files in remote copy tests * [wip] live index tests * tests	2024-09-17 14:31:20 +02:00
Mikkel Denker	21d28ea86d	remove 'package-lock.json' from gitignore	2024-09-12 12:54:15 +02:00
Mikkel Denker	7f5a4e846d	boost and exclude domains in crawl plan	2024-09-10 13:41:06 +02:00
Mikkel Denker	bd8f69707b	disable http redirect following in reqwest and handle it directly in crawler instead. this ensures we have the proper delay between requests	2024-09-10 11:36:32 +02:00
Mikkel Denker	21ba4cde65	Live index without replication (#221 ) * [WIP] live index code structure with a ton of todos * update meta file with segment changes * add endpoint to index webpages into live index * compact segments by date * cleanup old segments * fix clippy warnings * fix clippy warnings	2024-09-10 11:03:52 +02:00
Mikkel Denker	38f33c80b5	use words tokenizer at query time for all *NoTokenizer fields to ensure that e.g. the query 'example query site.com' matches site.com against the field instead of the entire query	2024-09-09 10:42:56 +02:00
Mikkel Denker	b3aa2a80f6	weekends ranking experiments to improve top-300 recall. improves recall from ~71% -> ~83%. not ready for production yet as the top ranked urls are way off	2024-09-09 09:54:01 +02:00
Mikkel Denker	ace0a81fef	Compute some site level statistics (number of pages, number of news articles and number of blog posts). This will be used to determine which sites to crawl in the live index (#220 )	2024-09-07 15:10:15 +02:00
Mikkel Denker	8d0ad573a7	start politeness factor at 2, decrease iff. we don't receive any 429 responses. also increase max delay to 180 seconds	2024-09-06 13:41:22 +02:00

1 2 3 4 5 ...

413 commits