0ct0pu5/search-engine-stract

Author	SHA1	Message	Date
Mikkel Denker	21465a706a	add rocksdb indexes to lfs (maybe)	2022-08-17 09:43:26 +02:00
Mikkel Denker	c3e51d23e8	remove some files from git lfs	2022-08-17 09:36:14 +02:00
Mikkel Denker	d8a0de3fa0	remove logfiles from data dir	2022-08-16 18:16:48 +02:00
Mikkel Denker	d2580902aa	add data folder to git lfs	2022-08-16 17:30:55 +02:00
Mikkel Denker	2e11771b62	maybe fix lfs	2022-08-16 16:34:44 +02:00
Mikkel Denker	91f501ac1e	data folder in lfs	2022-08-16 16:33:46 +02:00
Mikkel Denker	f7e88fab9d	data folder in lfs	2022-08-16 16:31:38 +02:00
Mikkel Denker	c87f21e018	include Cargo.lock	2022-08-16 13:55:59 +02:00
Mikkel Denker	d1ece2af46	Ftr/entity sidebar (#11 ) * show basic (and ugly) entity information during search * refactor image downloads into an image downloader * download entity images * show entity image and link to wiki article * show related entities * show links in entity text and infobox * don't match stopwords in entity titles during search * remove stopwords from entity query * fixed bug where some wikipedia images werent downloaded * prettify the primary image a bit	2022-08-16 13:52:31 +02:00
Mikkel Denker	8d5d375f0b	clippy fix	2022-08-12 14:39:15 +02:00
Mikkel Denker	082957f732	download sample warc file if not present on local run	2022-08-12 14:04:11 +02:00
Mikkel Denker	ab10ac851e	empty data folder in git	2022-08-12 13:56:32 +02:00
Mikkel Denker	3861386da6	take freshness into account during ranking	2022-08-10 09:52:06 +02:00
Mikkel Denker	b07c67559a	Refactor inverted index to it's own struct. Extra information such as favicon and other images are now separated from the inverted index and everything is stored in the index. This also allows us to easier keep a list of all pending image download jobs and execute these asynchronously (which we know do).	2022-08-10 08:51:51 +02:00
Mikkel Denker	7f42601f46	Index (almost) all text on page. Only use clean text for snippets unless absolutely necesarry.	2022-08-09 15:27:24 +02:00
Mikkel Denker	7e31e98bb7	Fixed bug that caused some queries to panic. Since we didn't check if the positions[i] vector had elements, some searches caused the weights and positions vector to have a different number of elements, thereby crashing on .zip_eq	2022-08-09 14:54:53 +02:00
Mikkel Denker	499488d454	if no snippet found, use webapge description if present	2022-08-09 14:47:17 +02:00
Mikkel Denker	4f55c72dd2	show update time in search result	2022-08-09 14:28:53 +02:00
Mikkel Denker	549d3fb074	show primary image in search results	2022-08-09 13:00:27 +02:00
Mikkel Denker	2fe0c1b5dc	specialized metadata functions	2022-08-09 09:51:43 +02:00
Mikkel Denker	82e7be6d20	specialized updated_time function	2022-08-08 16:51:43 +02:00
Mikkel Denker	975f229431	parse schema org json-ld data	2022-08-08 16:03:28 +02:00
Mikkel Denker	70163f970c	show favicons in search result	2022-08-08 14:48:06 +02:00
Mikkel Denker	003b54b820	image store with filters	2022-08-08 10:24:17 +02:00
Mikkel Denker	d19597beba	refactor kv trait out from webgraph	2022-08-08 10:24:08 +02:00
Mikkel Denker	0ed6bc236d	Stemmed terms now show up in the generated snippets. We use the language of the webpage to determine which stemmer to use.	2022-08-05 12:52:47 +02:00
Mikkel Denker	87f5695913	fixed broken README link to license file	2022-08-04 13:32:48 +02:00
Mikkel Denker	c3c1cb7f00	use fetch time during ranking	2022-08-04 12:51:47 +02:00
Mikkel Denker	faddb613aa	minor refactoring of url functions into a shared struct	2022-08-04 12:30:40 +02:00
Mikkel Denker	ddff7aebe6	show protocol in search url	2022-08-03 18:59:59 +02:00
Mikkel Denker	822213c10e	term proximity ranking	2022-08-03 18:49:11 +02:00
Mikkel Denker	976cc70d34	spans for term proximity ranking	2022-08-03 17:01:32 +02:00
Mikkel Denker	e6c88d64e1	Match webage terms across fields. If a webpage matches TermA in url and TermB in body it should still be a correct match even though no single field has both TermB and TermA.	2022-07-22 11:08:37 +02:00
Mikkel Denker	d5efc66c31	Custom tantivy query implementation. Preparations for expanded spans ranking which allows us to rank documents based on term proximity. Overall this change gives more control over the search+ranking procedure.	2022-07-22 10:41:39 +02:00
Mikkel Denker	c37db38eea	Bold suffix for suggestions	2022-07-01 18:30:48 +02:00
Mikkel Denker	d64847d874	Better UX for searchbar suggestions	2022-07-01 14:54:31 +02:00
Mikkel Denker	553d231da2	Move webgraph to use RocksDB as store	2022-06-30 15:41:11 +02:00
Mikkel Denker	96b2d3e6ea	Refactor webgraph for easier KV-store additions. GraphStore is now a struct backed by something that implements a new Kv trait. This makes it far easier to implement another backing key-value store, as all the caching etc. from the graph database is abstracted away. This change is introduced since we want to try RocksDB instead of Sled as Sled caused huge memory consumption when indexing 10_000 warc files. It brought a 256GB ram server to its knees.	2022-06-30 12:55:09 +02:00
Mikkel Denker	ea20d43df8	Highlight stemmed matches in snippet generation	2022-06-30 08:48:45 +02:00
Mikkel Denker	b526a782a6	Handle HTML comments in lexer	2022-06-29 19:48:23 +02:00
Mikkel Denker	2ebc020da2	Significant performance improvements to text extraction	2022-06-29 16:37:28 +02:00
Mikkel Denker	fda6c1698c	Initial JustText algorithm. Generate better snippets by guessing which parts of the website viable content. This also improves search accuracy as the BM25 score becomes way better (it is given cleaner data). Performance could probably be immensly improved, but it seems to work.	2022-06-29 15:23:37 +02:00
Mikkel Denker	6bc6ee352d	[WIP] Query suggestions	2022-06-29 08:55:30 +02:00
Mikkel Denker	c500c0e71c	Custom HTML lexer based on logos. There is no need to parse the entire html file into a tree since a stream of tags will be sufficient for extracting links+text. This should be faster than parsing. It wil also make it easier to implement the 'just_text' algorithm.	2022-06-28 13:49:50 +02:00
Mikkel Denker	97583078a6	Better navigational ranking	2022-06-27 11:10:41 +02:00
Mikkel Denker	0632110a8d	Separate stemmed and non-stemmed fields. This allows us to prioritize non-stemmed matches over stemmed matches.	2022-06-27 10:37:28 +02:00
Mikkel Denker	4bdfe59b36	show query in searchbar	2022-06-27 09:25:17 +02:00
Mikkel Denker	21149d8ced	Merge all segments at the end of map, don't merge automatically. This hugely increases the indexing performance as we don't merge as often.	2022-06-27 09:12:29 +02:00
Mikkel Denker	383d84266f	Optimizations to index schema and indexing. There seems to be a problem with large fast-fields, both in terms of crashes and performance. We therefore now check whether a query-document match is navigational by indexing the host string and boosting the field during search. In the future, we may want to add a boolean fast-field to indicate if the given webpage is the homepage and only do navigational searches on those. The commit also increases indexing ressources in the hopes that this reduces the number of merges during indexing.	2022-06-25 19:06:04 +02:00
Mikkel Denker	3c907b60b7	initial frontend	2022-06-25 17:14:07 +02:00

... 23 24 25 26 27

1308 commits