mwmbl

Author	SHA1	Message	Date
Daoud Clarke	eda7870788	Restrict to https and strip the prefix and / on the end	2022-08-11 22:23:14 +01:00
Daoud Clarke	23e47e963b	Simplify completions	2022-08-11 17:34:52 +01:00
Daoud Clarke	3bcb7f42c1	Use heuristic ranker	2022-08-09 22:56:12 +01:00
Daoud Clarke	c1b9e70743	Add new LTR model	2022-08-09 22:47:59 +01:00
Daoud Clarke	57476ed2c8	Tweak features	2022-08-09 22:23:36 +01:00
Daoud Clarke	c99e813398	Get best-performing configuration	2022-08-09 20:56:15 +01:00
Daoud Clarke	8b50643303	Add in match score feature (although it hurts the results)	2022-08-09 00:08:55 +01:00
Daoud Clarke	c60b73a403	Create a get_features function and make it work like the heuristic approach	2022-08-08 23:42:34 +01:00
Daoud Clarke	c1d361c0a0	New LTR model trained on more data	2022-08-08 22:52:37 +01:00
Daoud Clarke	b99d9d1c6a	Search for the term itself as well as its completion	2022-08-08 22:51:09 +01:00
Daoud Clarke	f40d82c449	Allow running with no background script	2022-08-01 23:33:02 +01:00
Daoud Clarke	ae658906dd	Store the best items, not the worst ones	2022-07-31 22:55:15 +01:00
Daoud Clarke	fc1742e24f	Reinstate correct num_pages	2022-07-31 00:45:00 +01:00
Daoud Clarke	bb5186196f	Use an in-memory queue	2022-07-31 00:43:58 +01:00
Daoud Clarke	62ba9ddc7e	Use a randomised timeout for getting a new batch	2022-07-30 23:10:37 +01:00
Daoud Clarke	2942d83673	Get URL scores in batches	2022-07-30 14:35:21 +01:00
Daoud Clarke	3709cb236f	Use correct index path; retrieve historical batches	2022-07-30 11:08:15 +01:00
Daoud Clarke	063ebb4504	args.index no longer exists	2022-07-30 10:57:15 +01:00
Daoud Clarke	ea32c0ba00	Double index size	2022-07-30 10:37:07 +01:00
Daoud Clarke	2d5235f6f6	More threads for retrieving batches	2022-07-30 10:10:11 +01:00
Daoud Clarke	218d873654	Delete unused SQL	2022-07-30 10:10:03 +01:00
Daoud Clarke	6209382d76	Index batches in memory	2022-07-24 15:44:01 +01:00
Daoud Clarke	1bceeae3df	Implement new indexing approach	2022-07-23 23:19:36 +01:00
Daoud Clarke	a8a6c67239	Use URL path to store locally so that we can easily get a local path from a URL	2022-07-20 22:21:35 +01:00
Daoud Clarke	0d1e7d841c	Implement a batch cache to store files locally before preprocessing	2022-07-19 21:18:43 +01:00
Daoud Clarke	5ce333cc9a	Log at info level	2022-07-18 23:46:01 +01:00
Daoud Clarke	a097ec9fbe	Allow more tries so that popular terms can be indexed	2022-07-18 23:42:09 +01:00
Daoud Clarke	cfca015efe	Enough preprocessing	2022-07-18 22:36:37 +01:00
Daoud Clarke	003cd217f4	Run preprocessing	2022-07-18 22:21:20 +01:00
Daoud Clarke	bcd31326b8	Just index a single page for now	2022-07-18 22:17:15 +01:00
Daoud Clarke	a471bc2437	Use a more specific exception in case we're discarding ones we shouldn't	2022-07-18 22:05:24 +01:00
Daoud Clarke	ce9f52267a	Run update	2022-07-18 21:55:27 +01:00
Daoud Clarke	09a9390c92	Catch corrupt data	2022-07-18 21:40:38 +01:00
Daoud Clarke	93307ad1ec	Add util script to send batch; add logging	2022-07-18 21:37:19 +01:00
Daoud Clarke	3c97fdb3a0	Merge pull request #66 from mwmbl/fix-unicode-encode-error Fix unicode encode error; bigger index	2022-07-16 10:59:14 +01:00
Daoud Clarke	680fe1ca0c	Fix unicode encoding error	2022-07-16 10:54:25 +01:00
Daoud Clarke	fee5cbb400	10x index size	2022-07-10 17:15:10 +01:00
milovanderlinden	dfd3f3962e	Fix issue #60	2022-07-10 11:10:03 +02:00
Daoud Clarke	dba50b372f	Don't include web.archive.org as a curated domain	2022-07-04 15:44:28 +01:00
Daoud Clarke	43815c7322	Add a URL length penalty	2022-07-03 22:10:02 +01:00
Daoud Clarke	a3ff2f537f	Score domain and path, weight components	2022-07-03 21:55:20 +01:00
Daoud Clarke	9482ae5028	Delete documents that have been preprocessed from the database to save space	2022-07-03 09:44:51 +01:00
Daoud Clarke	f9fefa0b62	Record new batches as being local	2022-07-02 13:25:31 +01:00
Daoud Clarke	e578d55789	Allow crawling links from unknown domains	2022-07-01 21:35:34 +01:00
Daoud Clarke	db1aa1a928	Don't require a slash for the search URL	2022-07-01 20:43:38 +01:00
Daoud Clarke	24f82a3c2f	Actually used the passed in timestamp	2022-06-30 20:57:01 +01:00
Daoud Clarke	d47457b834	CONFIRMED no longer exists	2022-06-30 20:45:26 +01:00
Daoud Clarke	b6f29548db	Fix log message	2022-06-30 20:42:37 +01:00
Daoud Clarke	e9835edc45	Wrap background tasks in try/except	2022-06-30 20:00:38 +01:00
Daoud Clarke	6ea3a95684	Allow batches to fail silently	2022-06-30 19:52:58 +01:00
Daoud Clarke	ddc8664c11	Queue the right type of batch	2022-06-29 22:52:12 +01:00
Daoud Clarke	2b52b50569	Queue new batches for indexing	2022-06-29 22:49:24 +01:00
Daoud Clarke	b8c495bda8	Correctly insert new URLs	2022-06-29 22:39:21 +01:00
Daoud Clarke	955d650cf4	Prevent deadlock when inserting URLs	2022-06-28 22:34:46 +01:00
Daoud Clarke	1457cba2c2	Cache batches; start a background process	2022-06-27 23:44:25 +01:00
Daoud Clarke	ff2312a5ca	Use different scores for same domain links	2022-06-27 22:46:06 +01:00
Daoud Clarke	36b168a8f6	Fix logic in found URL logic SQL and allow crawling URLs crawled by one user for now	2022-06-26 21:23:57 +01:00
Daoud Clarke	5e1ec9ccd5	Temporarily disable startup background processes; add root domains; check for empty batches.	2022-06-26 21:15:52 +01:00
Daoud Clarke	e27d749e18	Investigate duplication of URLs in batches	2022-06-26 21:11:51 +01:00
Daoud Clarke	1d9b5cb3ca	Make more robust	2022-06-21 08:44:46 +01:00
Daoud Clarke	30e1e19072	Update queued pages in the index	2022-06-20 23:35:44 +01:00
Daoud Clarke	4330551e0f	Tokenize documents and store pages to be added to the index	2022-06-20 22:54:35 +01:00
Daoud Clarke	9594915de1	WIP: index continuously. Retrieve batches and store in Postgres	2022-06-19 23:23:57 +01:00
Daoud Clarke	b8b605daed	Factor out connection code	2022-06-19 16:52:25 +01:00
Daoud Clarke	c31cea710f	CORS is handled by nginx	2022-06-19 13:13:36 +01:00
Daoud Clarke	96da534ca5	Don't add CORS on the python side	2022-06-19 11:34:54 +01:00
Daoud Clarke	866c17f2dc	Use the dokku app storage	2022-06-19 09:53:19 +01:00
Daoud Clarke	16c2692099	Start processing historical data on startup	2022-06-19 08:56:55 +01:00
Daoud Clarke	d400950689	Add script to process historical data	2022-06-18 15:31:35 +01:00
Daoud Clarke	d7c6dcb5c2	Use the correct port for dokku	2022-06-17 23:54:22 +01:00
Daoud Clarke	77088a8a1b	Use a database URL env var	2022-06-17 23:39:24 +01:00
Daoud Clarke	476481c5f8	Put the resources in the package	2022-06-17 23:32:43 +01:00
Daoud Clarke	505e7521d4	Copy the resources	2022-06-17 23:29:04 +01:00
Daoud Clarke	5ea9efcfa2	Fix relative path	2022-06-17 23:19:30 +01:00
Daoud Clarke	1c7420e5fb	Don't depend on existing data	2022-06-17 23:12:22 +01:00
Daoud Clarke	e2eb405083	Combine crawler and search servers	2022-06-16 22:49:41 +01:00
Daoud Clarke	770b4b945b	Refactor feature extraction	2022-05-07 22:52:36 +01:00
Daoud Clarke	87d8b40cad	Make order_results public	2022-05-06 23:15:50 +01:00
Daoud Clarke	229819e57e	Refactor to allow LTR ranker	2022-03-27 22:32:44 +01:00
Daoud Clarke	94287cec01	Get features for each string separately	2022-03-21 21:49:10 +00:00
Daoud Clarke	4740d89c6a	Add domain score feature	2022-03-21 21:13:20 +00:00
Daoud Clarke	af6a28fac3	Implement learning to rank feature extraction and thresholding	2022-03-20 22:01:45 +00:00
Daoud Clarke	2d334074af	Make get_results() public for learning to rank	2022-03-20 17:25:54 +00:00
Daoud Clarke	ee5ca6bcf6	Experiment with score variations (best is simple weighted domain score)	2022-02-27 21:24:16 +00:00
Daoud Clarke	6fb310c363	Use addition instead of multiplication	2022-02-25 22:19:26 +00:00
Daoud Clarke	4e6516ccf1	Scale by 0.99	2022-02-25 22:14:49 +00:00
Daoud Clarke	f5afbed2e5	Handle empty list	2022-02-25 22:11:09 +00:00
Daoud Clarke	efafec5214	Rank using item score as well as match score	2022-02-25 22:08:37 +00:00
Daoud Clarke	e1e9e404a3	Dedupe before indexing	2022-02-24 22:01:42 +00:00
Daoud Clarke	f5b20d0128	Index link counts	2022-02-24 20:47:36 +00:00
Daoud Clarke	b5b2005323	Store computed link counts	2022-02-23 22:13:38 +00:00
Daoud Clarke	00d18c3474	Remove unused code	2022-02-23 21:59:24 +00:00
Daoud Clarke	04a33a134b	Fixes to mwmbl API for changes to the index	2022-02-22 22:27:02 +00:00
Daoud Clarke	ae3b334a7f	Fixes for API changes	2022-02-22 22:12:39 +00:00
Daoud Clarke	326f7e3d7f	Use JSON instead of struct to store metadata	2022-02-18 22:22:47 +00:00
Daoud Clarke	e6273c7f76	WIP: include metadata in index - using struct approach	2022-02-18 22:12:22 +00:00
Daoud Clarke	e03e379ccf	Refactor to enable easier evaluation	2022-02-09 22:43:47 +00:00
Daoud Clarke	6e5e56f99a	New index; more pages	2022-02-04 18:08:23 +00:00
Daoud Clarke	fe6ace93e6	Improve handling of incomplete words: - Correctly generate regex for incomplete vs complete words - Return more than one top word from completer - Correctly handle no terms	2022-01-31 21:20:59 +00:00
Daoud Clarke	7d829bc319	Use python 3.10; complete terms	2022-01-30 23:24:00 +00:00
Daoud Clarke	3c75dd1a74	WIP: implement term completer	2022-01-30 22:20:28 +00:00
Daoud Clarke	01a21337a9	Don't index partial words	2022-01-30 14:30:02 +00:00
Daoud Clarke	2ef8304919	Remove some debug print statements	2022-01-30 13:16:24 +00:00
Daoud Clarke	5b89bbf05d	Index Mwmbl crawled data	2022-01-29 08:26:42 +00:00
Daoud Clarke	70254ae160	Analyse crawled URLs and domains	2022-01-26 18:51:58 +00:00
Daoud Clarke	171fa645d2	Add script to export top domains	2022-01-23 22:04:30 +00:00
ColinEspinas	3481ad372b	Removed old front-end files and routes	2022-01-19 23:33:37 +01:00
Daoud Clarke	a41088ca9a	Add CORS; revert back to previous index as it timed out deploying	2022-01-03 18:31:03 +00:00
Daoud Clarke	25918e42ef	Export URLs to sqlite for evaluation purposes	2022-01-02 20:06:13 +00:00
nitred	fbdb93c86a	Using the app object to start uvicorn, instead of using a reference like "mwmbl.tinysearchengine.app:app" - fixes the issue when running the server using python -m mwmbl.tinysearchengine.app When running the server using python -m, uvicorn seems to spawn a new process or interpreter session. At least it appears that way since already initialized & imported modules and variables appear to be uninitialized.	2021-12-31 02:15:16 +01:00
Daoud Clarke	e6655101ef	Add a component of the HN domain score when ranking	2021-12-30 22:20:10 +00:00
Daoud Clarke	02bcef640c	Merge pull request #25 from ColinEspinas/search-debounce Added debounce on search input	2021-12-29 20:59:29 +00:00
ColinEspinas	c636be9089	Added debounce on search input (#8 )	2021-12-29 21:03:47 +01:00
nitred	a72a08a7d9	added config and binary/entrypoint for mwmbl.tinysearchengine - using pydantic to validate the config - added a default bootstrap config at config/tinysearchengine.yaml - refactored app.py to include parsing CLI argument using argparse - refactored app.py to use fewer global variables - added "mwmbl-tinysearchengine" binary/entrypoint in pyproject.toml - updated Dockerfile to work with these changes and added comments to it	2021-12-29 15:26:33 +01:00
nitred	be40a15b27	Merge branch 'master' into mwmbl-package	2021-12-29 00:25:37 +01:00
nitred	11eedcde84	renamed package to mwmbl - renamed package to mwmbl in pyproject.toml - tinysearchengine and indexer modules have been moved into mwmbl package folder - analyse module has been left as is in the root of the repo - import statements in tinysearchengine now use mwmbl.tinysearchengine - import statements in indexer now use mwmbl.indexer or mwmbl.tinysearchengine or relative imports like .paths - import statements in analyse now use mwmbl.indexer or mwmbl.tinysearchengine - final CMD in Dockerfile now uses updated path mwmbl.tinysearchengine.app - fixed a couple of import statement errors in tinysearchengine/indexer.py	2021-12-28 12:35:46 +01:00

1 2 3 4 5

216 commits