Commit graph

447 commits

Author SHA1 Message Date
Daoud Clarke
65b366d30d Add spacy 2021-12-12 20:58:44 +00:00
Daoud Clarke
16a8356a23 Run multiple processes in parallel 2021-12-12 09:09:44 +00:00
Daoud Clarke
34dc50a6ed Output processed items to an output queue 2021-12-11 17:18:00 +00:00
Daoud Clarke
c46257c6d1 Use our own filesystem-based queue 2021-12-11 16:57:17 +00:00
Daoud Clarke
a76fd2d8f9 Use multiprocessing 2021-12-07 22:56:46 +00:00
Daoud Clarke
2d554b14e7 Save results to gzip file 2021-12-07 22:10:16 +00:00
Daoud Clarke
2562a5257a Extract locally 2021-12-05 22:25:37 +00:00
Daoud Clarke
c151fe3777 Extract archive info 2021-12-05 21:42:23 +00:00
Daoud Clarke
a173db319b Add EMR deploy scripts 2021-12-05 21:02:17 +00:00
Daoud Clarke
14817d7657 Optimise imports 2021-12-05 20:38:05 +00:00
Daoud Clarke
312f32bf61 Add common crawl extract script and dependency management with poetry 2021-12-05 20:31:49 +00:00
Daoud Clarke
896f782379 Improve typing of indexer 2021-06-13 21:41:19 +01:00
Daoud Clarke
0578f41a73 Limit number of chars used in query 2021-06-11 21:43:12 +01:00
Daoud Clarke
c81fc83900 Abstract index to allow storing anything 2021-06-05 22:22:31 +01:00
Daoud Clarke
fb5b6ffd45 Count terms 2021-05-30 21:30:34 +01:00
Daoud Clarke
62d22d9d52 Optimise imports 2021-05-30 20:46:39 +01:00
Daoud Clarke
16aec145d0 Replace dots in query with spaces 2021-05-25 21:47:19 +01:00
Daoud Clarke
550c6f6acc Check for term in title 2021-05-25 21:21:38 +01:00
Daoud Clarke
d6cc81278f Order results by Levenshtein distance to improve recall 2021-05-23 22:14:07 +01:00
Daoud Clarke
0e3069fdb3 Use top urls for performance test 2021-05-21 11:30:42 +01:00
Daoud Clarke
974f18647a Index queued items 2021-05-19 21:48:03 +01:00
Daoud Clarke
87fd458218 Smaller queue 2021-05-19 21:25:12 +01:00
Daoud Clarke
cc841c8b7e Use a filesystem-based queue 2021-05-05 22:16:27 +01:00
Daoud Clarke
7b4a3897b5 Set multithreading=True (but it doesn't seem to help) 2021-05-03 08:37:30 +01:00
Daoud Clarke
ba45d950ef Catch connection errors 2021-04-25 11:41:44 +01:00
Daoud Clarke
61ce4bb832 Use queues 2021-04-25 08:55:15 +01:00
Daoud Clarke
e76ce691d0 Retrieve domain titles 2021-04-25 07:58:01 +01:00
Daoud Clarke
ed90e49c5e Print big pages 2021-04-18 04:54:46 +01:00
Daoud Clarke
c84eeba92e Use a separate page size for testing 2021-04-16 22:01:01 +01:00
Daoud Clarke
ced0fceae8 Record docs per page 2021-04-16 05:28:51 +01:00
Daoud Clarke
fdb5cbbf3c Implement retrieval 2021-04-12 21:26:41 +01:00
Daoud Clarke
acc2a9194e Index using compression 2021-04-12 18:37:33 +01:00
Daoud Clarke
634e490cff Add a script for performance testing 2021-04-11 15:10:02 +01:00
Daoud Clarke
d6809fc6f4 Optimise queries 2021-03-25 08:38:09 +00:00
Daoud Clarke
3859b85fc8 Speed up inserts 2021-03-24 21:55:35 +00:00
Daoud Clarke
14f820ff37 Improve indexing; measure performance 2021-03-23 22:03:48 +00:00
Daoud Clarke
0c5bc061ae Allow the exact term too! 2021-03-22 21:31:49 +00:00
Daoud Clarke
d3ae0e00a2 Complete by getting first term 2021-03-22 21:29:19 +00:00
Daoud Clarke
c17c10ac4c Index wiki 2021-03-21 21:37:41 +00:00
Daoud Clarke
2eb6afc3fe Complete term 2021-03-18 21:48:35 +00:00
Daoud Clarke
257ea7a397 Handle missing results 2021-03-18 21:23:12 +00:00
Daoud Clarke
8e6a67f31b Parse wiki (slowly) 2021-03-15 22:06:37 +00:00
Daoud Clarke
f4215352c9 Searching 2021-03-14 19:37:00 +00:00
Daoud Clarke
980f084c08 Add frontend 2021-03-14 13:28:32 +00:00
Daoud Clarke
d7baa1f232 Add timeout 2021-03-14 12:53:23 +00:00
Daoud Clarke
9815372297 Create index 2021-03-13 22:21:50 +00:00
Daoud Clarke
b1bfe1cdd4 Initial commit 2021-03-13 20:54:15 +00:00