Commit graph

306 commits

Author SHA1 Message Date
Daoud Clarke
23e47e963b Simplify completions 2022-08-11 17:34:52 +01:00
Daoud Clarke
c6773b46c4
Merge pull request #72 from mwmbl/improve-ranking-with-multi-term-search
Improve ranking with multi term search
2022-08-10 21:43:51 +01:00
Daoud Clarke
74107667b4 Improve printing of search results in script 2022-08-10 21:43:13 +01:00
Daoud Clarke
3bcb7f42c1 Use heuristic ranker 2022-08-09 22:56:12 +01:00
Daoud Clarke
c1b9e70743 Add new LTR model 2022-08-09 22:47:59 +01:00
Daoud Clarke
57476ed2c8 Tweak features 2022-08-09 22:23:36 +01:00
Daoud Clarke
c99e813398 Get best-performing configuration 2022-08-09 20:56:15 +01:00
Daoud Clarke
8b50643303 Add in match score feature (although it hurts the results) 2022-08-09 00:08:55 +01:00
Daoud Clarke
c60b73a403 Create a get_features function and make it work like the heuristic approach 2022-08-08 23:42:34 +01:00
Daoud Clarke
c1d361c0a0 New LTR model trained on more data 2022-08-08 22:52:37 +01:00
Daoud Clarke
b99d9d1c6a Search for the term itself as well as its completion 2022-08-08 22:51:09 +01:00
Daoud Clarke
f40d82c449 Allow running with no background script 2022-08-01 23:33:02 +01:00
Daoud Clarke
046f86f7e3
Merge pull request #71 from mwmbl/fix-missing-scores
Store the best items, not the worst ones
2022-08-01 23:32:24 +01:00
Daoud Clarke
ae658906dd Store the best items, not the worst ones 2022-07-31 22:55:15 +01:00
Daoud Clarke
aa5878fd2f
Merge pull request #70 from mwmbl/reduce-new-batch-contention
Reduce new batch contention
2022-07-31 21:02:05 +01:00
Daoud Clarke
fc1742e24f Reinstate correct num_pages 2022-07-31 00:45:00 +01:00
Daoud Clarke
bb5186196f Use an in-memory queue 2022-07-31 00:43:58 +01:00
Daoud Clarke
62ba9ddc7e Use a randomised timeout for getting a new batch 2022-07-30 23:10:37 +01:00
Daoud Clarke
a54e093cf1
Merge pull request #69 from mwmbl/reduce-contention-for-client-queries
Reduce contention for client queries
2022-07-30 17:11:34 +01:00
Daoud Clarke
2942d83673 Get URL scores in batches 2022-07-30 14:35:21 +01:00
Daoud Clarke
3709cb236f Use correct index path; retrieve historical batches 2022-07-30 11:08:15 +01:00
Daoud Clarke
063ebb4504 args.index no longer exists 2022-07-30 10:57:15 +01:00
Daoud Clarke
ea32c0ba00 Double index size 2022-07-30 10:37:07 +01:00
Daoud Clarke
2d5235f6f6 More threads for retrieving batches 2022-07-30 10:10:11 +01:00
Daoud Clarke
218d873654 Delete unused SQL 2022-07-30 10:10:03 +01:00
Daoud Clarke
6209382d76 Index batches in memory 2022-07-24 15:44:01 +01:00
Daoud Clarke
1bceeae3df Implement new indexing approach 2022-07-23 23:19:36 +01:00
Daoud Clarke
a8a6c67239 Use URL path to store locally so that we can easily get a local path from a URL 2022-07-20 22:21:35 +01:00
Daoud Clarke
0d1e7d841c Implement a batch cache to store files locally before preprocessing 2022-07-19 21:18:43 +01:00
Daoud Clarke
27a4784d08
Merge pull request #68 from mwmbl/fix-missing-query
Fix missing query
2022-07-19 20:17:20 +01:00
Daoud Clarke
5ce333cc9a Log at info level 2022-07-18 23:46:01 +01:00
Daoud Clarke
a097ec9fbe Allow more tries so that popular terms can be indexed 2022-07-18 23:42:09 +01:00
Daoud Clarke
cfca015efe Enough preprocessing 2022-07-18 22:36:37 +01:00
Daoud Clarke
003cd217f4 Run preprocessing 2022-07-18 22:21:20 +01:00
Daoud Clarke
bcd31326b8 Just index a single page for now 2022-07-18 22:17:15 +01:00
Daoud Clarke
a471bc2437 Use a more specific exception in case we're discarding ones we shouldn't 2022-07-18 22:05:24 +01:00
Daoud Clarke
ce9f52267a Run update 2022-07-18 21:55:27 +01:00
Daoud Clarke
09a9390c92 Catch corrupt data 2022-07-18 21:40:38 +01:00
Daoud Clarke
93307ad1ec Add util script to send batch; add logging 2022-07-18 21:37:19 +01:00
Daoud Clarke
3c97fdb3a0
Merge pull request #66 from mwmbl/fix-unicode-encode-error
Fix unicode encode error; bigger index
2022-07-16 10:59:14 +01:00
Daoud Clarke
680fe1ca0c Fix unicode encoding error 2022-07-16 10:54:25 +01:00
Daoud Clarke
e1e1b0057b
Merge pull request #61 from milovanderlinden/issue-60-consistent-use-of-env-vars
Fix issue #60
2022-07-10 21:06:09 +01:00
Daoud Clarke
fee5cbb400 10x index size 2022-07-10 17:15:10 +01:00
milovanderlinden
dfd3f3962e Fix issue #60 2022-07-10 11:10:03 +02:00
Daoud Clarke
dba50b372f Don't include web.archive.org as a curated domain 2022-07-04 15:44:28 +01:00
Daoud Clarke
2e40ae1dca
Merge pull request #58 from mwmbl/improve-ranking-for-root-domains
Improve ranking for root domains
2022-07-03 22:10:55 +01:00
Daoud Clarke
43815c7322 Add a URL length penalty 2022-07-03 22:10:02 +01:00
Daoud Clarke
a3ff2f537f Score domain and path, weight components 2022-07-03 21:55:20 +01:00
Daoud Clarke
4b5df76ca5
Merge pull request #57 from mwmbl/clear-indexed-documents
Delete documents that have been preprocessed from the database to sav…
2022-07-03 09:45:52 +01:00
Daoud Clarke
9482ae5028 Delete documents that have been preprocessed from the database to save space 2022-07-03 09:44:51 +01:00