Commit graph

391 commits

Author SHA1 Message Date
Daoud Clarke
fc1742e24f Reinstate correct num_pages 2022-07-31 00:45:00 +01:00
Daoud Clarke
bb5186196f Use an in-memory queue 2022-07-31 00:43:58 +01:00
Daoud Clarke
62ba9ddc7e Use a randomised timeout for getting a new batch 2022-07-30 23:10:37 +01:00
Daoud Clarke
a54e093cf1
Merge pull request #69 from mwmbl/reduce-contention-for-client-queries
Reduce contention for client queries
2022-07-30 17:11:34 +01:00
Daoud Clarke
2942d83673 Get URL scores in batches 2022-07-30 14:35:21 +01:00
Daoud Clarke
3709cb236f Use correct index path; retrieve historical batches 2022-07-30 11:08:15 +01:00
Daoud Clarke
063ebb4504 args.index no longer exists 2022-07-30 10:57:15 +01:00
Daoud Clarke
ea32c0ba00 Double index size 2022-07-30 10:37:07 +01:00
Daoud Clarke
2d5235f6f6 More threads for retrieving batches 2022-07-30 10:10:11 +01:00
Daoud Clarke
218d873654 Delete unused SQL 2022-07-30 10:10:03 +01:00
Daoud Clarke
6209382d76 Index batches in memory 2022-07-24 15:44:01 +01:00
Daoud Clarke
1bceeae3df Implement new indexing approach 2022-07-23 23:19:36 +01:00
Daoud Clarke
a8a6c67239 Use URL path to store locally so that we can easily get a local path from a URL 2022-07-20 22:21:35 +01:00
Daoud Clarke
0d1e7d841c Implement a batch cache to store files locally before preprocessing 2022-07-19 21:18:43 +01:00
Daoud Clarke
27a4784d08
Merge pull request #68 from mwmbl/fix-missing-query
Fix missing query
2022-07-19 20:17:20 +01:00
Daoud Clarke
5ce333cc9a Log at info level 2022-07-18 23:46:01 +01:00
Daoud Clarke
a097ec9fbe Allow more tries so that popular terms can be indexed 2022-07-18 23:42:09 +01:00
Daoud Clarke
cfca015efe Enough preprocessing 2022-07-18 22:36:37 +01:00
Daoud Clarke
003cd217f4 Run preprocessing 2022-07-18 22:21:20 +01:00
Daoud Clarke
bcd31326b8 Just index a single page for now 2022-07-18 22:17:15 +01:00
Daoud Clarke
a471bc2437 Use a more specific exception in case we're discarding ones we shouldn't 2022-07-18 22:05:24 +01:00
Daoud Clarke
ce9f52267a Run update 2022-07-18 21:55:27 +01:00
Daoud Clarke
09a9390c92 Catch corrupt data 2022-07-18 21:40:38 +01:00
Daoud Clarke
93307ad1ec Add util script to send batch; add logging 2022-07-18 21:37:19 +01:00
Daoud Clarke
3c97fdb3a0
Merge pull request #66 from mwmbl/fix-unicode-encode-error
Fix unicode encode error; bigger index
2022-07-16 10:59:14 +01:00
Daoud Clarke
680fe1ca0c Fix unicode encoding error 2022-07-16 10:54:25 +01:00
Daoud Clarke
e1e1b0057b
Merge pull request #61 from milovanderlinden/issue-60-consistent-use-of-env-vars
Fix issue #60
2022-07-10 21:06:09 +01:00
Daoud Clarke
fee5cbb400 10x index size 2022-07-10 17:15:10 +01:00
milovanderlinden
dfd3f3962e Fix issue #60 2022-07-10 11:10:03 +02:00
Daoud Clarke
dba50b372f Don't include web.archive.org as a curated domain 2022-07-04 15:44:28 +01:00
Daoud Clarke
2e40ae1dca
Merge pull request #58 from mwmbl/improve-ranking-for-root-domains
Improve ranking for root domains
2022-07-03 22:10:55 +01:00
Daoud Clarke
43815c7322 Add a URL length penalty 2022-07-03 22:10:02 +01:00
Daoud Clarke
a3ff2f537f Score domain and path, weight components 2022-07-03 21:55:20 +01:00
Daoud Clarke
4b5df76ca5
Merge pull request #57 from mwmbl/clear-indexed-documents
Delete documents that have been preprocessed from the database to sav…
2022-07-03 09:45:52 +01:00
Daoud Clarke
9482ae5028 Delete documents that have been preprocessed from the database to save space 2022-07-03 09:44:51 +01:00
Daoud Clarke
6fa192daa4
Merge pull request #56 from mwmbl/allow-links-from-unknown-domains
Allow crawling links from unknown domains
2022-07-02 13:32:39 +01:00
Daoud Clarke
f9fefa0b62 Record new batches as being local 2022-07-02 13:25:31 +01:00
Daoud Clarke
e578d55789 Allow crawling links from unknown domains 2022-07-01 21:35:34 +01:00
Daoud Clarke
4967830ae1
Merge pull request #55 from mwmbl/index-continuously
Index continuously
2022-07-01 20:55:24 +01:00
Daoud Clarke
db1aa1a928 Don't require a slash for the search URL 2022-07-01 20:43:38 +01:00
Daoud Clarke
24f82a3c2f Actually used the passed in timestamp 2022-06-30 20:57:01 +01:00
Daoud Clarke
d47457b834 CONFIRMED no longer exists 2022-06-30 20:45:26 +01:00
Daoud Clarke
b6f29548db Fix log message 2022-06-30 20:42:37 +01:00
Daoud Clarke
e9835edc45 Wrap background tasks in try/except 2022-06-30 20:00:38 +01:00
Daoud Clarke
6ea3a95684 Allow batches to fail silently 2022-06-30 19:52:58 +01:00
Daoud Clarke
ddc8664c11 Queue the right type of batch 2022-06-29 22:52:12 +01:00
Daoud Clarke
2b52b50569 Queue new batches for indexing 2022-06-29 22:49:24 +01:00
Daoud Clarke
b8c495bda8 Correctly insert new URLs 2022-06-29 22:39:21 +01:00
Daoud Clarke
955d650cf4 Prevent deadlock when inserting URLs 2022-06-28 22:34:46 +01:00
Daoud Clarke
1457cba2c2 Cache batches; start a background process 2022-06-27 23:44:25 +01:00