Daoud Clarke
|
3709cb236f
|
Use correct index path; retrieve historical batches
|
2022-07-30 11:08:15 +01:00 |
|
Daoud Clarke
|
063ebb4504
|
args.index no longer exists
|
2022-07-30 10:57:15 +01:00 |
|
Daoud Clarke
|
ea32c0ba00
|
Double index size
|
2022-07-30 10:37:07 +01:00 |
|
Daoud Clarke
|
2d5235f6f6
|
More threads for retrieving batches
|
2022-07-30 10:10:11 +01:00 |
|
Daoud Clarke
|
218d873654
|
Delete unused SQL
|
2022-07-30 10:10:03 +01:00 |
|
Daoud Clarke
|
6209382d76
|
Index batches in memory
|
2022-07-24 15:44:01 +01:00 |
|
Daoud Clarke
|
1bceeae3df
|
Implement new indexing approach
|
2022-07-23 23:19:36 +01:00 |
|
Daoud Clarke
|
a8a6c67239
|
Use URL path to store locally so that we can easily get a local path from a URL
|
2022-07-20 22:21:35 +01:00 |
|
Daoud Clarke
|
0d1e7d841c
|
Implement a batch cache to store files locally before preprocessing
|
2022-07-19 21:18:43 +01:00 |
|
Daoud Clarke
|
27a4784d08
|
Merge pull request #68 from mwmbl/fix-missing-query
Fix missing query
|
2022-07-19 20:17:20 +01:00 |
|
Daoud Clarke
|
5ce333cc9a
|
Log at info level
|
2022-07-18 23:46:01 +01:00 |
|
Daoud Clarke
|
a097ec9fbe
|
Allow more tries so that popular terms can be indexed
|
2022-07-18 23:42:09 +01:00 |
|
Daoud Clarke
|
cfca015efe
|
Enough preprocessing
|
2022-07-18 22:36:37 +01:00 |
|
Daoud Clarke
|
003cd217f4
|
Run preprocessing
|
2022-07-18 22:21:20 +01:00 |
|
Daoud Clarke
|
bcd31326b8
|
Just index a single page for now
|
2022-07-18 22:17:15 +01:00 |
|
Daoud Clarke
|
a471bc2437
|
Use a more specific exception in case we're discarding ones we shouldn't
|
2022-07-18 22:05:24 +01:00 |
|
Daoud Clarke
|
ce9f52267a
|
Run update
|
2022-07-18 21:55:27 +01:00 |
|
Daoud Clarke
|
09a9390c92
|
Catch corrupt data
|
2022-07-18 21:40:38 +01:00 |
|
Daoud Clarke
|
93307ad1ec
|
Add util script to send batch; add logging
|
2022-07-18 21:37:19 +01:00 |
|
Daoud Clarke
|
3c97fdb3a0
|
Merge pull request #66 from mwmbl/fix-unicode-encode-error
Fix unicode encode error; bigger index
|
2022-07-16 10:59:14 +01:00 |
|
Daoud Clarke
|
680fe1ca0c
|
Fix unicode encoding error
|
2022-07-16 10:54:25 +01:00 |
|
Daoud Clarke
|
e1e1b0057b
|
Merge pull request #61 from milovanderlinden/issue-60-consistent-use-of-env-vars
Fix issue #60
|
2022-07-10 21:06:09 +01:00 |
|
Daoud Clarke
|
fee5cbb400
|
10x index size
|
2022-07-10 17:15:10 +01:00 |
|
milovanderlinden
|
dfd3f3962e
|
Fix issue #60
|
2022-07-10 11:10:03 +02:00 |
|
Daoud Clarke
|
dba50b372f
|
Don't include web.archive.org as a curated domain
|
2022-07-04 15:44:28 +01:00 |
|
Daoud Clarke
|
2e40ae1dca
|
Merge pull request #58 from mwmbl/improve-ranking-for-root-domains
Improve ranking for root domains
|
2022-07-03 22:10:55 +01:00 |
|
Daoud Clarke
|
43815c7322
|
Add a URL length penalty
|
2022-07-03 22:10:02 +01:00 |
|
Daoud Clarke
|
a3ff2f537f
|
Score domain and path, weight components
|
2022-07-03 21:55:20 +01:00 |
|
Daoud Clarke
|
4b5df76ca5
|
Merge pull request #57 from mwmbl/clear-indexed-documents
Delete documents that have been preprocessed from the database to sav…
|
2022-07-03 09:45:52 +01:00 |
|
Daoud Clarke
|
9482ae5028
|
Delete documents that have been preprocessed from the database to save space
|
2022-07-03 09:44:51 +01:00 |
|
Daoud Clarke
|
6fa192daa4
|
Merge pull request #56 from mwmbl/allow-links-from-unknown-domains
Allow crawling links from unknown domains
|
2022-07-02 13:32:39 +01:00 |
|
Daoud Clarke
|
f9fefa0b62
|
Record new batches as being local
|
2022-07-02 13:25:31 +01:00 |
|
Daoud Clarke
|
e578d55789
|
Allow crawling links from unknown domains
|
2022-07-01 21:35:34 +01:00 |
|
Daoud Clarke
|
4967830ae1
|
Merge pull request #55 from mwmbl/index-continuously
Index continuously
|
2022-07-01 20:55:24 +01:00 |
|
Daoud Clarke
|
db1aa1a928
|
Don't require a slash for the search URL
|
2022-07-01 20:43:38 +01:00 |
|
Daoud Clarke
|
24f82a3c2f
|
Actually used the passed in timestamp
|
2022-06-30 20:57:01 +01:00 |
|
Daoud Clarke
|
d47457b834
|
CONFIRMED no longer exists
|
2022-06-30 20:45:26 +01:00 |
|
Daoud Clarke
|
b6f29548db
|
Fix log message
|
2022-06-30 20:42:37 +01:00 |
|
Daoud Clarke
|
e9835edc45
|
Wrap background tasks in try/except
|
2022-06-30 20:00:38 +01:00 |
|
Daoud Clarke
|
6ea3a95684
|
Allow batches to fail silently
|
2022-06-30 19:52:58 +01:00 |
|
Daoud Clarke
|
ddc8664c11
|
Queue the right type of batch
|
2022-06-29 22:52:12 +01:00 |
|
Daoud Clarke
|
2b52b50569
|
Queue new batches for indexing
|
2022-06-29 22:49:24 +01:00 |
|
Daoud Clarke
|
b8c495bda8
|
Correctly insert new URLs
|
2022-06-29 22:39:21 +01:00 |
|
Daoud Clarke
|
955d650cf4
|
Prevent deadlock when inserting URLs
|
2022-06-28 22:34:46 +01:00 |
|
Daoud Clarke
|
1457cba2c2
|
Cache batches; start a background process
|
2022-06-27 23:44:25 +01:00 |
|
Daoud Clarke
|
ff2312a5ca
|
Use different scores for same domain links
|
2022-06-27 22:46:06 +01:00 |
|
Daoud Clarke
|
36b168a8f6
|
Fix logic in found URL logic SQL and allow crawling URLs crawled by one user for now
|
2022-06-26 21:23:57 +01:00 |
|
Daoud Clarke
|
5e1ec9ccd5
|
Temporarily disable startup background processes; add root domains; check for empty batches.
|
2022-06-26 21:15:52 +01:00 |
|
Daoud Clarke
|
e27d749e18
|
Investigate duplication of URLs in batches
|
2022-06-26 21:11:51 +01:00 |
|
Daoud Clarke
|
eb571fc5fe
|
Add a script to count urls in the index
|
2022-06-21 21:55:38 +01:00 |
|