Daoud Clarke
|
dba50b372f
|
Don't include web.archive.org as a curated domain
|
2022-07-04 15:44:28 +01:00 |
|
Daoud Clarke
|
2e40ae1dca
|
Merge pull request #58 from mwmbl/improve-ranking-for-root-domains
Improve ranking for root domains
|
2022-07-03 22:10:55 +01:00 |
|
Daoud Clarke
|
43815c7322
|
Add a URL length penalty
|
2022-07-03 22:10:02 +01:00 |
|
Daoud Clarke
|
a3ff2f537f
|
Score domain and path, weight components
|
2022-07-03 21:55:20 +01:00 |
|
Daoud Clarke
|
4b5df76ca5
|
Merge pull request #57 from mwmbl/clear-indexed-documents
Delete documents that have been preprocessed from the database to sav…
|
2022-07-03 09:45:52 +01:00 |
|
Daoud Clarke
|
9482ae5028
|
Delete documents that have been preprocessed from the database to save space
|
2022-07-03 09:44:51 +01:00 |
|
Daoud Clarke
|
6fa192daa4
|
Merge pull request #56 from mwmbl/allow-links-from-unknown-domains
Allow crawling links from unknown domains
|
2022-07-02 13:32:39 +01:00 |
|
Daoud Clarke
|
f9fefa0b62
|
Record new batches as being local
|
2022-07-02 13:25:31 +01:00 |
|
Daoud Clarke
|
e578d55789
|
Allow crawling links from unknown domains
|
2022-07-01 21:35:34 +01:00 |
|
Daoud Clarke
|
4967830ae1
|
Merge pull request #55 from mwmbl/index-continuously
Index continuously
|
2022-07-01 20:55:24 +01:00 |
|
Daoud Clarke
|
db1aa1a928
|
Don't require a slash for the search URL
|
2022-07-01 20:43:38 +01:00 |
|
Daoud Clarke
|
24f82a3c2f
|
Actually used the passed in timestamp
|
2022-06-30 20:57:01 +01:00 |
|
Daoud Clarke
|
d47457b834
|
CONFIRMED no longer exists
|
2022-06-30 20:45:26 +01:00 |
|
Daoud Clarke
|
b6f29548db
|
Fix log message
|
2022-06-30 20:42:37 +01:00 |
|
Daoud Clarke
|
e9835edc45
|
Wrap background tasks in try/except
|
2022-06-30 20:00:38 +01:00 |
|
Daoud Clarke
|
6ea3a95684
|
Allow batches to fail silently
|
2022-06-30 19:52:58 +01:00 |
|
Daoud Clarke
|
ddc8664c11
|
Queue the right type of batch
|
2022-06-29 22:52:12 +01:00 |
|
Daoud Clarke
|
2b52b50569
|
Queue new batches for indexing
|
2022-06-29 22:49:24 +01:00 |
|
Daoud Clarke
|
b8c495bda8
|
Correctly insert new URLs
|
2022-06-29 22:39:21 +01:00 |
|
Daoud Clarke
|
955d650cf4
|
Prevent deadlock when inserting URLs
|
2022-06-28 22:34:46 +01:00 |
|
Daoud Clarke
|
1457cba2c2
|
Cache batches; start a background process
|
2022-06-27 23:44:25 +01:00 |
|
Daoud Clarke
|
ff2312a5ca
|
Use different scores for same domain links
|
2022-06-27 22:46:06 +01:00 |
|
Daoud Clarke
|
36b168a8f6
|
Fix logic in found URL logic SQL and allow crawling URLs crawled by one user for now
|
2022-06-26 21:23:57 +01:00 |
|
Daoud Clarke
|
5e1ec9ccd5
|
Temporarily disable startup background processes; add root domains; check for empty batches.
|
2022-06-26 21:15:52 +01:00 |
|
Daoud Clarke
|
e27d749e18
|
Investigate duplication of URLs in batches
|
2022-06-26 21:11:51 +01:00 |
|
Daoud Clarke
|
eb571fc5fe
|
Add a script to count urls in the index
|
2022-06-21 21:55:38 +01:00 |
|
Daoud Clarke
|
1d9b5cb3ca
|
Make more robust
|
2022-06-21 08:44:46 +01:00 |
|
Daoud Clarke
|
30e1e19072
|
Update queued pages in the index
|
2022-06-20 23:35:44 +01:00 |
|
Daoud Clarke
|
4330551e0f
|
Tokenize documents and store pages to be added to the index
|
2022-06-20 22:54:35 +01:00 |
|
Daoud Clarke
|
9594915de1
|
WIP: index continuously. Retrieve batches and store in Postgres
|
2022-06-19 23:23:57 +01:00 |
|
Daoud Clarke
|
b8b605daed
|
Factor out connection code
|
2022-06-19 16:52:25 +01:00 |
|
Daoud Clarke
|
c31cea710f
|
CORS is handled by nginx
|
2022-06-19 13:13:36 +01:00 |
|
Daoud Clarke
|
96da534ca5
|
Don't add CORS on the python side
|
2022-06-19 11:34:54 +01:00 |
|
Daoud Clarke
|
9dbb724ba9
|
Use updated CORS settings
|
2022-06-19 11:31:55 +01:00 |
|
Daoud Clarke
|
e3baf87918
|
Remove seemingly extraneous backslashes
|
2022-06-19 11:27:37 +01:00 |
|
Daoud Clarke
|
c245be775b
|
Use an updated template
|
2022-06-19 11:25:38 +01:00 |
|
Daoud Clarke
|
01772517da
|
Remove problematic SSL_DIRECTIVES line
|
2022-06-19 11:23:01 +01:00 |
|
Daoud Clarke
|
a67ca7b298
|
Enable CORS in nginx
|
2022-06-19 11:16:03 +01:00 |
|
Daoud Clarke
|
866c17f2dc
|
Use the dokku app storage
|
2022-06-19 09:53:19 +01:00 |
|
Daoud Clarke
|
16c2692099
|
Start processing historical data on startup
|
2022-06-19 08:56:55 +01:00 |
|
Daoud Clarke
|
d400950689
|
Add script to process historical data
|
2022-06-18 15:31:35 +01:00 |
|
Daoud Clarke
|
eb1c59990c
|
Expose the port
|
2022-06-17 23:57:58 +01:00 |
|
Daoud Clarke
|
d7c6dcb5c2
|
Use the correct port for dokku
|
2022-06-17 23:54:22 +01:00 |
|
Daoud Clarke
|
77088a8a1b
|
Use a database URL env var
|
2022-06-17 23:39:24 +01:00 |
|
Daoud Clarke
|
476481c5f8
|
Put the resources in the package
|
2022-06-17 23:32:43 +01:00 |
|
Daoud Clarke
|
505e7521d4
|
Copy the resources
|
2022-06-17 23:29:04 +01:00 |
|
Daoud Clarke
|
5ea9efcfa2
|
Fix relative path
|
2022-06-17 23:19:30 +01:00 |
|
Daoud Clarke
|
1c7420e5fb
|
Don't depend on existing data
|
2022-06-17 23:12:22 +01:00 |
|
Daoud Clarke
|
a003914e91
|
Fix boto3 dependency
|
2022-06-17 22:14:55 +01:00 |
|
Daoud Clarke
|
363103468e
|
Update Dockerfile for changes
|
2022-06-17 21:26:21 +01:00 |
|