Commit graph

314 commits

Author SHA1 Message Date
Daoud Clarke
fee5cbb400 10x index size 2022-07-10 17:15:10 +01:00
milovanderlinden
dfd3f3962e Fix issue #60 2022-07-10 11:10:03 +02:00
Daoud Clarke
dba50b372f Don't include web.archive.org as a curated domain 2022-07-04 15:44:28 +01:00
Daoud Clarke
2e40ae1dca
Merge pull request #58 from mwmbl/improve-ranking-for-root-domains
Improve ranking for root domains
2022-07-03 22:10:55 +01:00
Daoud Clarke
43815c7322 Add a URL length penalty 2022-07-03 22:10:02 +01:00
Daoud Clarke
a3ff2f537f Score domain and path, weight components 2022-07-03 21:55:20 +01:00
Daoud Clarke
4b5df76ca5
Merge pull request #57 from mwmbl/clear-indexed-documents
Delete documents that have been preprocessed from the database to sav…
2022-07-03 09:45:52 +01:00
Daoud Clarke
9482ae5028 Delete documents that have been preprocessed from the database to save space 2022-07-03 09:44:51 +01:00
Daoud Clarke
6fa192daa4
Merge pull request #56 from mwmbl/allow-links-from-unknown-domains
Allow crawling links from unknown domains
2022-07-02 13:32:39 +01:00
Daoud Clarke
f9fefa0b62 Record new batches as being local 2022-07-02 13:25:31 +01:00
Daoud Clarke
e578d55789 Allow crawling links from unknown domains 2022-07-01 21:35:34 +01:00
Daoud Clarke
4967830ae1
Merge pull request #55 from mwmbl/index-continuously
Index continuously
2022-07-01 20:55:24 +01:00
Daoud Clarke
db1aa1a928 Don't require a slash for the search URL 2022-07-01 20:43:38 +01:00
Daoud Clarke
24f82a3c2f Actually used the passed in timestamp 2022-06-30 20:57:01 +01:00
Daoud Clarke
d47457b834 CONFIRMED no longer exists 2022-06-30 20:45:26 +01:00
Daoud Clarke
b6f29548db Fix log message 2022-06-30 20:42:37 +01:00
Daoud Clarke
e9835edc45 Wrap background tasks in try/except 2022-06-30 20:00:38 +01:00
Daoud Clarke
6ea3a95684 Allow batches to fail silently 2022-06-30 19:52:58 +01:00
Daoud Clarke
ddc8664c11 Queue the right type of batch 2022-06-29 22:52:12 +01:00
Daoud Clarke
2b52b50569 Queue new batches for indexing 2022-06-29 22:49:24 +01:00
Daoud Clarke
b8c495bda8 Correctly insert new URLs 2022-06-29 22:39:21 +01:00
Daoud Clarke
955d650cf4 Prevent deadlock when inserting URLs 2022-06-28 22:34:46 +01:00
Daoud Clarke
1457cba2c2 Cache batches; start a background process 2022-06-27 23:44:25 +01:00
Daoud Clarke
ff2312a5ca Use different scores for same domain links 2022-06-27 22:46:06 +01:00
Daoud Clarke
36b168a8f6 Fix logic in found URL logic SQL and allow crawling URLs crawled by one user for now 2022-06-26 21:23:57 +01:00
Daoud Clarke
5e1ec9ccd5 Temporarily disable startup background processes; add root domains; check for empty batches. 2022-06-26 21:15:52 +01:00
Daoud Clarke
e27d749e18 Investigate duplication of URLs in batches 2022-06-26 21:11:51 +01:00
Daoud Clarke
eb571fc5fe Add a script to count urls in the index 2022-06-21 21:55:38 +01:00
Daoud Clarke
1d9b5cb3ca Make more robust 2022-06-21 08:44:46 +01:00
Daoud Clarke
30e1e19072 Update queued pages in the index 2022-06-20 23:35:44 +01:00
Daoud Clarke
4330551e0f Tokenize documents and store pages to be added to the index 2022-06-20 22:54:35 +01:00
Daoud Clarke
9594915de1 WIP: index continuously. Retrieve batches and store in Postgres 2022-06-19 23:23:57 +01:00
Daoud Clarke
b8b605daed Factor out connection code 2022-06-19 16:52:25 +01:00
Daoud Clarke
c31cea710f CORS is handled by nginx 2022-06-19 13:13:36 +01:00
Daoud Clarke
96da534ca5 Don't add CORS on the python side 2022-06-19 11:34:54 +01:00
Daoud Clarke
9dbb724ba9 Use updated CORS settings 2022-06-19 11:31:55 +01:00
Daoud Clarke
e3baf87918 Remove seemingly extraneous backslashes 2022-06-19 11:27:37 +01:00
Daoud Clarke
c245be775b Use an updated template 2022-06-19 11:25:38 +01:00
Daoud Clarke
01772517da Remove problematic SSL_DIRECTIVES line 2022-06-19 11:23:01 +01:00
Daoud Clarke
a67ca7b298 Enable CORS in nginx 2022-06-19 11:16:03 +01:00
Daoud Clarke
866c17f2dc Use the dokku app storage 2022-06-19 09:53:19 +01:00
Daoud Clarke
16c2692099 Start processing historical data on startup 2022-06-19 08:56:55 +01:00
Daoud Clarke
d400950689 Add script to process historical data 2022-06-18 15:31:35 +01:00
Daoud Clarke
eb1c59990c Expose the port 2022-06-17 23:57:58 +01:00
Daoud Clarke
d7c6dcb5c2 Use the correct port for dokku 2022-06-17 23:54:22 +01:00
Daoud Clarke
77088a8a1b Use a database URL env var 2022-06-17 23:39:24 +01:00
Daoud Clarke
476481c5f8 Put the resources in the package 2022-06-17 23:32:43 +01:00
Daoud Clarke
505e7521d4 Copy the resources 2022-06-17 23:29:04 +01:00
Daoud Clarke
5ea9efcfa2 Fix relative path 2022-06-17 23:19:30 +01:00
Daoud Clarke
1c7420e5fb Don't depend on existing data 2022-06-17 23:12:22 +01:00