Commit graph

  • a8a6c67239 Use URL path to store locally so that we can easily get a local path from a URL Daoud Clarke 2022-07-20 22:21:35 +01:00
  • 0d1e7d841c Implement a batch cache to store files locally before preprocessing Daoud Clarke 2022-07-19 21:18:43 +01:00
  • 27a4784d08
    Merge pull request #68 from mwmbl/fix-missing-query Daoud Clarke 2022-07-19 20:17:20 +01:00
  • 5ce333cc9a Log at info level #68 Daoud Clarke 2022-07-18 23:46:01 +01:00
  • a097ec9fbe Allow more tries so that popular terms can be indexed Daoud Clarke 2022-07-18 23:42:09 +01:00
  • cfca015efe Enough preprocessing Daoud Clarke 2022-07-18 22:36:37 +01:00
  • 003cd217f4 Run preprocessing Daoud Clarke 2022-07-18 22:21:20 +01:00
  • bcd31326b8 Just index a single page for now Daoud Clarke 2022-07-18 22:17:15 +01:00
  • a471bc2437 Use a more specific exception in case we're discarding ones we shouldn't Daoud Clarke 2022-07-18 22:05:24 +01:00
  • ce9f52267a Run update Daoud Clarke 2022-07-18 21:55:27 +01:00
  • 09a9390c92 Catch corrupt data Daoud Clarke 2022-07-18 21:40:38 +01:00
  • 93307ad1ec Add util script to send batch; add logging Daoud Clarke 2022-07-18 21:37:19 +01:00
  • 3c97fdb3a0
    Merge pull request #66 from mwmbl/fix-unicode-encode-error Daoud Clarke 2022-07-16 10:59:14 +01:00
  • 680fe1ca0c Fix unicode encoding error #66 Daoud Clarke 2022-07-16 10:54:25 +01:00
  • e1e1b0057b
    Merge pull request #61 from milovanderlinden/issue-60-consistent-use-of-env-vars Daoud Clarke 2022-07-10 21:06:09 +01:00
  • fee5cbb400 10x index size Daoud Clarke 2022-07-10 17:15:10 +01:00
  • a93fbe9d66 Add CORS for local testing Daoud Clarke 2022-07-10 17:13:27 +01:00
  • dfd3f3962e Fix issue #60 #61 milovanderlinden 2022-07-10 11:10:03 +02:00
  • dba50b372f Don't include web.archive.org as a curated domain Daoud Clarke 2022-07-04 15:44:28 +01:00
  • 2e40ae1dca
    Merge pull request #58 from mwmbl/improve-ranking-for-root-domains Daoud Clarke 2022-07-03 22:10:55 +01:00
  • 43815c7322 Add a URL length penalty #58 Daoud Clarke 2022-07-03 22:10:02 +01:00
  • a3ff2f537f Score domain and path, weight components Daoud Clarke 2022-07-03 21:55:20 +01:00
  • 4b5df76ca5
    Merge pull request #57 from mwmbl/clear-indexed-documents Daoud Clarke 2022-07-03 09:45:52 +01:00
  • 9482ae5028 Delete documents that have been preprocessed from the database to save space #57 Daoud Clarke 2022-07-03 09:44:51 +01:00
  • 6fa192daa4
    Merge pull request #56 from mwmbl/allow-links-from-unknown-domains Daoud Clarke 2022-07-02 13:32:39 +01:00
  • f9fefa0b62 Record new batches as being local #56 Daoud Clarke 2022-07-02 13:25:31 +01:00
  • e578d55789 Allow crawling links from unknown domains Daoud Clarke 2022-07-01 21:35:34 +01:00
  • 4967830ae1
    Merge pull request #55 from mwmbl/index-continuously Daoud Clarke 2022-07-01 20:55:24 +01:00
  • db1aa1a928 Don't require a slash for the search URL #55 Daoud Clarke 2022-07-01 20:43:38 +01:00
  • 24f82a3c2f Actually used the passed in timestamp Daoud Clarke 2022-06-30 20:57:01 +01:00
  • d47457b834 CONFIRMED no longer exists Daoud Clarke 2022-06-30 20:45:26 +01:00
  • b6f29548db Fix log message Daoud Clarke 2022-06-30 20:42:37 +01:00
  • e9835edc45 Wrap background tasks in try/except Daoud Clarke 2022-06-30 20:00:38 +01:00
  • 6ea3a95684 Allow batches to fail silently Daoud Clarke 2022-06-30 19:52:58 +01:00
  • ddc8664c11 Queue the right type of batch Daoud Clarke 2022-06-29 22:52:12 +01:00
  • 2b52b50569 Queue new batches for indexing Daoud Clarke 2022-06-29 22:49:24 +01:00
  • b8c495bda8 Correctly insert new URLs Daoud Clarke 2022-06-29 22:39:21 +01:00
  • 955d650cf4 Prevent deadlock when inserting URLs Daoud Clarke 2022-06-28 22:34:46 +01:00
  • 1457cba2c2 Cache batches; start a background process Daoud Clarke 2022-06-27 23:44:25 +01:00
  • ff2312a5ca Use different scores for same domain links Daoud Clarke 2022-06-27 22:46:06 +01:00
  • 36b168a8f6 Fix logic in found URL logic SQL and allow crawling URLs crawled by one user for now Daoud Clarke 2022-06-26 21:23:57 +01:00
  • 5e1ec9ccd5 Temporarily disable startup background processes; add root domains; check for empty batches. Daoud Clarke 2022-06-26 21:15:52 +01:00
  • e27d749e18 Investigate duplication of URLs in batches Daoud Clarke 2022-06-26 21:11:51 +01:00
  • eb571fc5fe Add a script to count urls in the index Daoud Clarke 2022-06-21 21:55:38 +01:00
  • 1d9b5cb3ca Make more robust Daoud Clarke 2022-06-21 08:44:46 +01:00
  • 30e1e19072 Update queued pages in the index Daoud Clarke 2022-06-20 23:35:44 +01:00
  • 4330551e0f Tokenize documents and store pages to be added to the index Daoud Clarke 2022-06-20 22:54:35 +01:00
  • 9594915de1 WIP: index continuously. Retrieve batches and store in Postgres Daoud Clarke 2022-06-19 23:23:57 +01:00
  • b8b605daed Factor out connection code Daoud Clarke 2022-06-19 16:52:25 +01:00
  • c31cea710f CORS is handled by nginx Daoud Clarke 2022-06-19 13:13:36 +01:00
  • 96da534ca5 Don't add CORS on the python side Daoud Clarke 2022-06-19 11:34:54 +01:00
  • 9dbb724ba9 Use updated CORS settings Daoud Clarke 2022-06-19 11:31:55 +01:00
  • e3baf87918 Remove seemingly extraneous backslashes #54 monolith Daoud Clarke 2022-06-19 11:27:37 +01:00
  • c245be775b Use an updated template Daoud Clarke 2022-06-19 11:25:38 +01:00
  • 01772517da Remove problematic SSL_DIRECTIVES line Daoud Clarke 2022-06-19 11:23:01 +01:00
  • a67ca7b298 Enable CORS in nginx Daoud Clarke 2022-06-19 11:16:03 +01:00
  • 866c17f2dc Use the dokku app storage Daoud Clarke 2022-06-19 09:53:19 +01:00
  • 16c2692099 Start processing historical data on startup Daoud Clarke 2022-06-19 08:56:55 +01:00
  • d400950689 Add script to process historical data Daoud Clarke 2022-06-18 15:31:35 +01:00
  • eb1c59990c Expose the port Daoud Clarke 2022-06-17 23:57:58 +01:00
  • d7c6dcb5c2 Use the correct port for dokku Daoud Clarke 2022-06-17 23:54:22 +01:00
  • 77088a8a1b Use a database URL env var Daoud Clarke 2022-06-17 23:39:24 +01:00
  • 476481c5f8 Put the resources in the package Daoud Clarke 2022-06-17 23:32:43 +01:00
  • 505e7521d4 Copy the resources Daoud Clarke 2022-06-17 23:29:04 +01:00
  • 5ea9efcfa2 Fix relative path Daoud Clarke 2022-06-17 23:19:30 +01:00
  • 1c7420e5fb Don't depend on existing data Daoud Clarke 2022-06-17 23:12:22 +01:00
  • a003914e91 Fix boto3 dependency Daoud Clarke 2022-06-17 22:14:55 +01:00
  • 363103468e Update Dockerfile for changes Daoud Clarke 2022-06-17 21:26:21 +01:00
  • e2eb405083 Combine crawler and search servers Daoud Clarke 2022-06-16 22:49:41 +01:00
  • 7771657684
    Merge pull request #53 from mwmbl/record-historical-batches Daoud Clarke 2022-06-16 22:09:12 +01:00
  • 14107acc75 Use new server #53 Daoud Clarke 2022-06-09 22:24:54 +01:00
  • aaca8b2b6e Record historical batches via the API Daoud Clarke 2022-06-05 09:15:04 +01:00
  • 617666e3b7
    Merge pull request #51 from mwmbl/learning-to-rank Daoud Clarke 2022-06-04 12:36:15 +01:00
  • 770b4b945b Refactor feature extraction #51 Daoud Clarke 2022-05-07 22:52:36 +01:00
  • 87d8b40cad Make order_results public Daoud Clarke 2022-05-06 23:15:50 +01:00
  • 229819e57e Refactor to allow LTR ranker Daoud Clarke 2022-03-27 22:32:44 +01:00
  • 94287cec01 Get features for each string separately Daoud Clarke 2022-03-21 21:49:10 +00:00
  • 4740d89c6a Add domain score feature Daoud Clarke 2022-03-21 21:13:20 +00:00
  • af6a28fac3 Implement learning to rank feature extraction and thresholding Daoud Clarke 2022-03-20 22:01:45 +00:00
  • 2d334074af Make get_results() public for learning to rank Daoud Clarke 2022-03-20 17:25:54 +00:00
  • ee5ca6bcf6 Experiment with score variations (best is simple weighted domain score) update-index Daoud Clarke 2022-02-27 21:24:16 +00:00
  • 6fb310c363 Use addition instead of multiplication Daoud Clarke 2022-02-25 22:19:26 +00:00
  • 4e6516ccf1 Scale by 0.99 Daoud Clarke 2022-02-25 22:14:49 +00:00
  • f5afbed2e5 Handle empty list Daoud Clarke 2022-02-25 22:11:09 +00:00
  • efafec5214 Rank using item score as well as match score Daoud Clarke 2022-02-25 22:08:37 +00:00
  • e1e9e404a3 Dedupe before indexing Daoud Clarke 2022-02-24 22:01:42 +00:00
  • f5b20d0128 Index link counts Daoud Clarke 2022-02-24 20:47:36 +00:00
  • b5b2005323 Store computed link counts Daoud Clarke 2022-02-23 22:13:38 +00:00
  • 00d18c3474 Remove unused code Daoud Clarke 2022-02-23 21:59:24 +00:00
  • d19e0e51f7
    Merge pull request #47 from mwmbl/include-metadata-in-index Daoud Clarke 2022-02-23 21:10:24 +00:00
  • 04a33a134b Fixes to mwmbl API for changes to the index #47 Daoud Clarke 2022-02-22 22:27:02 +00:00
  • ae3b334a7f Fixes for API changes Daoud Clarke 2022-02-22 22:12:39 +00:00
  • 326f7e3d7f Use JSON instead of struct to store metadata Daoud Clarke 2022-02-18 22:22:47 +00:00
  • e6273c7f76 WIP: include metadata in index - using struct approach Daoud Clarke 2022-02-18 22:12:22 +00:00
  • 82c46b50bc
    Merge pull request #46 from mwmbl/refactor-for-evaluation Daoud Clarke 2022-02-16 21:28:21 +00:00
  • e03e379ccf Refactor to enable easier evaluation #46 Daoud Clarke 2022-02-09 22:43:47 +00:00
  • 4e36ee198c
    Merge pull request #42 from mwmbl/update-readme-for-new-crawler Daoud Clarke 2022-02-04 23:26:11 +00:00
  • c4e86ce313 Update readme for recent changes #42 update-readme-for-new-crawler Daoud Clarke 2022-02-04 22:07:09 +00:00
  • 51f2dd2690 Merge branch 'master' of github.com:mwmbl/mwmbl Daoud Clarke 2022-02-04 21:49:40 +00:00
  • 9f78d19c8c
    Merge pull request #41 from ColinEspinas/add-branding Daoud Clarke 2022-02-04 21:28:41 +00:00