Compare commits

..

362 commits

Author SHA1 Message Date
Daoud Clarke
cfe18162f1 Blacklist another domain 2023-11-21 11:24:48 +00:00
Daoud Clarke
b868b6284b Encode URLs properly 2023-11-21 10:45:50 +00:00
Daoud Clarke
c1489a27cf
Merge pull request #130 from mwmbl/fix-csrf-requirement
Use CSRF only for curation requests
2023-11-19 20:53:55 +00:00
Daoud Clarke
a2fd3d95d8 Use CSRF only for curation requests 2023-11-19 20:48:18 +00:00
Daoud Clarke
5874720801
Merge pull request #129 from mwmbl/allow-running-old-api
Allow running old API
2023-11-19 10:02:27 +00:00
Daoud Clarke
da787a67db Unused setting 2023-11-19 10:01:48 +00:00
Daoud Clarke
56ee43e730 Remove unused settings 2023-11-19 10:01:04 +00:00
Daoud Clarke
69f6a16cce Reinstate old API 2023-11-19 10:00:31 +00:00
Daoud Clarke
8c45b94aa6 Outdated settings file 2023-11-18 20:21:57 +00:00
Daoud Clarke
3c61f5818d Whitespace to allow git push 2023-11-18 20:15:39 +00:00
Daoud Clarke
a3cc316d15
Merge pull request #128 from mwmbl/beta
Allow users to curate search results
2023-11-18 20:14:50 +00:00
Daoud Clarke
36df016445
Merge pull request #127 from mwmbl/add-term-info-to-index
Add term info to index
2023-11-18 18:56:53 +00:00
Daoud Clarke
204304e18e Add term info to index 2023-11-18 18:49:41 +00:00
Daoud Clarke
a2b872008f Add a script to evaluate how much it costs to add the term to the index
Old sizes mean 33.3673 0.08148019988498635
New sizes mean 32.1322 0.07700185221489449
2023-11-16 17:42:18 +00:00
Daoud Clarke
8790d758a3
Merge pull request #126 from mwmbl/mobile-adjustments
Mobile adjustments
2023-11-11 21:03:03 +00:00
Daoud Clarke
7b19273fe2 Improve usability on mobile 2023-11-11 21:01:50 +00:00
Daoud Clarke
f62b8c3cf2 WIP: buttons overlap text 2023-11-11 10:35:06 +00:00
Daoud Clarke
619f7631d9 Hide branding on small screens 2023-11-10 13:54:43 +00:00
Daoud Clarke
5c29792d14 Check we have some elements before creating a sortable 2023-11-09 21:36:34 +00:00
Daoud Clarke
86ffcbb039 Don't allow reordering items on main page; fix adding an item 2023-11-09 21:31:44 +00:00
Daoud Clarke
6b8d0b0161 Update input value from URL 2023-11-09 14:16:28 +00:00
Daoud Clarke
4583304c86 Sort activity groups by most recent 2023-11-09 13:59:11 +00:00
Daoud Clarke
62187fa8ce Accidental change to dockerfile 2023-11-09 13:44:43 +00:00
Daoud Clarke
1b23076286 Specify manifest path in prod 2023-11-09 13:43:34 +00:00
Daoud Clarke
66e2da0c89
Merge pull request #125 from mwmbl/performance-improvements
Use sortable js instead of jquery
2023-11-09 10:51:04 +00:00
Daoud Clarke
7dd131df9f Use sortable js instead of jquery 2023-11-09 10:48:41 +00:00
Daoud Clarke
5f47f45ebc
Merge pull request #124 from mwmbl/tidy-beta
Tidy beta
2023-11-09 09:53:33 +00:00
Daoud Clarke
463a1178d0 Root now served by django 2023-11-08 13:08:22 +00:00
Daoud Clarke
c0605e6bf7 Redo footer 2023-11-08 12:49:17 +00:00
Daoud Clarke
dc2bd082cf Show activity list when there are no queries 2023-11-08 09:59:29 +00:00
Daoud Clarke
cce962e845 Add no results found 2023-11-08 08:59:22 +00:00
Daoud Clarke
6263e65bb9 Fix title on deleting query 2023-11-08 08:56:58 +00:00
Daoud Clarke
ae39eb98e9 Add logout button 2023-11-08 08:45:39 +00:00
Daoud Clarke
6d8facf977 Improve login and signup buttons 2023-11-07 21:39:17 +00:00
Daoud Clarke
d8a3e29282 Reinstate save 2023-11-07 21:06:44 +00:00
Daoud Clarke
b28ec062ca Fix search bar width; consisting capitalization 2023-11-07 20:49:25 +00:00
Daoud Clarke
787c36bcfe Make search bar compact 2023-11-07 20:40:14 +00:00
Daoud Clarke
28b326aedf Fix broken JS 2023-11-07 18:59:38 +00:00
Daoud Clarke
d7ad64b4e0 Build JS instead of html in vite 2023-11-06 20:10:54 +00:00
Daoud Clarke
e54f55ad8e Remove some unused code 2023-11-06 18:42:49 +00:00
Daoud Clarke
3678f0117a Handle no results in all cases 2023-11-06 18:27:22 +00:00
Daoud Clarke
9933a529c3 Work without JS 2023-11-06 18:09:54 +00:00
Daoud Clarke
e3371233b5 Use replace header instead of push 2023-11-05 21:52:09 +00:00
Daoud Clarke
8293a7afa4 Update query string 2023-11-05 21:45:13 +00:00
Daoud Clarke
5dbe792579 Update title in response 2023-11-05 21:20:48 +00:00
Daoud Clarke
932579ead3 Remove more unused code 2023-11-05 18:51:55 +00:00
Daoud Clarke
60b27d639e Missing chart.js dependency 2023-11-05 14:08:56 +00:00
Daoud Clarke
19a8c8ac79
Merge pull request #123 from mwmbl/use-htmx-for-search-results
Use htmx for search results
2023-11-05 13:24:34 +00:00
Daoud Clarke
f25f7057a2 Remove unused code 2023-10-30 16:44:15 +00:00
Daoud Clarke
6e39893bc1 Fix fetch url to return HTML instead of JSON 2023-10-30 16:39:58 +00:00
Daoud Clarke
fb27053295 WIP: implement search using htmx 2023-10-30 08:53:25 +00:00
Daoud Clarke
ff212d6e15 Add template for results; remove some unused code 2023-10-29 14:22:25 +00:00
Daoud Clarke
95f9c56ba6
Merge pull request #122 from mwmbl/login-ui
Login UI
2023-10-29 14:03:43 +00:00
Daoud Clarke
03293384aa Fix static files location 2023-10-28 18:23:46 +01:00
Daoud Clarke
372d780da7 No input 2023-10-28 16:45:14 +01:00
Daoud Clarke
03fd7e8d9d Collect static files 2023-10-28 16:38:56 +01:00
Daoud Clarke
b39264d131 Add default from email 2023-10-27 19:39:09 +01:00
Daoud Clarke
999e7d599b Add email settings for Sendgrid 2023-10-27 19:28:33 +01:00
Daoud Clarke
5be536274f Trusted origin should include schema 2023-10-27 13:26:47 +01:00
Daoud Clarke
aa8761eeb7 Set debug to true temporarily 2023-10-27 13:19:29 +01:00
Daoud Clarke
68e28d2e92 Trusted origins needed for CSRF (https://stackoverflow.com/a/70509982) 2023-10-27 08:30:08 +01:00
Daoud Clarke
4f0b1c44cb Static files hosted in different place 2023-10-27 07:26:35 +01:00
Daoud Clarke
949f66e2b4 Keep secret key private in prod 2023-10-27 07:01:06 +01:00
Daoud Clarke
0f1f1d64f4 Revert "Temp remove nginx config"
This reverts commit 191a0db758.
2023-10-27 06:44:04 +01:00
Daoud Clarke
191a0db758 Temp remove nginx config 2023-10-27 06:39:28 +01:00
Daoud Clarke
fa847d0bf9 Add beta.mwmbl.org as an allowed host 2023-10-27 06:30:48 +01:00
Daoud Clarke
d845b53429 Run migrate in main script 2023-10-27 06:18:26 +01:00
Daoud Clarke
36ec3ae4e5 Add database config 2023-10-26 17:32:46 +01:00
Daoud Clarke
911b243239 Fix curation 2023-10-25 23:12:41 +01:00
Daoud Clarke
4d823497a6 Add original curation front-end 2023-10-25 19:17:02 +01:00
Daoud Clarke
bb9e6aa4bd Implement curation API using Django Ninja 2023-10-25 16:39:42 +01:00
Daoud Clarke
bd017079d5 Add login using allauth 2023-10-24 10:32:06 +01:00
Daoud Clarke
9e9ef8c36c Add rewrite log 2023-10-22 10:31:32 +01:00
Daoud Clarke
0db4158317
Merge pull request #121 from mwmbl/allow-no-bg-in-prod
Add settings without background computation for prod
2023-10-19 14:34:13 +01:00
Daoud Clarke
b583e18d6a Add settings without background computation for prod 2023-10-19 14:33:10 +01:00
Daoud Clarke
4917b882d2 Exclude more spam sites 2023-10-18 17:02:45 +01:00
Daoud Clarke
78a9bfbb11 Filter out more spam domains 2023-10-17 22:05:53 +01:00
Daoud Clarke
8c7ddda7d9 Use blacklist on initialisation, add tests 2023-10-17 21:51:23 +01:00
Daoud Clarke
ce844b59ae
Merge pull request #120 from mwmbl/temporary-fix-seo-spam
Exclude domains with bad pattern
2023-10-17 17:47:00 +01:00
Daoud Clarke
b426fa3b7e Exclude domains with bad pattern 2023-10-17 17:45:26 +01:00
Daoud Clarke
f00eacf8aa
Merge pull request #119 from mwmbl/fix-nginx-config
Fix nginx config
2023-10-17 13:18:45 +01:00
Daoud Clarke
755c1362d0 Revert nginx.conf 2023-10-17 13:17:57 +01:00
Daoud Clarke
734f0abe5a Allow mwmbl.org 2023-10-15 16:47:23 +01:00
Daoud Clarke
a4ee3cb195 Use default config file 2023-10-15 16:43:36 +01:00
Daoud Clarke
66cdbd3d47 disable ipv6 2023-10-15 16:41:31 +01:00
Daoud Clarke
2310e32636 Try a default config 2023-10-15 16:30:33 +01:00
Daoud Clarke
f0d6a11ca1 Reenable 2023-10-15 16:21:49 +01:00
Daoud Clarke
cac78d8e70 Disable redirect 2023-10-15 16:06:43 +01:00
Daoud Clarke
d65875642e Revert "Try a different approach"
This reverts commit 0601bd526a.
2023-10-14 21:24:33 +01:00
Daoud Clarke
0601bd526a Try a different approach 2023-10-14 20:53:46 +01:00
Daoud Clarke
cc93c7ddc3 Redirect all http to https 2023-10-14 19:22:03 +01:00
Daoud Clarke
70c06b78da
Merge pull request #118 from mwmbl/new-api-url
Duplicate the API at /api/v1/
2023-10-14 11:44:37 +01:00
Daoud Clarke
f883ea9f7a Duplicate the API at /api/v1/ 2023-10-14 11:43:15 +01:00
Daoud Clarke
8f1484d381 Only match stats at the beginning 2023-10-13 21:08:24 +01:00
Daoud Clarke
597093bb0f More try files 2023-10-13 20:58:54 +01:00
Daoud Clarke
64c9aa120f Set the root 2023-10-13 20:54:31 +01:00
Daoud Clarke
1620322118 Use try_files instead of root 2023-10-13 20:41:39 +01:00
Daoud Clarke
a409c240e1 Missed semicolongs 2023-10-13 13:48:56 +01:00
Daoud Clarke
c30fd607d4 Special cases for home and stats pages 2023-10-13 13:41:17 +01:00
Daoud Clarke
973fe8198c There may not be files there, so use -rf 2023-10-12 22:12:58 +01:00
Daoud Clarke
b182e92b75 Extra comme 2023-10-12 22:06:57 +01:00
Daoud Clarke
bd496aeab2 Add the app.json in Dockerfile 2023-10-12 22:02:27 +01:00
Daoud Clarke
ef6b87e9cd Use a predeploy script instead 2023-10-12 21:44:34 +01:00
Daoud Clarke
64f0d0a04a typo 2023-10-12 21:32:16 +01:00
Daoud Clarke
09de4a918e Copy to the mounted volume in python code 2023-10-12 21:21:30 +01:00
Daoud Clarke
88c3437456
Merge pull request #117 from mwmbl/include-front-end
Include front end
2023-10-12 20:53:40 +01:00
Daoud Clarke
2e4456a338 Build the front end 2023-10-12 20:52:51 +01:00
Daoud Clarke
013c007bb0 Serve front end 2023-10-12 17:38:19 +01:00
Daoud Clarke
536487b3d2 Add front end 2023-10-12 17:17:42 +01:00
Daoud Clarke
0b380b50bf Move the static spec down 2023-10-11 21:15:23 +01:00
Daoud Clarke
040921836c Add static hosting 2023-10-11 21:09:58 +01:00
Daoud Clarke
8494b746e7 Add api.mwmbl.org to allowed hosts 2023-10-10 20:39:31 +01:00
Daoud Clarke
8f0061f0c3 Fix port 2023-10-10 20:35:30 +01:00
Daoud Clarke
c3fd328237
Merge pull request #116 from mwmbl/django-rewrite-fixes
Fix some paths, use prod settings in Dockerfile
2023-10-10 20:23:15 +01:00
Daoud Clarke
1227ae33c8 Run poetry lock 2023-10-10 20:21:37 +01:00
Daoud Clarke
c6d9e6ebb0 Fix some paths, use prod settings in Dockerfile 2023-10-10 20:18:43 +01:00
Daoud Clarke
213bdaa365
Merge pull request #115 from mwmbl/django-rewrite
Django rewrite
2023-10-10 16:25:36 +01:00
Daoud Clarke
918eaa8709 Rename django app to mwmbl 2023-10-10 13:51:06 +01:00
Daoud Clarke
fab5e5c782 Use different dev and prod settings 2023-10-08 21:42:04 +01:00
Daoud Clarke
a1d6fd8bb1 Start background processes 2023-10-08 21:20:32 +01:00
Daoud Clarke
b6fd27352b Add crawler router 2023-10-08 14:13:38 +01:00
Daoud Clarke
ed64ca6c91 Merge branch 'main' into django-rewrite 2023-10-07 19:19:34 +01:00
Daoud Clarke
d716cb347f
Merge pull request #114 from mwmbl/exclude-domains-by-keyword
Exclude domains by keyword
2023-10-04 20:20:51 +01:00
Daoud Clarke
41061a695b Add tests 2023-10-04 20:19:42 +01:00
Daoud Clarke
593c71f689 Exclude domains by keyword 2023-10-04 19:51:33 +01:00
Daoud Clarke
a77dc3eb4c
Merge pull request #113 from mwmbl/more-stats
Add more stats
2023-10-02 22:19:20 +01:00
Daoud Clarke
988f3fd2a9 Add more stats 2023-10-02 22:19:02 +01:00
Daoud Clarke
7c3aea5ca0 Temporary just select some URLs at random for initialization 2023-09-29 22:32:31 +01:00
Daoud Clarke
826d3d6ba9
Merge pull request #112 from mwmbl/stats
Stats
2023-09-29 21:49:28 +01:00
Daoud Clarke
ab527c4b58 Use stats manager from redis URL 2023-09-29 21:48:36 +01:00
Daoud Clarke
0d795b7c64 Fix bugs with date method 2023-09-29 21:27:32 +01:00
Daoud Clarke
e1bf423e69 Get stats 2023-09-29 13:58:26 +01:00
Daoud Clarke
a55a027107 Store stats in redis 2023-09-29 13:37:54 +01:00
Daoud Clarke
db658daa88 Store stats in redis 2023-09-28 17:48:29 +01:00
Daoud Clarke
86a6524f0a WIP add search API to Django 2023-09-24 08:09:18 +01:00
Daoud Clarke
177324353f Merge branch 'main' into django-rewrite 2023-09-23 12:56:22 +01:00
Daoud Clarke
bec00cdab5 Exclude additional domain 2023-09-22 23:06:04 +01:00
Daoud Clarke
7e054d0854 Better blacklist 2023-09-22 23:04:37 +01:00
Daoud Clarke
ed96386f05
Merge pull request #110 from mwmbl/update-blacklist
Exclude blacklisted domains
2023-09-22 21:54:12 +01:00
Daoud Clarke
019095a4c1 Exclude blacklisted domains 2023-09-22 21:53:53 +01:00
Daoud Clarke
19cc196e34 Add django ninja 2023-09-22 19:56:42 +01:00
Daoud Clarke
4aefc48716 Add django 2023-08-27 07:37:15 +01:00
Daoud Clarke
18dc760a34 Temp disable CORS 2023-05-20 23:23:43 +01:00
Daoud Clarke
01bf4c21df Temporarily disable lemmy as connection is refused 2023-05-20 22:26:33 +01:00
Daoud Clarke
8851d86ff4 Justext is not extra 2023-05-20 22:17:23 +01:00
Daoud Clarke
b5b37629ce Clean unicode when formatting result 2023-05-20 22:11:51 +01:00
Daoud Clarke
dec7c4853d Whitespace fix 2023-05-20 21:52:33 +01:00
Daoud Clarke
3e08c6e804 Check response status; provide an answer when registering 2023-05-20 21:51:57 +01:00
Daoud Clarke
60980a6bc7
Merge pull request #100 from mwmbl/user-registration
User registration
2023-04-30 20:31:09 +01:00
Daoud Clarke
8d64af4f1b Keep track of curated couments 2023-04-30 18:25:48 +01:00
Daoud Clarke
f0592f99df Require a curated boolean flag 2023-04-13 06:27:51 +01:00
Daoud Clarke
759dbf07b9 Revert index 2023-04-13 05:37:43 +01:00
Daoud Clarke
00b5438492 Track curated items in the index 2023-04-09 06:26:23 +01:00
Daoud Clarke
a87d3d6def Store curated pages in the index 2023-04-09 05:31:23 +01:00
Daoud Clarke
61cdd4dd71 Merge branch 'main' into user-registration 2023-04-01 07:17:29 +01:00
Daoud Clarke
3e1f5da28e Off by one error with page size 2023-04-01 06:40:03 +01:00
Daoud Clarke
91269d5100 Handle a bad batch 2023-04-01 06:35:44 +01:00
Rishabh Singh Ahluwalia
e9dfd40ecb
Merge pull request #98 from mwmbl/rishabh-fix-trim-data
Fix trimming page size logic while adding to a page
2023-03-28 08:18:53 -07:00
Rishabh Singh Ahluwalia
f232badd67 fix comma formatting 2023-03-27 22:18:10 -07:00
Rishabh Singh Ahluwalia
8e197a09f9 Fix trimming page size logic while adding to a page 2023-03-26 10:04:05 -07:00
Daoud Clarke
23688bd3ad Merge branch 'master' into user-registration 2023-03-18 22:37:45 +00:00
Daoud Clarke
0838157185
Merge pull request #97 from mwmbl/initialize-with-found-urls
Initialize with found urls
2023-02-25 18:20:11 +00:00
Daoud Clarke
7d0c55c015 Fix broken test 2023-02-25 18:18:09 +00:00
Daoud Clarke
e5c08e0d24 Fix big with other URLs 2023-02-25 16:48:59 +00:00
Daoud Clarke
a24156ce5c Initialize URLs by processing them like all other URLs to avoid bias 2023-02-25 13:45:03 +00:00
Daoud Clarke
6bb8bdf0c2 Initialize with new URLs 2023-02-25 10:48:22 +00:00
Daoud Clarke
a9e2b48840
Merge pull request #96 from mwmbl/unique-urls-in-queue
Unique URLs in queue
2023-02-25 10:35:32 +00:00
Daoud Clarke
5c94dfa669 Shuffle URLs before batching 2023-02-25 10:35:10 +00:00
Daoud Clarke
6ff62fb119 Ensure URLs in queue are unique 2023-02-25 10:34:09 +00:00
Daoud Clarke
c36e1dffcb Remove picolisp as a top domain since there are duplicate URLs 2023-02-25 09:56:26 +00:00
Daoud Clarke
362f9bfa9e Write page to the correct location (metadata size offset bug fix) 2023-02-24 21:46:18 +00:00
Daoud Clarke
5616626fc1
Merge pull request #89 from mwmbl/update-urls-queue-quickly
Update urls queue quickly
2023-02-24 21:39:40 +00:00
Daoud Clarke
bc6be8b6d5 Merge branch 'master' into update-urls-queue-quickly 2023-02-24 21:37:54 +00:00
Daoud Clarke
a03b76e5cc Fix broken test 2023-02-24 21:37:32 +00:00
Daoud Clarke
c97d946fcf Go back to processing 10,000 batches at a time 2023-02-24 21:29:42 +00:00
Rishabh Singh Ahluwalia
38a5dbbf3c
Merge pull request #94 from mwmbl/rishabh-port-configuration
Allow configuration of port
2023-02-23 07:31:07 -08:00
Rishabh Singh Ahluwalia
2aa61a5121
Merge pull request #95 from mwmbl/rishabh-unit-testing-with-ci
Add PyUnit dependency + Unit Tests for completer.py + Github Actions CI for running unit tests
2023-02-23 07:30:48 -08:00
Rishabh Singh Ahluwalia
30aff3b920 Add pytest, unit tests for completer,gh actions ci 2023-02-22 21:37:10 -08:00
Rishabh Singh Ahluwalia
842aec19e2 Add port to args 2023-02-22 19:59:42 -08:00
Daoud Clarke
50a059410b
Merge pull request #93 from mwmbl/add-code-of-conduct-1
Create CODE_OF_CONDUCT.md
2023-02-15 20:36:31 +00:00
Rishabh Singh Ahluwalia
084a870f65
Merge pull request #92 from mwmbl/rishabh-add-launch-json
Add launch.json for vscode run/debugging
2023-02-12 07:17:47 -08:00
Daoud Clarke
68ecdee145
Create CONTRIBUTING.md 2023-02-11 15:17:35 +00:00
Daoud Clarke
3a07fb54b5
Create CODE_OF_CONDUCT.md 2023-02-11 15:13:08 +00:00
Daoud Clarke
d8dbe54f9c
Update README.md 2023-02-11 15:10:30 +00:00
Daoud Clarke
2daf902ca3
Merge pull request #90 from mwmbl/m1-mmap-issue-fix-2
Offset by metadata size manually to increase compatibility
2023-02-11 08:30:46 +00:00
Rishabh Singh Ahluwalia
7fdc8480bd add launch.json for vscode debugging 2023-02-10 20:59:09 -08:00
Daoud Clarke
e890e56661 Offset by metadata size manually to increase compatibility 2023-02-05 15:49:09 +00:00
Daoud Clarke
5783cee6b7 Fix bugs 2023-01-24 22:52:58 +00:00
Daoud Clarke
77e39b4a89 Optimise URL update 2023-01-22 20:28:18 +00:00
Daoud Clarke
66700f8a3e Speed up domain parsing 2023-01-20 20:53:50 +00:00
Daoud Clarke
2b36f2ccc1 Try and balance URLs before adding to queue 2023-01-19 21:56:40 +00:00
Daoud Clarke
603fcd4eb2 Create a custom URL queue 2023-01-14 21:59:31 +00:00
Daoud Clarke
01f08fd88d Return updated URLs 2023-01-14 19:17:16 +00:00
Daoud Clarke
bd0cc3863e Don't try and update an empty list of URLs 2023-01-09 21:02:40 +00:00
Daoud Clarke
d347a17d63 Update URL queue separately from the other background process to speed it up 2023-01-09 20:50:28 +00:00
Daoud Clarke
7bd12c1ead Fix some bugs in URL fetching query 2023-01-02 20:51:23 +00:00
Daoud Clarke
a50f1d8ae3 Fix postgres install 2023-01-02 12:19:10 +00:00
Daoud Clarke
1ab16b1fb4 Install postgres client 2023-01-02 12:18:03 +00:00
Daoud Clarke
dda5a25ad0 Add core domains 2023-01-02 12:05:22 +00:00
Daoud Clarke
ab37bbe0a5 Exclude google plus 2023-01-01 22:18:47 +00:00
Daoud Clarke
2336ed7f7d Allow posting extra links with lower score weighting 2023-01-01 20:37:41 +00:00
Daoud Clarke
6edf48693b Check the domain is correct, potential bug in psql 2023-01-01 01:30:44 +00:00
Daoud Clarke
b7984684c9 Tidy, improve logging 2023-01-01 01:14:05 +00:00
Daoud Clarke
7c14cd99f8 Update the URL queue earlier 2022-12-31 23:37:59 +00:00
Daoud Clarke
0d33b4f68f
Merge pull request #86 from mwmbl/improve-crawling
Improve crawling
2022-12-31 22:56:21 +00:00
Daoud Clarke
a86e172bf3 Reinstate background tasks 2022-12-31 22:52:17 +00:00
Daoud Clarke
d9cd3c585b Get results from other domains 2022-12-31 22:51:00 +00:00
Daoud Clarke
77f08d8f0a Update URL status 2022-12-31 22:25:05 +00:00
Daoud Clarke
36af579f7c Sample domains 2022-12-31 17:04:38 +00:00
Daoud Clarke
ea16e7b5cd WIP: improve method of getting URLs for crawling 2022-12-31 13:37:40 +00:00
Daoud Clarke
7dae39b780 WIP: improve method of getting URLs for crawling 2022-12-31 13:32:15 +00:00
Daoud Clarke
c69108cfcc Don't delete an index if the sizes don't match 2022-12-27 10:52:46 +00:00
Daoud Clarke
bb8a36a612 Number of pages is an int 2022-12-27 10:40:53 +00:00
Daoud Clarke
c01129cdb9 Merge branch 'master' of github.com:mwmbl/mwmbl 2022-12-27 10:25:41 +00:00
Daoud Clarke
26351a1072 Use the correct storage location in prod 2022-12-27 10:24:48 +00:00
Daoud Clarke
f3f3831a97
Merge pull request #83 from omasanori/spacy-deps-rework
Rework installation of spaCy models for clarity
2022-12-27 10:20:52 +00:00
Masanori Ogino
71187a3938 Rework installation of spaCy models for clarity
- Install the wheel package for compatibility with future pip
- Use `spacy download` for installing model(s)
- Use `spacy validate` for checking model compatibility explicitly

Signed-off-by: Masanori Ogino <167209+omasanori@users.noreply.github.com>
2022-12-27 11:33:52 +09:00
Daoud Clarke
d85067ec09 Remove apt command 2022-12-24 20:20:53 +00:00
Daoud Clarke
1ef60e8d5d Put install in correct place 2022-12-24 20:18:02 +00:00
Daoud Clarke
8e613dd368 Install psql client 2022-12-24 20:13:53 +00:00
Daoud Clarke
80282cfc7a Exclude a domain 2022-12-24 19:59:56 +00:00
Daoud Clarke
8676abbc63 Format fetched url 2022-12-24 19:59:15 +00:00
Daoud Clarke
57295846cb
Update README.md 2022-12-21 21:49:56 +00:00
Daoud Clarke
0a4e1e4aee Add endpoint to fetch a URL and return title and extract 2022-12-21 21:15:34 +00:00
Daoud Clarke
c7571120cc Implement validation 2022-12-21 15:32:30 +00:00
Daoud Clarke
061462460b Separate out the curation to make it easier to store in a comment 2022-12-20 19:11:01 +00:00
Daoud Clarke
6cf27fa47f Fix serialisation issue 2022-12-19 23:19:32 +00:00
Daoud Clarke
b559a50506 Require the whole result 2022-12-19 22:18:28 +00:00
Daoud Clarke
5eab543f3b Merge branch 'master' into user-registration 2022-12-19 21:53:11 +00:00
Daoud Clarke
a88a1a3e95 Rename some parameters; return curation ID 2022-12-19 21:51:26 +00:00
Daoud Clarke
efc8e8e383
Merge pull request #78 from mwmbl/make-dev-easier
Make it easier to run mwmbl locally
2022-12-19 21:50:54 +00:00
Daoud Clarke
31c27daca4 Add curations 2022-12-11 18:48:25 +00:00
Daoud Clarke
f89e1d6043 Create a post when beginning curation 2022-12-10 23:45:10 +00:00
Daoud Clarke
eadb7f3e28 Follow a begin curate/update curation workflow 2022-12-10 22:49:06 +00:00
Daoud Clarke
f8ab6092b0 Suggest using dokku instead of docker directly 2022-12-08 22:33:58 +00:00
Daoud Clarke
8aa51e548b Allow login 2022-12-08 22:23:48 +00:00
Daoud Clarke
cf6ceedfd5 Actually allow registration 2022-12-07 22:56:20 +00:00
Daoud Clarke
a50bc28436 Make it easier to rum mwmbl locally 2022-12-07 20:01:31 +00:00
Daoud Clarke
d8d7149f4a Start to implement user registration using Lemmy as a back end 2022-12-06 22:36:38 +00:00
Daoud Clarke
c0f89ba6c3
Update matrix badge 2022-12-05 18:47:26 +00:00
Daoud Clarke
dd4dd8a752 Exclude an annoying web site 2022-12-02 21:29:06 +00:00
Daoud Clarke
40f9eade9a Update index name 2022-08-27 09:38:39 +01:00
Daoud Clarke
b6183e00ea
Merge pull request #74 from mwmbl/evaluate-indexing
Evaluate indexing
2022-08-27 09:37:22 +01:00
Daoud Clarke
cf253ae524 Split out URL updating from indexing 2022-08-26 22:20:35 +01:00
Daoud Clarke
f4fb9f831a Use terms and bigrams from the beginning of the string only 2022-08-26 17:20:11 +01:00
Daoud Clarke
619b6c3a93 Don't remove stopwords 2022-08-24 21:08:33 +01:00
Daoud Clarke
578b705609 Don't replace full stops and commas 2022-08-23 22:06:43 +01:00
Daoud Clarke
4779371cf3 Use a custom tokenizer 2022-08-23 21:57:38 +01:00
Daoud Clarke
b1eea2457f Script to index local batch for evaluation 2022-08-22 22:47:42 +01:00
Daoud Clarke
480be85cfd Fix bug in completions with duplicated terms 2022-08-14 22:03:50 +01:00
Daoud Clarke
f7660bcd27
Merge pull request #73 from mwmbl/completion
Completion
2022-08-13 23:55:22 +01:00
Daoud Clarke
627f82d19f Suggest searching Google if there are no search results 2022-08-13 23:54:57 +01:00
Daoud Clarke
f1c77d1389 Search google if there are no results 2022-08-13 23:47:48 +01:00
Daoud Clarke
fe5eff7b64 Exclude web.archive.org as we're only crawling that right now 2022-08-13 10:52:31 +01:00
Daoud Clarke
00705703f3 Require matching at least half the terms 2022-08-11 23:27:30 +01:00
Daoud Clarke
eda7870788 Restrict to https and strip the prefix and / on the end 2022-08-11 22:23:14 +01:00
Daoud Clarke
23e47e963b Simplify completions 2022-08-11 17:34:52 +01:00
Daoud Clarke
c6773b46c4
Merge pull request #72 from mwmbl/improve-ranking-with-multi-term-search
Improve ranking with multi term search
2022-08-10 21:43:51 +01:00
Daoud Clarke
74107667b4 Improve printing of search results in script 2022-08-10 21:43:13 +01:00
Daoud Clarke
3bcb7f42c1 Use heuristic ranker 2022-08-09 22:56:12 +01:00
Daoud Clarke
c1b9e70743 Add new LTR model 2022-08-09 22:47:59 +01:00
Daoud Clarke
57476ed2c8 Tweak features 2022-08-09 22:23:36 +01:00
Daoud Clarke
c99e813398 Get best-performing configuration 2022-08-09 20:56:15 +01:00
Daoud Clarke
8b50643303 Add in match score feature (although it hurts the results) 2022-08-09 00:08:55 +01:00
Daoud Clarke
c60b73a403 Create a get_features function and make it work like the heuristic approach 2022-08-08 23:42:34 +01:00
Daoud Clarke
c1d361c0a0 New LTR model trained on more data 2022-08-08 22:52:37 +01:00
Daoud Clarke
b99d9d1c6a Search for the term itself as well as its completion 2022-08-08 22:51:09 +01:00
Daoud Clarke
f40d82c449 Allow running with no background script 2022-08-01 23:33:02 +01:00
Daoud Clarke
046f86f7e3
Merge pull request #71 from mwmbl/fix-missing-scores
Store the best items, not the worst ones
2022-08-01 23:32:24 +01:00
Daoud Clarke
ae658906dd Store the best items, not the worst ones 2022-07-31 22:55:15 +01:00
Daoud Clarke
aa5878fd2f
Merge pull request #70 from mwmbl/reduce-new-batch-contention
Reduce new batch contention
2022-07-31 21:02:05 +01:00
Daoud Clarke
fc1742e24f Reinstate correct num_pages 2022-07-31 00:45:00 +01:00
Daoud Clarke
bb5186196f Use an in-memory queue 2022-07-31 00:43:58 +01:00
Daoud Clarke
62ba9ddc7e Use a randomised timeout for getting a new batch 2022-07-30 23:10:37 +01:00
Daoud Clarke
a54e093cf1
Merge pull request #69 from mwmbl/reduce-contention-for-client-queries
Reduce contention for client queries
2022-07-30 17:11:34 +01:00
Daoud Clarke
2942d83673 Get URL scores in batches 2022-07-30 14:35:21 +01:00
Daoud Clarke
3709cb236f Use correct index path; retrieve historical batches 2022-07-30 11:08:15 +01:00
Daoud Clarke
063ebb4504 args.index no longer exists 2022-07-30 10:57:15 +01:00
Daoud Clarke
ea32c0ba00 Double index size 2022-07-30 10:37:07 +01:00
Daoud Clarke
2d5235f6f6 More threads for retrieving batches 2022-07-30 10:10:11 +01:00
Daoud Clarke
218d873654 Delete unused SQL 2022-07-30 10:10:03 +01:00
Daoud Clarke
6209382d76 Index batches in memory 2022-07-24 15:44:01 +01:00
Daoud Clarke
1bceeae3df Implement new indexing approach 2022-07-23 23:19:36 +01:00
Daoud Clarke
a8a6c67239 Use URL path to store locally so that we can easily get a local path from a URL 2022-07-20 22:21:35 +01:00
Daoud Clarke
0d1e7d841c Implement a batch cache to store files locally before preprocessing 2022-07-19 21:18:43 +01:00
Daoud Clarke
27a4784d08
Merge pull request #68 from mwmbl/fix-missing-query
Fix missing query
2022-07-19 20:17:20 +01:00
Daoud Clarke
5ce333cc9a Log at info level 2022-07-18 23:46:01 +01:00
Daoud Clarke
a097ec9fbe Allow more tries so that popular terms can be indexed 2022-07-18 23:42:09 +01:00
Daoud Clarke
cfca015efe Enough preprocessing 2022-07-18 22:36:37 +01:00
Daoud Clarke
003cd217f4 Run preprocessing 2022-07-18 22:21:20 +01:00
Daoud Clarke
bcd31326b8 Just index a single page for now 2022-07-18 22:17:15 +01:00
Daoud Clarke
a471bc2437 Use a more specific exception in case we're discarding ones we shouldn't 2022-07-18 22:05:24 +01:00
Daoud Clarke
ce9f52267a Run update 2022-07-18 21:55:27 +01:00
Daoud Clarke
09a9390c92 Catch corrupt data 2022-07-18 21:40:38 +01:00
Daoud Clarke
93307ad1ec Add util script to send batch; add logging 2022-07-18 21:37:19 +01:00
Daoud Clarke
3c97fdb3a0
Merge pull request #66 from mwmbl/fix-unicode-encode-error
Fix unicode encode error; bigger index
2022-07-16 10:59:14 +01:00
Daoud Clarke
680fe1ca0c Fix unicode encoding error 2022-07-16 10:54:25 +01:00
Daoud Clarke
e1e1b0057b
Merge pull request #61 from milovanderlinden/issue-60-consistent-use-of-env-vars
Fix issue #60
2022-07-10 21:06:09 +01:00
Daoud Clarke
fee5cbb400 10x index size 2022-07-10 17:15:10 +01:00
milovanderlinden
dfd3f3962e Fix issue #60 2022-07-10 11:10:03 +02:00
Daoud Clarke
dba50b372f Don't include web.archive.org as a curated domain 2022-07-04 15:44:28 +01:00
Daoud Clarke
2e40ae1dca
Merge pull request #58 from mwmbl/improve-ranking-for-root-domains
Improve ranking for root domains
2022-07-03 22:10:55 +01:00
Daoud Clarke
43815c7322 Add a URL length penalty 2022-07-03 22:10:02 +01:00
Daoud Clarke
a3ff2f537f Score domain and path, weight components 2022-07-03 21:55:20 +01:00
Daoud Clarke
4b5df76ca5
Merge pull request #57 from mwmbl/clear-indexed-documents
Delete documents that have been preprocessed from the database to sav…
2022-07-03 09:45:52 +01:00
Daoud Clarke
9482ae5028 Delete documents that have been preprocessed from the database to save space 2022-07-03 09:44:51 +01:00
Daoud Clarke
6fa192daa4
Merge pull request #56 from mwmbl/allow-links-from-unknown-domains
Allow crawling links from unknown domains
2022-07-02 13:32:39 +01:00
Daoud Clarke
f9fefa0b62 Record new batches as being local 2022-07-02 13:25:31 +01:00
Daoud Clarke
e578d55789 Allow crawling links from unknown domains 2022-07-01 21:35:34 +01:00
Daoud Clarke
4967830ae1
Merge pull request #55 from mwmbl/index-continuously
Index continuously
2022-07-01 20:55:24 +01:00
Daoud Clarke
db1aa1a928 Don't require a slash for the search URL 2022-07-01 20:43:38 +01:00
Daoud Clarke
24f82a3c2f Actually used the passed in timestamp 2022-06-30 20:57:01 +01:00
Daoud Clarke
d47457b834 CONFIRMED no longer exists 2022-06-30 20:45:26 +01:00
Daoud Clarke
b6f29548db Fix log message 2022-06-30 20:42:37 +01:00
Daoud Clarke
e9835edc45 Wrap background tasks in try/except 2022-06-30 20:00:38 +01:00
Daoud Clarke
6ea3a95684 Allow batches to fail silently 2022-06-30 19:52:58 +01:00
Daoud Clarke
ddc8664c11 Queue the right type of batch 2022-06-29 22:52:12 +01:00
Daoud Clarke
2b52b50569 Queue new batches for indexing 2022-06-29 22:49:24 +01:00
Daoud Clarke
b8c495bda8 Correctly insert new URLs 2022-06-29 22:39:21 +01:00
Daoud Clarke
955d650cf4 Prevent deadlock when inserting URLs 2022-06-28 22:34:46 +01:00
Daoud Clarke
1457cba2c2 Cache batches; start a background process 2022-06-27 23:44:25 +01:00
Daoud Clarke
ff2312a5ca Use different scores for same domain links 2022-06-27 22:46:06 +01:00
Daoud Clarke
36b168a8f6 Fix logic in found URL logic SQL and allow crawling URLs crawled by one user for now 2022-06-26 21:23:57 +01:00
Daoud Clarke
5e1ec9ccd5 Temporarily disable startup background processes; add root domains; check for empty batches. 2022-06-26 21:15:52 +01:00
Daoud Clarke
e27d749e18 Investigate duplication of URLs in batches 2022-06-26 21:11:51 +01:00
Daoud Clarke
eb571fc5fe Add a script to count urls in the index 2022-06-21 21:55:38 +01:00
Daoud Clarke
1d9b5cb3ca Make more robust 2022-06-21 08:44:46 +01:00
Daoud Clarke
30e1e19072 Update queued pages in the index 2022-06-20 23:35:44 +01:00
Daoud Clarke
4330551e0f Tokenize documents and store pages to be added to the index 2022-06-20 22:54:35 +01:00
Daoud Clarke
9594915de1 WIP: index continuously. Retrieve batches and store in Postgres 2022-06-19 23:23:57 +01:00
Daoud Clarke
b8b605daed Factor out connection code 2022-06-19 16:52:25 +01:00
Daoud Clarke
c31cea710f CORS is handled by nginx 2022-06-19 13:13:36 +01:00
Daoud Clarke
96da534ca5 Don't add CORS on the python side 2022-06-19 11:34:54 +01:00
Daoud Clarke
9dbb724ba9 Use updated CORS settings 2022-06-19 11:31:55 +01:00
Daoud Clarke
e3baf87918 Remove seemingly extraneous backslashes 2022-06-19 11:27:37 +01:00
Daoud Clarke
c245be775b Use an updated template 2022-06-19 11:25:38 +01:00
Daoud Clarke
01772517da Remove problematic SSL_DIRECTIVES line 2022-06-19 11:23:01 +01:00
Daoud Clarke
a67ca7b298 Enable CORS in nginx 2022-06-19 11:16:03 +01:00
Daoud Clarke
866c17f2dc Use the dokku app storage 2022-06-19 09:53:19 +01:00
Daoud Clarke
16c2692099 Start processing historical data on startup 2022-06-19 08:56:55 +01:00
Daoud Clarke
d400950689 Add script to process historical data 2022-06-18 15:31:35 +01:00
Daoud Clarke
eb1c59990c Expose the port 2022-06-17 23:57:58 +01:00
Daoud Clarke
d7c6dcb5c2 Use the correct port for dokku 2022-06-17 23:54:22 +01:00
Daoud Clarke
77088a8a1b Use a database URL env var 2022-06-17 23:39:24 +01:00
Daoud Clarke
476481c5f8 Put the resources in the package 2022-06-17 23:32:43 +01:00
Daoud Clarke
505e7521d4 Copy the resources 2022-06-17 23:29:04 +01:00
Daoud Clarke
5ea9efcfa2 Fix relative path 2022-06-17 23:19:30 +01:00
Daoud Clarke
1c7420e5fb Don't depend on existing data 2022-06-17 23:12:22 +01:00
Daoud Clarke
a003914e91 Fix boto3 dependency 2022-06-17 22:14:55 +01:00
Daoud Clarke
363103468e Update Dockerfile for changes 2022-06-17 21:26:21 +01:00
Daoud Clarke
e2eb405083 Combine crawler and search servers 2022-06-16 22:49:41 +01:00
Daoud Clarke
7771657684
Merge pull request #53 from mwmbl/record-historical-batches
Record historical batches
2022-06-16 22:09:12 +01:00
Daoud Clarke
14107acc75 Use new server 2022-06-09 22:24:54 +01:00
Daoud Clarke
aaca8b2b6e Record historical batches via the API 2022-06-05 09:15:04 +01:00
Daoud Clarke
617666e3b7
Merge pull request #51 from mwmbl/learning-to-rank
Learning to rank
2022-06-04 12:36:15 +01:00
Daoud Clarke
770b4b945b Refactor feature extraction 2022-05-07 22:52:36 +01:00
Daoud Clarke
87d8b40cad Make order_results public 2022-05-06 23:15:50 +01:00
Daoud Clarke
229819e57e Refactor to allow LTR ranker 2022-03-27 22:32:44 +01:00
Daoud Clarke
94287cec01 Get features for each string separately 2022-03-21 21:49:10 +00:00
Daoud Clarke
4740d89c6a Add domain score feature 2022-03-21 21:13:20 +00:00
Daoud Clarke
af6a28fac3 Implement learning to rank feature extraction and thresholding 2022-03-20 22:01:45 +00:00
Daoud Clarke
2d334074af Make get_results() public for learning to rank 2022-03-20 17:25:54 +00:00
137 changed files with 208954 additions and 1524 deletions

59
.github/workflows/ci.yml vendored Normal file
View file

@ -0,0 +1,59 @@
name: CI
on:
push:
branches: [main]
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
#----------------------------------------------
# check-out repo and set-up python
#----------------------------------------------
- name: Check out repository
uses: actions/checkout@v3
- name: Set up python
id: setup-python
uses: actions/setup-python@v4
with:
python-version: '3.10'
#----------------------------------------------
# ----- install & configure poetry -----
#----------------------------------------------
- name: Install Poetry
uses: snok/install-poetry@v1.3.3
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
#----------------------------------------------
# load cached venv if cache exists
#----------------------------------------------
- name: Load cached venv
id: cached-poetry-dependencies
uses: actions/cache@v3
with:
path: .venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}
#----------------------------------------------
# install dependencies if cache does not exist
#----------------------------------------------
- name: Install dependencies
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
run: poetry install --no-interaction --no-root
#----------------------------------------------
# install your root project, if required
#----------------------------------------------
- name: Install project
run: poetry install --no-interaction
#----------------------------------------------
# run test suite
#----------------------------------------------
- name: Run tests
env:
DJANGO_SETTINGS_MODULE: mwmbl.settings_dev
run: |
poetry run pytest

1
.gitignore vendored
View file

@ -17,6 +17,7 @@ __pycache__/
build/
develop-eggs/
dist/
front-end/dist/
downloads/
eggs/
.eggs/

15
.vscode/launch.json vendored Normal file
View file

@ -0,0 +1,15 @@
{
"version": "0.2.0",
"configurations": [
{
"name": "mwmbl",
"type": "python",
"request": "launch",
"module": "mwmbl.main",
"python": "${workspaceFolder}/.venv/bin/python",
"stopOnEntry": false,
"console": "integratedTerminal",
"justMyCode": true
}
]
}

128
CODE_OF_CONDUCT.md Normal file
View file

@ -0,0 +1,128 @@
# Contributor Covenant Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
https://matrix.to/#/#mwmbl:matrix.org.
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see the FAQ at
https://www.contributor-covenant.org/faq. Translations are available at
https://www.contributor-covenant.org/translations.

6
CONTRIBUTING.md Normal file
View file

@ -0,0 +1,6 @@
Contributions are very welcome!
Please join the discussion at https://matrix.to/#/#mwmbl:matrix.org and let us know what you're planning to do.
See https://book.mwmbl.org/page/developers/ for a guide to development.

View file

@ -1,3 +1,10 @@
FROM node:hydrogen-bullseye as front-end
COPY front-end /front-end
WORKDIR /front-end
RUN npm install && npm run build
FROM python:3.10.2-bullseye as base
ENV PYTHONFAULTHANDLER=1 \
@ -13,6 +20,7 @@ ENV PIP_DEFAULT_TIMEOUT=100 \
PIP_NO_CACHE_DIR=1 \
POETRY_VERSION=1.1.12
# Create a /venv directory & environment.
# This directory will be copied into the final stage of docker build.
RUN python -m venv /venv
@ -25,19 +33,28 @@ COPY mwmbl /app/mwmbl
# Use pip to install the mwmbl python package
# PEP 518, PEP 517 and others have allowed for a standardized python packaging API, which allows
# pip to be able to install poetry packages.
RUN /venv/bin/pip install pip --upgrade && \
/venv/bin/pip install .
# en-core-web-sm requires a compatible version of spacy
RUN /venv/bin/pip install pip wheel --upgrade && \
/venv/bin/pip install . && \
/venv/bin/python -m spacy download en_core_web_sm-3.2.0 --direct && \
/venv/bin/python -m spacy validate
FROM base as final
RUN apt-get update && apt-get install -y postgresql-client
# Copy only the required /venv directory from the builder image that contains mwmbl and its dependencies
COPY --from=builder /venv /venv
# Working directory is /app
# Copying data and config into /app so that relative (default) paths in the config work
COPY data /app/data
COPY config /app/config
# Copy the front end build
COPY --from=front-end /front-end/dist /front-end-build
# Using the mwmbl-tinysearchengine binary/entrypoint which comes packaged with mwmbl
# TODO: fix the arguments for the recent changes
CMD ["/venv/bin/mwmbl-tinysearchengine", "--config", "config/tinysearchengine.yaml"]
ADD nginx.conf.sigil /app
# ADD app.json /app
# Set up a volume where the data will live
VOLUME ["/data"]
EXPOSE 5000
CMD ["/venv/bin/mwmbl-tinysearchengine"]

View file

@ -2,18 +2,19 @@
# Mwmbl - **No ads, no tracking, no cruft, no profit**
[![Matrix](https://img.shields.io/matrix/mwmbl:matrix.org?color=blue&label=Matrix&style=for-the-badge)](https://matrix.to/#/#mwmbl:matrix.org)
[![Matrix](https://img.shields.io/matrix/mwmbl:matrix.org)](https://matrix.to/#/#mwmbl:matrix.org)
Mwmbl is a non-profit, ad-free, free-libre and free-lunch search
engine with a focus on useability and speed. At the moment it is
little more than an idea together with a [proof of concept
implementation](https://mwmbl.org/) of
the web front-end and search technology on a very small index. A
crawler is still to be implemented.
the web front-end and search technology on a small index.
Our vision is a community working to provide top quality search
particularly for hackers, funded purely by donations.
![mwmbl](https://user-images.githubusercontent.com/1283077/218265959-be4220b4-dcf0-47ab-acd3-f06df0883b52.gif)
Crawling
========
@ -119,16 +120,29 @@ author (email address is in the git commit history).
Development
===========
### Using Docker
1. Create a new folder called `data` in the root of the repository
2. Download the [index file](https://storage.googleapis.com/mwmbl/index.tinysearch) and place it the new data folder
3. Run `$ docker build . -t mwmbl`
4. Run `$ docker run -p 8080:8080 mwmbl`
### Local Testing
1. Create and activate a python (3.10) environment using any tool you like e.g. poetry,venv, conda etc.
2. Run `$ pip install .`
3. Run `$ mwmbl-tinysearchengine --config config/tinysearchengine.yaml`
This will run against a local test database without running background
tasks to update batches etc.
This is the simplest way to configure postgres, but you can set it up
how you like as long as the `DATABASE_URL` you give is correct for
your configuration.
1. Install postgres and create a user for your current username
2. Install [poetry](https://python-poetry.org/docs/#installation)
3. Run `poetry install` to install dependencies
4. Run `poetry shell` in the root directory to enter the virtual environment
5. Run `$ DATABASE_URL="postgres://username@" python -m mwmbl.main` replacing "username" with your username.
### Using Dokku
Note: this method is not recommended as it is more involved, and your index will not have any data in it unless you
set up a crawler to crawl to your server. You will need to set up your own Backblaze or S3 equivalent storage, or
have access to the production keys, which we probably won't give you.
Follow the [deployment instructions](https://github.com/mwmbl/mwmbl/wiki/Deployment)
Frequently Asked Question
=========================

51
analyse/add_term_info.py Normal file
View file

@ -0,0 +1,51 @@
"""
Investigate adding term information to the database.
How much extra space will it take?
"""
import os
from pathlib import Path
from random import Random
import numpy as np
from scipy.stats import sem
from mwmbl.tinysearchengine.indexer import TinyIndex, Document, _trim_items_to_page, astuple
from zstandard import ZstdCompressor
from mwmbl.utils import add_term_info
random = Random(1)
INDEX_PATH = Path(__file__).parent.parent / "devdata" / "index-v2.tinysearch"
def run():
compressor = ZstdCompressor()
with TinyIndex(Document, INDEX_PATH) as index:
# Get some random integers between 0 and index.num_pages:
pages = random.sample(range(index.num_pages), 10000)
old_sizes = []
new_sizes = []
for i in pages:
page = index.get_page(i)
term_documents = []
for document in page:
term_document = add_term_info(document, index, i)
term_documents.append(term_document)
value_tuples = [astuple(value) for value in term_documents]
num_fitting, compressed = _trim_items_to_page(compressor, index.page_size, value_tuples)
new_sizes.append(num_fitting)
old_sizes.append(len(page))
print("Old sizes mean", np.mean(old_sizes), sem(old_sizes))
print("New sizes mean", np.mean(new_sizes), sem(new_sizes))
if __name__ == '__main__':
run()

View file

@ -7,15 +7,23 @@ import json
from collections import defaultdict, Counter
from urllib.parse import urlparse
from mwmbl.indexer.paths import CRAWL_GLOB
from mwmbl.crawler import HashedBatch
from mwmbl.indexer import CRAWL_GLOB, MWMBL_DATA_DIR
# TODO: remove this line - temporary override
CRAWL_GLOB = str(MWMBL_DATA_DIR / "b2") + "/*/*/2022-06-23/*/*/*.json.gz"
def get_urls():
for path in glob.glob(CRAWL_GLOB):
data = json.load(gzip.open(path))
user = data['user_id_hash']
for item in data['items']:
yield user, item['url']
batch = HashedBatch.parse_obj(data)
user = batch.user_id_hash
for item in batch.items:
if item.content is not None:
for url in item.content.links:
yield user, url
def analyse_urls(urls):

View file

@ -1,7 +1,7 @@
import json
from mwmbl.indexer.paths import TOP_DOMAINS_JSON_PATH
from mwmbl.tinysearchengine.hn_top_domains_filtered import DOMAINS
from mwmbl.indexer import TOP_DOMAINS_JSON_PATH
from mwmbl.hn_top_domains_filtered import DOMAINS
def export_top_domains_to_json():

View file

@ -3,8 +3,8 @@ Export the list of unique URLs to a SQLite file for analysis/evaluation.
"""
import sqlite3
from mwmbl.indexer.paths import URLS_PATH
from mwmbl.tinysearchengine.app import get_config_and_index
from mwmbl.indexer import URLS_PATH
from mwmbl.app import get_config_and_index
def create_database():

View file

@ -0,0 +1,19 @@
"""
Count unique URLs in the index.
"""
from mwmbl.tinysearchengine import TinyIndex, Document
def run():
urls = set()
with TinyIndex(Document, 'data/index.tinysearch') as index:
for i in range(index.num_pages):
print("Page", i)
page = index.get_page(i)
new_urls = {doc.url for doc in page}
urls |= new_urls
print("URLs", len(urls))
if __name__ == '__main__':
run()

View file

@ -1,20 +0,0 @@
from mwmbl.tinysearchengine.indexer import TinyIndex, NUM_PAGES, PAGE_SIZE, Document
from mwmbl.indexer.paths import INDEX_PATH
def get_items():
tiny_index = TinyIndex(Document, INDEX_PATH, NUM_PAGES, PAGE_SIZE)
items = tiny_index.retrieve('soup')
if items:
for item in items:
print("Items", item)
def run():
tiny_index = TinyIndex(Document, INDEX_PATH, NUM_PAGES, PAGE_SIZE)
for i in range(100):
tiny_index.get_page(i)
if __name__ == '__main__':
run()

View file

@ -0,0 +1,4 @@
"""
Analyse recent batches looking for duplicates.
"""

View file

@ -0,0 +1,55 @@
"""
See how many unique URLs and root domains we have crawled.
"""
import glob
import gzip
import json
import requests
from mwmbl.indexer import CRAWL_GLOB
API_ENDPOINT = "http://95.216.215.29/batches/historical"
def total_num_batches():
return len(glob.glob(CRAWL_GLOB))
def get_batches():
for path in sorted(glob.glob(CRAWL_GLOB)):
hashed_batch = json.load(gzip.open(path))
yield hashed_batch
def convert_item(item):
return {
'url': item['url'],
'status': 200,
'timestamp': item['timestamp'],
'content': {
'title': item['title'],
'extract': item['extract'],
'links': item['links'],
}
}
def run():
total_batches = total_num_batches()
batches = get_batches()
for i, hashed_batch in enumerate(batches):
new_batch = {
'user_id_hash': hashed_batch['user_id_hash'],
'timestamp': hashed_batch['timestamp'],
'items': [convert_item(item) for item in hashed_batch['items']]
}
response = requests.post(API_ENDPOINT, json=new_batch)
print(f"Response {i} of {total_batches}", response)
if __name__ == '__main__':
run()

32
analyse/search.py Normal file
View file

@ -0,0 +1,32 @@
import logging
import sys
from itertools import islice
from mwmbl.indexer import INDEX_PATH
from mwmbl.tinysearchengine.completer import Completer
from mwmbl.tinysearchengine import TinyIndex, Document
from mwmbl.tinysearchengine.rank import HeuristicRanker
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
def clean(sequence):
return ''.join(x['value'] for x in sequence)
def run():
with TinyIndex(Document, INDEX_PATH) as tiny_index:
completer = Completer()
ranker = HeuristicRanker(tiny_index, completer)
items = ranker.search('jasper fforde')
print()
if items:
for i, item in enumerate(islice(items, 10)):
print(f"{i + 1}. {item['url']}")
print(clean(item['title']))
print(clean(item['extract']))
print()
if __name__ == '__main__':
run()

27
analyse/send_batch.py Normal file
View file

@ -0,0 +1,27 @@
"""
Send a batch to a running instance.
"""
import requests
from mwmbl.crawler import Batch, Item, ItemContent
URL = 'http://localhost:5000/crawler/batches/'
def run():
batch = Batch(user_id='test_user_id111111111111111111111111', items=[Item(
url='https://www.theguardian.com/stage/2007/nov/18/theatre',
content=ItemContent(
title='A nation in search of the new black | Theatre | The Guardian',
extract="Topic-stuffed and talk-filled, Kwame Kwei-Armah's new play proves that issue-driven drama is (despite reports of its death) still being written and staged…",
links=[]),
timestamp=123456,
status=200,
)])
result = requests.post(URL, data=batch.json())
print("Result", result.content)
if __name__ == '__main__':
run()

26
analyse/update_urls.py Normal file
View file

@ -0,0 +1,26 @@
import os
import pickle
from datetime import datetime
from pathlib import Path
from queue import Queue
from mwmbl.indexer import record_urls_in_database
def run_update_urls_on_fixed_batches():
with open(Path(os.environ["HOME"]) / "data" / "mwmbl" / "hashed-batches.pickle", "rb") as file:
batches = pickle.load(file)
# print("Batches", batches[:3])
queue = Queue()
start = datetime.now()
record_urls_in_database(batches, queue)
total_time = (datetime.now() - start).total_seconds()
print("Total time:", total_time)
if __name__ == '__main__':
run_update_urls_on_fixed_batches()

35
analyse/url_queue.py Normal file
View file

@ -0,0 +1,35 @@
import logging
import os
import pickle
import sys
from datetime import datetime
from pathlib import Path
from queue import Queue
from mwmbl.url_queue import URLQueue
FORMAT = '%(levelname)s %(name)s %(asctime)s %(message)s'
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, format=FORMAT)
def run_url_queue():
data = pickle.load(open(Path(os.environ["HOME"]) / "data" / "mwmbl" / "found-urls.pickle", "rb"))
print("First URLs", [x.url for x in data[:1000]])
new_item_queue = Queue()
queued_batches = Queue()
queue = URLQueue(new_item_queue, queued_batches)
new_item_queue.put(data)
start = datetime.now()
queue.update()
total_time = (datetime.now() - start).total_seconds()
print(f"Total time: {total_time}")
if __name__ == '__main__':
run_url_queue()

7
app.json Normal file
View file

@ -0,0 +1,7 @@
{
"scripts": {
"dokku": {
"predeploy": "rm -rf /app/static/* && cp -r /front-end-build/* /app/static/"
}
}
}

BIN
devdata/index-v2.tinysearch Normal file

Binary file not shown.

View file

@ -0,0 +1,341 @@
body {
display: flex;
flex-direction: column;
overflow-y: scroll;
background-color: var(--light-color);
min-height: 100vh;
height: fit-content;
padding-top: 25px;
transition: padding 300ms ease;
}
@media (prefers-reduced-motion) {
body {
transition: none;
}
}
.branding {
display: flex;
align-items: center;
margin: 25px;
}
@media screen and (max-width: 600px) {
.branding {
display: none;
}
}
.brand-title {
text-align: center;
font-weight: var(--black-font-weight);
font-size: 1.5rem;
margin: 10px 15px 10px 10px;
}
.brand-icon {
height: 2.5rem;
}
.search-menu {
position: sticky;
top: 0;
display: flex;
flex-direction: column;
align-items: center;
max-width: 800px;
margin: 0 auto;
width: 100%;
padding: 10px;
background-color: rgba(248, 248, 248, .9);
z-index: 10;
}
.search-menu.compact {
flex-direction: row;
}
.search-menu.compact .branding {
margin: 0 25px 0 0;
}
.search-menu.compact .brand-title {
font-size: 1.2rem;
}
.search-menu.compact .brand-icon {
height: 2rem;
}
.search-bar {
position: relative;
width: 100%;
}
.search-bar-input {
background-color: var(--gray-color);
border: none;
padding: 15px 15px 15px 50px;
border-radius: 10px;
outline: none;
font-size: var(--default-font-size);
width: 100%;
font-weight: var(--bold-font-weight);
box-shadow: 0 0 0 0 var(--primary-color);
transition:
box-shadow 200ms ease-in-out;
}
.search-bar-input::placeholder {
color: var(--dark-color);
opacity: .3;
}
.search-bar-input:focus {
box-shadow: 0 0 0 0.2rem var(--primary-color);
}
.search-bar i {
position: absolute;
top: 50%;
left: 15px;
transform: translateY(-50%);
color: var(--dark-color);
opacity: .3;
font-size: 1.5rem;
pointer-events: none;
}
.main, footer {
display: block;
max-width: 800px;
width: 100%;
margin: 0 auto;
}
.results {
max-width: 100%;
list-style-type: none;
padding: 0;
}
.result {
min-height: 120px;
}
.result-container {
text-decoration: none;
color: var(--dark-color);
padding: 15px;
border-radius: 10px;
outline: 3px solid transparent;
outline-offset: 3px;
transition:
background-color 200ms ease-in-out,
outline 100ms ease-in-out;
}
.result-container:hover,.result-container:focus {
background-color: var(--gray-color);
}
.result-container:focus {
outline: 3px solid var(--primary-color);
}
.result .link {
font-size: .9rem;
}
.result .title, .result .title>* {
color: var(--primary-color);
font-size: 1.1rem;
}
.result .extract {
opacity: .8;
font-size: .9rem;
}
.empty-result, .home {
text-align: center;
opacity: .5;
font-weight: var(--bold-font-weight);
}
.footer {
position: sticky;
top: 100vh;
margin-bottom: 25px;
padding: 10px;
}
.footer-text {
text-align: center;
opacity: .5;
font-weight: var(--bold-font-weight);
margin-bottom: 10px;
}
.footer-list {
list-style-type: none;
padding: 0;
margin: 0;
display: flex;
justify-content: center;
gap: 10px;
}
.footer-link {
display: flex;
align-items: center;
text-decoration: none;
padding: 10px;
color: var(--dark-color);
border-radius: 10px;
background-color: var(--gray-color);
box-shadow: 0 0 0 0 var(--primary-color);
transition:
box-shadow 200ms ease-in-out;
}
.footer-link:hover {
box-shadow: 0 0 0 0.2rem var(--dark-color);
}
.footer-link i {
font-size: 1.2rem;
margin-right: 5px;
color: inherit;
}
.footer-link>span {
color: inherit;
font-size: var(--default-font-size);
font-weight: var(--bold-font-weight);
}
@media screen and (min-width:576px) {
.brand-title {
margin: 0 25px 0 15px;
}
}
.noscript {
display: flex;
flex-direction: column;
height: calc(100vh - 25px);
width: 100%;
justify-content: center;
align-items: center;
}
a {
font-weight: var(--bold-font-weight);
color: var(--primary-color);
text-decoration: none;
}
.curation-buttons {
display: grid;
grid-auto-flow: column;
grid-column-gap: 20px;
grid-auto-columns: max-content;
}
.result-container .button {
background-color: var(--dark-gray-color);
color: white;
padding: 5px 10px;
margin: 0;
font-size: var(--small-font-size);
font-weight: var(--bold-font-weight);
}
.validated {
background-color: green !important;
}
.modal {
/*display: none; !* Hidden by default *!*/
position: fixed; /* Stay in place */
z-index: 100; /* Sit on top */
left: 0;
top: 0;
width: 100%; /* Full width */
height: 100%; /* Full height */
overflow: auto; /* Enable scroll if needed */
background-color: rgb(0,0,0); /* Fallback color */
background-color: rgba(0,0,0,0.4); /* Black w/ opacity */
}
/* Modal Content/Box */
.modal-content {
background-color: #fefefe;
margin: 15% auto; /* 15% from the top and centered */
padding: 20px;
border: 1px solid #888;
max-width: 800px;
width: 80%; /* Could be more or less, depending on screen size */
}
/* The Close Button */
.close {
color: #aaa;
float: right;
font-size: 28px;
font-weight: bold;
}
.close:hover,
.close:focus {
color: black;
text-decoration: none;
cursor: pointer;
}
.button {
background-color: var(--primary-color);
border: none;
color: white;
padding: 10px 20px;
margin: 10px;
text-align: center;
text-decoration: none;
display: inline-block;
font-size: var(--default-font-size);
border-radius: 50px;
cursor: pointer;
flex-shrink: 0;
transition: background-color 200ms ease-in-out;
}
@media screen and (max-width: 600px) {
.button {
padding: 5px 10px;
font-size: var(--small-font-size);
margin: 5px;
}
}
.button:hover {
background-color: var(--dark-color);
}
.login-info {
padding: 10px;
}
/* Sortable styling is not working in HTML 5 yet */
/*.sortable-drag {*/
/* opacity: 1.0;*/
/*}*/
/*.sortable-ghost {*/
/* opacity: 1.0;*/
/*}*/
/*.sortable-chosen {*/
/* opacity: 0;*/
/*}*/

View file

@ -0,0 +1,33 @@
/*
Josh's Custom CSS Reset
https://www.joshwcomeau.com/css/custom-css-reset/
*/
*, *::before, *::after {
box-sizing: border-box;
font-family: var(--regular-font);
color: var(--dark-color);
font-size: var(--default-font-size);
}
* {
margin: 0;
}
html, body {
height: 100%;
}
body {
line-height: 1.5;
-webkit-font-smoothing: antialiased;
}
img, picture, video, canvas, svg {
display: block;
max-width: 100%;
}
input, button, textarea, select {
font: inherit;
}
p, h1, h2, h3, h4, h5, h6 {
overflow-wrap: break-word;
}
#root, #__next {
isolation: isolate;
}

View file

@ -0,0 +1,20 @@
:root {
/* This is the theme file, use it to define theme variables. */
/* Colors: */
--dark-color: #0A1931;
--primary-color: #185ADB;
--gray-color: #EEEEEE;
--light-color: #F8F8F8;
--dark-gray-color: #767676;
/* Fonts: */
--regular-font: 'Inter', sans-serif;
--small-font-size: 12px;
--default-font-size: 16px;
--default-font-weight: 400;
--bold-font-weight: 700;
--black-font-weight: 900;
}

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View file

@ -0,0 +1,50 @@
@font-face {
font-family: 'Inter';
font-style: normal;
font-weight: 400;
font-display: swap;
src: url("Inter-Regular.woff2?v=3.19") format("woff2"),
url("Inter-Regular.woff?v=3.19") format("woff");
}
@font-face {
font-family: 'Inter';
font-style: italic;
font-weight: 400;
font-display: swap;
src: url("Inter-Italic.woff2?v=3.19") format("woff2"),
url("Inter-Italic.woff?v=3.19") format("woff");
}
@font-face {
font-family: 'Inter';
font-style: normal;
font-weight: 700;
font-display: swap;
src: url("Inter-Bold.woff2?v=3.19") format("woff2"),
url("Inter-Bold.woff?v=3.19") format("woff");
}
@font-face {
font-family: 'Inter';
font-style: italic;
font-weight: 700;
font-display: swap;
src: url("Inter-BoldItalic.woff2?v=3.19") format("woff2"),
url("Inter-BoldItalic.woff?v=3.19") format("woff");
}
@font-face {
font-family: 'Inter';
font-style: normal;
font-weight: 900;
font-display: swap;
src: url("Inter-Black.woff2?v=3.19") format("woff2"),
url("Inter-Black.woff?v=3.19") format("woff");
}
@font-face {
font-family: 'Inter';
font-style: italic;
font-weight: 900;
font-display: swap;
src: url("Inter-BlackItalic.woff2?v=3.19") format("woff2"),
url("Inter-BlackItalic.woff?v=3.19") format("woff");
}

Binary file not shown.

View file

@ -0,0 +1,122 @@
/*--------------------------------
Phosphor Web Font
-------------------------------- */
@font-face {
font-family: 'Phosphor';
src: url("Phosphor.woff2") format("woff2");
font-weight: normal;
font-style: normal;
font-display: swap;
}
/*------------------------
base class definition
-------------------------*/
[class^="ph-"],
[class*=" ph-"] {
display: inline-flex;
}
[class^="ph-"]:before,
[class*=" ph-"]:before {
font: normal normal normal 1em/1 "Phosphor";
color: inherit;
flex-shrink: 0;
speak: none;
text-transform: none;
text-decoration: inherit;
text-align: center;
/* Better Font Rendering */
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
}
/*------------------------
change icon size
-------------------------*/
/* relative units */
.ph-xxs {
font-size: 0.5em;
}
.ph-xs {
font-size: 0.75em;
}
.ph-sm {
font-size: 0.875em;
}
.ph-lg {
font-size: 1.3333em;
line-height: 0.75em;
vertical-align: -0.0667em;
}
.ph-xl {
font-size: 1.5em;
line-height: 0.6666em;
vertical-align: -0.075em;
}
.ph-1x {
font-size: 1em;
}
.ph-2x {
font-size: 2em;
}
.ph-3x {
font-size: 3em;
}
.ph-4x {
font-size: 4em;
}
.ph-5x {
font-size: 5em;
}
.ph-6x {
font-size: 6em;
}
.ph-7x {
font-size: 7em;
}
.ph-8x {
font-size: 8em;
}
.ph-9x {
font-size: 9em;
}
.ph-10x {
font-size: 10em;
}
.ph-fw {
text-align: center;
width: 1.25em;
}
/*------------------------
icons (to add an icon you want to use,
copy it from the unused.css file)
-------------------------*/
.ph-magnifying-glass-bold::before {
content: "\f8bf";
}
.ph-github-logo-bold::before {
content: "\f852";
}
.ph-info-bold::before {
content: "\f88f";
}
.ph-book-bold::before {
content: "\f6fb";
}
.ph-browser-bold::before {
content: "\f70d";
}
.ph-youtube-logo-bold::before {
content: "\fa5d";
}
.ph-chat-circle-text-bold::before {
content: "\f74c";
}

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,14 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="100%" height="100%" viewBox="0 0 9375 9375" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" xmlns:serif="http://www.serif.com/" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:1.5;">
<style>
path {
fill: #000;
}
@media ( prefers-color-scheme: dark ) {
path {
fill: #fff !important;
}
}
</style>
<path d="M6128.72,8251.56c495.65,0 919.697,-176.222 1272.13,-528.659c352.437,-352.438 528.659,-776.484 528.659,-1272.13l-0,-3358.75c-0,-94.644 -35.492,-176.841 -106.482,-246.581c-70.985,-69.739 -153.801,-104.612 -248.445,-104.612c-99.634,-0 -184.314,34.873 -254.054,104.612c-69.746,69.74 -104.612,151.937 -104.612,246.581l-0,3358.75c-0,301.373 -105.857,557.923 -317.571,769.63c-211.708,211.714 -468.251,317.571 -769.63,317.571c-298.89,0 -554.808,-105.857 -767.766,-317.571c-212.958,-211.707 -319.434,-468.257 -319.434,-769.63l-0,-3358.75c-0,-94.644 -34.873,-176.841 -104.613,-246.581c-69.739,-69.739 -154.426,-104.612 -254.054,-104.612c-94.649,-0 -176.841,34.873 -246.58,104.612c-69.74,69.74 -104.613,151.937 -104.613,246.581l0,3358.75c0,301.373 -106.476,557.923 -319.434,769.63c-212.959,211.714 -468.883,317.571 -767.766,317.571c-301.379,0 -557.923,-105.857 -769.636,-317.571c-211.708,-211.707 -317.565,-468.257 -317.565,-769.63l0,-3358.75c0,-94.644 -34.873,-176.841 -104.612,-246.581c-69.74,-69.739 -154.427,-104.612 -254.054,-104.612c-94.65,-0 -176.841,34.873 -246.581,104.612c-69.739,69.74 -104.612,151.937 -104.612,246.581l-0,3358.75c-0,326.283 80.327,627.662 240.976,904.131c160.656,276.469 378.593,495.031 653.817,655.686c275.224,160.649 575.984,240.977 902.267,240.977c291.416,0 563.525,-64.761 816.335,-194.277c252.81,-129.517 460.158,-307.608 622.058,-534.263c166.878,226.655 376.722,404.746 629.532,534.263c252.809,129.516 524.919,194.277 816.335,194.277Zm-0.96,-1617.39l-0.582,-0c-99.627,-0 -184.314,-34.873 -254.054,-104.612c-69.739,-69.74 -104.612,-151.938 -104.612,-246.581l-0,-3358.74c-0,-301.373 -105.857,-557.923 -317.565,-769.63c-210.698,-210.699 -465.799,-316.549 -765.32,-317.559c-299.521,1.01 -554.622,106.86 -765.314,317.559c-211.714,211.707 -317.571,468.257 -317.571,769.63l0,3358.75c0,94.644 -34.866,176.841 -104.606,246.581c-69.739,69.739 -154.426,104.612 -254.054,104.612l-8.638,0c-94.643,0 -176.841,-34.873 -246.58,-104.612c-69.74,-69.74 -104.613,-151.937 -104.613,-246.581l0,-3358.75c0,-301.373 -106.476,-557.923 -319.434,-769.63c-212.959,-211.714 -468.876,-317.571 -767.766,-317.571c-301.379,-0 -557.922,105.857 -769.63,317.571c-211.714,211.707 -317.571,468.257 -317.571,769.63l0,3358.75c0,94.644 -34.867,176.841 -104.612,246.581c-69.74,69.739 -154.42,104.612 -254.054,104.612c-94.644,0 -176.841,-34.873 -246.581,-104.612c-69.739,-69.74 -104.606,-151.937 -104.606,-246.581l0,-3358.75c0,-326.283 80.321,-627.662 240.977,-904.131c160.649,-276.469 378.586,-495.031 653.816,-655.686c275.224,-160.649 575.978,-240.977 902.261,-240.977c291.416,-0 563.526,64.761 816.335,194.277c252.81,129.517 460.164,307.608 622.058,534.263c166.878,-226.655 376.722,-404.746 629.532,-534.263c252.809,-129.516 524.919,-194.277 816.335,-194.277l8.638,-0c164.822,-0 323.472,20.718 475.941,62.154l5.239,1.431c41.114,11.263 81.609,24.024 121.497,38.284c72.687,25.87 143.907,56.675 213.652,92.408c250.636,128.408 456.592,304.549 617.866,528.412l4.328,5.665c166.872,-226.58 376.667,-404.598 629.396,-534.077c252.809,-129.516 524.925,-194.277 816.335,-194.277c495.657,-0 919.704,176.222 1272.14,528.659c352.437,352.438 528.653,776.484 528.653,1272.13l0,3358.75c0,94.644 -35.492,176.841 -106.476,246.581c-70.984,69.739 -153.801,104.612 -248.451,104.612c-99.627,0 -184.314,-34.873 -254.054,-104.612c-69.739,-69.74 -104.612,-151.937 -104.612,-246.581l-0,-3358.75c-0,-301.373 -105.851,-557.923 -317.565,-769.63c-211.713,-211.714 -468.257,-317.571 -769.636,-317.571c-298.883,-0 -554.807,105.857 -767.766,317.571c-212.952,211.707 -319.434,468.257 -319.434,769.63l-0,3358.75c-0,94.644 -34.867,176.841 -104.606,246.581c-69.746,69.739 -154.427,104.612 -254.055,104.612l-0.582,-0.006Z" style="stroke:#185ADB;stroke-width:4.17px;"/></svg>

After

Width:  |  Height:  |  Size: 4.2 KiB

View file

@ -0,0 +1,4 @@
<svg width="300" height="300" viewBox="0 0 300 300" fill="none" xmlns="http://www.w3.org/2000/svg">
<rect x="25.909" y="49.7723" width="250.569" height="200.455" fill="white"/>
<path fill-rule="evenodd" clip-rule="evenodd" d="M300 195C300 252.951 252.951 300 195 300H105C47.049 300 0 252.951 0 195V105C0 47.0489 47.049 0 105 0H195C252.951 0 300 47.0489 300 105V195ZM187.005 200.017H186.99C184.431 200.017 182.255 199.121 180.463 197.329C178.671 195.537 177.775 193.425 177.775 190.993V104.696C177.775 96.9523 175.055 90.3607 169.616 84.9212C164.202 79.5076 157.648 76.7879 149.952 76.762C142.256 76.7879 135.702 79.5076 130.288 84.9212C124.849 90.3607 122.129 96.9523 122.129 104.696V190.993C122.129 193.425 121.233 195.537 119.441 197.329C117.649 199.121 115.473 200.017 112.914 200.017H112.692C110.26 200.017 108.148 199.121 106.356 197.329C104.564 195.537 103.668 193.425 103.668 190.993V104.696C103.668 96.9523 100.933 90.3607 95.461 84.9212C89.9894 79.4815 83.414 76.7617 75.7345 76.7617C67.991 76.7617 61.3995 79.4815 55.96 84.9212C50.5203 90.3607 47.8005 96.9523 47.8005 104.696V190.993C47.8005 193.425 46.9047 195.537 45.1127 197.329C43.3208 199.121 41.1451 200.017 38.5851 200.017C36.1534 200.017 34.0415 199.121 32.2496 197.329C30.4578 195.537 29.5619 193.425 29.5619 190.993V104.696C29.5619 96.3123 31.6257 88.5688 35.7535 81.4654C39.8811 74.362 45.4806 68.7463 52.5523 64.6186C59.6237 60.4909 67.3511 58.427 75.7345 58.427C83.2219 58.427 90.2134 60.091 96.7089 63.4187C103.204 66.7464 108.532 71.3222 112.692 77.1457C116.979 71.3222 122.371 66.7464 128.867 63.4187C135.362 60.091 142.354 58.427 149.841 58.427H150.063C154.298 58.427 158.374 58.9594 162.292 60.024L162.426 60.0607C163.483 60.3501 164.523 60.678 165.548 61.0444C167.415 61.7091 169.245 62.5006 171.037 63.4187C177.477 66.7179 182.769 71.2436 186.912 76.9954L187.024 77.141C191.311 71.3193 196.701 66.7454 203.195 63.4187C209.691 60.091 216.682 58.427 224.169 58.427C236.905 58.427 247.8 62.9548 256.855 72.0101C265.91 81.0655 270.438 91.9607 270.438 104.696V190.993C270.438 193.425 269.526 195.537 267.702 197.329C265.879 199.121 263.751 200.017 261.319 200.017C258.759 200.017 256.583 199.121 254.791 197.329C252.999 195.537 252.103 193.425 252.103 190.993V104.696C252.103 96.9523 249.384 90.3607 243.944 84.9212C238.504 79.4815 231.913 76.7617 224.169 76.7617C216.49 76.7617 209.915 79.4815 204.443 84.9212C198.971 90.3607 196.236 96.9523 196.236 104.696V190.993C196.236 193.425 195.34 195.537 193.548 197.329C191.756 199.121 189.58 200.017 187.02 200.017L187.005 200.017ZM187.03 241.573C199.765 241.573 210.66 237.045 219.716 227.99C228.771 218.935 233.299 208.039 233.299 195.304V109.007C233.299 106.575 232.387 104.463 230.563 102.671C228.739 100.879 226.611 99.9832 224.179 99.9832C221.619 99.9832 219.444 100.879 217.652 102.671C215.86 104.463 214.964 106.575 214.964 109.007V195.304C214.964 203.048 212.244 209.639 206.804 215.079C201.365 220.518 194.773 223.238 187.03 223.238C179.35 223.238 172.775 220.518 167.303 215.079C161.832 209.639 159.096 203.048 159.096 195.304V109.007C159.096 106.575 158.2 104.463 156.408 102.671C154.616 100.879 152.44 99.9832 149.881 99.9832C147.449 99.9832 145.337 100.879 143.545 102.671C141.753 104.463 140.857 106.575 140.857 109.007V195.304C140.857 203.048 138.122 209.639 132.65 215.079C127.178 220.518 120.603 223.238 112.923 223.238C105.18 223.238 98.5884 220.518 93.1488 215.079C87.7093 209.639 84.9894 203.048 84.9894 195.304V109.007C84.9894 106.575 84.0934 104.463 82.3016 102.671C80.5097 100.879 78.3338 99.9832 75.7741 99.9832C73.3422 99.9832 71.2304 100.879 69.4386 102.671C67.6467 104.463 66.7507 106.575 66.7507 109.007V195.304C66.7507 203.688 68.8146 211.431 72.9422 218.535C77.07 225.638 82.6696 231.254 89.741 235.381C96.8125 239.509 104.54 241.573 112.923 241.573C120.411 241.573 127.402 239.909 133.898 236.581C140.393 233.254 145.721 228.678 149.881 222.854C154.168 228.678 159.56 233.254 166.056 236.581C172.551 239.909 179.543 241.573 187.03 241.573V241.573Z" fill="#185ADB"/>
</svg>

After

Width:  |  Height:  |  Size: 3.9 KiB

File diff suppressed because one or more lines are too long

18
front-end/config.js Normal file
View file

@ -0,0 +1,18 @@
/**
* This file is made for tweaking parameters on the front-end
* without having to dive in the source code.
*
* THIS IS NOT A PLACE TO PUT SENSIBLE DATA LIKE API KEYS.
* THIS FILE IS PUBLIC.
*/
export default {
componentPrefix: 'mwmbl',
publicApiURL: '/api/v1/',
// publicApiURL: 'http://localhost:5000/',
searchQueryParam: 'q',
commands: {
'go: ': 'https://',
'search: google.com ': 'https://www.google.com/search?q=',
}
}

1269
front-end/package-lock.json generated Normal file

File diff suppressed because it is too large Load diff

19
front-end/package.json Normal file
View file

@ -0,0 +1,19 @@
{
"name": "front-end",
"private": true,
"type": "module",
"scripts": {
"dev": "vite",
"build": "vite build",
"preview": "vite preview"
},
"devDependencies": {
"@vitejs/plugin-legacy": "^2.3.1",
"terser": "^5.16.1",
"vite": "^3.2.3"
},
"dependencies": {
"chart.js": "^4.4.0",
"sortablejs": "^1.15.0"
}
}

View file

@ -0,0 +1,21 @@
import define from "../../utils/define.js";
export default define('add-button', class extends HTMLButtonElement {
constructor() {
super();
this.__setup();
}
__setup() {
this.__events();
}
__events() {
this.addEventListener('click', (e) => {
console.log("Add button");
document.querySelector('.modal').style.display = 'block';
document.querySelector('.modal input').focus();
})
}
}, { extends: 'button' });

View file

@ -0,0 +1,69 @@
import define from '../../utils/define.js';
import config from "../../../config.js";
import {globalBus} from "../../utils/events.js";
const FETCH_URL = '/app/fetch?'
const template = () => /*html*/`
<form class="modal-content">
<span class="close">&times;</span>
<input class="add-result" placeholder="Enter a URL...">
<button>Save</button>
</form>
`;
export default define('add-result', class extends HTMLDivElement {
constructor() {
super();
this.classList.add('modal');
this.__setup();
}
__setup() {
this.innerHTML = template();
this.__events();
this.style.display = 'none';
}
__events() {
this.querySelector('.close').addEventListener('click', e => {
if (e.target === this) {
this.style.display = 'none';
}
});
this.addEventListener('click', e => {
this.style.display = 'none';
});
this.querySelector('form').addEventListener('click', e => {
// Clicking on the form shouldn't close it
e.stopPropagation();
});
this.addEventListener('submit', this.__urlSubmitted.bind(this));
}
async __urlSubmitted(e) {
e.preventDefault();
const value = this.querySelector('input').value;
console.log("Input value", value);
const query = document.querySelector('.search-bar input').value;
const url = `${FETCH_URL}url=${encodeURIComponent(value)}&query=${encodeURIComponent(query)}`;
const response = await fetch(url);
if (response.status === 200) {
const data = await response.text();
console.log("Data", data);
const addResultEvent = new CustomEvent('curate-add-result', {detail: data});
globalBus.dispatch(addResultEvent);
} else {
console.log("Bad response", response);
// TODO
}
}
}, { extends: 'div' });

View file

@ -0,0 +1,35 @@
import define from "../../utils/define.js";
import {globalBus} from "../../utils/events.js";
export default define('delete-button', class extends HTMLButtonElement {
constructor() {
super();
this.__setup();
}
__setup() {
this.__events();
}
__events() {
this.addEventListener('click', (e) => {
console.log("Delete button");
const result = this.closest('.result');
const parent = result.parentNode;
const index = Array.prototype.indexOf.call(parent.getElementsByClassName('result'), result);
console.log("Delete index", index);
const beginCuratingEvent = new CustomEvent('curate-delete-result', {
detail: {
data: {
delete_index: index
}
}
});
globalBus.dispatch(beginCuratingEvent);
})
}
}, { extends: 'button' });

View file

@ -0,0 +1,45 @@
import define from '../../utils/define.js';
import escapeString from '../../utils/escapeString.js';
import { globalBus } from '../../utils/events.js';
export default define('result', class extends HTMLLIElement {
constructor() {
super();
this.classList.add('result');
this.__setup();
}
__setup() {
this.__events();
}
__events() {
this.addEventListener('keydown', (e) => {
if (this.firstElementChild === document.activeElement) {
if (e.key === 'ArrowDown') {
e.preventDefault();
this?.nextElementSibling?.firstElementChild.focus();
}
if (e.key === 'ArrowUp') {
e.preventDefault();
if (this.previousElementSibling)
this.previousElementSibling?.firstElementChild.focus();
else {
const focusSearchEvent = new CustomEvent('focus-search');
globalBus.dispatch(focusSearchEvent);
}
}
}
})
}
__handleBold(input) {
let text = '';
for (const part of input) {
if (part.is_bold) text += `<strong>${escapeString(part.value)}</strong>`;
else text += escapeString(part.value);
}
return text;
}
}, { extends: 'li' });

View file

@ -0,0 +1,53 @@
import define from "../../utils/define.js";
import {globalBus} from "../../utils/events.js";
const VALIDATED_CLASS = "validated";
export default define('validate-button', class extends HTMLButtonElement {
constructor() {
super();
this.__setup();
}
__setup() {
this.__events();
}
__events() {
this.addEventListener('click', (e) => {
console.log("Validate button");
const result = this.closest('.result');
const parent = result.parentNode;
const index = Array.prototype.indexOf.call(parent.getElementsByClassName('result'), result);
console.log("Validate index", index);
const curationValidateEvent = new CustomEvent('curate-validate-result', {
detail: {
data: {
validate_index: index
}
}
});
globalBus.dispatch(curationValidateEvent);
})
}
isValidated() {
return this.classList.contains(VALIDATED_CLASS);
}
validate() {
this.classList.add(VALIDATED_CLASS);
}
unvalidate() {
this.classList.remove(VALIDATED_CLASS);
}
toggleValidate() {
this.classList.toggle(VALIDATED_CLASS);
}
}, { extends: 'button' });

View file

@ -0,0 +1,191 @@
import {globalBus} from '../../utils/events.js';
import Sortable from 'sortablejs';
class ResultsHandler {
constructor() {
this.results = null;
this.oldIndex = null;
this.curating = false;
this.__setup();
}
__setup() {
this.__events();
this.__initializeResults();
}
__events() {
document.body.addEventListener('htmx:load', e => {
this.__initializeResults();
});
// Focus first element when coming from the search bar
globalBus.on('focus-result', () => {
this.results.firstElementChild.firstElementChild.focus();
});
globalBus.on('curate-delete-result', (e) => {
console.log("Curate delete result event", e);
this.__beginCurating.bind(this)();
const children = this.results.getElementsByClassName('result');
let deleteIndex = e.detail.data.delete_index;
const child = children[deleteIndex];
this.results.removeChild(child);
const newResults = this.__getResults();
const curationSaveEvent = new CustomEvent('save-curation', {
detail: {
type: 'delete',
data: {
timestamp: Date.now(),
url: document.location.href,
results: newResults,
curation: {
delete_index: deleteIndex
}
}
}
});
globalBus.dispatch(curationSaveEvent);
});
globalBus.on('curate-validate-result', (e) => {
console.log("Curate validate result event", e);
this.__beginCurating.bind(this)();
const children = this.results.getElementsByClassName('result');
const validateChild = children[e.detail.data.validate_index];
validateChild.querySelector('.curate-approve').toggleValidate();
const newResults = this.__getResults();
const curationStartEvent = new CustomEvent('save-curation', {
detail: {
type: 'validate',
data: {
timestamp: Date.now(),
url: document.location.href,
results: newResults,
curation: e.detail.data
}
}
});
globalBus.dispatch(curationStartEvent);
});
globalBus.on('begin-curating-results', (e) => {
// We might not be online, or logged in, so save the curation in local storage in case:
console.log("Begin curation event", e);
this.__beginCurating.bind(this)();
});
globalBus.on('curate-add-result', (e) => {
console.log("Add result", e);
this.__beginCurating();
const resultData = e.detail;
this.results.insertAdjacentHTML('afterbegin', resultData);
const newResults = this.__getResults();
const url = newResults[0].url;
let detail = {
type: 'add',
data: {
timestamp: Date.now(),
url: document.location.href,
results: newResults,
curation: {
insert_index: 0,
url: url
}
}
};
console.log("Detail", detail);
const curationSaveEvent = new CustomEvent('save-curation', {
detail: detail
});
globalBus.dispatch(curationSaveEvent);
});
}
__initializeResults() {
this.results = document.querySelector('.results');
if (this.results) {
const sortable = new Sortable(this.results, {
"onStart": this.__sortableActivate.bind(this),
"onEnd": this.__sortableDeactivate.bind(this),
"handle": ".handle",
});
}
this.curating = false;
}
__sortableActivate(event) {
console.log("Sortable activate", event);
this.__beginCurating();
this.oldIndex = event.oldIndex;
}
__beginCurating() {
if (!this.curating) {
const results = this.__getResults();
const curationStartEvent = new CustomEvent('save-curation', {
detail: {
type: 'begin',
data: {
timestamp: Date.now(),
url: document.location.href,
results: results,
curation: {}
}
}
});
globalBus.dispatch(curationStartEvent);
this.curating = true;
}
}
__getResults() {
const resultsElements = document.querySelectorAll('.results .result:not(.ui-sortable-placeholder)');
const results = [];
for (let resultElement of resultsElements) {
const result = {
url: resultElement.querySelector('a').href,
title: resultElement.querySelector('.title').innerText,
extract: resultElement.querySelector('.extract').innerText,
curated: resultElement.querySelector('.curate-approve').isValidated()
}
results.push(result);
}
console.log("Results", results);
return results;
}
__sortableDeactivate(event) {
const newIndex = event.newIndex;
console.log('Sortable deactivate', this.oldIndex, newIndex);
const newResults = this.__getResults();
const curationMoveEvent = new CustomEvent('save-curation', {
detail: {
type: 'move',
data: {
timestamp: Date.now(),
url: document.location.href,
results: newResults,
curation: {
old_index: this.oldIndex,
new_index: newIndex,
}
}
}
});
globalBus.dispatch(curationMoveEvent);
}
}
const resultsHandler = new ResultsHandler();

View file

@ -0,0 +1,112 @@
import define from '../../utils/define.js';
import {globalBus} from "../../utils/events.js";
import config from "../../../config.js";
const CURATION_KEY_PREFIX = "curation-";
const CURATION_URL = config.publicApiURL + "curation/";
const template = () => /*html*/`
<span></span>
`;
export default define('save', class extends HTMLDivElement {
constructor() {
super();
this.currentCurationId = null;
this.classList.add('save');
this.sendId = 0;
this.sending = false;
this.__setup();
}
__setup() {
this.innerHTML = template();
this.__events();
// TODO: figure out when to call __sendToApi()
// setInterval(this.__sendToApi.bind(this), 1000);
}
__events() {
globalBus.on('save-curation', (e) => {
// We might not be online, or logged in, so save the curation in local storage in case:
console.log("Curation event", e);
this.__setCuration(e.detail);
this.__sendToApi();
});
}
__setCuration(curation) {
this.sendId += 1;
const key = CURATION_KEY_PREFIX + this.sendId;
localStorage.setItem(key, JSON.stringify(curation));
}
__getOldestCurationKey() {
let oldestId = Number.MAX_SAFE_INTEGER;
let oldestKey = null;
for (let i=0; i<localStorage.length; ++i) {
const key = localStorage.key(i);
if (key.startsWith(CURATION_KEY_PREFIX)) {
const timestamp = parseInt(key.substring(CURATION_KEY_PREFIX.length));
if (timestamp < oldestId) {
oldestKey = key;
oldestId = timestamp;
}
}
}
return oldestKey;
}
async __sendToApi() {
if (this.sending) {
return;
}
this.sending = true;
const csrftoken = document.cookie
.split('; ')
.find((row) => row.startsWith('csrftoken='))
?.split('=')[1];
if (!csrftoken) {
console.log("No auth");
return;
}
const key = this.__getOldestCurationKey();
if (key !== null) {
const value = JSON.parse(localStorage.getItem(key));
console.log("Value", value);
const url = CURATION_URL + value['type'];
const data = value['data'];
console.log("Data", data);
const response = await fetch(url, {
method: 'POST',
cache: 'no-cache',
headers: {'Content-Type': 'application/json', 'X-CSRFToken': csrftoken},
credentials: "same-origin",
mode: "same-origin",
body: JSON.stringify(data),
});
console.log("Save curation API response", response);
if (response.status === 200) {
localStorage.removeItem(key);
} else {
console.log("Bad response, skipping");
return;
}
const responseData = await response.json();
console.log("Response data", responseData);
// There may be more to send, wait a second and see
setTimeout(this.__sendToApi.bind(this), 1000);
}
this.sending = false;
}
}, { extends: 'div' });

26
front-end/src/index.js Normal file
View file

@ -0,0 +1,26 @@
/**
* This file is mainly used as an entry point
* to import components or define globals.
*
* Please do not pollute this file if you can make
* util or component files instead.
*/
import 'vite/modulepreload-polyfill';
// Waiting for top-level await to be better supported.
(async () => {
// Check if a suggestion redirect is needed.
const { redirectToSuggestions } = await import("./utils/suggestions.js");
const redirected = redirectToSuggestions();
if (!redirected) {
// Load components only after redirects are checked.
import("./components/organisms/results.js");
import("./components/organisms/save.js");
import("./components/molecules/add-button.js");
import("./components/molecules/add-result.js");
import("./components/molecules/delete-button.js");
import("./components/molecules/result.js");
import("./components/molecules/validate-button.js");
}
})();

View file

@ -0,0 +1,69 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Mwmbl Stats</title>
<!-- Favicons -->
<link rel="icon" href="/static/images/favicon.svg" type="image/svg+xml">
<!-- Fonts import -->
<link rel="preload" href="/static/fonts/inter/inter.css" as="style" onload="this.onload=null;this.rel='stylesheet'">
<noscript>
<link rel="stylesheet" href="/static/fonts/inter/inter.css">
</noscript>
<!-- CSS Stylesheets (this is critical CSS) -->
<link rel="stylesheet" type="text/css" href="/static/css/reset.css">
<link rel="stylesheet" type="text/css" href="/static/css/theme.css">
<link rel="stylesheet" type="text/css" href="/static/css/global.css">
<link rel="stylesheet" type="text/css" href="stats.css">
</head>
<body>
<section>
<div class="info">
<h1>Mwmbl Stats</h1>
<p>
Mwmbl is a <a href="https://matrix.to/#/#mwmbl:matrix.org">community</a> devoted to building a
<a href="https://en.wikipedia.org/wiki/Free_and_open-source_software">free</a> search engine. You can try it
out <a href="/">here</a> or help us improve the index by
<a href="https://en.wikipedia.org/wiki/Web_crawler">crawling</a> the web with our
<a href="https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-crawler/">Firefox extension</a>
or <a href="https://github.com/mwmbl/crawler-script">command line script</a>.
</p>
</div>
</section>
<section>
<div class="info">
<h1>Number of users crawling today: <span id="num-users"></span></h1>
<div class="wrap">
<canvas id="users-by-day"></canvas>
</div>
</div>
<div class="info">
<h1>Number of URLs crawled today: <span id="num-urls"></span></h1>
<div class="wrap">
<canvas id="urls-by-day"></canvas>
</div>
</div>
<div class="info">
<div class="wrap">
<canvas id="urls-by-hour"></canvas>
</div>
</div>
</section>
<section>
<div class="info tall">
<div class="wrap tall">
<canvas id="urls-by-user"></canvas>
</div>
</div>
<div class="info tall">
<div class="wrap tall">
<canvas id="urls-by-domain"></canvas>
</div>
</div>
</section>
<script src="./stats.js" type="module"></script>
</body>
</html>

View file

@ -0,0 +1,33 @@
body {
background: #eeeeee;
}
section {
display: flex;
flex-wrap: wrap;
}
.info {
flex: 1 500px;
margin: 10px;
padding: 50px;
background: #ffffff;
border-radius: 50px;
}
.wrap {
height: 512px;
}
#users-by-day-info {
width: 100%;
}
#url-info {
height: 3000px;
}
.tall {
height: 3000px;
}

View file

@ -0,0 +1,113 @@
import {Chart} from "chart.js/auto";
(async () => {
Chart.defaults.font.size = 16;
function createChart(elementId, labels, label) {
const canvas = document.getElementById(elementId);
return new Chart(canvas, {
type: 'line',
data: {
labels: labels,
datasets: [{
label: label,
borderWidth: 1
}]
},
options: {
scales: {
y: {
beginAtZero: true
}
},
maintainAspectRatio: false
}
});
}
const urlsCrawledDailyChart = createChart('urls-by-day', null, "URLs crawled by day");
const urlsCrawledHourlyChart = createChart('urls-by-hour', [...Array(24).keys()], "URLs crawled today by hour")
const usersCrawledDailyChart = createChart('users-by-day', null, "Number of users crawling by day")
const urlsByUserCanvas = document.getElementById('urls-by-user');
const byUserChart = new Chart(urlsByUserCanvas, {
type: 'bar',
data: {
datasets: [{
label: "Top users",
borderWidth: 1
// barThickness: 15
}]
},
options: {
scales: {
x: {
beginAtZero: true
}
},
indexAxis: 'y',
maintainAspectRatio: false
}
});
const urlsByDomainCanvas = document.getElementById('urls-by-domain');
const byDomainChart = new Chart(urlsByDomainCanvas, {
type: 'bar',
data: {
datasets: [{
label: "Top domains",
borderWidth: 1
}]
},
options: {
scales: {
x: {
beginAtZero: true
}
},
indexAxis: 'y',
maintainAspectRatio: false
}
});
function updateStats() {
fetch("https://api.mwmbl.org/crawler/stats").then(result => {
result.json().then(stats => {
console.log("Stats", stats);
const urlCountSpan = document.getElementById("num-urls");
urlCountSpan.innerText = stats.urls_crawled_today;
const numUsers = Object.values(stats.users_crawled_daily)[Object.keys(stats.users_crawled_daily).length - 1];
const userCountSpan = document.getElementById("num-users");
userCountSpan.innerText = numUsers;
usersCrawledDailyChart.data.labels = Object.keys(stats.users_crawled_daily);
usersCrawledDailyChart.data.datasets[0].data = Object.values(stats.users_crawled_daily);
usersCrawledDailyChart.update();
urlsCrawledHourlyChart.data.datasets[0].data = stats.urls_crawled_hourly;
urlsCrawledHourlyChart.update();
urlsCrawledDailyChart.data.labels = Object.keys(stats.urls_crawled_daily);
urlsCrawledDailyChart.data.datasets[0].data = Object.values(stats.urls_crawled_daily);
urlsCrawledDailyChart.update();
byUserChart.data.labels = Object.keys(stats.top_users);
byUserChart.data.datasets[0].data = Object.values(stats.top_users);
byUserChart.update();
byDomainChart.data.labels = Object.keys(stats.top_domains);
byDomainChart.data.datasets[0].data = Object.values(stats.top_domains);
byDomainChart.update();
})
});
}
updateStats();
setInterval(() => {
updateStats();
}, 5000);
})();

View file

@ -0,0 +1,13 @@
/**
* A debounce function to reduce input spam
* @param {*} callback Function that will be called
* @param {*} timeout Minimum amount of time between calls
* @returns The debounced function
*/
export default (callback, timeout = 100) => {
let timer;
return (...args) => {
clearTimeout(timer);
timer = setTimeout(() => { callback.apply(this, args); }, timeout);
};
}

View file

@ -0,0 +1,15 @@
import config from '../../config.js';
/** Define a web component, this is a wrapper
* around the `customElements.define` native function.
* @function define
* @param {string} name Name of the component (will be prefixed by the config `componentPrefix`)
* @param {CustomElementConstructor} constructor
* @param {ElementDefinitionOptions} [options]
* @returns {string} Returns the element name ready for the DOM (.e.g `<search-bar></search-bar>`)
*/
export default (name, constructor, options) => {
const componentName = `${config.componentPrefix}-${name}`;
if (!customElements.get(componentName)) customElements.define(componentName, constructor, options);
return componentName;
}

View file

@ -0,0 +1,10 @@
/**
* Escapes string with HTML Characters Codes.
* @param {string} input String to escape
* @returns {string}
*/
export default (input) => {
return String(input).replace(/[^\w. ]/gi, (character) => {
return `&#${character.charCodeAt(0)};`;
});
}

View file

@ -0,0 +1,30 @@
/**
* A class destined to be used as an event bus.
*
* It is simply a trick using a div element
* to carry events.
*/
class Bus {
constructor() {
this.element = document.createElement('div');
}
on(eventName, callback) {
this.element.addEventListener(eventName, callback);
}
dispatch(event) {
this.element.dispatchEvent(event);
}
}
/**
* A global event bus that can be used to
* dispatch events in between components
* */
const globalBus = new Bus();
export {
Bus,
globalBus,
}

View file

@ -0,0 +1,24 @@
/**
* Handle redirect requests from the suggestion back-end.
*/
import config from "../../config.js";
const redirectToSuggestions = () => {
const search = decodeURIComponent(document.location.search).replace(/\+/g, ' ').substr(3);
console.log("Search", search);
for (const [command, urlTemplate] of Object.entries(config.commands)) {
console.log("Command", command);
if (search.startsWith(command)) {
const newUrl = urlTemplate + search.substr(command.length);
window.location.replace(newUrl);
return true;
}
}
return false;
}
export {
redirectToSuggestions
};

24
front-end/vite.config.js Normal file
View file

@ -0,0 +1,24 @@
import legacy from '@vitejs/plugin-legacy'
import { resolve } from 'path'
export default {
root: './src',
base: '/static',
publicDir: '../assets',
build: {
outDir: '../dist',
manifest: true,
rollupOptions: {
input: {
index: resolve(__dirname, 'src/index.js'),
stats: resolve(__dirname, 'src/stats/index.html'),
},
},
minify: false,
},
plugins: [
legacy({
targets: ['defaults', 'not IE 11'],
}),
]
}

22
manage.py Executable file
View file

@ -0,0 +1,22 @@
#!/usr/bin/env python
"""Django's command-line utility for administrative tasks."""
import os
import sys
def main():
"""Run administrative tasks."""
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'mwmbl.settings_dev')
try:
from django.core.management import execute_from_command_line
except ImportError as exc:
raise ImportError(
"Couldn't import Django. Are you sure it's installed and "
"available on your PYTHONPATH environment variable? Did you "
"forget to activate a virtual environment?"
) from exc
execute_from_command_line(sys.argv)
if __name__ == '__main__':
main()

8
mwmbl/admin.py Normal file
View file

@ -0,0 +1,8 @@
from django.contrib.admin import ModelAdmin
from django.contrib.auth.admin import UserAdmin
from django.contrib import admin
from mwmbl.models import MwmblUser, UserCuration
admin.site.register(MwmblUser, UserAdmin)
admin.site.register(UserCuration, ModelAdmin)

48
mwmbl/apps.py Normal file
View file

@ -0,0 +1,48 @@
import os
import shutil
from multiprocessing import Process, Queue
from pathlib import Path
from django.apps import AppConfig
from django.conf import settings
from mwmbl.crawler.urls import URLDatabase
from mwmbl.database import Database
from mwmbl.indexer.indexdb import IndexDatabase
class MwmblConfig(AppConfig):
name = "mwmbl"
verbose_name = "Mwmbl Application"
def ready(self):
# Imports here to avoid AppRegistryNotReady exception
from mwmbl.search_setup import queued_batches
from mwmbl import background
from mwmbl.indexer.paths import INDEX_NAME
from mwmbl.indexer.update_urls import update_urls_continuously
from mwmbl.tinysearchengine.indexer import TinyIndex, Document, PAGE_SIZE
from mwmbl.url_queue import update_queue_continuously
index_path = Path(settings.DATA_PATH) / INDEX_NAME
try:
existing_index = TinyIndex(item_factory=Document, index_path=index_path)
if existing_index.page_size != PAGE_SIZE or existing_index.num_pages != settings.NUM_PAGES:
raise ValueError(f"Existing index page sizes ({existing_index.page_size}) or number of pages "
f"({existing_index.num_pages}) do not match")
except FileNotFoundError:
print("Creating a new index")
TinyIndex.create(item_factory=Document, index_path=index_path, num_pages=settings.NUM_PAGES,
page_size=PAGE_SIZE)
with Database() as db:
url_db = URLDatabase(db.connection)
url_db.create_tables()
index_db = IndexDatabase(db.connection)
index_db.create_tables()
if settings.RUN_BACKGROUND_PROCESSES:
new_item_queue = Queue()
Process(target=background.run, args=(settings.DATA_PATH,)).start()
Process(target=update_queue_continuously, args=(new_item_queue, queued_batches,)).start()
Process(target=update_urls_continuously, args=(settings.DATA_PATH, new_item_queue)).start()

16
mwmbl/asgi.py Normal file
View file

@ -0,0 +1,16 @@
"""
ASGI config for app project.
It exposes the ASGI callable as a module-level variable named ``application``.
For more information on this file, see
https://docs.djangoproject.com/en/4.2/howto/deployment/asgi/
"""
import os
from django.core.asgi import get_asgi_application
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'mwmbl.settings_dev')
application = get_asgi_application()

41
mwmbl/background.py Normal file
View file

@ -0,0 +1,41 @@
"""
Script that updates data in a background process.
"""
import logging
import sys
from logging import getLogger, basicConfig
from pathlib import Path
from time import sleep
from mwmbl.crawler.urls import URLDatabase
from mwmbl.database import Database
from mwmbl.indexer import index_batches, historical
from mwmbl.indexer.batch_cache import BatchCache
from mwmbl.indexer.paths import BATCH_DIR_NAME, INDEX_NAME
basicConfig(stream=sys.stdout, level=logging.INFO)
logger = getLogger(__name__)
def run(data_path: str):
logger.info("Started background process")
with Database() as db:
url_db = URLDatabase(db.connection)
url_db.create_tables()
historical.run()
index_path = Path(data_path) / INDEX_NAME
batch_cache = BatchCache(Path(data_path) / BATCH_DIR_NAME)
while True:
try:
batch_cache.retrieve_batches(num_batches=10000)
except Exception:
logger.exception("Error retrieving batches")
try:
index_batches.run(batch_cache, index_path)
except Exception:
logger.exception("Error indexing batches")
sleep(10)

View file

235
mwmbl/crawler/app.py Normal file
View file

@ -0,0 +1,235 @@
import gzip
import hashlib
import json
import os
from datetime import datetime, timezone, date
from queue import Queue, Empty
from typing import Union
from uuid import uuid4
import boto3
import requests
from fastapi import HTTPException
from ninja import NinjaAPI
from redis import Redis
from mwmbl.crawler.batch import Batch, NewBatchRequest, HashedBatch
from mwmbl.crawler.stats import MwmblStats, StatsManager
from mwmbl.crawler.urls import URLDatabase, FoundURL, URLStatus
from mwmbl.database import Database
from mwmbl.indexer.batch_cache import BatchCache
from mwmbl.indexer.indexdb import IndexDatabase, BatchInfo, BatchStatus
from mwmbl.settings import (
ENDPOINT_URL,
KEY_ID,
APPLICATION_KEY,
BUCKET_NAME,
MAX_BATCH_SIZE,
USER_ID_LENGTH,
VERSION,
PUBLIC_URL_PREFIX,
PUBLIC_USER_ID_LENGTH,
FILE_NAME_SUFFIX,
DATE_REGEX)
stats_manager = StatsManager(Redis.from_url(os.environ.get("REDIS_URL")))
def get_bucket(name):
s3 = boto3.resource('s3', endpoint_url=ENDPOINT_URL, aws_access_key_id=KEY_ID,
aws_secret_access_key=APPLICATION_KEY)
return s3.Object(BUCKET_NAME, name)
def upload(data: bytes, name: str):
bucket = get_bucket(name)
result = bucket.put(Body=data)
return result
last_batch = None
def create_router(batch_cache: BatchCache, queued_batches: Queue, version: str) -> NinjaAPI:
router = NinjaAPI(urls_namespace=f"crawler-{version}")
@router.post('/batches/')
def post_batch(request, batch: Batch):
if len(batch.items) > MAX_BATCH_SIZE:
raise HTTPException(400, f"Batch size too large (maximum {MAX_BATCH_SIZE}), got {len(batch.items)}")
if len(batch.user_id) != USER_ID_LENGTH:
raise HTTPException(400, f"User ID length is incorrect, should be {USER_ID_LENGTH} characters")
if len(batch.items) == 0:
return {
'status': 'ok',
}
user_id_hash = _get_user_id_hash(batch)
now = datetime.now(timezone.utc)
seconds = (now - datetime(now.year, now.month, now.day, tzinfo=timezone.utc)).seconds
# How to pad a string with zeros: https://stackoverflow.com/a/39402910
# Maximum seconds in a day is 60*60*24 = 86400, so 5 digits is enough
padded_seconds = str(seconds).zfill(5)
# See discussion here: https://stackoverflow.com/a/13484764
uid = str(uuid4())[:8]
filename = f'1/{VERSION}/{now.date()}/1/{user_id_hash}/{padded_seconds}__{uid}.json.gz'
# Using an approach from https://stackoverflow.com/a/30476450
epoch_time = (now - datetime(1970, 1, 1, tzinfo=timezone.utc)).total_seconds()
hashed_batch = HashedBatch(user_id_hash=user_id_hash, timestamp=epoch_time, items=batch.items)
stats_manager.record_batch(hashed_batch)
data = gzip.compress(hashed_batch.json().encode('utf8'))
upload(data, filename)
global last_batch
last_batch = hashed_batch
batch_url = f'{PUBLIC_URL_PREFIX}{filename}'
batch_cache.store(hashed_batch, batch_url)
# Record the batch as being local so that we don't retrieve it again when the server restarts
infos = [BatchInfo(batch_url, user_id_hash, BatchStatus.LOCAL)]
with Database() as db:
index_db = IndexDatabase(db.connection)
index_db.record_batches(infos)
return {
'status': 'ok',
'public_user_id': user_id_hash,
'url': batch_url,
}
@router.post('/batches/new')
def request_new_batch(request, batch_request: NewBatchRequest) -> list[str]:
user_id_hash = _get_user_id_hash(batch_request)
try:
urls = queued_batches.get(block=False)
except Empty:
return []
found_urls = [FoundURL(url, user_id_hash, 0.0, URLStatus.ASSIGNED, datetime.utcnow()) for url in urls]
with Database() as db:
url_db = URLDatabase(db.connection)
url_db.update_found_urls(found_urls)
return urls
@router.get('/batches/{date_str}/users/{public_user_id}')
def get_batches_for_date_and_user(request, date_str, public_user_id):
check_date_str(date_str)
check_public_user_id(public_user_id)
prefix = f'1/{VERSION}/{date_str}/1/{public_user_id}/'
return get_batch_ids_for_prefix(prefix)
@router.get('/batches/{date_str}/users/{public_user_id}/batch/{batch_id}')
def get_batch_from_id(request, date_str, public_user_id, batch_id):
url = get_batch_url(batch_id, date_str, public_user_id)
data = json.loads(gzip.decompress(requests.get(url).content))
return {
'url': url,
'batch': data,
}
@router.get('/latest-batch')
def get_latest_batch(request) -> list[HashedBatch]:
return [] if last_batch is None else [last_batch]
@router.get('/batches/{date_str}/users')
def get_user_id_hashes_for_date(request, date_str: str):
check_date_str(date_str)
prefix = f'1/{VERSION}/{date_str}/1/'
return get_subfolders(prefix)
@router.get('/stats')
def get_stats(request) -> MwmblStats:
return stats_manager.get_stats()
@router.get('/')
def status(request):
return {
'status': 'ok'
}
return router
def _get_user_id_hash(batch: Union[Batch, NewBatchRequest]):
return hashlib.sha3_256(batch.user_id.encode('utf8')).hexdigest()
def check_public_user_id(public_user_id):
if len(public_user_id) != PUBLIC_USER_ID_LENGTH:
raise HTTPException(400, f"Incorrect public user ID length, should be {PUBLIC_USER_ID_LENGTH}")
def get_batch_url(batch_id, date_str, public_user_id):
check_date_str(date_str)
check_public_user_id(public_user_id)
url = f'{PUBLIC_URL_PREFIX}1/{VERSION}/{date_str}/1/{public_user_id}/{batch_id}{FILE_NAME_SUFFIX}'
return url
def get_batch_id_from_file_name(file_name: str):
assert file_name.endswith(FILE_NAME_SUFFIX)
return file_name[:-len(FILE_NAME_SUFFIX)]
def get_batch_ids_for_prefix(prefix):
filenames = get_batches_for_prefix(prefix)
filename_endings = sorted(filename.rsplit('/', 1)[1] for filename in filenames)
results = {'batch_ids': [get_batch_id_from_file_name(name) for name in filename_endings]}
return results
def get_batches_for_prefix(prefix):
s3 = boto3.resource('s3', endpoint_url=ENDPOINT_URL, aws_access_key_id=KEY_ID,
aws_secret_access_key=APPLICATION_KEY)
bucket = s3.Bucket(BUCKET_NAME)
items = bucket.objects.filter(Prefix=prefix)
filenames = [item.key for item in items]
return filenames
def check_date_str(date_str):
if not DATE_REGEX.match(date_str):
raise HTTPException(400, f"Incorrect date format, should be YYYY-MM-DD")
def get_subfolders(prefix):
client = boto3.client('s3', endpoint_url=ENDPOINT_URL, aws_access_key_id=KEY_ID,
aws_secret_access_key=APPLICATION_KEY)
items = client.list_objects(Bucket=BUCKET_NAME,
Prefix=prefix,
Delimiter='/')
item_keys = [item['Prefix'][len(prefix):].strip('/') for item in items['CommonPrefixes']]
return item_keys
def get_batches_for_date(date_str):
check_date_str(date_str)
prefix = f'1/{VERSION}/{date_str}/1/'
cache_filename = prefix + 'batches.json.gz'
cache_url = PUBLIC_URL_PREFIX + cache_filename
try:
cached_batches = json.loads(gzip.decompress(requests.get(cache_url).content))
print(f"Got cached batches for {date_str}")
return cached_batches
except gzip.BadGzipFile:
pass
batches = get_batches_for_prefix(prefix)
result = {'batch_urls': [f'{PUBLIC_URL_PREFIX}{batch}' for batch in sorted(batches)]}
if date_str != str(date.today()):
# Don't cache data from today since it may change
data = gzip.compress(json.dumps(result).encode('utf8'))
upload(data, cache_filename)
print(f"Cached batches for {date_str} in {PUBLIC_URL_PREFIX}{cache_filename}")
return result

38
mwmbl/crawler/batch.py Normal file
View file

@ -0,0 +1,38 @@
from typing import Optional
from ninja import Schema
class ItemContent(Schema):
title: str
extract: str
links: list[str]
extra_links: Optional[list[str]]
class ItemError(Schema):
name: str
message: Optional[str]
class Item(Schema):
url: str
status: Optional[int]
timestamp: int
content: Optional[ItemContent]
error: Optional[ItemError]
class Batch(Schema):
user_id: str
items: list[Item]
class NewBatchRequest(Schema):
user_id: str
class HashedBatch(Schema):
user_id_hash: str
timestamp: int
items: list[Item]

133
mwmbl/crawler/stats.py Normal file
View file

@ -0,0 +1,133 @@
import gzip
from datetime import datetime, timedelta
from glob import glob
from itertools import islice
from logging import getLogger
from urllib.parse import urlparse
from pydantic import BaseModel
from redis import Redis
from mwmbl.crawler.batch import HashedBatch
from mwmbl.indexer.update_urls import get_datetime_from_timestamp
logger = getLogger(__name__)
URL_DATE_COUNT_KEY = "url-count-{date}"
URL_HOUR_COUNT_KEY = "url-count-hour-{hour}"
USERS_KEY = "users-{date}"
USER_COUNT_KEY = "user-count-{date}"
HOST_COUNT_KEY = "host-count-{date}"
SHORT_EXPIRE_SECONDS = 60 * 60 * 24
LONG_EXPIRE_SECONDS = 60 * 60 * 24 * 30
class MwmblStats(BaseModel):
urls_crawled_today: int
urls_crawled_daily: dict[str, int]
urls_crawled_hourly: list[int]
users_crawled_daily: dict[str, int]
top_users: dict[str, int]
top_domains: dict[str, int]
class StatsManager:
def __init__(self, redis: Redis):
self.redis = redis
def record_batch(self, hashed_batch: HashedBatch):
date_time = get_datetime_from_timestamp(hashed_batch.timestamp)
num_crawled_urls = sum(1 for item in hashed_batch.items if item.content is not None)
url_count_key = URL_DATE_COUNT_KEY.format(date=date_time.date())
self.redis.incrby(url_count_key, num_crawled_urls)
self.redis.expire(url_count_key, LONG_EXPIRE_SECONDS)
print("Date time", date_time)
hour = datetime(date_time.year, date_time.month, date_time.day, date_time.hour)
hour_key = URL_HOUR_COUNT_KEY.format(hour=hour)
self.redis.incrby(hour_key, num_crawled_urls)
self.redis.expire(hour_key, SHORT_EXPIRE_SECONDS)
users_key = USERS_KEY.format(date=date_time.date())
self.redis.sadd(users_key, hashed_batch.user_id_hash)
self.redis.expire(users_key, LONG_EXPIRE_SECONDS)
user_count_key = USER_COUNT_KEY.format(date=date_time.date())
self.redis.zincrby(user_count_key, num_crawled_urls, hashed_batch.user_id_hash)
self.redis.expire(user_count_key, SHORT_EXPIRE_SECONDS)
host_key = HOST_COUNT_KEY.format(date=date_time.date())
for item in hashed_batch.items:
if item.content is None:
continue
host = urlparse(item.url).netloc
self.redis.zincrby(host_key, 1, host)
self.redis.expire(host_key, SHORT_EXPIRE_SECONDS)
def get_stats(self) -> MwmblStats:
date_time = datetime.now()
date = date_time.date()
urls_crawled_daily = {}
users_crawled_daily = {}
for i in range(29, -1, -1):
date_i = date - timedelta(days=i)
url_count_key = URL_DATE_COUNT_KEY.format(date=date_i)
url_count = self.redis.get(url_count_key)
if url_count is None:
url_count = 0
urls_crawled_daily[str(date_i)] = url_count
user_day_count_key = USERS_KEY.format(date=date_i)
user_day_count = self.redis.scard(user_day_count_key)
users_crawled_daily[str(date_i)] = user_day_count
hour_counts = []
for i in range(date_time.hour + 1):
hour = datetime(date_time.year, date_time.month, date_time.day, i)
hour_key = URL_HOUR_COUNT_KEY.format(hour=hour)
hour_count = self.redis.get(hour_key)
if hour_count is None:
hour_count = 0
hour_counts.append(hour_count)
user_count_key = USER_COUNT_KEY.format(date=date_time.date())
user_counts = self.redis.zrevrange(user_count_key, 0, 100, withscores=True)
host_key = HOST_COUNT_KEY.format(date=date_time.date())
host_counts = self.redis.zrevrange(host_key, 0, 100, withscores=True)
urls_crawled_today = list(urls_crawled_daily.values())[-1]
return MwmblStats(
urls_crawled_today=urls_crawled_today,
urls_crawled_daily=urls_crawled_daily,
urls_crawled_hourly=hour_counts,
users_crawled_daily=users_crawled_daily,
top_users=user_counts,
top_domains=host_counts,
)
def get_test_batches():
for path in glob("./devdata/batches/**/*.json.gz", recursive=True):
print("Processing path", path)
with gzip.open(path) as gzip_file:
yield HashedBatch.parse_raw(gzip_file.read())
if __name__ == '__main__':
redis = Redis(host='localhost', port=6379, decode_responses=True)
stats = StatsManager(redis)
batches = get_test_batches()
start = datetime.now()
processed = 0
for batch in islice(batches, 10000):
stats.record_batch(batch)
processed += 1
total_time = (datetime.now() - start).total_seconds()
print("Processed", processed)
print("Total time", total_time)
print("Time per batch", total_time/processed)

156
mwmbl/crawler/urls.py Normal file
View file

@ -0,0 +1,156 @@
"""
Database storing info on URLs
"""
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from logging import getLogger
from psycopg2.extras import execute_values
# Client has one hour to crawl a URL that has been assigned to them, or it will be reassigned
from mwmbl.utils import batch
REASSIGN_MIN_HOURS = 5
BATCH_SIZE = 100
MAX_URLS_PER_TOP_DOMAIN = 100
MAX_TOP_DOMAINS = 500
MAX_OTHER_DOMAINS = 50000
logger = getLogger(__name__)
class URLStatus(Enum):
"""
URL state update is idempotent and can only progress forwards.
"""
NEW = 0 # One user has identified this URL
QUEUED = 5 # The URL has been queued for crawling
ASSIGNED = 10 # The crawler has given the URL to a user to crawl
ERROR_TIMEOUT = 20 # Timeout while retrieving
ERROR_404 = 30 # 404 response
ERROR_OTHER = 40 # Some other error
ERROR_ROBOTS_DENIED = 50 # Robots disallow this page
CRAWLED = 100 # At least one user has crawled the URL
@dataclass
class FoundURL:
url: str
user_id_hash: str
score: float
status: URLStatus
timestamp: datetime
class URLDatabase:
def __init__(self, connection):
self.connection = connection
def create_tables(self):
logger.info("Creating URL tables")
sql = """
CREATE TABLE IF NOT EXISTS urls (
url VARCHAR PRIMARY KEY,
status INT NOT NULL DEFAULT 0,
user_id_hash VARCHAR NOT NULL,
score FLOAT NOT NULL DEFAULT 1,
updated TIMESTAMP NOT NULL DEFAULT NOW()
)
"""
with self.connection.cursor() as cursor:
cursor.execute(sql)
# cursor.execute(index_sql)
# cursor.execute(view_sql)
def update_found_urls(self, found_urls: list[FoundURL]) -> list[FoundURL]:
if len(found_urls) == 0:
return []
get_urls_sql = """
SELECT url FROM urls
WHERE url in %(urls)s
"""
lock_urls_sql = """
SELECT url FROM urls
WHERE url in %(urls)s
FOR UPDATE SKIP LOCKED
"""
insert_sql = f"""
INSERT INTO urls (url, status, user_id_hash, score, updated) values %s
ON CONFLICT (url) DO UPDATE SET
status = GREATEST(urls.status, excluded.status),
user_id_hash = CASE
WHEN urls.status > excluded.status THEN urls.user_id_hash ELSE excluded.user_id_hash
END,
score = urls.score + excluded.score,
updated = CASE
WHEN urls.status > excluded.status THEN urls.updated ELSE excluded.updated
END
RETURNING url, user_id_hash, score, status, updated
"""
input_urls = [x.url for x in found_urls]
assert len(input_urls) == len(set(input_urls))
with self.connection as connection:
with connection.cursor() as cursor:
logger.info(f"Input URLs: {len(input_urls)}")
cursor.execute(get_urls_sql, {'urls': tuple(input_urls)})
existing_urls = {x[0] for x in cursor.fetchall()}
new_urls = set(input_urls) - existing_urls
cursor.execute(lock_urls_sql, {'urls': tuple(input_urls)})
locked_urls = {x[0] for x in cursor.fetchall()}
urls_to_insert = new_urls | locked_urls
logger.info(f"URLs to insert: {len(urls_to_insert)}")
if len(urls_to_insert) != len(input_urls):
print(f"Only got {len(urls_to_insert)} instead of {len(input_urls)} - {len(new_urls)} new")
sorted_urls = sorted(found_urls, key=lambda x: x.url)
data = [
(found_url.url, found_url.status.value, found_url.user_id_hash, found_url.score, found_url.timestamp)
for found_url in sorted_urls if found_url.url in urls_to_insert]
logger.info(f"Data: {len(data)}")
results = execute_values(cursor, insert_sql, data, fetch=True)
logger.info(f"Results: {len(results)}")
updated = [FoundURL(*result) for result in results]
return updated
def get_urls(self, status: URLStatus, num_urls: int) -> list[FoundURL]:
sql = f"""
SELECT url, status, user_id_hash, score, updated FROM urls
WHERE status = %(status)s
LIMIT %(num_urls)s
"""
# TODO: reinstate this line once performance issue is resolved:
# ORDER BY score DESC
with self.connection.cursor() as cursor:
cursor.execute(sql, {'status': status.value, 'num_urls': num_urls})
results = cursor.fetchall()
return [FoundURL(url, user_id_hash, score, status, updated) for url, status, user_id_hash, score, updated in results]
def get_url_scores(self, urls: list[str]) -> dict[str, float]:
sql = f"""
SELECT url, score FROM urls WHERE url IN %(urls)s
"""
url_scores = {}
for url_batch in batch(urls, 10000):
with self.connection.cursor() as cursor:
cursor.execute(sql, {'urls': tuple(url_batch)})
results = cursor.fetchall()
url_scores.update({result[0]: result[1] for result in results})
return url_scores

16
mwmbl/database.py Normal file
View file

@ -0,0 +1,16 @@
from psycopg2 import connect
from mwmbl.settings import DATABASE_URL
class Database:
def __init__(self):
self.connection = None
def __enter__(self):
self.connection = connect(DATABASE_URL)
self.connection.set_session(autocommit=True)
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.connection.close()

41
mwmbl/format.py Normal file
View file

@ -0,0 +1,41 @@
import re
from mwmbl.tokenizer import tokenize, clean_unicode
def format_result_with_pattern(pattern, result):
formatted_result = {}
for content_type, content_raw in [('title', result.title), ('extract', result.extract)]:
content = clean_unicode(content_raw)
matches = re.finditer(pattern, content, re.IGNORECASE)
all_spans = [0] + sum((list(m.span()) for m in matches), []) + [len(content)]
content_result = []
for i in range(len(all_spans) - 1):
is_bold = i % 2 == 1
start = all_spans[i]
end = all_spans[i + 1]
content_result.append({'value': content[start:end], 'is_bold': is_bold})
formatted_result[content_type] = content_result
formatted_result['url'] = result.url
return formatted_result
def get_query_regex(terms, is_complete, is_url):
if not terms:
return ''
word_sep = r'\b' if is_url else ''
if is_complete:
term_patterns = [rf'{word_sep}{re.escape(term)}{word_sep}' for term in terms]
else:
term_patterns = [rf'{word_sep}{re.escape(term)}{word_sep}' for term in terms[:-1]] + [
rf'{word_sep}{re.escape(terms[-1])}']
pattern = '|'.join(term_patterns)
return pattern
def format_result(result, query):
tokens = tokenize(query)
pattern = get_query_regex(tokens, True, False)
return format_result_with_pattern(pattern, result)

View file

@ -103,7 +103,6 @@ DOMAINS = {'blog.samaltman.com': 0.9906157038365982,
'lists.gnu.org': 0.9719999849041815,
'www.ccc.de': 0.9719484596362211,
'googleprojectzero.blogspot.com': 0.9719076672640107,
'plus.google.com': 0.9718907817464862,
'blog.cloudflare.com': 0.9718848285343656,
'jeffhuang.com': 0.9718720207664465,
'duckduckgo.com': 0.9718309347264379,
@ -776,7 +775,7 @@ DOMAINS = {'blog.samaltman.com': 0.9906157038365982,
'pi-hole.net': 0.9453523308356795,
'erik-engheim.medium.com': 0.9453523308356795,
'projecteuler.net': 0.9453380897239767,
'web.archive.org': 0.945336089251807,
# 'web.archive.org': 0.945336089251807,
'coreos.com': 0.945323515243916,
'slack.engineering': 0.9453141410084411,
'jenkins.io': 0.9452507411562461,
@ -3379,7 +3378,6 @@ DOMAINS = {'blog.samaltman.com': 0.9906157038365982,
'mathpix.com': 0.8899039505478837,
'www.vulture.com': 0.8899034479557729,
'bair.berkeley.edu': 0.8898667877223271,
'picolisp.com': 0.8898372822592416,
'www.goldsborough.me': 0.8897894354492999,
'arkadiyt.com': 0.8897865060368211,
'flowingdata.com': 0.8897859193800971,

View file

@ -1,10 +0,0 @@
from itertools import islice
from typing import Iterator
def grouper(n: int, iterator: Iterator):
while True:
chunk = tuple(islice(iterator, n))
if not chunk:
return
yield chunk

View file

@ -0,0 +1,90 @@
"""
Store for local batches.
We store them in a directory on the local machine.
"""
import gzip
import json
import os
from logging import getLogger
from multiprocessing.pool import ThreadPool
from pathlib import Path
from urllib.parse import urlparse
from pydantic import ValidationError
from mwmbl.crawler.batch import HashedBatch
from mwmbl.database import Database
from mwmbl.indexer.indexdb import IndexDatabase, BatchStatus
from mwmbl.retry import retry_requests
logger = getLogger(__name__)
class BatchCache:
num_threads = 20
def __init__(self, repo_path):
os.makedirs(repo_path, exist_ok=True)
self.path = repo_path
def get_cached(self, batch_urls: list[str]) -> dict[str, HashedBatch]:
batches = {}
for url in batch_urls:
path = self.get_path_from_url(url)
try:
data = gzip.GzipFile(path).read()
except FileNotFoundError:
logger.exception(f"Missing batch file: {path}")
continue
try:
batch = HashedBatch.parse_raw(data)
except ValidationError:
logger.exception(f"Unable to parse batch, skipping: '{data}'")
continue
batches[url] = batch
return batches
def retrieve_batches(self, num_batches):
with Database() as db:
index_db = IndexDatabase(db.connection)
index_db.create_tables()
with Database() as db:
index_db = IndexDatabase(db.connection)
batches = index_db.get_batches_by_status(BatchStatus.REMOTE, num_batches)
logger.info(f"Found {len(batches)} remote batches")
if len(batches) == 0:
return
urls = [batch.url for batch in batches]
pool = ThreadPool(self.num_threads)
results = pool.imap_unordered(self.retrieve_batch, urls)
total_processed = 0
for result in results:
total_processed += result
logger.info(f"Processed batches with {total_processed} items")
index_db.update_batch_status(urls, BatchStatus.LOCAL)
def retrieve_batch(self, url):
data = json.loads(gzip.decompress(retry_requests.get(url).content))
try:
batch = HashedBatch.parse_obj(data)
except ValidationError:
logger.info(f"Failed to validate batch {data}")
return 0
if len(batch.items) > 0:
self.store(batch, url)
return len(batch.items)
def store(self, batch, url):
path = self.get_path_from_url(url)
logger.debug(f"Storing local batch at {path}")
os.makedirs(path.parent, exist_ok=True)
with open(path, 'wb') as output_file:
data = gzip.compress(batch.json().encode('utf8'))
output_file.write(data)
def get_path_from_url(self, url) -> Path:
url_path = urlparse(url).path
return Path(self.path) / url_path.lstrip('/')

View file

@ -0,0 +1,33 @@
from datetime import timedelta
from requests_cache import CachedSession
from mwmbl.hn_top_domains_filtered import DOMAINS
from mwmbl.settings import BLACKLIST_DOMAINS_URL, EXCLUDED_DOMAINS, DOMAIN_BLACKLIST_REGEX
def get_blacklist_domains():
with CachedSession(expire_after=timedelta(days=1)) as session:
response = session.get(BLACKLIST_DOMAINS_URL)
return set(response.text.split())
def is_domain_blacklisted(domain: str, blacklist_domains: set[str]):
if domain in EXCLUDED_DOMAINS or DOMAIN_BLACKLIST_REGEX.search(domain) is not None \
or domain in blacklist_domains:
return True
if domain in DOMAINS:
return False
# TODO: this is to filter out spammy domains that look like:
# brofqpxj.uelinc.com
# gzsmjc.fba01.com
# 59648.etnomurcia.com
#
# Eventually we can figure out a better way to identify SEO spam
domain_parts = domain.split('.')
if (len(domain_parts) == 3 and domain_parts[2] == "com" and len(domain_parts[0]) in {6, 8}) or (
set(domain_parts[0]) <= set("1234567890")
):
return True

View file

@ -4,8 +4,9 @@ Dedupe pages that have been crawled more than once and prepare them for indexing
import glob
import gzip
import json
from itertools import islice
from typing import Iterator
from mwmbl.indexer.batch import grouper
from mwmbl.indexer.fsqueue import FSQueue, GzipJsonBlobSerializer
from mwmbl.indexer.paths import CRAWL_GLOB, TINYSEARCH_DATA_DIR
@ -40,3 +41,11 @@ def run():
if __name__ == '__main__':
run()
def grouper(n: int, iterator: Iterator):
while True:
chunk = tuple(islice(iterator, n))
if not chunk:
return
yield chunk

View file

@ -9,14 +9,6 @@ from pathlib import Path
import numpy as np
import pandas as pd
DATA_DIR = Path(os.environ['HOME']) / 'data' / 'tinysearch'
ALL_DOMAINS_PATH = DATA_DIR / 'hn-top-domains.csv'
TOP_DOMAINS_PATH = '../tinysearchengine/hn_top_domains_filtered.py'
MIN_COUNT = 10
PROBABILITY_THRESHOLD = 0.8
def get_top_domains():
data = pd.read_csv(ALL_DOMAINS_PATH, index_col='domain')
data = data[data.index.notnull()]

View file

@ -0,0 +1,28 @@
from datetime import date, timedelta
from mwmbl.crawler.app import get_batches_for_date
from mwmbl.database import Database
from mwmbl.indexer.indexdb import BatchInfo, BatchStatus, IndexDatabase
DAYS = 20
def run():
for day in range(DAYS):
date_str = str(date.today() - timedelta(days=day))
with Database() as db:
index_db = IndexDatabase(db.connection)
index_db.create_tables()
batches = get_batches_for_date(date_str)
batch_urls = batches['batch_urls']
print("Historical batches for date", date_str, len(batch_urls))
infos = [BatchInfo(url, get_user_id_hash_from_url(url), BatchStatus.REMOTE) for url in batch_urls]
index_db.record_batches(infos)
def get_user_id_hash_from_url(url):
return url.split('/')[9]
if __name__ == '__main__':
run()

View file

@ -1,35 +1,38 @@
"""
Create a search index
"""
from collections import Counter
from typing import Iterable
from urllib.parse import unquote
import pandas as pd
from mwmbl.tinysearchengine.indexer import Document, TokenizedDocument, TinyIndex
from mwmbl.tinysearchengine.indexer import TokenizedDocument
from mwmbl.tokenizer import tokenize, get_bigrams
DEFAULT_SCORE = 0
HTTP_START = 'http://'
HTTPS_START = 'https://'
BATCH_SIZE = 100
NUM_FIRST_TOKENS = 3
NUM_BIGRAMS = 5
def is_content_token(nlp, token):
lexeme = nlp.vocab[token.orth]
return (lexeme.is_alpha or lexeme.is_digit) and not token.is_stop
def tokenize(nlp, input_text):
cleaned_text = input_text.encode('utf8', 'replace').decode('utf8')
tokens = nlp.tokenizer(cleaned_text)
if input_text.endswith(''):
# Discard the last two tokens since there will likely be a word cut in two
tokens = tokens[:-2]
content_tokens = [token for token in tokens if is_content_token(nlp, token)]
lowered = {nlp.vocab[token.orth].text.lower() for token in content_tokens}
return lowered
STOPWORDS = set("0,1,2,3,4,5,6,7,8,9,a,A,about,above,across,after,again,against,all,almost,alone,along,already,also," \
"although,always,am,among,an,and,another,any,anyone,anything,anywhere,are,aren't,around,as,at,b,B,back," \
"be,became,because,become,becomes,been,before,behind,being,below,between,both,but,by,c,C,can,cannot,can't," \
"could,couldn't,d,D,did,didn't,do,does,doesn't,doing,done,don't,down,during,e,E,each,either,enough,even," \
"ever,every,everyone,everything,everywhere,f,F,few,find,first,for,four,from,full,further,g,G,get,give,go," \
"h,H,had,hadn't,has,hasn't,have,haven't,having,he,he'd,he'll,her,here,here's,hers,herself,he's,him," \
"himself,his,how,however,how's,i,I,i'd,if,i'll,i'm,in,interest,into,is,isn't,it,it's,its,itself,i've," \
"j,J,k,K,keep,l,L,last,least,less,let's,m,M,made,many,may,me,might,more,most,mostly,much,must,mustn't," \
"my,myself,n,N,never,next,no,nobody,noone,nor,not,nothing,now,nowhere,o,O,of,off,often,on,once,one,only," \
"or,other,others,ought,our,ours,ourselves,out,over,own,p,P,part,per,perhaps,put,q,Q,r,R,rather,s,S,same," \
"see,seem,seemed,seeming,seems,several,shan't,she,she'd,she'll,she's,should,shouldn't,show,side,since,so," \
"some,someone,something,somewhere,still,such,t,T,take,than,that,that's,the,their,theirs,them,themselves," \
"then,there,therefore,there's,these,they,they'd,they'll,they're,they've,this,those,though,three,through," \
"thus,to,together,too,toward,two,u,U,under,until,up,upon,us,v,V,very,w,W,was,wasn't,we,we'd,we'll,well," \
"we're,were,weren't,we've,what,what's,when,when's,where,where's,whether,which,while,who,whole,whom,who's," \
"whose,why,why's,will,with,within,without,won't,would,wouldn't,x,X,y,Y,yet,you,you'd,you'll,your,you're," \
"yours,yourself,yourselves,you've,z,Z".split(','))
def prepare_url_for_tokenizing(url: str):
@ -45,29 +48,29 @@ def prepare_url_for_tokenizing(url: str):
def get_pages(nlp, titles_urls_and_extracts, link_counts) -> Iterable[TokenizedDocument]:
for i, (title_cleaned, url, extract) in enumerate(titles_urls_and_extracts):
title_tokens = tokenize(nlp, title_cleaned)
prepared_url = prepare_url_for_tokenizing(unquote(url))
url_tokens = tokenize(nlp, prepared_url)
extract_tokens = tokenize(nlp, extract)
print("Extract tokens", extract_tokens)
tokens = title_tokens | url_tokens | extract_tokens
score = link_counts.get(url, DEFAULT_SCORE)
yield TokenizedDocument(tokens=list(tokens), url=url, title=title_cleaned, extract=extract, score=score)
yield tokenize_document(url, title_cleaned, extract, score)
if i % 1000 == 0:
print("Processed", i)
def index_titles_urls_and_extracts(indexer: TinyIndex, nlp, titles_urls_and_extracts, link_counts, terms_path):
terms = Counter()
pages = get_pages(nlp, titles_urls_and_extracts, link_counts)
for page in pages:
for token in page.tokens:
indexer.index(token, Document(url=page.url, title=page.title, extract=page.extract, score=page.score))
terms.update([t.lower() for t in page.tokens])
def get_index_tokens(tokens):
first_tokens = tokens[:NUM_FIRST_TOKENS]
bigrams = get_bigrams(NUM_BIGRAMS, tokens)
return set(first_tokens + bigrams)
term_df = pd.DataFrame({
'term': terms.keys(),
'count': terms.values(),
})
term_df.to_csv(terms_path)
def tokenize_document(url, title_cleaned, extract, score):
title_tokens = tokenize(title_cleaned)
prepared_url = prepare_url_for_tokenizing(unquote(url))
url_tokens = tokenize(prepared_url)
extract_tokens = tokenize(extract)
# print("Extract tokens", extract_tokens)
tokens = get_index_tokens(title_tokens) | get_index_tokens(url_tokens) | get_index_tokens(extract_tokens)
# doc = Document(title_cleaned, url, extract, score)
# token_scores = {token: score_result([token], doc, True) for token in tokens}
# high_scoring_tokens = [k for k, v in token_scores.items() if v > 0.5]
# print("High scoring", len(high_scoring_tokens), token_scores, doc)
document = TokenizedDocument(tokens=list(tokens), url=url, title=title_cleaned, extract=extract, score=score)
return document

View file

@ -0,0 +1,93 @@
"""
Index batches that are stored locally.
"""
from collections import defaultdict
from logging import getLogger
from typing import Collection, Iterable
import spacy
from mwmbl.indexer import process_batch
from spacy import Language
from mwmbl.crawler.batch import HashedBatch, Item
from mwmbl.crawler.urls import URLDatabase, URLStatus
from mwmbl.database import Database
from mwmbl.indexer.batch_cache import BatchCache
from mwmbl.indexer.index import tokenize_document
from mwmbl.indexer.indexdb import BatchStatus
from mwmbl.tinysearchengine.indexer import Document, TinyIndex
from mwmbl.utils import add_term_info, add_term_infos
logger = getLogger(__name__)
def get_documents_from_batches(batches: Collection[HashedBatch]) -> Iterable[tuple[str, str, str]]:
for batch in batches:
for item in batch.items:
if item.content is not None:
yield item.content.title, item.url, item.content.extract
def run(batch_cache: BatchCache, index_path: str):
def process(batches: Collection[HashedBatch]):
with Database() as db:
url_db = URLDatabase(db.connection)
index_batches(batches, index_path, url_db)
logger.info("Indexed pages")
process_batch.run(batch_cache, BatchStatus.URLS_UPDATED, BatchStatus.INDEXED, process)
def index_batches(batch_data: Collection[HashedBatch], index_path: str, url_db: URLDatabase):
document_tuples = list(get_documents_from_batches(batch_data))
urls = [url for title, url, extract in document_tuples]
url_scores = url_db.get_url_scores(urls)
logger.info(f"Indexing {len(urls)} document tuples and {len(url_scores)} URL scores")
documents = [Document(title, url, extract, url_scores.get(url, 1.0)) for title, url, extract in document_tuples]
page_documents = preprocess_documents(documents, index_path)
index_pages(index_path, page_documents)
def index_pages(index_path, page_documents):
with TinyIndex(Document, index_path, 'w') as indexer:
for page, documents in page_documents.items():
new_documents = []
existing_documents = indexer.get_page(page)
seen_urls = set()
seen_titles = set()
sorted_documents = sorted(documents + existing_documents, key=lambda x: x.score, reverse=True)
# TODO: for now we add the term here, until all the documents in the index have terms
sorted_documents_with_terms = add_term_infos(sorted_documents, indexer, page)
for document in sorted_documents_with_terms:
if document.title in seen_titles or document.url in seen_urls:
continue
new_documents.append(document)
seen_urls.add(document.url)
seen_titles.add(document.title)
logger.info(f"Storing {len(new_documents)} documents for page {page}, originally {len(existing_documents)}")
indexer.store_in_page(page, new_documents)
def preprocess_documents(documents, index_path):
page_documents = defaultdict(list)
with TinyIndex(Document, index_path, 'w') as indexer:
for document in documents:
tokenized = tokenize_document(document.url, document.title, document.extract, document.score)
for token in tokenized.tokens:
page = indexer.get_key_page_index(token)
term_document = Document(document.title, document.url, document.extract, document.score, token)
page_documents[page].append(term_document)
print(f"Preprocessed for {len(page_documents)} pages")
return page_documents
def get_url_error_status(item: Item):
if item.status == 404:
return URLStatus.ERROR_404
if item.error is not None:
if item.error.name == 'AbortError':
return URLStatus.ERROR_TIMEOUT
elif item.error.name == 'RobotsDenied':
return URLStatus.ERROR_ROBOTS_DENIED
return URLStatus.ERROR_OTHER

View file

@ -1,49 +0,0 @@
"""
Index data crawled through the Mwmbl crawler.
"""
import json
from logging import getLogger
import spacy
from mwmbl.indexer.fsqueue import FSQueue, GzipJsonBlobSerializer, FSQueueError
from mwmbl.indexer.index import index_titles_urls_and_extracts
from mwmbl.indexer.paths import INDEX_PATH, MWMBL_CRAWL_TERMS_PATH, TINYSEARCH_DATA_DIR, LINK_COUNT_PATH
from mwmbl.tinysearchengine.indexer import TinyIndex, Document, NUM_PAGES, PAGE_SIZE
logger = getLogger(__name__)
def index_mwmbl_crawl_data():
nlp = spacy.load("en_core_web_sm")
titles_urls_and_extracts = get_mwmbl_crawl_titles_urls_and_extracts()
link_counts = json.load(open(LINK_COUNT_PATH))
TinyIndex.create(Document, INDEX_PATH, NUM_PAGES, PAGE_SIZE)
with TinyIndex(Document, INDEX_PATH, 'w') as indexer:
index_titles_urls_and_extracts(indexer, nlp, titles_urls_and_extracts, link_counts, MWMBL_CRAWL_TERMS_PATH)
def get_mwmbl_crawl_titles_urls_and_extracts():
input_queue = FSQueue(TINYSEARCH_DATA_DIR, 'mwmbl-search-items', GzipJsonBlobSerializer())
input_queue.unlock_all()
while True:
try:
next_item = input_queue.get()
except FSQueueError as e:
logger.exception(f'Error with item {e.item_id}')
input_queue.error(e.item_id)
continue
if next_item is None:
logger.info('Not more items to process, stopping')
break
item_id, item_data = next_item
logger.info(f'Processing item {item_id}')
for item in item_data['items']:
yield item['title'], item['url'], item['extract']
input_queue.done(item_id)
if __name__ == '__main__':
index_mwmbl_crawl_data()

71
mwmbl/indexer/indexdb.py Normal file
View file

@ -0,0 +1,71 @@
"""
Database interface for batches of crawled data.
"""
from dataclasses import dataclass
from enum import Enum
from psycopg2.extras import execute_values
class BatchStatus(Enum):
REMOTE = 0 # The batch only exists in long term storage
LOCAL = 10 # We have a copy of the batch locally in Postgresql
URLS_UPDATED = 20 # We've updated URLs from the batch
INDEXED = 30 # The batch has been indexed
@dataclass
class BatchInfo:
url: str
user_id_hash: str
status: BatchStatus
class IndexDatabase:
def __init__(self, connection):
self.connection = connection
def create_tables(self):
batches_sql = """
CREATE TABLE IF NOT EXISTS batches (
url VARCHAR PRIMARY KEY,
user_id_hash VARCHAR NOT NULL,
status INT NOT NULL
)
"""
with self.connection.cursor() as cursor:
cursor.execute(batches_sql)
def record_batches(self, batch_infos: list[BatchInfo]):
sql = """
INSERT INTO batches (url, user_id_hash, status) values %s
ON CONFLICT (url) DO NOTHING
"""
data = [(info.url, info.user_id_hash, info.status.value) for info in batch_infos]
with self.connection.cursor() as cursor:
execute_values(cursor, sql, data)
def get_batches_by_status(self, status: BatchStatus, num_batches=1000) -> list[BatchInfo]:
sql = """
SELECT * FROM batches WHERE status = %(status)s LIMIT %(num_batches)s
"""
with self.connection.cursor() as cursor:
cursor.execute(sql, {'status': status.value, 'num_batches': num_batches})
results = cursor.fetchall()
return [BatchInfo(url, user_id_hash, status) for url, user_id_hash, status in results]
def update_batch_status(self, batch_urls: list[str], status: BatchStatus):
if not batch_urls:
return
sql = """
UPDATE batches SET status = %(status)s
WHERE url IN %(urls)s
"""
with self.connection.cursor() as cursor:
cursor.execute(sql, {'status': status.value, 'urls': tuple(batch_urls)})

View file

@ -26,3 +26,6 @@ TOP_DOMAINS_JSON_PATH = TINYSEARCH_DATA_DIR / 'hn-top-domains.json'
MWMBL_DATA_DIR = DATA_DIR / "mwmbl"
CRAWL_GLOB = str(MWMBL_DATA_DIR / "b2") + "/*/*/*/*/*/*.json.gz"
LINK_COUNT_PATH = MWMBL_DATA_DIR / 'crawl-counts.json'
INDEX_NAME = 'index-v2.tinysearch'
BATCH_DIR_NAME = 'batches'

View file

@ -0,0 +1,33 @@
from logging import getLogger
from typing import Callable, Collection
from mwmbl.crawler.batch import HashedBatch
from mwmbl.database import Database
from mwmbl.indexer.batch_cache import BatchCache
from mwmbl.indexer.indexdb import BatchStatus, IndexDatabase
logger = getLogger(__name__)
def run(batch_cache: BatchCache, start_status: BatchStatus, end_status: BatchStatus,
process: Callable[[Collection[HashedBatch], ...], None], *args):
with Database() as db:
index_db = IndexDatabase(db.connection)
logger.info(f"Getting batches with status {start_status}")
batches = index_db.get_batches_by_status(start_status, 10000)
logger.info(f"Got {len(batches)} batch urls")
if len(batches) == 0:
return
batch_data = batch_cache.get_cached([batch.url for batch in batches])
logger.info(f"Got {len(batch_data)} cached batches")
missing_batches = {batch.url for batch in batches} - batch_data.keys()
logger.info(f"Got {len(missing_batches)} missing batches")
index_db.update_batch_status(list(missing_batches), BatchStatus.REMOTE)
process(batch_data.values(), *args)
index_db.update_batch_status(list(batch_data.keys()), end_status)

View file

@ -0,0 +1,105 @@
from collections import defaultdict
from datetime import datetime, timezone, timedelta
from logging import getLogger
from multiprocessing import Queue
from pathlib import Path
from time import sleep
from typing import Collection
from urllib.parse import urlparse
from mwmbl.crawler.batch import HashedBatch
from mwmbl.crawler.urls import URLDatabase, URLStatus, FoundURL
from mwmbl.database import Database
from mwmbl.hn_top_domains_filtered import DOMAINS
from mwmbl.indexer import process_batch
from mwmbl.indexer.batch_cache import BatchCache
from mwmbl.indexer.blacklist import get_blacklist_domains, is_domain_blacklisted
from mwmbl.indexer.index_batches import get_url_error_status
from mwmbl.indexer.indexdb import BatchStatus
from mwmbl.indexer.paths import BATCH_DIR_NAME
from mwmbl.settings import UNKNOWN_DOMAIN_MULTIPLIER, SCORE_FOR_SAME_DOMAIN, \
SCORE_FOR_DIFFERENT_DOMAIN, SCORE_FOR_ROOT_PATH, EXTRA_LINK_MULTIPLIER
from mwmbl.utils import get_domain
logger = getLogger(__name__)
def update_urls_continuously(data_path: str, new_item_queue: Queue):
batch_cache = BatchCache(Path(data_path) / BATCH_DIR_NAME)
while True:
try:
run(batch_cache, new_item_queue)
except Exception:
logger.exception("Error updating URLs")
sleep(10)
def run(batch_cache: BatchCache, new_item_queue: Queue):
process_batch.run(batch_cache, BatchStatus.LOCAL, BatchStatus.URLS_UPDATED, record_urls_in_database, new_item_queue)
def record_urls_in_database(batches: Collection[HashedBatch], new_item_queue: Queue):
start = datetime.now()
blacklist_domains = get_blacklist_domains()
blacklist_retrieval_time = datetime.now() - start
logger.info(f"Recording URLs in database for {len(batches)} batches, with {len(blacklist_domains)} blacklist "
f"domains, retrieved in {blacklist_retrieval_time.total_seconds()} seconds")
with Database() as db:
url_db = URLDatabase(db.connection)
url_scores = defaultdict(float)
url_users = {}
url_timestamps = {}
url_statuses = defaultdict(lambda: URLStatus.NEW)
for batch in batches:
for item in batch.items:
timestamp = get_datetime_from_timestamp(item.timestamp / 1000.0)
url_timestamps[item.url] = timestamp
url_users[item.url] = batch.user_id_hash
if item.content is None:
url_statuses[item.url] = get_url_error_status(item)
else:
url_statuses[item.url] = URLStatus.CRAWLED
try:
crawled_page_domain = get_domain(item.url)
except ValueError:
logger.info(f"Couldn't parse URL {item.url}")
continue
score_multiplier = 1 if crawled_page_domain in DOMAINS else UNKNOWN_DOMAIN_MULTIPLIER
for link in item.content.links:
process_link(batch.user_id_hash, crawled_page_domain, link, score_multiplier, timestamp, url_scores,
url_timestamps, url_users, False, blacklist_domains)
if item.content.extra_links:
for link in item.content.extra_links:
process_link(batch.user_id_hash, crawled_page_domain, link, score_multiplier, timestamp, url_scores,
url_timestamps, url_users, True, blacklist_domains)
found_urls = [FoundURL(url, url_users[url], url_scores[url], url_statuses[url], url_timestamps[url])
for url in url_scores.keys() | url_statuses.keys()]
logger.info(f"Found URLs, {len(found_urls)}")
urls = url_db.update_found_urls(found_urls)
new_item_queue.put(urls)
logger.info(f"Put {len(urls)} new items in the URL queue")
def process_link(user_id_hash, crawled_page_domain, link, unknown_domain_multiplier, timestamp, url_scores, url_timestamps, url_users, is_extra: bool, blacklist_domains):
parsed_link = urlparse(link)
if is_domain_blacklisted(parsed_link.netloc, blacklist_domains):
logger.debug(f"Excluding link for blacklisted domain: {parsed_link}")
return
extra_multiplier = EXTRA_LINK_MULTIPLIER if is_extra else 1.0
score = SCORE_FOR_SAME_DOMAIN if parsed_link.netloc == crawled_page_domain else SCORE_FOR_DIFFERENT_DOMAIN
url_scores[link] += score * unknown_domain_multiplier * extra_multiplier
url_users[link] = user_id_hash
url_timestamps[link] = timestamp
domain = f'{parsed_link.scheme}://{parsed_link.netloc}/'
url_scores[domain] += SCORE_FOR_ROOT_PATH * unknown_domain_multiplier
url_users[domain] = user_id_hash
url_timestamps[domain] = timestamp
def get_datetime_from_timestamp(timestamp: float) -> datetime:
batch_datetime = datetime(1970, 1, 1, tzinfo=timezone.utc) + timedelta(seconds=timestamp)
return batch_datetime

14
mwmbl/main.py Normal file
View file

@ -0,0 +1,14 @@
import django
import uvicorn
from django.core.management import call_command
def run():
django.setup()
call_command("collectstatic", "--clear", "--noinput")
call_command("migrate")
uvicorn.run("mwmbl.asgi:application", host="0.0.0.0", port=5000)
if __name__ == "__main__":
run()

View file

@ -0,0 +1,58 @@
# Generated by Django 4.2.6 on 2023-10-25 11:55
from django.conf import settings
import django.contrib.auth.models
import django.contrib.auth.validators
from django.db import migrations, models
import django.db.models.deletion
import django.utils.timezone
class Migration(migrations.Migration):
initial = True
dependencies = [
('auth', '0012_alter_user_first_name_max_length'),
]
operations = [
migrations.CreateModel(
name='MwmblUser',
fields=[
('id', models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('password', models.CharField(max_length=128, verbose_name='password')),
('last_login', models.DateTimeField(blank=True, null=True, verbose_name='last login')),
('is_superuser', models.BooleanField(default=False, help_text='Designates that this user has all permissions without explicitly assigning them.', verbose_name='superuser status')),
('username', models.CharField(error_messages={'unique': 'A user with that username already exists.'}, help_text='Required. 150 characters or fewer. Letters, digits and @/./+/-/_ only.', max_length=150, unique=True, validators=[django.contrib.auth.validators.UnicodeUsernameValidator()], verbose_name='username')),
('first_name', models.CharField(blank=True, max_length=150, verbose_name='first name')),
('last_name', models.CharField(blank=True, max_length=150, verbose_name='last name')),
('email', models.EmailField(blank=True, max_length=254, verbose_name='email address')),
('is_staff', models.BooleanField(default=False, help_text='Designates whether the user can log into this admin site.', verbose_name='staff status')),
('is_active', models.BooleanField(default=True, help_text='Designates whether this user should be treated as active. Unselect this instead of deleting accounts.', verbose_name='active')),
('date_joined', models.DateTimeField(default=django.utils.timezone.now, verbose_name='date joined')),
('groups', models.ManyToManyField(blank=True, help_text='The groups this user belongs to. A user will get all permissions granted to each of their groups.', related_name='user_set', related_query_name='user', to='auth.group', verbose_name='groups')),
('user_permissions', models.ManyToManyField(blank=True, help_text='Specific permissions for this user.', related_name='user_set', related_query_name='user', to='auth.permission', verbose_name='user permissions')),
],
options={
'verbose_name': 'user',
'verbose_name_plural': 'users',
'abstract': False,
},
managers=[
('objects', django.contrib.auth.models.UserManager()),
],
),
migrations.CreateModel(
name='UserCuration',
fields=[
('id', models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('timestamp', models.DateTimeField()),
('url', models.CharField(max_length=300)),
('results', models.JSONField()),
('curation_type', models.CharField(max_length=20)),
('curation', models.JSONField()),
('user', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to=settings.AUTH_USER_MODEL)),
],
),
]

View file

15
mwmbl/models.py Normal file
View file

@ -0,0 +1,15 @@
from django.db import models
from django.contrib.auth.models import AbstractUser
class MwmblUser(AbstractUser):
pass
class UserCuration(models.Model):
user = models.ForeignKey(MwmblUser, on_delete=models.CASCADE)
timestamp = models.DateTimeField()
url = models.CharField(max_length=300)
results = models.JSONField()
curation_type = models.CharField(max_length=20)
curation = models.JSONField()

View file

89
mwmbl/platform/curate.py Normal file
View file

@ -0,0 +1,89 @@
from logging import getLogger
from typing import Any
from urllib.parse import parse_qs
from ninja import Router, NinjaAPI
from mwmbl.indexer.update_urls import get_datetime_from_timestamp
from mwmbl.models import UserCuration
from mwmbl.platform.data import CurateBegin, CurateMove, CurateDelete, CurateAdd, CurateValidate, \
make_curation_type
from mwmbl.tinysearchengine.indexer import TinyIndex, Document
from mwmbl.tokenizer import tokenize
from mwmbl.utils import add_term_info, add_term_infos
RESULT_URL = "https://mwmbl.org/?q="
MAX_CURATED_SCORE = 1_111_111.0
logger = getLogger(__name__)
def create_router(index_path: str, version: str) -> NinjaAPI:
router = NinjaAPI(urls_namespace=f"curate-{version}", csrf=True)
@router.post("/begin")
def user_begin_curate(request, curate_begin: make_curation_type(CurateBegin)):
return _curate(request, "curate_begin", curate_begin)
@router.post("/move")
def user_move_result(request, curate_move: make_curation_type(CurateMove)):
return _curate(request, "curate_move", curate_move)
@router.post("/delete")
def user_delete_result(request, curate_delete: make_curation_type(CurateDelete)):
return _curate(request, "curate_delete", curate_delete)
@router.post("/add")
def user_add_result(request, curate_add: make_curation_type(CurateAdd)):
return _curate(request, "curate_add", curate_add)
@router.post("/validate")
def user_add_result(request, curate_validate: make_curation_type(CurateValidate)):
return _curate(request, "curate_validate", curate_validate)
def _curate(request, curation_type: str, curation: Any):
user_curation = UserCuration(
user=request.user,
timestamp=get_datetime_from_timestamp(curation.timestamp / 1000.0),
url=curation.url,
results=curation.dict()["results"],
curation_type=curation_type,
curation=curation.curation.dict(),
)
user_curation.save()
with TinyIndex(Document, index_path, 'w') as indexer:
query_string = parse_qs(curation.url)
if len(query_string) > 1:
raise ValueError(f"Should be one query string in the URL: {curation.url}")
queries = next(iter(query_string.values()))
if len(queries) > 1:
raise ValueError(f"Should be one query value in the URL: {curation.url}")
query = queries[0]
tokens = tokenize(query)
term = " ".join(tokens)
documents = [
Document(result.title, result.url, result.extract, MAX_CURATED_SCORE - i, term, result.curated)
for i, result in enumerate(curation.results)
]
page_index = indexer.get_key_page_index(term)
existing_documents_no_terms = indexer.get_page(page_index)
existing_documents = add_term_infos(existing_documents_no_terms, indexer, page_index)
other_documents = [doc for doc in existing_documents if doc.term != term]
logger.info(f"Found {len(other_documents)} other documents for term {term} at page {page_index} "
f"with terms { {doc.term for doc in other_documents} }")
all_documents = documents + other_documents
logger.info(f"Storing {len(all_documents)} documents at page {page_index}")
indexer.store_in_page(page_index, all_documents)
return {"curation": "ok"}
return router

46
mwmbl/platform/data.py Normal file
View file

@ -0,0 +1,46 @@
from datetime import datetime
from typing import TypeVar, Generic
from ninja import Schema
class Result(Schema):
url: str
title: str
extract: str
curated: bool
class CurateBegin(Schema):
pass
class CurateMove(Schema):
old_index: int
new_index: int
class CurateDelete(Schema):
delete_index: int
class CurateAdd(Schema):
insert_index: int
url: str
class CurateValidate(Schema):
validate_index: int
is_validated: bool
T = TypeVar('T', CurateBegin, CurateAdd, CurateDelete, CurateMove, CurateValidate)
def make_curation_type(t):
class Curation(Schema):
timestamp: int
url: str
results: list[Result]
curation: t
return Curation

Binary file not shown.

File diff suppressed because it is too large Load diff

17
mwmbl/retry.py Normal file
View file

@ -0,0 +1,17 @@
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
MAX_RETRY = 2
MAX_RETRY_FOR_SESSION = 2
BACK_OFF_FACTOR = 0.3
TIME_BETWEEN_RETRIES = 1000
ERROR_CODES = (500, 502, 504)
retry_requests = requests.Session()
retry = Retry(total=5, backoff_factor=1)
adapter = HTTPAdapter(max_retries=retry)
retry_requests.mount('http://', adapter)
retry_requests.mount('https://', adapter)

19
mwmbl/search_setup.py Normal file
View file

@ -0,0 +1,19 @@
from multiprocessing import Queue
from pathlib import Path
from django.conf import settings
from mwmbl.indexer.batch_cache import BatchCache
from mwmbl.indexer.paths import INDEX_NAME, BATCH_DIR_NAME
from mwmbl.tinysearchengine.completer import Completer
from mwmbl.tinysearchengine.indexer import TinyIndex, Document
from mwmbl.tinysearchengine.rank import HeuristicRanker
queued_batches = Queue()
completer = Completer()
index_path = Path(settings.DATA_PATH) / INDEX_NAME
tiny_index = TinyIndex(item_factory=Document, index_path=index_path)
tiny_index.__enter__()
ranker = HeuristicRanker(tiny_index, completer)
batch_cache = BatchCache(Path(settings.DATA_PATH) / BATCH_DIR_NAME)

Some files were not shown because too many files have changed in this diff Show more