Commit graph

1308 commits

Author SHA1 Message Date
Mikkel Denker
ef09de73e2 more links for graph 2022-12-19 20:25:08 +01:00
Mikkel Denker
dd40f362a7 stop shortest path calculation when distance is too large 2022-12-16 11:15:22 +01:00
Mikkel Denker
b01ef0ada8 hyperloglog off by one 2022-12-16 10:51:06 +01:00
Mikkel Denker
08dc18e16a fixed hyperloglog precision 2022-12-16 10:34:16 +01:00
Mikkel Denker
f722dab687 speedup harmonic centrality calculations 2022-12-15 21:17:39 +01:00
Mikkel Denker
938d8d402c turnoff betweenness centrality for proxy node selection (wasn't fast enough) 2022-12-15 09:49:05 +01:00
Mikkel Denker
024ae7cbc3 scaled test values 2022-12-14 16:41:34 +01:00
Mikkel Denker
f07245834a fixed page offset bug and prepared for multi stage ranking 2022-12-14 16:01:34 +01:00
Mikkel Denker
453aca402e get more detailed score from collector and fixed disalignment between some frontend and backend terminology 2022-12-12 14:06:32 +01:00
Mikkel Denker
fac468cc3c upgrade tantivy 2022-12-12 09:51:28 +01:00
Mikkel Denker
1b61921799 use localsearcher in all tests 2022-12-12 09:13:49 +01:00
Mikkel Denker
4b18636dc5 ability to skip full graph creation 2022-12-08 19:01:19 +01:00
Mikkel Denker
084198f6d1 stackoverflow sidebar 2022-12-08 15:47:00 +01:00
Mikkel Denker
58292e4f2e allow non-ssh cloning 2022-12-07 14:33:17 +01:00
Mikkel Denker
b518cf20b0 stackoverflow sidebar 2022-12-06 14:55:33 +01:00
Mikkel Denker
2172ddcf2f stackoverflow enhanced snippet 2022-12-05 14:48:04 +01:00
Mikkel Denker
78acb66829 remove block notion from webgraph 2022-12-02 13:10:34 +01:00
Mikkel Denker
465a49bf10 update quickstart.optic 2022-12-02 12:39:25 +01:00
Mikkel Denker
66ab5248ae merge webgraph sequentially 2022-12-01 15:55:18 +01:00
Mikkel Denker
51adbc9333 fix github linguist? 2022-12-01 15:26:52 +01:00
Mikkel Denker
a03a4957be
Ftr/optics language (#69)
* Store all schema_org from webpages in a field

* flatten json tokenizer

* rename goggles -> optics

* update optics syntax

* cargo workspace

* very simple lsp wasm connection

* optics as separate package

* hover stuff

* optics vscode extension published

* syntax errors on-save and begin schema-field

* Use separate targets for LSP and rest (#68)

By moving the different targets into separate workspaces, we avoid some
of the issues where rust-analyzer might just stop working.

By adding the two projects to .vscode/settings.json we keep the ability
to get completions, goto definitions, rename, and such operations.

This requires us to specify the dependency versions in the LSP crate, as
we can no longer refer to them by the workspace version. The positive of
this is that the WASM/LSP dependent crates are now moved to the LSP crate.

* schema.org syntax in optic

* optic can now perform schema searches

* simplified schema_org flattening

* wrote new quickstart.optic

* update like-text

Co-authored-by: Oliver Bøving <oliver@bvng.dk>
2022-12-01 14:59:49 +01:00
Mikkel Denker
ca010fa7c0
More schema org (#67)
* parse single microdata item

* parse microdata from entire website

* convert microdata to schema.org

* Refactor schema.org to have a json_ld parse module

* a lot of schema.org types

* Less types for schema.org
It would currently require waaaaaaaay too much work to define all the types for schema.org compared to the benefits we would get from having them defined.

* test with stackoverflow question and a recipe
2022-11-16 10:14:45 +01:00
Mikkel Denker
fe9fa4e945 harmonic centrality improvements (kahan sum, bloom filter, ability to controll number of hyperloglogcounters) 2022-11-08 20:32:17 +01:00
Mikkel Denker
56788b1c9f less locking in webgraph 2022-11-08 18:28:36 +01:00
Mikkel Denker
71f883ae0b less synchronization in hyperball 2022-11-07 20:27:45 +01:00
Mikkel Denker
213e118fff parallel hyperball 2022-11-07 20:18:05 +01:00
Mikkel Denker
86ed00d05a use intmap in betweenness centrality calculation to reduce memory 2022-11-07 19:48:24 +01:00
Mikkel Denker
a24da43546 if we always flush before calling .edges(), we cannot iterate the edges of a graph that is opened as read-only 2022-11-06 16:25:58 +01:00
Mikkel Denker
fc0a1bd756 more error restitant 2022-11-06 15:56:26 +01:00
Mikkel Denker
769890a64b calculate harmonic centrality using hyperball 2022-11-06 15:39:21 +01:00
Mikkel Denker
cfe03f9992 specialized integer hashmap implementation for speedup in apprxomate harmonic calculations 2022-11-05 17:05:14 +01:00
Mikkel Denker
39f5536c30 small code cleanup 2022-11-04 10:02:02 +01:00
Mikkel Denker
0d2fb24442 calculate query specific centrality during ranking 2022-11-03 20:47:28 +01:00
Mikkel Denker
bbca94c37e
Parse DMOZ data (#66)
* Parse DMOZ data

* index topics as facets

* calculate topic centrality

* fix serious bug in webgraph where some nodes dissapeared (there is still a bug somewhere, but waaaay less nodes are missing now)

* apply topic centrality during search
2022-11-03 14:19:21 +01:00
Mikkel Denker
ce7a3f3599 Crawl stability field 2022-10-28 13:38:53 +02:00
Mikkel Denker
848d2e3c89 Remove duplicate search results based on simhash 2022-10-27 14:49:47 +02:00
Mikkel Denker
2c127d5f39
Ftr/configure command (#65)
* Add autosuggest scrape as a separate command

* Save queries continuously

* Save images as they get downloaded (way lower memory usage)

* Created configure subcommand

* Updated justfile and setup documentation
2022-10-26 14:58:26 +02:00
Mikkel Denker
647044f27d Fixed XSS by not properly escaping search query 2022-10-25 19:26:01 +02:00
Mikkel Denker
01326f4310 remove protocol from full-centrality 2022-10-25 16:15:58 +02:00
Mikkel Denker
1aa8f97ec6 Limit number of segments during indexing. Else we risk having too many open memmap's during merge 2022-10-25 15:08:04 +02:00
Mikkel Denker
0a4ad17c26 merge indexes from cli 2022-10-25 13:34:38 +02:00
Mikkel Denker
cfabb5a453 merge indexes from cli 2022-10-25 13:29:46 +02:00
Mikkel Denker
e221eb1292 spell correction only highlight corrected part 2022-10-25 10:09:27 +02:00
Mikkel Denker
76ed6624c5 unit test personal centrality from goggles 2022-10-16 17:21:05 +02:00
Mikkel Denker
596b58586e ensure that additional links from A -> B does not artificially increase the centrality of B 2022-10-14 14:45:53 +02:00
Mikkel Denker
ae64cd807a full webgraph should be without protocol 2022-10-14 13:51:08 +02:00
Mikkel Denker
d89149da2c cleanup host graph nodes 2022-10-14 13:41:52 +02:00
Mikkel Denker
83a37d9e32 lowercase webgraph node names 2022-10-14 13:27:07 +02:00
Mikkel Denker
fd97641eb8
Ftr/trust centrality (#63)
* refactor harmonic centrality into separate centrality module

* betweenness centrality

* betweenness speedup by not using hashmaps

* [WIP] trust centrality

* more robust warc download?

* unit test for trust centrality calculation

* refactor centrality store and also save trust centrality in store

* approx harmonic centrality working

* Order trusted nodes by betweenness, if the user specified too many, merge the worse trusted nodes into some of the better and update their weight

* dislike sites

* re-enable harmonic centrality calculation

* calculate betweenness on full graph

* added nodeid to schema

* use personal centrality during search

* sort centrality values in csv files

* liked and disliked sites in goggles syntax
2022-10-14 13:14:47 +02:00
Oliver Bøving
1d26bb2ec0
Integrate Alpine.js (#64)
* Move settings into a subfolder

This ensures that the paths are layed out the same on the frontend and
on the backend.

* Format goggles and sites .astro

* Add Alpine.js and use `ServeDir` instead of `Spa`

The `ServeDir` allows us the use the public dir and remove the extra
/assets/ also solving the issue with loading scripts without inline.

Currently only a few computes are ported to use Alpine.js, but the rest
should be doable!
2022-10-13 13:09:54 +02:00