there seems to be a bug somewhere in the decompression of one of the other codecs for enormous columns. the to_id and from_id columns in the webgraph seems to point to wrong ids in the webgraph (don't know exactly why). as the ids are random hashes anyway, we don't gain much from compression and can therefore simply store the columns as raw u64 values on disk
each posting list is already sorted by the new document ids (even without index sorting). if new(a) < new(b) => old(a) < old(b) and vice versa. the posting lists can therefore be streamed to disk instead of reading the full lists into memory and sort them
* overall structure for new webgraph store
* webgraph schema structure and HostLinksQuery
* deserialize edge
* forward/backlink queries
* full edge queries and iter smalledges
* [wip] use new store in webgraph
* remove id2node db
* shortcircuit link queries
* [wip] remote webgraph trait structure
* [wip] shard awareness
* finish remote webgraph trait structure
* optimize read
* merge webgraphs
* construct webgraph store
* make sure 'just configure' works and everything looks correct
* re-open index after it has been downloaded from replica
* remove writer directory lock
* update meta file with segment changes
* flatten live index directory structure a bit for better overview
* additional live index tests
* [WIP] live index code structure with a ton of todos
* update meta file with segment changes
* add endpoint to index webpages into live index
* compact segments by date
* cleanup old segments
* fix clippy warnings
* fix clippy warnings
most of the time, we want to fetch multiple columns for each document in the result set. by ordering the fields by rows, we can fetch all the relevant fields with a minimum number of IO operations, whereas we would need at least one IO operation for each field if they were column ordered
our segments are starting to grow too big, so the assumption that number of position bytes can be in a u32 is no longer the case. storing it in u64 might not be what regular users of tantivy want, as our use of the library most likely doesn't resemble the average user. forking tantivy allows us to customize it directly for our usecase