Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KV Data Loading: Commit Time #55

Open
thegreatfatzby opened this issue May 31, 2024 · 4 comments
Open

KV Data Loading: Commit Time #55

thegreatfatzby opened this issue May 31, 2024 · 4 comments

Comments

@thegreatfatzby
Copy link

Will the commit time field be required for updates or deletes?

@peiwenhu
Copy link
Collaborator

it'll be required for every operation including updates and deletes.

For reasons such as 1. the pubsub-based data delivery is not in-order 2. internal optimization for file-based data reading may make reading out-of-order, we need to depend on some client-defined time to determine a deterministic order, rather than some server run-time decision.

@thegreatfatzby
Copy link
Author

I see, so micro and macro question:

  • Will this be required per row? Or could it somehow be inferred per batch, like from a file, a file name, or other metadata?
  • Will this type of loading be the only type supported? This isn't required on all cacheing or other data storage solutions...some of them sort of "replicate" that (pun intended) internally but don't require it as part of the API, and leave some of that ordering to the client, which is not unreasonable.

@thegreatfatzby
Copy link
Author

Also @truemike and @swapnilpandit

@peiwenhu
Copy link
Collaborator

peiwenhu commented Jun 3, 2024

Will this be required per row? Or could it somehow be inferred per batch, like from a file, a file name, or other metadata?
Yes. It is required per row.

Technically it could also infer from elsewhere but we try to keep things simple unless there is a strong reason. Given that we already have 2 ways to ingest data (pubsub, fs) and 2 data formats (Avro, Riegeli), and we may have more ways in the future, we want to keep the feature matrix as simple as possible.

Will this type of loading be the only type supported?

We're open to suggestions but this is the only type supported as of now. We design within the constraints of TEE, which does not persist data across machine restarts, which makes it really hard to make the KV server as the source of truth of the data, for 1. decisions made by the KV server cannot persist across restarts without great complexity 2. consensus algorithm to make such decisions is also hard due to the constraints so it's much easier for each server to operate independently. Therefore it's much cleaner to let the client control this aspect. I suspect the other caching/storage solutions don't need to worry about this as much as we do. But it's always nice to find some inspiration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants