Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuckEvent: provide a Buck Event Publisher proto #685

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sluongng
Copy link
Contributor

Provide a BuckEvent Publisher service which emits BuckEvents to an
external server implementation.

Closes #226

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2024
@sluongng sluongng marked this pull request as ready for review June 17, 2024 14:02
Provide a BuckEvent Publisher service which emits BuckEvents to an
external server implementation.

Closes facebook#226
@facebook-github-bot
Copy link
Contributor

@cjhopman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@JakobDegen JakobDegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying some comments left on phabricator:

I think we need the rest of the stack to accept this.
Is it not clear how this will be used.
(Also I'm not 100% sure grpc is the right choice, maybe simple piping to stdin would be sufficient).

I think I agree that I'd like to see a little more code before landing this just yet


message BuckEventResponse {
// A trace-unique 64-bit identifying the stream.
uint64 stream_id = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having stream_id we can just create a new grpc call for each stream, that would be more natural.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, to partially respond to Stiopa on this: That seems hard to practically implement on the buck2 side, since we may create a number of different instances of the client for each command. And even if we didn't do that, it seems unwise to prevent ourselves from doing that in the future

@JakobDegen
Copy link
Contributor

Actually, I have another thing to consider: Internally, we've found that the size of the event logs can be extremely substantial and that keeping the entire mechanism reliable hasn't been super straightforward. Now, while I don't if anyone else's repo is big enough to be affected by that, and I'm sure that some of the unreliability is a result of misbehaviors specific to our infra, I can also imagine that some of the lessons learned from that are more generally useful.

To go into detail a bit, the way our scribe client works is roughly that it allocates the "real" client as a global, and gets a handle to it for each instance of the type you see. The global then keeps a queue of messages that is used to 1) batch writes, and 2) rate limit itself. While the rate limiting is probably not necessary in OSS, the write batching might be of interest. That might be something to consider for this API. Hopefully that information also explains what send_now does. I'm going to try and get our client open sourced for you to at least look at and evaluate what parts of it you want to reuse or not

facebook-github-bot pushed a commit that referenced this pull request Jul 2, 2024
Summary: This won't be able to build in OSS, but we can at least make the code available. Of interest specifically to #685

Reviewed By: ndmitchell

Differential Revision: D59262895

fbshipit-source-id: e4e72a402e174dedd08715a627a84e7b16d1a225
@sluongng
Copy link
Contributor Author

sluongng commented Jul 3, 2024

There seem to be 2 concerns here:

  1. Client initialization and stream_id:

I plan to mimic Bazel's approach here. In Bazel, each "invocation" gets a unique stream ID. As events get sent to the server, the server will reply with "ack" (identified via trace_id) for events that have persisted. Network disruption could happen and in such cases, the client might want to retry sending events that were not "ack" by the server. Upon resend, the events could be sent to a different server instance, thus the stream_id is needed to identify the retry stream is the same as the previous disrupted stream.

  1. Batching:

I am not sure why is batching needed. Multiple GRPC streams could multiplex over a single HTTP/2 connection so the overhead of sending events one-by-one is relatively low. I could see potential benefit in compressing multiple events together to send as a batch, but I would argue that it should be left to the server implementation to optimize. (i.e. some implementations could provide a local forwarding proxy that handles the batching).

In Bazel, it's better to stream the events as soon as they arrive as some events contain the console log information, which enables the server to implement an alternative build UI on their website. I could imagine that we might be able to extend BuckEvent to support this use case (aka. the removed ControlEvent).

Could you please elaborate a bit more on why batch sending needs to be implemented on the client side? Perhaps you are thinking of potential non-grpc client implementations?

@sluongng
Copy link
Contributor Author

sluongng commented Jul 3, 2024

I think we need the rest of the stack to accept this.
Is it not clear how this will be used.
(Also I'm not 100% sure grpc is the right choice, maybe simple piping to stdin would be sufficient).

From my perspective, it could be potentially a huge time investment to code the rest of the implementation without agreeing on the spec first. At least some agreements on the grpc service would be a positive signal for me to put more time into this.

The current design is modeled after PublishBuildToolEventStream https://github.com/bazelbuild/bazel/blob/0b280ac442aa51c7cca1c311658dbd1a6346dff5/third_party/googleapis/google/devtools/build/v1/publish_build_event.proto#L72-L84. Which is a battle-tested RPC with multiple server implementations available among the REAPI ecosystem. A sample client implementation is available in Bazel (java) as well as plenty of proxy implementations available on Github.

With that said, I do understand that this is going to be a Buck2-specific API. So if there are any alternative proposals, I would love to hear them.

@JakobDegen
Copy link
Contributor

Could you please elaborate a bit more on why batch sending needs to be implemented on the client side?

So the thing to keep in mind is that our event sending is done synchronously in a bunch of latency sensitive places. While a benchmark would certainly be much better than guessing, my expectation is that doing a network op for each of these events might be prohibitively slow, at least in some of our builds.

That being said, I don't know that we have to figure this out now. Happy to let you experiment and see if it matters or not.

From my perspective, it could be potentially a huge time investment to code the rest of the implementation without agreeing on the spec first. At least some agreements on the grpc service would be a positive signal for me to put more time into this.

Yeah, sorry, I was a bit too brief before in just asking for more code. I think you're right to ask for alignment on use of a GRPC API in general, and I'll bring that up at our team meeting tomorrow to make sure we're all good to move forward in general.

The main thing I would ask for from your side is that the service definition itself is clearly marked as being unstable until we've gathered some experience with using it. The way we've reported events to our own service has changed quite a lot in the past, and I imagine that much of that configuration may at some point need room in the API (the first thing that comes to mind is that we only send a subset of events right now, and I imagine not all consumers are going to agree on which subset is the right one...).

Other than that, I don't think it makes much of a difference whether we merge this PR now or wait for some additional code on the client before doing so, but hopefully the things I mentioned above are enough for you to be able to continue work on your side

@sluongng
Copy link
Contributor Author

sluongng commented Jul 8, 2024

Yeah, sorry, I was a bit too brief before in just asking for more code. I think you're right to ask for alignment on use of a GRPC API in general, and I'll bring that up at our team meeting tomorrow to make sure we're all good to move forward in general.

Thank you so much. Looking forward to this.

So the thing to keep in mind is that our event sending is done synchronously in a bunch of latency sensitive places. While a benchmark would certainly be much better than guessing, my expectation is that doing a network op for each of these events might be prohibitively slow, at least in some of our builds.

This is a good call out.

In Bazel, there is a flag that could switch sending build events asynchronously vs synchronously. The reason for synchronous is that in many OSS setups, the CI worker is an ephemeral CI container/VM that will get shut down right after the build is finished. In those scenarios, it might benefit folks to turn on synchronous event sending so that their builds would wait for all events to be sent before exiting and initializing container shutdown.

I think sending events in async/sync mode is tangent to sending events via stream/batch request. Let's not conflate the 2. There are ways to implement async event sink while using grpc bidi stream as well as batch request while supporting queue by-passing for send_now. For example, we could maintain an internal priority queue in Buck Daemon and send_now events will always get the highest priority. I think we could hash this out in future PRs when we dive into implementation details.

the service definition itself is clearly marked as being unstable until we've gathered some experience with using it.

I have no problem with this. If there is any existing convention in Buck2 to mark unstable service, I would be happy to follow. If there isn't, I will name the configuration key to activate this unstable_buck_event_service = grpcs://my.service.com/buckevent or experimental_buck_event_service = grpcs://my.service.com/buckevent. We could also rename the .proto file or the Service name itself, although I think it's a bit overkill.

Let me know what you prefer here.

Other than that, I don't think it makes much of a difference whether we merge this PR now or wait for some additional code on the client before doing so, but hopefully the things I mentioned above are enough for you to be able to continue work on your side

I think the main diff will be who has to pay the rebase cost.

If the review time for future PRs is short, I have no problem leaving this unmerged. However, if the lead time is long, I will have to rebase my PR stack and account for existing refactoring(codemod) in Buck2 repo, which could be a pain to deal with. So I would prefer to have this merged earlier rather than later.

@sluongng
Copy link
Contributor Author

@JakobDegen friendly ping. I am looking for a confirmation on the gRPC API direction before investing more time into this.

@JakobDegen
Copy link
Contributor

Alright, sorry for the delay. We had some back and forth on this internally and our ourselves not quite sure what we think the right answer is.

The reason for synchronous is that in many OSS setups, the CI worker is an ephemeral CI container/VM that will get shut down right after the build is finished. In those scenarios, it might benefit folks to turn on synchronous event sending so that their builds would wait for all events to be sent before exiting and initializing container shutdown.

So we actually have the same problem, which is why even though our event sending is asynchronous in the sense that it doesn't do a network op on each send call, we will still make sure to flush all our buffers at the end of the command before returning to the user. I don't think this is the reason you should be electing to go full sync.

I think sending events in async/sync mode is tangent to sending events via stream/batch request. Let's not conflate the 2. There are ways to implement async event sink while using grpc bidi stream as well as batch request while supporting queue by-passing for send_now. For example, we could maintain an internal priority queue in Buck Daemon and send_now events will always get the highest priority. I think we could hash this out in future PRs when we dive into implementation details.

👍

I have no problem with this. If there is any existing convention in Buck2 to mark unstable service

I think a comment at the top of the proto file would be more than enough

If the review time for future PRs is short, I have no problem leaving this unmerged. However, if the lead time is long, I will have to rebase my PR stack and account for existing refactoring(codemod) in Buck2 repo, which could be a pain to deal with. So I would prefer to have this merged earlier rather than later.

Yeah that's fair. Let's merge this then and we can iterate

I'll add one additional point that came up internally:

Given how big logs can sometimes be, and that users are often working from home on not-great internet (or god forbid, from a hotel or plane or something), the default state was that we had complete data for not even 90% of builds. Fixing that to get 99% complete data was a significant investment for us. The part where we currently have machine local queues in a couple different places that allow us to bridge any periods of reduced internet connectivity were a very big contributing part of that. It's not blocking, but I wanted to at least raise awareness of that so you get a chance to consider it.

If you're otherwise comfortable with this, let me know and I'll go ahead and merge it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API design for BuckEvent data
3 participants