-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy in behavior of DoFn.process()
when yield
is combined with return
statement, or vice versa
#22969
Comments
yield
is combined with return
statement, or vice versayield
is combined with return
statement, or vice versa
yield
is combined with return
statement, or vice versaDoFn.process()
when yield
is combined with return
statement, or vice versa
Thanks for the detailed report @toransahu In the short-term we should at least update the docs to indicate that mixing return and yield leads to undefined behavior. (optional) medium-term: detect this situation and fail at construction time. Long-term: Enable mixing yield and return. CC: @tvalentyn @robertwb |
I think P1 is appropriate here since this could lead to data loss. I'm not sure updating docs is enough to drop to P2, since users could easily miss that. |
@TheNeuralBit Thanks for promptly triaging the issue. |
Has anything been done to mitigate this? It seems quite severe! |
This is in the backlog, awaiting available hands; doc update in flight. I think we can recommend |
.take-issue |
Taking a closer look, the unexpected behavior is not caused by Beam but rather how Python interprets a function with For example:
Seems that such mix-and-match was an error in Python 2. |
@liferoad Thanks for the PR. @tvalentyn Thanks for PR review & sharing the root cause. |
.close-issue |
The issue this warning message points to has been resolved showing it was an issue with Python 2 and not with beam. apache#22969
What happened?
Apache Beam SDK Version: 2.40.0
SDK Language: Python
Runner: All (DirectRunner, DataflowRunner, PortableRunner etc.)
TL;DR: If a
DoFn.process
hasyield <some value>
as well asreturn <some iterable>
statements, then it does not emit element for statementreturn <some iterable>
.The Apache Beam, Programming Guide, 4.2.1.2. Creating a DoFn states that:
That statement is correct when a
DoFn.process
:yield <some value>
onlyreturn <some iterable>
onlyto emit an element, within the definition.
If the combination of
yield
andreturn
are used in theDoFn.process()
definition, then it does not comply with the statement made in the document.See this example pipeline:
Actual Output:
Expected Output:
If the analogy of
DoFn.process
is similar to a Pythongenerator
, then on running a similar code (made purely withgenerator
):Outputs:
If we compare behavior of
DoFn.process
with Pythongenerator
, for all 4 cases:Case 4.
dofn_issue.py
&generator_eg.py
.dofn_issue.py
&generator_eg.py
.Case 3.
dofn_issue.py
&generator_eg.py
.dofn_issue.py
&generator_eg.py
.Case 2.
dofn_issue.py
. b) Output is expected forgenerator_eg.py
.dofn_issue.py
&generator_eg.py
.Case 1.
dofn_issue.py
. b) Output is expected forgenerator_eg.py
.dofn_issue.py
&generator_eg.py
.Discrepancy
If
dofn_issue.py
for case 3 emits on statementreturn <some iterable>
(which is NOT the case withgenerator_eg.py
), then why it does NOT emit for case 1 and case2's statementreturn <some iterable>
.Issue Priority
Priority: 1
Issue Component
Component: sdk-py-core
The text was updated successfully, but these errors were encountered: