Release Notes - Beam - Version 2.19.0 - HTML format

Sub-task

  • [BEAM-4455] - Provide automatic schema registration for Protos
  • [BEAM-5546] - Beam Dependency Update Request: commons-codec:commons-codec
  • [BEAM-7116] - Remove KV from Schema transforms
  • [BEAM-7850] - Make Environment a top level attribute of PTransform
  • [BEAM-7861] - Make it easy to change between multi-process and multi-thread mode for Python Direct runners
  • [BEAM-7949] - Add time-based cache threshold support in the data service of the Python SDK harness
  • [BEAM-7951] - Allow runner to configure customization WindowedValue coder such as ValueOnlyWindowedValueCoder
  • [BEAM-8623] - Add additional message field to Provision API response for passing status endpoint
  • [BEAM-8624] - Implement FnService for status api in Dataflow runner
  • [BEAM-8701] - Beam Dependency Update Request: commons-io:commons-io
  • [BEAM-8716] - Beam Dependency Update Request: org.apache.commons:commons-csv
  • [BEAM-8717] - Beam Dependency Update Request: org.apache.commons:commons-lang3
  • [BEAM-8749] - Beam Dependency Update Request: com.datastax.cassandra:cassandra-driver-mapping
  • [BEAM-8842] - Consistently timing out: BigQueryStreamingInsertTransformIntegrationTests.test_multiple_destinations_transform
  • [BEAM-8946] - Report collection size from MongoDBIOIT
  • [BEAM-8978] - Report saved data size from HadoopFormatIOIT

Bug

  • [BEAM-2409] - Spark runner produces exactly twice the number of results in streaming mode when use triggers to re-window results on global window.
  • [BEAM-5495] - PipelineResources algorithm is not working in most environments
  • [BEAM-7156] - Nexmark Query 14 'SESSION_SIDE_INPUT_JOIN' is producing twice the number of results in Spark Runner
  • [BEAM-7868] - Hidden Flink Runner parameters are dropped in python pipelines
  • [BEAM-7991] - gradle cleanPython race
  • [BEAM-8435] - Allow access to PaneInfo from Python DoFns
  • [BEAM-8496] - remove SDF translators in flink streaming transform translator
  • [BEAM-8577] - FileSystems may have not be initialized during ResourceId deserialization
  • [BEAM-8581] - Python SDK labels ontime empty panes as late
  • [BEAM-8582] - Python SDK emits duplicate records for Default and AfterWatermark triggers
  • [BEAM-8810] - Dataflow runner - Work stuck in state COMMITTING with streaming commit rpcs
  • [BEAM-8830] - fix Flatten tests in Spark Structured Streaming runner
  • [BEAM-8846] - Force synchronization of the stream observer in BeamFnControlClient
  • [BEAM-8865] - FileIO's Javadoc is outdated: TypeDescriptors.KVs and unhandled IOException
  • [BEAM-8885] - PubsubGrpcClient doesn't respect PubsubOptions#getPubsubRootUrl
  • [BEAM-8943] - SDK harness servers don't shut down properly when SDK harness environment cleanup fails
  • [BEAM-8955] - AvroSchemaTest.testAvroPipelineGroupBy broken on Spark runner
  • [BEAM-8959] - Boolean pipeline options which default to true cannot be set to false
  • [BEAM-8962] - FlinkMetricContainer causes churn in the JobManager and lets the web frontend malfunction
  • [BEAM-8988] - apache_beam.io.gcp.bigquery_read_it_test failing with: NotImplementedError: BigQuery source must be split before being read
  • [BEAM-8989] - Backwards incompatible change in ParDo.getSideInputs (caught by failure when running Apache Nemo quickstart)
  • [BEAM-8995] - apache_beam.io.gcp.bigquery_read_it_test failing on Py3.5 PC with: TypeError: the JSON object must be str, not 'bytes'
  • [BEAM-8999] - PGBKCVOperation does not respect timestamp combiners
  • [BEAM-9006] - Meta space memory leak caused by the shutdown hook of ProcessManager
  • [BEAM-9034] - Update environment_id for ExternalTransform in Python SDK
  • [BEAM-9050] - Beam pickler doesn't pickle classes that have __module__ set to None.
  • [BEAM-9060] - Flink suppresses stdout/stderr during JobGraph generation from JAR
  • [BEAM-9065] - Spark runner accumulates metrics (incorrectly) between runs
  • [BEAM-9078] - Large Tarball Artifacts Should Use GCS Resumable Upload
  • [BEAM-9083] - PR9677 breaks ValidatesRunnerTest of open source runners
  • [BEAM-9123] - HadoopResourceId returns wrong directory name
  • [BEAM-9127] - postcommit: suites:portable:py2:crossLanguagePortableWordCount failing
  • [BEAM-9138] - beam_Release_Gradle_Build failure in Go
  • [BEAM-9144] - Beam's own Avro TimeConversion class in beam-sdk-java-core
  • [BEAM-9151] - Dataflow legacy worker tests are mis-configured
  • [BEAM-9423] - Re-Add the stop button to the Flink web interface for pipelines

New Feature

  • [BEAM-1440] - Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
  • [BEAM-6671] - Beam 2.9.0 java.lang.NoSuchFieldError: internal_static_google_rpc_LocalizedMessage_fieldAccessorTable
  • [BEAM-8139] - Execute portable Spark application jar
  • [BEAM-8630] - Prototype of BeamSQL Calc using ZetaSQL Expression Evaluator
  • [BEAM-8844] - [SQL] Create performance tests for BigQueryTable
  • [BEAM-9023] - Upgrade to ZetaSQL 2019.12.1

Improvement

  • [BEAM-3419] - Enable iterable side input for beam runners.
  • [BEAM-3759] - Add support for PaneInfo descriptor in Python SDK
  • [BEAM-5192] - Support Elasticsearch 7.x
  • [BEAM-6008] - Improve error reporting in Java/Python PortableRunner
  • [BEAM-7961] - Add tests for all runner native transforms and some widely used composite transforms to cross-language validates runner test suite
  • [BEAM-8296] - Containerize the Spark job server
  • [BEAM-8536] - Migrate usage of DelayedBundleApplication.requested_execution_time to time duration
  • [BEAM-8745] - More fine-grained controls for the size of a BigQuery Load job
  • [BEAM-8746] - Allow the local job service to work from inside docker
  • [BEAM-8801] - PubsubMessageToRow should not check useFlatSchema() in processElement
  • [BEAM-8816] - Load balance bundle processing w/ multiple SDK workers
  • [BEAM-8837] - PCollectionVisualizationTest: possible bug
  • [BEAM-8886] - Add a python mongodbio integration test that triggers load split
  • [BEAM-8891] - Create and submit Spark portable jar in Python
  • [BEAM-8901] - add experimental flag for reusing flink local environment
  • [BEAM-8929] - Remove unnecessary exception handling in FnApiControlClientPoolService
  • [BEAM-8930] - External workers should receive artifact endpoint when started from python
  • [BEAM-8935] - Fail fast if sdk harness startup failed
  • [BEAM-8953] - Extend ParquetIO.Read/ReadFiles.Builder to support Avro GenericData model
  • [BEAM-8993] - [SQL] MongoDb should use predicate push-down
  • [BEAM-8996] - Auto-generate pipeline options documentation for FlinkRunner
  • [BEAM-9000] - Java Test Assertions without toString for GenericJson subclasses
  • [BEAM-9004] - Update Mockito Matchers usage to ArgumentMatchers since Matchers is deprecated in Mockito 2
  • [BEAM-9012] - Include `-> None` on Pipeline and PipelineOptions `__init__` methods for pytype compatibility
  • [BEAM-9019] - Improve Spark Encoders (wrappers of beam coders)
  • [BEAM-9020] - LengthPrefixUnknownCodersTest to avoid relying on AbstractMap's equality
  • [BEAM-9053] - Improve error message when unable to get the correct filesystem for specified path in Python SDK
  • [BEAM-9055] - Unify the config names of Fn Data API across languages
  • [BEAM-9122] - Add uses_keyed_state step property to python dataflow runner
  • [BEAM-9163] - pydoc: Update sphinx_rtd_theme

Test

  • [BEAM-8512] - Add integration tests for Python "flink_runner.py"

Task

  • [BEAM-2572] - Implement an S3 filesystem for Python SDK
  • [BEAM-5690] - Issue with GroupByKey in BeamSql using SparkRunner
  • [BEAM-8342] - upgrade samza runner to use samza 1.3
  • [BEAM-9358] - BigQueryIO potential write speed regression

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.