Release Notes - Beam - Version 2.29.0 - HTML format

Sub-task

  • [BEAM-7092] - Add Spark 3 test jobs to the CI (Java 8)
  • [BEAM-7637] - Migrate S3FileSystem to AWS SDK for Java 2
  • [BEAM-8696] - Beam Dependency Update Request: com.google.protobuf:protobuf-java
  • [BEAM-9282] - Create new module for Spark 3 runner
  • [BEAM-9283] - Add Spark 3 test jobs to the CI (Java 11)
  • [BEAM-10944] - Support CREATE FUNCTION statement with Java UDF
  • [BEAM-11126] - Beam Dependency Update Request: org.checkerframework:checker-qual
  • [BEAM-11654] - Publish Spark 2 and 3 specific Job-Server containers
  • [BEAM-11747] - Reject the mixed Java UDF and ZetaSQL builtin operators cases
  • [BEAM-11771] - Beam Dependency Update Request: org.checkerframework:checker
  • [BEAM-11899] - Beam Dependency Update Request: org.apache.commons:commons-pool2

Bug

  • [BEAM-8221] - NullPointerException in reading from non-existent Kafka topic
  • [BEAM-9239] - Dependency conflict with Spark using aws io
  • [BEAM-10582] - Beam Dependency Update Request: pyarrow
  • [BEAM-11033] - Update Dataflow metrics processor to handle portable jobs
  • [BEAM-11125] - Beam Dependency Update Request: org.checkerframework
  • [BEAM-11326] - Enforce deadlines during splitAtFraction in BigQueryStorageStreamSource
  • [BEAM-11613] - Update Dataflow multi-language pipelines to use SDK harness images available in GCR
  • [BEAM-11647] - beam_PreCommit_Go_Cron flaky
  • [BEAM-11657] - Kafka read performance regression due to added header support
  • [BEAM-11706] - TriggerProto translation shows up as 1% cpu on some benchmarks
  • [BEAM-11719] - Enforce deterministic coding for GroupByKey and Stateful DoFns
  • [BEAM-11720] - Beam hardcodes pip path, which may be inconvenient for some custom container users.
  • [BEAM-11746] - GroupIntoBatchesTest.testInGlobalWindow flaky
  • [BEAM-11749] - Portable Flink runner skips timers when dynamic timer tags are used
  • [BEAM-11784] - Java pipeline proto serialization does not ensure topological ordering of root transforms
  • [BEAM-11801] - BigtableIO should not set useCachedDataPool when using an emulator
  • [BEAM-11807] - SDK Worker with multithreading causes boto3 the KeyError(endpoint_resolver)
  • [BEAM-11815] - Fail to read more than 1M of items with DynamoDBIO
  • [BEAM-11824] - WindowingStrategyTranslation does not set merge status in proto
  • [BEAM-11833] - UnboundedSourceAsSDFRestrictionTracker reports incorrect watermark after failed claim
  • [BEAM-11834] - Array elements are assumed not to be nullable.
  • [BEAM-11848] - publish_docker_images script fails to deploy images for 2.28.0 RC1
  • [BEAM-11861] - ParquetIO throws Coder not found when using parseGenericRecord or parseFilesGenericRecord
  • [BEAM-11862] - Write To Kafka does not work
  • [BEAM-11863] - Java Quick Start is not working on MAC M1
  • [BEAM-11864] - NPE when registering fromRowFunction
  • [BEAM-11881] - DataFrame subpartitioning order is incorrect
  • [BEAM-11884] - Deterministic coding enforcement causes BigQueryBatchFieldLoads/GroupFilesByTableDestinations to fail
  • [BEAM-11887] - testMergingCustomWindowsWithoutCustomWindowTypes failing on Flink VR
  • [BEAM-11910] - Increase subsequent page size for bags after the first
  • [BEAM-11921] - Github actions Java test permared
  • [BEAM-11929] - DataframeTransfom, BatchRowsAsDataFrame do not preserve field order when schema created with beam.Row
  • [BEAM-11967] - Dataflow metrics failing in runner v2
  • [BEAM-11972] - ParquetIO should close all opened channels/readers
  • [BEAM-11979] - Can't use ReadFromMongoDB with a datetime in filter
  • [BEAM-12030] - DataFrame read_* functions raise IndexError when no files exist
  • [BEAM-12042] - TVF with no arguments causes ArrayIndexOutOfBoundsException.
  • [BEAM-12043] - Terminal external transforms broken on Dataflow
  • [BEAM-12044] - JdbcIO should explicitly setAutoCommit to false
  • [BEAM-12054] - Mutator.close() has to be moved to @FinishBundle in WriteFn and DeleteFn
  • [BEAM-12071] - DataFrame IO sinks do not correctly partition by window
  • [BEAM-12095] - spark_runner.py broken by Spark 3 upgrade.
  • [BEAM-12292] - 2.29.0 cherrypick: WindmillStateCache has a 0% hit rate in 2.29

New Feature

  • [BEAM-5601] - Dataflow runner should support custom windowfn for portability
  • [BEAM-10861] - Adds URNs and payloads to PubSub transforms
  • [BEAM-10994] - Add Hot Key Logging in Dataflow Runner
  • [BEAM-11325] - KafkaIO should be able to read from new added topic/partition automatically during pipeline execution time
  • [BEAM-11628] - Implement GroupBy.apply
  • [BEAM-11658] - Match .snappy files into the given (de)compressor
  • [BEAM-11694] - Re-enable combiner packing for DataflowRunner, FnApiRunner and PortableRunner
  • [BEAM-11698] - Implement BIT_XOR as CombineFn for Zetasql
  • [BEAM-11772] - GCP BigQuery sink (file loads) uses runner determined sharding for unbounded data
  • [BEAM-11850] - Support DDL in SQL Transform
  • [BEAM-11932] - ServiceOptions for configuring Dataflow

Improvement

  • [BEAM-2530] - Make Beam compatible with next Java LTS version (Java 11)
  • [BEAM-10120] - Support Dynamic Timers in the Flink Portable Runner
  • [BEAM-10671] - Add environment configuration fields as first-class pipeline options
  • [BEAM-11634] - Give JobInvoker threads unique names.
  • [BEAM-11705] - Write to bigquery always assigns unique insert id per row causing performance issue
  • [BEAM-11736] - FnApiRunner should pass PipelineOptions to sdk_worker instances
  • [BEAM-11752] - Using LoadingCache instead of Map to cache BundleProcessor
  • [BEAM-11778] - Create an extension of SimpleCatalog.
  • [BEAM-11789] - Upgrade gradle-dependency-analyze plugin to 1.4.3.
  • [BEAM-11806] - KafkaIO - Partition Recognition in WriteRecords
  • [BEAM-11866] - Remove InvalidWindows from Java SDK and use already merged bit
  • [BEAM-11867] - Remove SYNCHRONIZED_PROCESSING_TIME time domain from model protos
  • [BEAM-11870] - IllegalArgumentExceptions from Runner.fromOptions in Pipeline.create should be raised as-is
  • [BEAM-11913] - Add support for Hadoop configuration on ParquetIO
  • [BEAM-11941] - Upgrade Flink runner to Flink version 1.12.2
  • [BEAM-11946] - Use ReadFromKafkaDoFn for KafkaIO.Read by default when beam_fn_api is enabled
  • [BEAM-11958] - Don't use new Jackson APIs to avoid classpath issues when parsing AWS configuration
  • [BEAM-11969] - Make row-group size configurable in ParquetIO.Sink
  • [BEAM-12010] - CalcMergeRule should not merge BeamCalcRel and BeamZetaSqlCalcRel.
  • [BEAM-12033] - Validate casts from double literals to numeric during expression conversion.
  • [BEAM-12057] - Add missing populateDisplayData methods to ParquetIO

Test

  • [BEAM-11023] - GroupByKeyTest testLargeKeys100MB and testGroupByKeyWithBadEqualsHashCode are failing on Spark Structured Streaming runner

Wish

  • [BEAM-11213] - Beam metrics should be displayed in Spark UI

Task

  • [BEAM-11265] - Java quickstart shouldn't use pom.xml as input
  • [BEAM-11324] - Additional verification in PartitioningSession

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.