Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use bigquery storage to get the result of a query #802

Closed
fabiencelier opened this issue Jan 19, 2021 · 9 comments
Closed

How can I use bigquery storage to get the result of a query #802

fabiencelier opened this issue Jan 19, 2021 · 9 comments
Assignees
Labels
api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. type: question Request for information or clarification. Not an issue.

Comments

@fabiencelier
Copy link

I'd like to download the result of a query as fast as possible.
The examples in https://github.com/googleapis/java-bigquerystorage/tree/master/samples/snippets/src/main/java/com/example/bigquerystorage are all about downloading/uploading to a table.

How can use this library to send a query and download the result ? Is there any example about it somewhere ?

@product-auto-label product-auto-label bot added the api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. label Jan 19, 2021
@meredithslota meredithslota added the type: question Request for information or clarification. Not an issue. label Jan 19, 2021
@stephaniewang526
Copy link
Contributor

Could you provide me with the specific endpoints you are asking about?

@fabiencelier fabiencelier changed the title How can I use bigquery storage to get the result ofa query How can I use bigquery storage to get the result of a query Jan 20, 2021
@fabiencelier
Copy link
Author

This example shows how to retrieve a table, I'd like to retrieve the result of a query in a similar way. Is this possible ?

@kmjung
Copy link
Contributor

kmjung commented Jan 20, 2021

Sure -- you can look at the Apache Beam BigQuery query source as one example. This won't necessarily be faster than the default method of fetching query results, however, particularly for small result sets. What is your use case?

@fabiencelier
Copy link
Author

What is your use case?

I want to do a query and download the result into Java memory as fast as possible.
My query result is quite big: several million rows

This won't necessarily be faster than the default method of fetching query results

Are you talking about the google-cloud-bigquery library ? It is very slow for me but I might not use it correctly. I do something like that:

BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
TableResult result = bigquery.query(queryConfig);
for (FieldValueList row : result.iterateAll()) {
  // do stuff
}

It takes around 3 minutes to download 10M rows

Instead I have tried to do a query with the bigquery API then download the temporary result table with the bigquerystorage API, but I'm not sure if it is the good way to do it:

Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).build());
queryJob = queryJob.waitFor();
TableId temporaryTable = ((QueryJobConfiguration) queryJob.getConfiguration()).getDestinationTable();
// Use bigquerystorage to download the temporary table to Arrow

It takes around 10 seconds to download the same 10M rows.

you can look at the Apache Beam BigQuery query source as one example.

I am not familiar with Beam, I'll have a look at it :)

@stephaniewang526
Copy link
Contributor

stephaniewang526 commented Jan 21, 2021

What version of google-cloud-bigquery are you using to run the query? When you say "download" the result, what do you mean? I am curious as to why it is so slow. Does it take 3min to get the results back? Or 3min to "download". Thanks.

@fabiencelier
Copy link
Author

What version of google-cloud-bigquery are you using to run the query

I am using

  <dependencyManagement>
    <dependencies>
      <dependency>
        <groupId>com.google.cloud</groupId>
        <artifactId>libraries-bom</artifactId>
        <version>16.2.1</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
    </dependencies>
  </dependencyManagement>

which gives me google-cloud-bigquery 1.126.1

When you say "download" the result, what do you mean?

I mean the time it takes to retrieve all the data into my local JVM

This code takes several minutes to iterate over 10M rows:

TableResult result = bigquery.query(queryConfig);

long count = 0;
for (FieldValueList row : result.iterateAll()) {
  for (FieldValue val : row) {
    count++;
  }
}

@stephaniewang526 stephaniewang526 self-assigned this Jan 25, 2021
@stephaniewang526
Copy link
Contributor

stephaniewang526 commented Jan 25, 2021

How wide is your row? How much data roughly is in your 10M+ row? I want to compare this to our benchmark to see if it's normal.

Also, can you share a code snippet of how you are doing the same operation using the bigquerystorage API which is taking 10s instead of 3min?

Thanks!

@fabiencelier
Copy link
Author

I have 10M rows with 3 columns (integer, string, float) and as CSV it represents ~200MB.
Unless there is something I am missing, the google-cloud-bigquery library is not option for medium to large results as it is way too slow.

Here is my benchmark code for bigquerystorage API (with 1 thread to keep it simple):

Example.java
import com.google.api.gax.rpc.ServerStream;
import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.storage.v1.BigQueryReadClient;
import com.google.cloud.bigquery.storage.v1.CreateReadSessionRequest;
import com.google.cloud.bigquery.storage.v1.DataFormat;
import com.google.cloud.bigquery.storage.v1.ReadRowsRequest;
import com.google.cloud.bigquery.storage.v1.ReadRowsResponse;
import com.google.cloud.bigquery.storage.v1.ReadSession;
import com.google.cloud.bigquery.storage.v1.ReadSession.TableReadOptions;
import com.google.common.base.Preconditions;
import io.atoti.Config.MyTable;
import java.io.IOException;

public class Example {

  public static final BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

  public static void main(String[] args) throws InterruptedException, IOException {
    final String query = "SELECT * FROM table"; // whatever query with big result

    long start = System.currentTimeMillis();
    TableId temporaryResultTable = getTemporaryTable(query);
    long queryDone = System.currentTimeMillis();
    downloadTable(temporaryResultTable);
    long end = System.currentTimeMillis();


    System.out.println("query time:\t" + (queryDone-start) +"ms");
    System.out.println("download time:\t" + (end-queryDone) +"ms");
    System.out.println("total time:\t" + (end-start) +"ms");

  }

  public static TableId getTemporaryTable(String query) throws InterruptedException{
    QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).setUseLegacySql(false).build();
    Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).build());

    // Wait for the query to complete.
    queryJob = queryJob.waitFor();
    return ((QueryJobConfiguration) queryJob.getConfiguration()).getDestinationTable();
  }


  public static void downloadTable(TableId table) throws IOException {
    String parent = String.format("projects/%s", table.getProject());
    String srcTable =
        String.format(
            "projects/%s/datasets/%s/tables/%s",
            table.getProject(), table.getDataset(), table.getTable());

    TableReadOptions options = TableReadOptions.newBuilder().build();

    ReadSession.Builder sessionBuilder =
        ReadSession.newBuilder()
            .setTable(srcTable)
            .setDataFormat(DataFormat.ARROW)
            .setReadOptions(options);

    CreateReadSessionRequest.Builder builder =
        CreateReadSessionRequest.newBuilder()
            .setParent(parent)
            .setReadSession(sessionBuilder)
            .setMaxStreamCount(1);

    try (BigQueryReadClient client = BigQueryReadClient.create()) {
      ReadSession session = client.createReadSession(builder.build());
      String streamName = session.getStreams(0).getName();

      try (SimpleArrowRowReader reader = new SimpleArrowRowReader(session.getArrowSchema())) {
        ReadRowsRequest readRowsRequest =
            ReadRowsRequest.newBuilder().setReadStream(streamName).build();

        ServerStream<ReadRowsResponse> stream = client.readRowsCallable().call(readRowsRequest);
        for (ReadRowsResponse response : stream) {
          Preconditions.checkState(response.hasArrowRecordBatch());
          reader.processRows(response.getArrowRecordBatch());
        }
      }
    }
  }
}
SimpleArrowRowReader.java
import com.google.cloud.bigquery.storage.v1.ArrowRecordBatch;
import com.google.cloud.bigquery.storage.v1.ArrowSchema;
import com.google.common.base.Preconditions;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.FieldVector;
import org.apache.arrow.vector.VectorLoader;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.ipc.ReadChannel;
import org.apache.arrow.vector.ipc.message.MessageSerializer;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.Schema;
import org.apache.arrow.vector.util.ByteArrayReadableSeekableByteChannel;

public class SimpleArrowRowReader implements AutoCloseable {

    BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
    private final VectorSchemaRoot root;
    private final VectorLoader loader;
    private long count = 0;

    public SimpleArrowRowReader(ArrowSchema arrowSchema) throws IOException {
      Schema schema =
          MessageSerializer.deserializeSchema(
              new ReadChannel(
                  new ByteArrayReadableSeekableByteChannel(
                      arrowSchema.getSerializedSchema().toByteArray())));
      Preconditions.checkNotNull(schema);
      List<FieldVector> vectors = new ArrayList<>();
      for (Field field : schema.getFields()) {
        vectors.add(field.createVector(allocator));
      }
      root = new VectorSchemaRoot(vectors);
      loader = new VectorLoader(root);
    }

    public void processRows(ArrowRecordBatch batch) throws IOException {
      org.apache.arrow.vector.ipc.message.ArrowRecordBatch deserializedBatch =
          MessageSerializer.deserializeRecordBatch(
              new ReadChannel(
                  new ByteArrayReadableSeekableByteChannel(
                      batch.getSerializedRecordBatch().toByteArray())),
              allocator);

      loader.load(deserializedBatch);
      deserializedBatch.close();
      long batchSize = root.getRowCount();
      // System.out.println("\tBatch size: "+ count);
      count += batchSize;
      root.clear();
    }

    @Override
    public void close() {
      System.out.println("Stream size: " + count);
      root.close();
      allocator.close();
    }
}

which outputs

Stream size: 10000000
query time:	928ms
download time:	8523ms
total time:	9451ms

I wonder if that is a viable solution because:

  • I am using the temporary table of the result so I don't know if it will exists until the end of the download
  • I have to cast to QueryJobConfiguration so it doesn't seems a part of the public API

But I don't see any solution more "official" for doing that with the same speed

@stephaniewang526
Copy link
Contributor

This seems reasonable. We potentially will be doing a BigQuery client re-write and this will be taken into consideration when we do that. Thank you!

shubhwip pushed a commit to shubhwip/java-bigquerystorage that referenced this issue Oct 7, 2023
…cies to v2.7.0 (googleapis#802)

[![WhiteSource Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
| [com.google.cloud:google-cloud-shared-dependencies](https://togithub.com/googleapis/java-shared-dependencies) | `2.6.0` -> `2.7.0` | [![age](https://badges.renovateapi.com/packages/maven/com.google.cloud:google-cloud-shared-dependencies/2.7.0/age-slim)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://badges.renovateapi.com/packages/maven/com.google.cloud:google-cloud-shared-dependencies/2.7.0/adoption-slim)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://badges.renovateapi.com/packages/maven/com.google.cloud:google-cloud-shared-dependencies/2.7.0/compatibility-slim/2.6.0)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://badges.renovateapi.com/packages/maven/com.google.cloud:google-cloud-shared-dependencies/2.7.0/confidence-slim/2.6.0)](https://docs.renovatebot.com/merge-confidence/) |

---

### Release Notes

<details>
<summary>googleapis/java-shared-dependencies</summary>

### [`v2.7.0`](https://togithub.com/googleapis/java-shared-dependencies/blob/HEAD/CHANGELOG.md#&#8203;270-httpsgithubcomgoogleapisjava-shared-dependenciescomparev260v270-2022-01-28)

[Compare Source](https://togithub.com/googleapis/java-shared-dependencies/compare/v2.6.0...v2.7.0)

##### Dependencies

-   update dependency com.fasterxml.jackson:jackson-bom to v2.13.1 ([#&#8203;555](https://togithub.com/googleapis/java-shared-dependencies/issues/555)) ([67b5663](https://togithub.com/googleapis/java-shared-dependencies/commit/67b56635509215fd132b2ffbcaed995a6de93879))
-   update dependency com.google.api-client:google-api-client-bom to v1.33.1 ([#&#8203;580](https://togithub.com/googleapis/java-shared-dependencies/issues/580)) ([895dba2](https://togithub.com/googleapis/java-shared-dependencies/commit/895dba2afef6c295b0e21a4ccd27feee1b6af8e2))
-   update dependency com.google.api:api-common to v2.1.3 ([#&#8203;587](https://togithub.com/googleapis/java-shared-dependencies/issues/587)) ([5dbbce8](https://togithub.com/googleapis/java-shared-dependencies/commit/5dbbce80baa045818757912e6db4f222c12b75f6))
-   update dependency com.google.auth:google-auth-library-bom to v1.4.0 ([#&#8203;578](https://togithub.com/googleapis/java-shared-dependencies/issues/578)) ([c274c62](https://togithub.com/googleapis/java-shared-dependencies/commit/c274c62905192bcc48c82104bf50cbeaa1894c2b))
-   update dependency com.google.errorprone:error_prone_annotations to v2.11.0 ([#&#8203;582](https://togithub.com/googleapis/java-shared-dependencies/issues/582)) ([314da38](https://togithub.com/googleapis/java-shared-dependencies/commit/314da38c9f12cf425b5d1be8297e9f2a78d13948))
-   update dependency com.google.http-client:google-http-client-bom to v1.41.2 ([#&#8203;579](https://togithub.com/googleapis/java-shared-dependencies/issues/579)) ([c8df06d](https://togithub.com/googleapis/java-shared-dependencies/commit/c8df06ddb3f4a975e928bbdd9dfae293e81efa39))
-   update dependency com.google.oauth-client:google-oauth-client-bom to v1.33.0 ([#&#8203;576](https://togithub.com/googleapis/java-shared-dependencies/issues/576)) ([3315988](https://togithub.com/googleapis/java-shared-dependencies/commit/3315988059ca0223f078b0f760960674bfa8c04a))
-   update dependency com.google.protobuf:protobuf-bom to v3.19.3 ([#&#8203;571](https://togithub.com/googleapis/java-shared-dependencies/issues/571)) ([c1931e1](https://togithub.com/googleapis/java-shared-dependencies/commit/c1931e1c4611ca3d1e2afac1c9e983cf1ec4a7de))
-   update dependency io.grpc:grpc-bom to v1.44.0 ([#&#8203;586](https://togithub.com/googleapis/java-shared-dependencies/issues/586)) ([52c47b7](https://togithub.com/googleapis/java-shared-dependencies/commit/52c47b7425af438ba1a82dac87f9002353a6a87a))
-   update dependency org.checkerframework:checker-qual to v3.21.1 ([#&#8203;554](https://togithub.com/googleapis/java-shared-dependencies/issues/554)) ([81bbb91](https://togithub.com/googleapis/java-shared-dependencies/commit/81bbb91774ab67161371bfd0dfd143de89dee4c9))
-   update gax.version to v2.11.0 ([#&#8203;581](https://togithub.com/googleapis/java-shared-dependencies/issues/581)) ([1899612](https://togithub.com/googleapis/java-shared-dependencies/commit/189961226f84cdd4e283780d2c7d4a4dffc3e00c))
-   update gax.version to v2.9.0 ([#&#8203;573](https://togithub.com/googleapis/java-shared-dependencies/issues/573)) ([ee5e38e](https://togithub.com/googleapis/java-shared-dependencies/commit/ee5e38efea574eeb2b5803192da98ba1ce1c8e6e))
-   update google.common-protos.version to v2.7.2 ([#&#8203;588](https://togithub.com/googleapis/java-shared-dependencies/issues/588)) ([caac0e6](https://togithub.com/googleapis/java-shared-dependencies/commit/caac0e68e336b801becb6322b5b81fce1824ad8a))
-   update google.core.version to v2.4.0 ([#&#8203;589](https://togithub.com/googleapis/java-shared-dependencies/issues/589)) ([6a292e3](https://togithub.com/googleapis/java-shared-dependencies/commit/6a292e3530a06d6b228f3154ed4e979ef68d7aeb))
-   update iam.version to v1.2.1 ([#&#8203;569](https://togithub.com/googleapis/java-shared-dependencies/issues/569)) ([62dd270](https://togithub.com/googleapis/java-shared-dependencies/commit/62dd2705b257c4bb539bfe619234f38d0e13b138))
-   update opencensus.version to v0.30.0 ([#&#8203;552](https://togithub.com/googleapis/java-shared-dependencies/issues/552)) ([ebae62e](https://togithub.com/googleapis/java-shared-dependencies/commit/ebae62eca75234db4fd571f42b8ed0f6ce1bd1f7))
-   update opencensus.version to v0.31.0 ([#&#8203;585](https://togithub.com/googleapis/java-shared-dependencies/issues/585)) ([021b066](https://togithub.com/googleapis/java-shared-dependencies/commit/021b066a833af1629c89fcca593d16b9786ee0bf))

</details>

---

### Configuration

📅 **Schedule**: At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, click this checkbox.

---

This PR has been generated by [WhiteSource Renovate](https://renovate.whitesourcesoftware.com). View repository job log [here](https://app.renovatebot.com/dashboard#github/googleapis/java-storage-nio).
shubhwip pushed a commit to shubhwip/java-bigquerystorage that referenced this issue Oct 7, 2023
🤖 I have created a release *beep* *boop*
---


### [0.123.19](googleapis/java-storage-nio@v0.123.18...v0.123.19) (2022-02-03)


### Dependencies

* **java:** update actions/github-script action to v5 ([googleapis#1339](googleapis/java-storage-nio#1339)) ([googleapis#800](googleapis/java-storage-nio#800)) ([4c82c37](googleapis/java-storage-nio@4c82c37))
* update actions/github-script action to v5 ([googleapis#799](googleapis/java-storage-nio#799)) ([40febb2](googleapis/java-storage-nio@40febb2))
* update dependency com.google.cloud:google-cloud-shared-dependencies to v2.7.0 ([googleapis#802](googleapis/java-storage-nio#802)) ([2beefb6](googleapis/java-storage-nio@2beefb6))
* update dependency com.google.cloud:google-cloud-storage to v2.2.3 ([googleapis#786](googleapis/java-storage-nio#786)) ([b82657c](googleapis/java-storage-nio@b82657c))
* update dependency com.google.cloud:google-cloud-storage to v2.3.0 ([googleapis#796](googleapis/java-storage-nio#796)) ([e822be5](googleapis/java-storage-nio@e822be5))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

4 participants