Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_table_from_dataframe support JSON column dtype #1966

Open
jlynchMicron opened this issue Jul 2, 2024 · 4 comments
Open

load_table_from_dataframe support JSON column dtype #1966

jlynchMicron opened this issue Jul 2, 2024 · 4 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@jlynchMicron
Copy link

jlynchMicron commented Jul 2, 2024

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Is your feature request related to a problem? Please describe.
python-bigquery does not seem to currently support uploading dataframes where one of the columns in the destination table is JSON dtype.

Quick partial code example:

schema = [
    bigquery.SchemaField("system_id", "STRING"),
    bigquery.SchemaField("summery_type", "STRING"),
    bigquery.SchemaField("summery_datetime", "DATETIME"),
    bigquery.SchemaField("summery_is_good", "BOOL"),
    bigquery.SchemaField("summery_data", "JSON"), #Stored in dataframe as JSON formatted string
]
job_config = bigquery.LoadJobConfig(schema=schema)

job = client.load_table_from_dataframe(data, table_obj, job_config=job_config)
print(job.result())

Result:
google.api_core.exceptions.BadRequest: 400 Unsupported field type: JSON; reason: invalid, message: Unsupported field type: JSON

Describe the solution you'd like
Implement support for loading data to BigQuery that contain JSON columns.

Additional context

$ conda list | grep bigquery
google-cloud-bigquery-core 3.25.0             pyhd8ed1ab_0  
google-cloud-bigquery-storage 2.25.0             pyhca7485f_0  
google-cloud-bigquery-storage-core 2.25.0             pyhca7485f_0

Related issues:
googleapis/python-bigquery-sqlalchemy#399
googleapis/python-bigquery-pandas#698
googleapis/python-bigquery-dataframes#816

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 2, 2024
@jlynchMicron
Copy link
Author

jlynchMicron commented Jul 2, 2024

I have also tried uploading the data as dtype STRING in hopes that it would be converted to JSON server-side, but that also results in the following error: google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/gdw-dev-sse/jobs?uploadType=resumable: Provided Schema does not match Table XXXXXXXXXXXXXXXX. Field summery_data has changed type from JSON to STRING
NOTE: Hiding table name

@jlynchMicron
Copy link
Author

I was able to upload my data with the following code, but JSON support should be added for pandas dataframes.

Code:

data['summery_datetime'] = data['summery_datetime'].astype(str) #Datetime objects are not JSON serializable, convert to datetime string.
data = data.to_dict(orient='records')
job = client.load_table_from_json(data, table_obj, job_config=job_config)
print(job.result())

@tswast
Copy link
Contributor

tswast commented Jul 8, 2024

Thanks @jlynchMicron for providing a workaround. I think there are a few problems we'll need to work through, one of which is that the bigquery backend doesn't support JSON in load jobs from Parquet files: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions Please file a Feature Request https://cloud.google.com/support/docs/issue-trackers (specifically, Create new BigQuery issue).

Another possible workaround is to use CSV as the source format. In your example

schema = [
    bigquery.SchemaField("system_id", "STRING"),
    bigquery.SchemaField("summery_type", "STRING"),
    bigquery.SchemaField("summery_datetime", "DATETIME"),
    bigquery.SchemaField("summery_is_good", "BOOL"),
    bigquery.SchemaField("summery_data", "JSON"), #Stored in dataframe as JSON formatted string
]
job_config = bigquery.LoadJobConfig(
    schema=schema,
    source_format="CSV",
)

job = client.load_table_from_dataframe(data, table_obj, job_config=job_config)
print(job.result())

For googlers watching this issue, I have proposed a design go/bf-json which proposes a JSONDtype in https://github.com/googleapis/python-db-dtypes-pandas which would allow us to autodetect when JSON is used in a DataFrame. Before that though, we could do the same and choose the appropriate serialization format depending on the provided schema. For example, parquet must be used with STRUCT/ARRAY columns, but CSV must be used for JSON.

@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jul 8, 2024
@leahecole leahecole assigned Linchin and unassigned leahecole Jul 16, 2024
@Linchin Linchin assigned chelsea-lin and unassigned Linchin Jul 23, 2024
@Linchin
Copy link
Contributor

Linchin commented Jul 23, 2024

googleapis/python-db-dtypes-pandas#284 can potentially fulfill this feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants