What do you do when you want to query over multiple parquet files but the schemas don’t quite line up? Let’s find out ðð»
I’ve got a set of parquet files in S3 (well, lakeFS1, but let’s not quibble over details) with the same datail split by year:
$ lakectl fs ls lakefs://drones03/main/drone-registrations/
object 2023-03-01 09:47:36 +0000 UTC 30.7 kB Registations-P107-Active-2016.parquet
object 2023-03-01 09:48:54 +0000 UTC 119.7 kB Registations-P107-Active-2017.parquet
object 2023-03-01 09:44:47 +0000 UTC 594.3 kB Registations-P107-Active-2018.parquet
object 2023-03-01 09:45:04 +0000 UTC 1.3 MB Registations-P107-Active-2019.parquet
object 2023-03-01 09:48:12 +0000 UTC 2.8 MB Registations-P107-Active-2020.parquet
object 2023-03-01 09:48:51 +0000 UTC 3.2 MB Registations-P107-Active-2021.parquet
I want to query and manipulate the data. DuckDB is my friend since it works with parquet files. I fire it up with an empty database:
$ duckdb drones.duckdb
A DESCRIBE gives me the schema:
D DESCRIBE SELECT * FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet') ;
âââââââââââââââââââââââââââ¬ââââââââââââââ¬ââââââââââ¬ââââââââââ¬ââââââââââ¬ââââââââââ
â column_name â column_type â null â key â default â extra â
â varchar â varchar â varchar â varchar â varchar â varchar â
âââââââââââââââââââââââââââŒââââââââââââââŒââââââââââŒââââââââââŒââââââââââŒââââââââââ€
â Registration Date â VARCHAR â YES â â â â
â Registion Expire Dt â VARCHAR â YES â â â â
â Asset Type â VARCHAR â YES â â â â
â RID Equipped â BOOLEAN â YES â â â â
â Asset Model â VARCHAR â YES â â â â
â Physical City â VARCHAR â YES â â â â
â Physical State/Province â VARCHAR â YES â â â â
â Physical Postal Code â BIGINT â YES â â â â
â Mailing City â VARCHAR â YES â â â â
â Mailing State/Province â VARCHAR â YES â â â â
â Mailing Postal Code â BIGINT â YES â â â â
âââââââââââââââââââââââââââŽââââââââââââââŽââââââââââŽââââââââââŽââââââââââŽââââââââââ€
â 11 rows 6 columns â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Now I should be able to query a sample of the data to check it out, right. Right?
D SELECT *
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet')
USING SAMPLE 5 ROWS;
100% ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Error: Conversion Error: Could not convert string 'V1M 2K9' to INT64
Spoiler: UNION_BY_NAME ð
After posting this blog, two people both suggested using the UNION_BY_NAME option which was added to DuckDB recently. This worked perfectly:
D SELECT *
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet',
union_by_name=True)
USING SAMPLE 5 ROWS;
100% ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
ââââââââââââââââââââââââ¬ââââââââââââââââââââââ¬ââââââââââââââââ¬âââââââââââââââ¬ââââ¬âââââââââââââââââââââââ¬âââââââââââââââ¬âââââââââââââââââââââââ¬ââââââââââââââââââââââ
â Registration Date â Registion Expire Dt â Asset Type â RID Equipped â ⊠â Physical Postal Code â Mailing City â Mailing State/Prov⊠â Mailing Postal Code â
â varchar â varchar â varchar â boolean â â varchar â varchar â varchar â varchar â
ââââââââââââââââââââââââŒââââââââââââââââââââââŒââââââââââââââââŒâââââââââââââââŒââââŒâââââââââââââââââââââââŒâââââââââââââââŒâââââââââââââââââââââââŒââââââââââââââââââââââ€
â 2021-09-27 17:17:1⊠â 2024-09-27 â HOMEBUILT_UAS â false â ⊠â 32177 â Palatka â FL â 32177 â
â 2020-12-06 02:18:5⊠â 2023-12-05 â PURCHASED â â ⊠â 10065 â New York â NY â 10065 â
â 2019-10-08 15:33:3⊠â 2022-10-08 â PURCHASED â â ⊠â 83706 â Boise â ID â 83706 â
â 2020-12-03 14:26:0⊠â 2023-12-03 â PURCHASED â â ⊠â 49506 â Grand Rapids â MI â 49506 â
â 2020-01-27 18:57:0⊠â 2023-01-27 â HOME_BUILT â â ⊠â 35758 â Madison â AL â 35758 â
ââââââââââââââââââââââââŽââââââââââââââââââââââŽââââââââââââââââŽâââââââââââââââŽââââŽâââââââââââââââââââââââŽâââââââââââââââŽâââââââââââââââââââââââŽââââââââââââââââââââââ€
â 5 rows 11 columns (8 shown) â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Thanks @mraasveldt and @__AlexMonahan__ðð»
Problem solved. But if you want to follow along with another option, read onâŠ
Back to the Detective Story ð
So, we have this problem:
D SELECT *
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet')
USING SAMPLE 5 ROWS;
100% ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Error: Conversion Error: Could not convert string 'V1M 2K9' to INT64
Huh. That sucks. Let’s try it on a single file:
D SELECT * FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet') USING SAMPLE 5 ROWS;
ââââââââââââââââââââââââ¬ââââââââââââââââââââââ¬ââââââââââââââââââ¬âââââââââââââââ¬ââââ¬âââââââââââââââââââââââ¬âââââââââââââââââ¬âââââââââââââââââââââââ¬ââââââââââââââââââââââ
â Registration Date â Registion Expire Dt â Asset Type â RID Equipped â ⊠â Physical Postal Code â Mailing City â Mailing State/Prov⊠â Mailing Postal Code â
â varchar â varchar â varchar â boolean â â int64 â varchar â varchar â int64 â
ââââââââââââââââââââââââŒââââââââââââââââââââââŒââââââââââââââââââŒâââââââââââââââŒââââŒâââââââââââââââââââââââŒâââââââââââââââââŒâââââââââââââââââââââââŒââââââââââââââââââââââ€
â 2016-09-28 19:27:5⊠â 2025-09-28 â TRADITIONAL_UAS â false â ⊠â 80203 â Denver â CO â 80203 â
â 2016-10-26 13:10:2⊠â 2022-10-26 â PURCHASED â â ⊠â 33611 â Tampa â FL â 33611 â
â 2016-10-25 15:58:4⊠â 2022-10-25 â PURCHASED â â ⊠â 23337 â Wallops Island â VA â 23337 â
â 2016-11-30 17:17:1⊠â 2022-11-30 â PURCHASED â â ⊠â 32114 â Daytona Beach â FL â 32114 â
â 2016-10-25 15:58:4⊠â 2022-10-25 â PURCHASED â â ⊠â 23337 â Wallops Island â VA â 23337 â
ââââââââââââââââââââââââŽââââââââââââââââââââââŽââââââââââââââââââŽâââââââââââââââŽââââŽâââââââââââââââââââââââŽâââââââââââââââââŽâââââââââââââââââââââââŽââââââââââââââââââââââ€
â 5 rows 11 columns (8 shown) â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
D
That works, so we’re going to have to narrow down the problem. As a side note, I should probably log an enhancement request for more detailed error messages (for example, which file had the error, and which field).
Looking at the error message there’s a data type problem with an INT64 field (BIGINT). In the schema there are two fields with that:
Physical Postal Code
Mailing Postal Code
DuckDB’s parquet docs page points me to the parquet_schema function, so let’s have a look at these fields over the files in question:
SELECT file_name, name, type, logical_type
FROM parquet_schema('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet')
WHERE name like '%Postal Code%';
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââââââââ¬âââââââââââââ¬âââââââââââââââ
â file_name â name â type â logical_type â
â varchar â varchar â varchar â varchar â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââŒâââââââââââââââââââââââŒâââââââââââââŒâââââââââââââââ€
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â Physical Postal Code â INT64 â â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â Mailing Postal Code â INT64 â â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â Physical Postal Code â BYTE_ARRAY â StringType() â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â Mailing Postal Code â BYTE_ARRAY â StringType() â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2018.parquet â Physical Postal Code â BYTE_ARRAY â StringType() â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2018.parquet â Mailing Postal Code â BYTE_ARRAY â StringType() â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet â Physical Postal Code â BYTE_ARRAY â StringType() â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet â Mailing Postal Code â BYTE_ARRAY â StringType() â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet â Physical Postal Code â BYTE_ARRAY â StringType() â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet â Mailing Postal Code â BYTE_ARRAY â StringType() â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â Physical Postal Code â BYTE_ARRAY â StringType() â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â Mailing Postal Code â BYTE_ARRAY â StringType() â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââŽâââââââââââââââââââââââŽâââââââââââââŽâââââââââââââââ€
â 12 rows 4 columns â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
So there’s the problem; the 2016 file uses INT64 for those fields whilst the remaining files use a string. The SELECT that I ran above against all the files failed because when it tried to apply the schema of the first file against the data read from the others it ended up trying to convert a string to a number, which is never going to end well.
Here’s one way to fix things with a UNION ALL. Note the use of filename metadata column to help verify the data we’re getting is what’s expected. To start with let’s just try it against two files:
SELECT filename, CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code"
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
USING SAMPLE 5 ROWS
UNION ALL
SELECT filename, "Physical Postal Code"
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet', filename=true)
USING SAMPLE 5 ROWS;
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââââââââ
â filename â Physical Postal Code â
â varchar â varchar â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââŒâââââââââââââââââââââââ€
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 80237 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 35222 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 36112 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 87123 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 35806 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â 68179 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â 32114 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â 93637 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â 33611 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â 36112 â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââŽâââââââââââââââââââââââ€
â 10 rows 2 columns â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Now we just need to specify the remainder of the files. Previously we wildcarded, but we can’t include the 2016 file in that (since we’re handling that with a CAST in the first block of the UNION), so we need to modify it. We’ll test our new selection pattern first:
SELECT filename
FROM read_parquet(['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'],
filename=true)
GROUP BY filename;
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â filename â
â varchar â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ€
â s3://drones03/main/drone-registrations/Registations-P107-Active-2018.parquet â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Let’s test this with the UNION:
SELECT filename, CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code"
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
USING SAMPLE 5 ROWS
UNION ALL
SELECT filename, "Physical Postal Code"
FROM read_parquet(['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'], filename=true)
USING SAMPLE 5 ROWS;
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââââââââ
â filename â Physical Postal Code â
â varchar â varchar â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââŒâââââââââââââââââââââââ€
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 87123 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 32114 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 67301 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 35806 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 32114 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â 35806 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â 98290 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â 07641 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â 33133 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â 20166 â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââŽâââââââââââââââââââââââ€
â 10 rows 2 columns â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Now we can put this all together to do what we were trying to do in the first place; look at a sample of rows from across the set of files - but making allowances for the mismatched datatypes of the schema.
WITH x AS (SELECT "Registration Date",
"Registion Expire Dt",
"Asset Type",
"RID Equipped",
"Asset Model",
"Physical City",
"Physical State/Province",
CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code",
"Mailing City",
"Mailing State/Province",
CAST("Mailing Postal Code" AS VARCHAR) AS "Mailing Postal Code",
filename
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
UNION ALL
SELECT *
FROM read_parquet (['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'],
filename=true)
)
SELECT * FROM x
USING SAMPLE 10 ROWS;
ââââââââââââââââââââââââââââââ¬ââââââââââââââââââââââ¬ââââââââââââââââââ¬âââââââââââââââ¬âââââââââââââââââââ¬ââââââââââââââââ¬ââââââââââââââââââââââââââ¬âââââââââââââââââââââââ¬âââââââââââââââ¬âââââââââââââââââââââââââ¬ââââââââââââââââââââââ¬âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Registration Date â Registion Expire Dt â Asset Type â RID Equipped â Asset Model â Physical City â Physical State/Province â Physical Postal Code â Mailing City â Mailing State/Province â Mailing Postal Code â filename â
â varchar â varchar â varchar â boolean â varchar â varchar â varchar â varchar â varchar â varchar â varchar â varchar â
ââââââââââââââââââââââââââââââŒââââââââââââââââââââââŒââââââââââââââââââŒâââââââââââââââŒâââââââââââââââââââŒââââââââââââââââŒââââââââââââââââââââââââââŒâââââââââââââââââââââââŒâââââââââââââââŒâââââââââââââââââââââââââŒââââââââââââââââââââââŒâââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ€
â 2021-03-24 13:35:06.817000 â 2024-03-24 â TRADITIONAL_UAS â false â Phantom 4 Pro V2 â Minneapolis â MN â 55423 â Minneapolis â MN â 55423 â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â
â 2021-12-06 12:43:38.265000 â 2024-12-06 â TRADITIONAL_UAS â false â Air 2S â Wo;;ots â CA â 95490 â Wo;;ots â CA â 95490 â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â
â 2021-12-11 01:49:20.235000 â 2024-12-10 â TRADITIONAL_UAS â false â Mavic Air â Rocklin â CA â 95677 â Rocklin â CA â 95677 â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â
â 2020-06-01 12:43:09.114000 â 2023-06-01 â PURCHASED â â Mavic 2 Air â Loveland â CO â 80538 â Loveland â CO â 80538 â s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet â
â 2021-10-19 22:03:03.630000 â 2024-10-19 â HOMEBUILT_UAS â false â X1 â Philadelphia â PA â 19146 â Philadelphia â PA â 19146 â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â
â 2020-12-31 00:59:16.326000 â 2023-12-30 â PURCHASED â â SP7100 â Cornelius â OR â 97113 â Cornelius â OR â 97113 â s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet â
â 2020-10-14 18:31:39.662000 â 2023-10-14 â PURCHASED â â Phantom Rtk 4 â San Antonio â TX â 78216 â San Antonio â TX â 78216 â s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet â
â 2019-05-14 23:10:35.507000 â 2025-05-14 â TRADITIONAL_UAS â false â Mavic Pro 2 â Greensboro â NC â 27409 â Greensboro â NC â 27409 â s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet â
â 2019-05-29 14:50:18.922000 â 2025-05-29 â TRADITIONAL_UAS â false â Phantom 4 Pro â Brighton â CO â 80601 â Brighton â CO â 80601 â s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet â
â 2021-02-06 14:56:38.651000 â 2024-02-06 â PURCHASED â â Mavic 2 Pro â Louisville â KY â 40204 â Louisville â KY â 40204 â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â
ââââââââââââââââââââââââââââââŽââââââââââââââââââââââŽââââââââââââââââââŽâââââââââââââââŽâââââââââââââââââââŽââââââââââââââââŽââââââââââââââââââââââââââŽâââââââââââââââââââââââŽâââââââââââââââŽâââââââââââââââââââââââââŽââââââââââââââââââââââŽâââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ€
â 10 rows 12 columns â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
This looks good, but we should check we’re getting data from all files. We’ll do that with a COUNT aggregate against the CTE:
WITH x AS (SELECT "Registration Date",
"Registion Expire Dt",
"Asset Type",
"RID Equipped",
"Asset Model",
"Physical City",
"Physical State/Province",
CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code",
"Mailing City",
"Mailing State/Province",
CAST("Mailing Postal Code" AS VARCHAR) AS "Mailing Postal Code",
filename
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
UNION ALL
SELECT *
FROM read_parquet (['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'],
filename=true)
)
SELECT filename, COUNT(*)
FROM x
GROUP BY filename
ORDER BY filename;ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¬âââââââââââââââ
â filename â count_star() â
â varchar â int64 â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââŒâââââââââââââââ€
â s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet â 1280 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet â 5819 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2018.parquet â 24695 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet â 60105 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet â 143670 â
â s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet â 162826 â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââŽâââââââââââââââ
Looks good to me!
For a finishing touch we could even wrap it in a VIEW:
CREATE OR REPLACE VIEW Registations_P107_Active AS
WITH x AS (SELECT "Registration Date",
"Registion Expire Dt",
"Asset Type",
"RID Equipped",
"Asset Model",
"Physical City",
"Physical State/Province",
CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code",
"Mailing City",
"Mailing State/Province",
CAST("Mailing Postal Code" AS VARCHAR) AS "Mailing Postal Code",
filename
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
UNION ALL
SELECT *
FROM read_parquet (['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'],
filename=true)
)
SELECT * FROM x;Which then can be used the same as above like this:
SELECT filename, COUNT(*)
FROM Registations_P107_Active
GROUP BY filename;