What do you do when you want to query over multiple parquet files but the schemas don’t quite line up? Let’s find out ππ»
I’ve got a set of parquet files in S3 (well, lakeFS1, but let’s not quibble over details) with the same datail split by year:
$ lakectl fs ls lakefs://drones03/main/drone-registrations/
object 2023-03-01 09:47:36 +0000 UTC 30.7 kB Registations-P107-Active-2016.parquet
object 2023-03-01 09:48:54 +0000 UTC 119.7 kB Registations-P107-Active-2017.parquet
object 2023-03-01 09:44:47 +0000 UTC 594.3 kB Registations-P107-Active-2018.parquet
object 2023-03-01 09:45:04 +0000 UTC 1.3 MB Registations-P107-Active-2019.parquet
object 2023-03-01 09:48:12 +0000 UTC 2.8 MB Registations-P107-Active-2020.parquet
object 2023-03-01 09:48:51 +0000 UTC 3.2 MB Registations-P107-Active-2021.parquet
I want to query and manipulate the data. DuckDB is my friend since it works with parquet files. I fire it up with an empty database:
$ duckdb drones.duckdb
A DESCRIBE
gives me the schema:
D DESCRIBE SELECT * FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet') ;
βββββββββββββββββββββββββββ¬ββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β column_name β column_type β null β key β default β extra β
β varchar β varchar β varchar β varchar β varchar β varchar β
βββββββββββββββββββββββββββΌββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Registration Date β VARCHAR β YES β β β β
β Registion Expire Dt β VARCHAR β YES β β β β
β Asset Type β VARCHAR β YES β β β β
β RID Equipped β BOOLEAN β YES β β β β
β Asset Model β VARCHAR β YES β β β β
β Physical City β VARCHAR β YES β β β β
β Physical State/Province β VARCHAR β YES β β β β
β Physical Postal Code β BIGINT β YES β β β β
β Mailing City β VARCHAR β YES β β β β
β Mailing State/Province β VARCHAR β YES β β β β
β Mailing Postal Code β BIGINT β YES β β β β
βββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ€
β 11 rows 6 columns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Now I should be able to query a sample of the data to check it out, right. Right?
D SELECT *
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet')
USING SAMPLE 5 ROWS;
100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Error: Conversion Error: Could not convert string 'V1M 2K9' to INT64
Spoiler: UNION_BY_NAME
π
After posting this blog, two people both suggested using the UNION_BY_NAME
option which was added to DuckDB recently. This worked perfectly:
D SELECT *
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet',
union_by_name=True)
USING SAMPLE 5 ROWS;
100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββ¬βββββββββββββββ¬ββββ¬βββββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββββββββββββββββ
β Registration Date β Registion Expire Dt β Asset Type β RID Equipped β β¦ β Physical Postal Code β Mailing City β Mailing State/Provβ¦ β Mailing Postal Code β
β varchar β varchar β varchar β boolean β β varchar β varchar β varchar β varchar β
ββββββββββββββββββββββββΌββββββββββββββββββββββΌββββββββββββββββΌβββββββββββββββΌββββΌβββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββΌββββββββββββββββββββββ€
β 2021-09-27 17:17:1β¦ β 2024-09-27 β HOMEBUILT_UAS β false β β¦ β 32177 β Palatka β FL β 32177 β
β 2020-12-06 02:18:5β¦ β 2023-12-05 β PURCHASED β β β¦ β 10065 β New York β NY β 10065 β
β 2019-10-08 15:33:3β¦ β 2022-10-08 β PURCHASED β β β¦ β 83706 β Boise β ID β 83706 β
β 2020-12-03 14:26:0β¦ β 2023-12-03 β PURCHASED β β β¦ β 49506 β Grand Rapids β MI β 49506 β
β 2020-01-27 18:57:0β¦ β 2023-01-27 β HOME_BUILT β β β¦ β 35758 β Madison β AL β 35758 β
ββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββ΄βββββββββββββββ΄ββββ΄βββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββββββββββββββββ€
β 5 rows 11 columns (8 shown) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Thanks @mraasveldt
and @__AlexMonahan__
ππ»
Problem solved. But if you want to follow along with another option, read onβ¦
Back to the Detective Story π
So, we have this problem:
D SELECT *
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet')
USING SAMPLE 5 ROWS;
100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Error: Conversion Error: Could not convert string 'V1M 2K9' to INT64
Huh. That sucks. Let’s try it on a single file:
D SELECT * FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet') USING SAMPLE 5 ROWS;
ββββββββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββ¬ββββ¬βββββββββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββββββββββββββββ
β Registration Date β Registion Expire Dt β Asset Type β RID Equipped β β¦ β Physical Postal Code β Mailing City β Mailing State/Provβ¦ β Mailing Postal Code β
β varchar β varchar β varchar β boolean β β int64 β varchar β varchar β int64 β
ββββββββββββββββββββββββΌββββββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββΌββββΌβββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββββββββββββ€
β 2016-09-28 19:27:5β¦ β 2025-09-28 β TRADITIONAL_UAS β false β β¦ β 80203 β Denver β CO β 80203 β
β 2016-10-26 13:10:2β¦ β 2022-10-26 β PURCHASED β β β¦ β 33611 β Tampa β FL β 33611 β
β 2016-10-25 15:58:4β¦ β 2022-10-25 β PURCHASED β β β¦ β 23337 β Wallops Island β VA β 23337 β
β 2016-11-30 17:17:1β¦ β 2022-11-30 β PURCHASED β β β¦ β 32114 β Daytona Beach β FL β 32114 β
β 2016-10-25 15:58:4β¦ β 2022-10-25 β PURCHASED β β β¦ β 23337 β Wallops Island β VA β 23337 β
ββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ΄βββββββββββββββ΄ββββ΄βββββββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββββββββββββββββ€
β 5 rows 11 columns (8 shown) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
D
That works, so we’re going to have to narrow down the problem. As a side note, I should probably log an enhancement request for more detailed error messages (for example, which file had the error, and which field).
Looking at the error message there’s a data type problem with an INT64
field (BIGINT
). In the schema there are two fields with that:
Physical Postal Code
Mailing Postal Code
DuckDB’s parquet docs page points me to the parquet_schema
function, so let’s have a look at these fields over the files in question:
SELECT file_name, name, type, logical_type
FROM parquet_schema('s3://drones03/main/drone-registrations/Registations-P107-Active-*.parquet')
WHERE name like '%Postal Code%';
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββ¬βββββββββββββββ
β file_name β name β type β logical_type β
β varchar β varchar β varchar β varchar β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββΌβββββββββββββββ€
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β Physical Postal Code β INT64 β β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β Mailing Postal Code β INT64 β β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β Physical Postal Code β BYTE_ARRAY β StringType() β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β Mailing Postal Code β BYTE_ARRAY β StringType() β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2018.parquet β Physical Postal Code β BYTE_ARRAY β StringType() β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2018.parquet β Mailing Postal Code β BYTE_ARRAY β StringType() β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet β Physical Postal Code β BYTE_ARRAY β StringType() β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet β Mailing Postal Code β BYTE_ARRAY β StringType() β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet β Physical Postal Code β BYTE_ARRAY β StringType() β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet β Mailing Postal Code β BYTE_ARRAY β StringType() β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β Physical Postal Code β BYTE_ARRAY β StringType() β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β Mailing Postal Code β BYTE_ARRAY β StringType() β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββ΄βββββββββββββ΄βββββββββββββββ€
β 12 rows 4 columns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
So there’s the problem; the 2016 file uses INT64
for those fields whilst the remaining files use a string. The SELECT
that I ran above against all the files failed because when it tried to apply the schema of the first file against the data read from the others it ended up trying to convert a string to a number, which is never going to end well.
Here’s one way to fix things with a UNION ALL
. Note the use of filename
metadata column to help verify the data we’re getting is what’s expected. To start with let’s just try it against two files:
SELECT filename, CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code"
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
USING SAMPLE 5 ROWS
UNION ALL
SELECT filename, "Physical Postal Code"
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet', filename=true)
USING SAMPLE 5 ROWS;
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββ
β filename β Physical Postal Code β
β varchar β varchar β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββ€
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 80237 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 35222 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 36112 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 87123 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 35806 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β 68179 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β 32114 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β 93637 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β 33611 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β 36112 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββ€
β 10 rows 2 columns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Now we just need to specify the remainder of the files. Previously we wildcarded, but we can’t include the 2016 file in that (since we’re handling that with a CAST
in the first block of the UNION
), so we need to modify it. We’ll test our new selection pattern first:
SELECT filename
FROM read_parquet(['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'],
filename=true)
GROUP BY filename;
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β filename β
β varchar β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β s3://drones03/main/drone-registrations/Registations-P107-Active-2018.parquet β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Let’s test this with the UNION
:
SELECT filename, CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code"
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
USING SAMPLE 5 ROWS
UNION ALL
SELECT filename, "Physical Postal Code"
FROM read_parquet(['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'], filename=true)
USING SAMPLE 5 ROWS;
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββ
β filename β Physical Postal Code β
β varchar β varchar β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββ€
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 87123 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 32114 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 67301 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 35806 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 32114 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β 35806 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β 98290 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β 07641 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β 33133 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β 20166 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββ€
β 10 rows 2 columns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Now we can put this all together to do what we were trying to do in the first place; look at a sample of rows from across the set of files - but making allowances for the mismatched datatypes of the schema.
WITH x AS (SELECT "Registration Date",
"Registion Expire Dt",
"Asset Type",
"RID Equipped",
"Asset Model",
"Physical City",
"Physical State/Province",
CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code",
"Mailing City",
"Mailing State/Province",
CAST("Mailing Postal Code" AS VARCHAR) AS "Mailing Postal Code",
filename
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
UNION ALL
SELECT *
FROM read_parquet (['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'],
filename=true)
)
SELECT * FROM x
USING SAMPLE 10 ROWS;
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Registration Date β Registion Expire Dt β Asset Type β RID Equipped β Asset Model β Physical City β Physical State/Province β Physical Postal Code β Mailing City β Mailing State/Province β Mailing Postal Code β filename β
β varchar β varchar β varchar β boolean β varchar β varchar β varchar β varchar β varchar β varchar β varchar β varchar β
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2021-03-24 13:35:06.817000 β 2024-03-24 β TRADITIONAL_UAS β false β Phantom 4 Pro V2 β Minneapolis β MN β 55423 β Minneapolis β MN β 55423 β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β
β 2021-12-06 12:43:38.265000 β 2024-12-06 β TRADITIONAL_UAS β false β Air 2S β Wo;;ots β CA β 95490 β Wo;;ots β CA β 95490 β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β
β 2021-12-11 01:49:20.235000 β 2024-12-10 β TRADITIONAL_UAS β false β Mavic Air β Rocklin β CA β 95677 β Rocklin β CA β 95677 β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β
β 2020-06-01 12:43:09.114000 β 2023-06-01 β PURCHASED β β Mavic 2 Air β Loveland β CO β 80538 β Loveland β CO β 80538 β s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet β
β 2021-10-19 22:03:03.630000 β 2024-10-19 β HOMEBUILT_UAS β false β X1 β Philadelphia β PA β 19146 β Philadelphia β PA β 19146 β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β
β 2020-12-31 00:59:16.326000 β 2023-12-30 β PURCHASED β β SP7100 β Cornelius β OR β 97113 β Cornelius β OR β 97113 β s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet β
β 2020-10-14 18:31:39.662000 β 2023-10-14 β PURCHASED β β Phantom Rtk 4 β San Antonio β TX β 78216 β San Antonio β TX β 78216 β s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet β
β 2019-05-14 23:10:35.507000 β 2025-05-14 β TRADITIONAL_UAS β false β Mavic Pro 2 β Greensboro β NC β 27409 β Greensboro β NC β 27409 β s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet β
β 2019-05-29 14:50:18.922000 β 2025-05-29 β TRADITIONAL_UAS β false β Phantom 4 Pro β Brighton β CO β 80601 β Brighton β CO β 80601 β s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet β
β 2021-02-06 14:56:38.651000 β 2024-02-06 β PURCHASED β β Mavic 2 Pro β Louisville β KY β 40204 β Louisville β KY β 40204 β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β
ββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 10 rows 12 columns β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This looks good, but we should check we’re getting data from all files. We’ll do that with a COUNT
aggregate against the CTE:
WITH x AS (SELECT "Registration Date",
"Registion Expire Dt",
"Asset Type",
"RID Equipped",
"Asset Model",
"Physical City",
"Physical State/Province",
CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code",
"Mailing City",
"Mailing State/Province",
CAST("Mailing Postal Code" AS VARCHAR) AS "Mailing Postal Code",
filename
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
UNION ALL
SELECT *
FROM read_parquet (['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'],
filename=true)
)
SELECT filename, COUNT(*)
FROM x
GROUP BY filename
ORDER BY filename;
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββ
β filename β count_star() β
β varchar β int64 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββ€
β s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet β 1280 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2017.parquet β 5819 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2018.parquet β 24695 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2019.parquet β 60105 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2020.parquet β 143670 β
β s3://drones03/main/drone-registrations/Registations-P107-Active-2021.parquet β 162826 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββ
Looks good to me!
For a finishing touch we could even wrap it in a VIEW
:
CREATE OR REPLACE VIEW Registations_P107_Active AS
WITH x AS (SELECT "Registration Date",
"Registion Expire Dt",
"Asset Type",
"RID Equipped",
"Asset Model",
"Physical City",
"Physical State/Province",
CAST("Physical Postal Code" AS VARCHAR) AS "Physical Postal Code",
"Mailing City",
"Mailing State/Province",
CAST("Mailing Postal Code" AS VARCHAR) AS "Mailing Postal Code",
filename
FROM read_parquet('s3://drones03/main/drone-registrations/Registations-P107-Active-2016.parquet', filename=true)
UNION ALL
SELECT *
FROM read_parquet (['s3://drones03/main/drone-registrations/Registations-P107-Active-201[7-9].parquet',
's3://drones03/main/drone-registrations/Registations-P107-Active-202*.parquet'],
filename=true)
)
SELECT * FROM x;
Which then can be used the same as above like this:
SELECT filename, COUNT(*)
FROM Registations_P107_Active
GROUP BY filename;