Does Trino Insert overwrite work for HDFS external unpartitioned table and partitioned external table in HDFS?
Insert overwrite operation is not supported by Trino when the table is stored on S3, encrypted HDFS or an external location.
The documentation does not make it clear if it works for HDFS external table and about partitioned and unpartitioned tables.
For context, I am really inspired by salesforce blog and how they achieved petabyte level of data ETL/ingestion.
My goal is to give power to our data analyst team to create simple pipelines using TrinoSQL as it can connect to lot of connectors, bringing data from several location to HDFS is very easy and convenient
I do not know what wast the original intention, but I would say that supporting INSERT OVERWRITE for non-partitioned tables seems awkward.
It does not bring much benefit over DROP + CREATE.I would say that we should not try to use hive.insert-existing-partitions-behavior when inserting into non-partitioned table. One option would be to ignore the config property, the other to fail if INSERT is done to non-partitioned table when values is not APPEND .
In case of partitions this behavior is only applied on specific partition we are affecting during operation, which is useful in case of ETL workloads.
Recently I tried using airflow and inserted data every 15mins into external table from postgres to hive using INSERT OVERWRITE behavior and by setting hive.insert-existing-partitions-behavior=OVERWRITE as session property. It worked fine for one partition but as soon as new partition was added when the date changed, I saw duplicate data was added in the newly added partition as well. I will reconfirm this again later next week but meanwhile Could you please explain what do you mean by “only applied for specific partition”? Was this duplicate data because of some misconfiguration I did on running airflow job or trino is not aware about previously added partition and its data?
I rechecked the code as seems that. INSERT OVERWRITE should work for unpartitioned tables only in STAGE_AND_MOVE_TO_TARGET_DIRECTORY write mode. This option is selected basing on this code:
private boolean shouldUseTemporaryDirectory(ConnectorSession session, HdfsContext context, Path path, Optional<Path> externalLocation)
// skip using temporary directory for S3
&& !isS3FileSystem(context, hdfsEnvironment, path)
// skip using temporary directory if destination is encrypted; it's not possible to move a file between encryption zones
&& !isHdfsEncrypted(context, hdfsEnvironment, path)
// Skip using temporary directory if destination is external. Target may be on a different file system.
From what you describe this should work as expected. Each time you will perform INSERT and OVERWRITE is enabled, you will remove all data from partitions matching current query context. Only new data will remain in them.
But we would need some opinion from someone who worked with HDFS more.
Got it . Thank you so much. As for drop part, I have solution to create access policy which will not allow drop on that table or may be copy the internal table data to different hdfs location as a backup.
If you use now then your dt is always diffrent and partitions will not be matched.
But I am converting now() to date in the query which will give me something like 2021-11-04 will it still not work?