Streamdal for Data Science
All data that is ingested by the Streamdal collectors is written as parquet files into an AWS S3 bucket and exposed via AWS Athena.
Customers who bring their own S3 bucket are able to get direct access to the parquet data which can then be used for powering out-of-band data-science tasks.
Each collection in Streamdal is exposed in AWS Athena as a table. The table is fully managed and kept up to date by Streamdal as the schema in your collection evolves.
To get access to Athena - please send us an email that includes your AWS account ID and our support representatives will enable access.
Snowflake (and most other data warehousing platforms) support AWS S3 and the parquet format out of the box which makes integration simple and quick.
Your data team has the following
- Use Streamdal’s parquet data as an external table
- Perform a traditional, periodic load (
COPY INTO) of the parquet data into a Snowflake table
In order to gain access to the parquet data in S3, you will need to provide your AWS key + secret via Snowflake and the full S3 path to the parquet data. By including a trailing
/ in the S3 ARN, Snowflake will recursively walk all of the “directories” and search for parquet data.
Example of a traditional load of a single parquet file via S3
Create a schema
Create a parquet format
Create a stage
create or replace stage onefilestage s3://batchsh-datalakes/$team_id/$account_id/$datalake_id/$collection_id/year=2021/month=11/day=23/1637629076391797918.parquet' CREDENTIALS=(aws_key_id='AKIAYSK7J2EGYWDAK5B6' aws_secret_key='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') file_format = parquet;
- Create a table (that points to the stage & format)
create table mytable using template ( select array_agg(object_construct(*)) from table( infer_schema( location=>'@onefilestage', file_format=>'my_parquet_format' ) ) );
- Load the data into the table
copy into MYTABLE from (select $1:raw_json::varchar, $1:batch::variant, $1:client::variant from @onefilestage );