OCSF to S3: Streaming with Kinesis, Firehose, and Zephflow - AI & LLM Workflows

OCSF

Pricing

Blog

Company

Contact

Docs

Try OCSF Mapper

Try Data API Builder

Workflow Builder

OCSF to S3: Streaming with Kinesis, Firehose, and Zephflow

In Part 2, we build the final stage of our pipeline. Learn to stream OCSF logs to S3 as Parquet using Kinesis Firehose, a Glue schema, and a Zephflow sink, making your data ready for large-scale analysis.

Gerrit Jansen van Vuuren

Fouding Engineer, Fleak

In this post, we’ll continue where we left off in our collaboration with Cardinal. If you haven’t read the first part of this series, please start there first.

In the previous post, we explored how to map a VPC log sample into OCSF and run that mapping. We ended up creating a Kinesis source to hold our VPC logs and then built a Zephflow workflow to ingest and transform them. That’s all good, but the ultimate goal is to send these transformed logs somewhere for analysis. That’s what this post is about.

By isolating the transformation step between two queues—in our case, Kinesis data streams—our workflow can focus solely on the transformation. We can scale this step to as many workers as needed. The transformation step then writes into a destination Kinesis stream. From here, we can either use our own software or AWS’s Firehose capability to process the data further.

For our post, we’ll get our OCSF data into S3 as Parquet files. This structured format is crucial as it prepares the data for powerful analytics. The setup enables tools like AWS Security Lake and is also ideal for providing the real-time observability and querying capabilities of the Cardinal platform. Luckily for us, Firehose can handle the conversion to Parquet and delivery to S3. We’ll just need to provide a Glue schema that fits our data. Let’s take a look at that first.

From OCSF To Glue

OCSF as a schema is a kind of lingua franca between formats and tools. Unfortunately, the reality is messier—most tools still have their own schemas. AWS Glue has its own expectations, though it does support Avro. At Fleak, we've built internal tools to convert directly from OCSF to Avro—which is no simple feat.

For this post, we’re giving you a Glue schema that supports a JSON message based on the OCSF network activity schema.

This schema has been simplified from the full OCSF definition. Any recursive types have been removed. OCSF allows infinite recursion, but most downstream tools—sensibly—do not. Avro, for instance, would frown at you.

We’re going to use Firehose to write out to Parquet, which requires a Glue table and schema. Let’s start by creating the database and table

Creating the Glue Table for Parquet Output

We’ll call the db ocsf, and the table networkactivity.

Type in the following command to create a Glue database.

aws glue create-database  --database-input '{"Name": "ocsf"}'

This creates a glue database—nothing like a postgres database and is purely a place to store schemas. Now we can create our table. Glue tables expect their data in S3. We’ll need an s3 bucket and folder for it. If you don’t have a bucket already create one or run the following command

aws s3api create-bucket --bucket ocsf

You may need to change the bucket name. We’ll use the name ocsf, but make sure you update it in the commands that follow.

Now we’re ready to create the Glue table with the right schema. We’ll call our table networkactivity and point it to the bucket we just created.

Update the S3 bucket in the following command and then run it to create the networkactivity glue table:

aws glue create-table \
  --database-name ocsf \
  --table-input '{
    "Name": "networkactivity",
    "TableType": "EXTERNAL_TABLE",
    "Parameters": {
      "classification": "parquet",
      "compressionType": "gzip",
      "typeOfData": "file"
    },
    "PartitionKeys": [
      { "Name": "year", "Type": "int" },
      { "Name": "month", "Type": "int" },
      { "Name": "day", "Type": "int" },
      { "Name": "hour", "Type": "int" }
    ],
    "StorageDescriptor": {
      "Columns": [
        { "Name": "time", "Type": "bigint" },
        { "Name": "proxy", "Type": "struct<ip:string,port:bigint,type:string,type_id:bigint>" },
        { "Name": "status", "Type": "string" },
        { "Name": "message", "Type": "string" },
        { "Name": "traffic", "Type": "struct<bytes:bigint,packets:bigint>" },
        { "Name": "metadata", "Type": "struct<uid:string,product:struct<name:string>,version:string,log_name:string,logged_time:bigint>" },
        { "Name": "type_uid", "Type": "bigint" },
        { "Name": "class_uid", "Type": "bigint" },
        { "Name": "status_id", "Type": "bigint" },
        { "Name": "activity_id", "Type": "bigint" },
        { "Name": "severity_id", "Type": "bigint" },
        { "Name": "status_code", "Type": "string" },
        { "Name": "category_uid", "Type": "bigint" },
        { "Name": "dst_endpoint", "Type": "struct<ip:string,port:bigint>" },
        { "Name": "src_endpoint", "Type": "struct<ip:string,port:bigint>" },
        { "Name": "status_detail", "Type": "string" },
        { "Name": "connection_info", "Type": "struct<uid:string,direction_id:bigint,protocol_num:bigint,protocol_name:string>" },
        { "Name": "timezone_offset", "Type": "bigint" }
      ],
      "Location": "s3://ocsf/networkactivity",
      "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
      "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
      "SerdeInfo": {
        "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
      }
    }
  }'

This may seem like a huge command but it is quite simple. This command creates a glue table that expects parquet files in the s3 location s3://ocsf/networkactivity. It sets up the standard hive compliant input output formats as parquet. What makes this command so long are the columns in the storage descriptor. This is the glue schema I spoke about earlier, that represents the OCSF schema.

Now that we have our glue database, we’re ready to create a Firehose pipeline, which will use a Kinesis data stream as input. AWS’s Firehose is a way to package up a full delivery stream from source—the Kinesis data stream—to destination—s3.

Creating a Firehose delivery pipeline

We’ll set up a firehose delivery stream and name it ocsf-parquet, and set the type as KinesisStreamAsSource—meaning it will read from Kinesis. We’ll need a source Kinesis stream first. So let’s create it now.

Run the following command to create a Kinesis stream. We’ll write data to this stream, and Firehose will pick it up and write to S3.

aws kinesis create-stream  --stream-name ocsf-stream  --shard-count 1

Before we can continue, we need a role with permissions to access Kinesis, CloudWatch, Glue, and S3—everything in AWS works with roles. Run the following command to create a role called firehose-ocsf-role. Make sure the

aws iam put-role-policy \
  --role-name firehose-ocsf-role \
  --policy-name firehose-ocsf-policy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "kinesis:DescribeStream",
          "kinesis:GetShardIterator",
          "kinesis:GetRecords",
          "kinesis:ListShards"
        ],
        "Resource": "arn:aws:kinesis:eu-west-1:123456789012:stream/ocsf-stream"
      },
      {
        "Effect": "Allow",
        "Action": [
          "s3:PutObject",
          "s3:GetBucketLocation",
          "s3:GetObject",
          "s3:ListBucket"
        ],
        "Resource": [
          "arn:aws:s3:::ocsf",
          "arn:aws:s3:::ocsf/*"
        ]
      },
      {
        "Effect": "Allow",
        "Action": [
          "glue:GetTable",
          "glue:GetTableVersion",
          "glue:GetDatabase"
        ],
        "Resource": "*"
      },
      {
        "Effect": "Allow",
        "Action": [
          "logs:PutLogEvents"
        ],
        "Resource": "*"
      }
    ]
  }'

With our role and Kinesis setup, we can create the Firehose delivery stream.

For the following command, replace the role ARN and Kinesis ARNs with the ones you created above. Run the following command:

aws firehose create-delivery-stream \
  --delivery-stream-name ocsf-parquet \
  --delivery-stream-type KinesisStreamAsSource \
  --kinesis-stream-source-configuration '{
    "KinesisStreamARN": "<KINESIS STREAM>",
    "RoleARN": "<ROLE>"
  }' \
  --extended-s3-destination-configuration '{
    "RoleARN": "<ROLE>",
    "BucketARN": "arn:aws:s3:::ocsf",
    "Prefix": "year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/",
    "ErrorOutputPrefix": "errors/",
    "BufferingHints": {
      "SizeInMBs": 64,
      "IntervalInSeconds": 300
    },
    "CompressionFormat": "UNCOMPRESSED",
    "CloudWatchLoggingOptions": {
      "Enabled": true,
      "LogGroupName": "/aws/kinesisfirehose/ocsf-parquet",
      "LogStreamName": "ocsf-parquet-stream"
    },
    "DataFormatConversionConfiguration": {
      "Enabled": true,
      "InputFormatConfiguration": {
        "Deserializer": {
          "OpenXJsonSerDe": {}
        }
      },
      "OutputFormatConfiguration": {
        "Serializer": {
          "ParquetSerDe": {
            "Compression": "GZIP"
          }
        }
      },
      "SchemaConfiguration": {
        "DatabaseName": "ocsf",
        "TableName": "networkactivity",
        "RoleARN": "<ROLE>"
      }
    }
  }'

This command will create a Firehose delivery stream that reads from the Kinesis stream we created. It will check that the incoming data is compatible with the Glue schema, and then write the data out into Parquet. We also specified buffering options that will either write out a file when its size is 64mb or when 5 minutes have passed. Also note that, because we expect JSON data from Kinesis, we specified the OpenXJsonSerde.

That’s it. We now have a full delivery pipeline that will read from our Kinesis stream and write data into S3 as Parquet files. The only thing left to do is update our Zephflow workflow and run it.

Updating the Zephflow Workflow

If you followed along in the previous post, you should have a Zephflow workflow named

vpc-ocsf.yml and a Python script to run it called vpc_ocsf_transformer.py.

All Zephflow workflows have a source and sink node. Our workflow currently reads from Kinesis, transforms the VPC logs into OCSF, and prints the results. We are going to replace the sink node with a “kinesissink” command and configure it to point to the Kinesis data stream we created in the previous section.

Open your workflow file, and replace the sink command with the data below. Replace the stream name and region with the Kinesis stream created and the AWS region where it lives.

 - id: sink
    commandName: kinesissink
    config: '{"encodingType": "JSON_OBJECT", "streamName": "<STREAM NAME>", "regionStr": "<REGION>"}'
    outputs: []

That’s it. The Zephflow part is refreshingly simple. Now comes the testing part. Run your workflow using the vpc_ocsf_transformer.py Python script. After a few minutes, you should see it writing data to Kinesis.

Debugging

Nothing ever just works. If it does, brilliant!. But it’s good to know what to look out for when things don't.

If there are no obvious errors from Zephflow, then take a look at the s3 bucket in the “error” or “errors” directory. If any files appear there, it means that Firehose couldn’t convert the data to the Glue schema you specified. Download the error file and check for clues as to which columns failed.

Another source of failure is permissions. AWS permissions are not always intuitive, and you will need to double-check that every phase has permissions. CloudWatch can also help you out. If all else fails, create an AWS help desk ticket or ask a friend.

Wrapping Up and Looking Ahead

In this post, we completed the final piece of our two-part series with Cardinal. In the previous post, we created an OCSF mapping using Fleak’s online OCSF mapper and explored how to run the mapping in a workflow using the open-source framework Zephflow.

We then configured AWS to write VPC logs to Kinesis and updated our Zephflow workflow to read them. In this post, we completed the example by creating a Firehose delivery that reads from Kinesis and writes the transformed OCSF data into S3 as Parquet files.

With this pipeline complete, your security data is now continuously transformed into the OCSF standard and stored efficiently in S3. This creates a flexible and powerful data asset, ready for advanced analytics. You can now feed this data into security tools like AWS Security Lake or connect it to platforms like Cardinal to enable real-time observability.

If you’ve followed along, you should have a clearer understanding of where Fleak’s OCSF mapper and the Zephflow framework fit in a data transformation pipeline. You don’t have to stick with Kinesis or VPC logs and can swap these pieces out, maintaining the same transformation logic and workflow core.