Introducing Syntrend - Synthetic Data made easy

Syntrend is a lightweight tool using an expressive project structure to generate randomized synthetic datasets for local development, quality assurance, load testing, and bug investigations.

✨ Introducing Syntrend!

I built Syntrend as a response to a lack of tooling that helps developers build their products separate from production data pipelines. It's a synthetic data generator using YAML project files to generate random and calculated values and properties based off of expressions in the project.

Time-sensitive generation

It's primary objectives are:

Be Lightweight: I want it to run anywhere a developer works.
Be Easy to Use: It should be easy to understand and use for all members of the team.
Be Environment Agnostic: Developers can work offline or online, using local or remote workspaces. QA Engineers need to work in a wide variety of environment from local, remote, in CI Pipelines, and using integrated systems.
Support As Many Data Types As Possible: Data takes many forms, so we should be able to generate data into those many forms using an extendable toolset.
Be Expressive: Data can have a personality, and we need this data to express that personality wherever we are regardless of how much, how many, or how crazy that data can be.

As of right now (v0.3.0), a 45KB package could potentially generate terabytes of data from a single project multiple times over expressing different patterns and behaviours you would see in production.

🧭 Background

I worked many years as a consulting developer for secretive organisations that made it difficult to see their data until I was in their office. I learned to develop based on descriptions they gave me and got fairly good at it. So when I moved into a team role working on the cloud with access to production data, it was like a dream that I could see the data and develop/test around it. But I still relied on testing strategies using data patterns instead of solely relying on production data.

One thing I noticed of my peers was a tendency to build everything using that data, including the testing pipeline. I thought this was odd.

How could you replicate an event that caused a bug quickly?
How do you ensure the fix or feature you made supports this pattern and similar ones like it?
How can you develop at times you don't have access (like commuting to work or when CrowdStrike makes a mistake)?

So of course when I ask about it, I get these responses:

"The data is too big to create snapshots"
"It's hard to replicate the data when you need it"
"It takes too long to write unit tests for everything we're doing"
"We have no infrastructure for passing alternative datasets to our code"

So I knew something had to be done. I looked around for data generators but they were either too simple (just random value generators) or too complex (using hosted services). So I built one.

❓ How Does It Work?

Easy! Build a project then run it! The easiest part is to run the project, so let's get that out of the way.

Once you follow the Quickstart, you should have the tool installed. You're ready to run it.

syntrend generate project_file.yaml

That's it, you get all the data you want in the way you want it where you want it.

📝 The Project File

The whole structure is documented here but you could start with a simple file to generate a random string of characters that looks like this.

Basic (Generators)

Read more about Generator Types

type: string

Let's save it to "project.yaml" and run it with syntrend generate project.yaml:

"Ht3jCoxzpL"

What about a random name?

type: name

"Phillip Ho"

And how about a complex object, something with multiple values?

type: object
properties:
  attribute:
    type: integer
  content:
    type: name

{"attributes": 21, "content": "Christina Baird"}

You now have a project file that will generate random values.

Want two objects? Sure! Let's name the one we created already "sensor" and a new one called "users":

objects:
  users:
    type: object
    properties:
      first_name:
        type: first_name
      last_name:
        type: last_name
      user_uuid:
        type: uuid
  sensor:
    type: object
    properties:
      attribute:
        type: integer
      content:
        type: string

{"first_name": "Jamie", "last_name": "Wheeler", "user_uuid": "6ca70582-ec9b-41f8-87bc-adf78ee5a9a8"}
{"attribute": -418, "content": "M7Nfh0rk2Jte"}

Let's Generate a few of each object now:

objects:
  users:
    output:
      count: 3
    type: object
    properties:
      first_name:
        type: first_name
      last_name:
        type: last_name
      user_uuid:
        type: uuid
  sensor:
    output:
      count: 10
    type: object
    properties:
      attribute:
        type: integer
      content:
        type: string

{"first_name": "Lauren", "last_name": "Ward", "user_uuid": "6058b3a4-feaf-4638-a5b3-954d43c884d5"}
{"first_name": "Mark", "last_name": "Long", "user_uuid": "7e584fb3-d7fd-44c6-9354-c58c325588eb"}
{"first_name": "Brian", "last_name": "George", "user_uuid": "cf68be6e-7abf-4374-b5bc-c99d19fe75e3"}
{"attribute": 189, "content": "TMZFuHU2LGka"}
{"attribute": 52, "content": "LPdwWhxSFFtLtPWbF"}
{"attribute": 481, "content": "yPRfhLgc8Es"}
{"attribute": 457, "content": "GmzZ0zetxBeiLBhQ"}
{"attribute": -383, "content": "DTrRwRiFxWx6eaM12Q"}
{"attribute": -457, "content": "DtL7xGTArT"}
{"attribute": -360, "content": "nbj8eIEJGzgcFKQiGXX9"}
{"attribute": 356, "content": "qxOtpwdIw95Y"}
{"attribute": -447, "content": "B6FndtHh3UBe9"}
{"attribute": 338, "content": "5LqavzqitsQ0JD"}

Expressions/Trends

Read more about Expressions

Now I have a two objects. Can I relate one object to another? Let's define "users" a reference dataset and "sensor" as events referring to it.

In a way, setting collection: true will define
that dataset as a reference data since it is
generated before other datasets.

objects:
  users:
    output:
      collection: true
      count: 3
    type: object
    properties:
      first_name:
        type: first_name
      last_name:
        type: last_name
      user_uuid:
        type: uuid
  sensor:
    output:
      count: 10
    type: object
    properties:
      user:
        type: uuid
        expression: users(random(1, 3)).user_uuid
      attribute:
        type: integer
      content:
        type: string

[
  {"first_name": "Jack", "last_name": "Davis", "user_uuid": "53cb62d0-e534-4b25-bd5c-6ab41905d93a"},
  {"first_name": "Jonathan", "last_name": "Browning", "user_uuid": "26b38baf-6e59-415a-943b-71c896c2e061"},
  {"first_name": "George", "last_name": "Willis", "user_uuid": "16f1c40e-0c65-4cb3-9a01-e2278014df8c"}
]
{"user": "16f1c40e-0c65-4cb3-9a01-e2278014df8c", "attribute": 282, "content": "vukz1J6qxM8jY"}
{"user": "26b38baf-6e59-415a-943b-71c896c2e061", "attribute": 200, "content": "zk9cz3w6itYZNFnF"}
{"user": "26b38baf-6e59-415a-943b-71c896c2e061", "attribute": 326, "content": "YZBOxep4D"}
{"user": "53cb62d0-e534-4b25-bd5c-6ab41905d93a", "attribute": 140, "content": "JOL9SurzMfrjEzP4"}
{"user": "53cb62d0-e534-4b25-bd5c-6ab41905d93a", "attribute": 403, "content": "6N66o9v4"}
{"user": "26b38baf-6e59-415a-943b-71c896c2e061", "attribute": -175, "content": "6OGwnZEI8qS8"}
{"user": "16f1c40e-0c65-4cb3-9a01-e2278014df8c", "attribute": -320, "content": "P9TvTI5cyNzzPeF8tm"}
{"user": "53cb62d0-e534-4b25-bd5c-6ab41905d93a", "attribute": 324, "content": "JyZJSnvWo"}
{"user": "53cb62d0-e534-4b25-bd5c-6ab41905d93a", "attribute": 305, "content": "nfY4jUPCvv1Ga6lUlNW"}
{"user": "26b38baf-6e59-415a-943b-71c896c2e061", "attribute": -129, "content": "aKEmLLnh01FyV8Ae"}

Let's also make sure the "attribute" value represents a pattern, maybe a unique one per user.

Notice the sensor object was duplicated so each one
represents a specific user and can have it's own expression.

objects:
  users:
    output:
      collection: true
      count: 3
    type: object
    properties:
      first_name:
        type: first_name
      last_name:
        type: last_name
      user_uuid:
        type: uuid
  sensor1:
    output:
      count: 5
    type: object
    properties:
      user:
        type: uuid
        expression: users(1).user_uuid
      attribute:
        type: integer
        expression: interval * 2 + 10
      content:
        type: string
  sensor2:
    output:
      count: 5
    type: object
    properties:
      user:
        type: uuid
        expression: users(2).user_uuid
      attribute:
        type: integer
        expression: 10 * sin(interval * 2/3) + 11
      content:
        type: string
  sensor3:
    output:
      count: 5
    type: object
    properties:
      user:
        type: uuid
        expression: users(3).user_uuid
      attribute:
        type: integer
        expression: 5 + interval ** 3
      content:
        type: string

[
  {"first_name": "Christine", "last_name": "Howell", "user_uuid": "c29d10c7-00db-4f2c-83b2-5cbae225393a"},
  {"first_name": "Susan", "last_name": "Chang", "user_uuid": "c2235e5e-3d82-4224-96ce-d31915da732e"},
  {"first_name": "Ryan", "last_name": "Jones", "user_uuid": "bbf63933-29b5-43f8-a488-5ae15153d128"}
]
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 10, "content": "S1bDSqs11JX9YZN6nnM"}
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 12, "content": "R2YcoJSoiWG0PlleesUI"}
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 14, "content": "3SEhZyp5cj2X"}
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 16, "content": "SBCIEOmEe1rWQpp3JARi"}
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 18, "content": "PZZvckWpf"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 11, "content": "AAVuPAJIONR"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 17, "content": "X3JobELFhU1W4e"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 20, "content": "HJflwdxhoHX56K9diB"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 20, "content": "PBZTlAOUvrZC"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 15, "content": "7vKIHVRwSZrma"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 5, "content": "mRjdv1HVZ"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 6, "content": "jLh7zqKH"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 13, "content": "JQutPK9UFMceQQ9Ifg96"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 32, "content": "BuYXN4NKniprAOj"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 69, "content": "YIcewp2WKC50"}

Now we have multiple user sensor objects, one with a linear sensor progression, another with a sine-wave pattern, and another with exponential growth.

Data Formats

Read more about Formatting

How about writing things into different formats? Let's supply our user data as SQL to be loaded into a database, while the sensor data is XML.

Here, the sensor objects are the same type of object,
so we name that object "sensor" as the XML tag.

objects:
  users:
    output:
      format: sql
      collection: true
      count: 3
    type: object
    properties:
      first_name:
        type: first_name
      last_name:
        type: last_name
      user_uuid:
        type: uuid
  sensor1:
    output:
      format: xml
      count: 5
      xml_tag: sensor
    type: object
    properties:
      user:
        type: uuid
        expression: users(1).user_uuid
        xml_attr: true
      attribute:
        type: integer
        expression: interval * 2 + 10
        xml_attr: true
      content:
        type: string
  sensor2:
    output:
      format: xml
      count: 5
      xml_tag: sensor
    type: object
    properties:
      user:
        type: uuid
        expression: users(2).user_uuid
        xml_attr: true
      attribute:
        type: integer
        expression: 10 * sin(interval * 2/3) + 11
        xml_attr: true
      content:
        type: string
  sensor3:
    output:
      format: xml
      count: 5
      xml_tag: sensor
    type: object
    properties:
      user:
        type: uuid
        expression: users(3).user_uuid
        xml_attr: true
      attribute:
        type: integer
        expression: 5 + interval ** 3
        xml_attr: true
      content:
        type: string

insert into users (first_name, last_name, user_uuid) values ("Lisa", "Lyons", "f46c3608-7ebd-48c8-8b44-d6442c0794e0");
insert into users (first_name, last_name, user_uuid) values ("Michael", "Taylor", "bf4c132d-189b-4e5d-9cdf-cbf543a985d4");
insert into users (first_name, last_name, user_uuid) values ("Kevin", "Reynolds", "11c3b09f-7d9b-408c-a3c4-2e68d35c4972");

w7vI4It

cpwcH0dnTipZ8

fgzvwHAHKSQAiCdLTP

wwULgvtqa1ggOODz2MZe

BNgcVGp22LwLzq

WztCGYEpHDc8Y9LORq

0qd2zWkEaia3Ma

QaR7wrpFEQZEBMo1OQ

lRGy52

DgjMbhE6e14rkIEvV9J7

OXzNLeua

CR5LtUhZDtEj5Vrc4P

6Cpoz4Larda

eLfUuj7quGeP6m

2T6Vzh7yJbuFgkH4ZFLb

Need to write them to files? OK! Add an output block with directory set to the directory you want files to be written.

output:
  directory: out
objects:
  ...

This is a global output for the project, and can be
redefined at the object-level to override specific values.

Now there's a new "out" directory containing 15 XML files (5 for each sensor object) and a SQL file with the 3 INSERT commands, just like what was in the console earlier.

Want some tabular data for a report? Got that too! You can duplicate one of the objects and make it reference the other object to be printed to console in a readable format.

For now, you'll need to set directory: '-' for the "users_report"
object's output block to signify you want to write to console

This will change with v1.0

...
  users_report:
    output:
      format: table
      directory: '-'
      collection: true
      count: 3
    type: object
    properties:
      first_name:
        type: first_name
        expression: users(interval + 1).first_name
      last_name:
        type: last_name
        expression: users(interval + 1).last_name
      user_uuid:
        type: uuid
        expression: users(interval + 1).user_uuid

first_name last_name user_uuid                            
===========================================================
 Amanda     Lowery    ccca1b25-b600-4729-adf5-1ef767491e28 
 Kim        Hammond   7b64afbc-c84b-4f50-b801-6576a7978fff 
 Eric       Johnson   ee3a9acb-7e5b-4d99-9f52-775f62b20175

Re-composing Projects

There are a few things that make YAML such a great choice for large, user-modified documents as compared to XML or JSON:

Easy-to-read, white-spaced syntax
Files may contain multiple documents (separated by ---)
Custom tags to help serialize objects
Anchors and Aliases with object overrides

All of these elements are helpful to break down large structures into smaller and re-useable elements. Unfortunately, there is one limitation with Anchors and Aliases: it does not support overriding nested objects. Instead, according to the YAML syntax, the alias will be applied but since another instance of the same key name exists, the object will be fully replaced in the later key.

..., it is necessary to impose an order on mapping keys and
employ alias nodes to indicate a subsequent occurrence of a
previously encountered node

For example, the &obj1 anchor is re-applied for obj2 with some modifications.

obj1: &obj1
  key1: test
  key2: word
  key3:
    some: keys
obj2:
  <<: *obj1
  key2: words
  key3:
    other: keys
  key4: another test

which would serialize as:

obj1:
  key1: test
  key2: word
  key3:
    some: keys
obj2:
  key1: test
  key2: word
  key3:
    some: keys
  key2: words
  key3:
    other: keys
  key4: another test

The keys in obj2 have duplicates because of the alias. This is fine for updating simple keys like key2 and to introduce new keys like key4. For objects that contain nested keys (like key3), we don't get a merge of objects but a full replacement which is not great for large structures.

To resolve this limitation, a few YAML tags and meta objects were introduced (!syntrend/ref, !syntrend/root, and ). Seeing this in action, we can redefine the project in the previous section using these elements:

--- !syntrend/ref::users
output:
  collection: true
  count: 3
type: object
properties:
  first_name:
    type: first_name
  last_name:
    type: last_name
  user_uuid:
    type: uuid

--- !syntrend/ref::sensors
output:
  format: xml
  count: 5
  xml_tag: sensor
type: object
properties:
  user:
    type: uuid
    xml_attr: true
  attribute:
    type: integer
    xml_attr: true
  content:
    type: string

--- !syntrend/root
output:
  directory: out
objects:
  users:
    bases:
      - ref: users
    output:
      format: sql
  users_report:
    bases:
      - ref: users
    output:
      format: table
      directory: '-'
    properties:
      first_name:
        expression: users(interval + 1).first_name
      last_name:
        expression: users(interval + 1).last_name
      user_uuid:
        expression: users(interval + 1).user_uuid
  sensor1:
    bases:
      - ref: sensors
    properties:
      user:
        expression: users(1).user_uuid
      attribute:
        expression: interval * 2 + 10
  sensor2:
    bases:
      - ref: sensors
    properties:
      user:
        expression: users(2).user_uuid
      attribute:
        expression: 10 * sin(interval * 2/3) + 11
  sensor3:
    bases:
      - ref: sensors
    properties:
      user:
        expression: users(3).user_uuid
      attribute:
        expression: 5 + interval ** 3

Here, we've created two reference documents (named ref::users and ref::sensors) and one root document (tagged with !syntrend/root to remove confusion in parsing logic). Each object contains a bases meta object (can also use _bases if you need to use bases for something else) with overrides for edits to the reference objects.

How Will Anyone Use It?

I started writing how different people on a team would use it, but it became a very lengthy essay.

Instead, I'm releasing this project to the public and I want to hear how you or those around you will use it. Since this tool will be used mostly for development, testing and/or benchmarking, I have a goal to get Syntrend to be as stable as possible so everyone can rely on it. That means gathering a collection of use cases that I can then use to solidify my testing framework and document the tool for others to learn from.

🔭 What's Next?

Learn from the users! I'm starting to identify things to get this to v1.0, but I need feedback from you. Specifically:

How would you use it?
Is there functionality missing that would help to make an impact on your workflow?
Any generators, formats or output targets you would like to see supported?

Please feel free to try it out, share contributions, suggestions, and/or issues.

Introducing Syntrend - Synthetic Data made easy