Syntrend is a lightweight tool using an expressive project structure to generate randomized synthetic datasets for local development, quality assurance, load testing, and bug investigations.
✨ Introducing Syntrend!
I built Syntrend as a response to a lack of tooling that helps developers build their products separate from production data pipelines. It's a synthetic data generator using YAML project files to generate random and calculated values and properties based off of expressions in the project.
It's primary objectives are:
- Be Lightweight: I want it to run anywhere a developer works.
- Be Easy to Use: It should be easy to understand and use for all members of the team.
- Be Environment Agnostic: Developers can work offline or online, using local or remote workspaces. QA Engineers need to work in a wide variety of environment from local, remote, in CI Pipelines, and using integrated systems.
- Support As Many Data Types As Possible: Data takes many forms, so we should be able to generate data into those many forms using an extendable toolset.
- Be Expressive: Data can have a personality, and we need this data to express that personality wherever we are regardless of how much, how many, or how crazy that data can be.
As of right now (v0.3.0), a 45KB package could potentially generate terabytes of data from a single project multiple times over expressing different patterns and behaviours you would see in production.
🧭 Background
I worked many years as a consulting developer for secretive organisations that made it difficult to see their data until I was in their office. I learned to develop based on descriptions they gave me and got fairly good at it. So when I moved into a team role working on the cloud with access to production data, it was like a dream that I could see the data and develop/test around it. But I still relied on testing strategies using data patterns instead of solely relying on production data.
One thing I noticed of my peers was a tendency to build everything using that data, including the testing pipeline. I thought this was odd.
- How could you replicate an event that caused a bug quickly?
- How do you ensure the fix or feature you made supports this pattern and similar ones like it?
- How can you develop at times you don't have access (like commuting to work or when CrowdStrike makes a mistake)?
So of course when I ask about it, I get these responses:
- "The data is too big to create snapshots"
- "It's hard to replicate the data when you need it"
- "It takes too long to write unit tests for everything we're doing"
- "We have no infrastructure for passing alternative datasets to our code"
So I knew something had to be done. I looked around for data generators but they were either too simple (just random value generators) or too complex (using hosted services). So I built one.
❓ How Does It Work?
Easy! Build a project then run it! The easiest part is to run the project, so let's get that out of the way.
Once you follow the Quickstart, you should have the tool installed. You're ready to run it.
syntrend generate project_file.yaml
That's it, you get all the data you want in the way you want it where you want it.
📝 The Project File
The whole structure is documented here but you could start with a simple file to generate a random string of characters that looks like this.
Basic (Generators)
Read more about Generator Types
type: string
Let's save it to "project.yaml" and run it with syntrend generate project.yaml
:
"Ht3jCoxzpL"
What about a random name?
type: name
"Phillip Ho"
And how about a complex object, something with multiple values?
type: object
properties:
attribute:
type: integer
content:
type: name
{"attributes": 21, "content": "Christina Baird"}
You now have a project file that will generate random values.
Want two objects? Sure! Let's name the one we created already "sensor" and a new one called "users":
objects:
users:
type: object
properties:
first_name:
type: first_name
last_name:
type: last_name
user_uuid:
type: uuid
sensor:
type: object
properties:
attribute:
type: integer
content:
type: string
{"first_name": "Jamie", "last_name": "Wheeler", "user_uuid": "6ca70582-ec9b-41f8-87bc-adf78ee5a9a8"}
{"attribute": -418, "content": "M7Nfh0rk2Jte"}
Let's Generate a few of each object now:
objects:
users:
output:
count: 3
type: object
properties:
first_name:
type: first_name
last_name:
type: last_name
user_uuid:
type: uuid
sensor:
output:
count: 10
type: object
properties:
attribute:
type: integer
content:
type: string
{"first_name": "Lauren", "last_name": "Ward", "user_uuid": "6058b3a4-feaf-4638-a5b3-954d43c884d5"}
{"first_name": "Mark", "last_name": "Long", "user_uuid": "7e584fb3-d7fd-44c6-9354-c58c325588eb"}
{"first_name": "Brian", "last_name": "George", "user_uuid": "cf68be6e-7abf-4374-b5bc-c99d19fe75e3"}
{"attribute": 189, "content": "TMZFuHU2LGka"}
{"attribute": 52, "content": "LPdwWhxSFFtLtPWbF"}
{"attribute": 481, "content": "yPRfhLgc8Es"}
{"attribute": 457, "content": "GmzZ0zetxBeiLBhQ"}
{"attribute": -383, "content": "DTrRwRiFxWx6eaM12Q"}
{"attribute": -457, "content": "DtL7xGTArT"}
{"attribute": -360, "content": "nbj8eIEJGzgcFKQiGXX9"}
{"attribute": 356, "content": "qxOtpwdIw95Y"}
{"attribute": -447, "content": "B6FndtHh3UBe9"}
{"attribute": 338, "content": "5LqavzqitsQ0JD"}
Expressions/Trends
Read more about Expressions
Now I have a two objects. Can I relate one object to another? Let's define "users" a reference dataset and "sensor" as events referring to it.
In a way, setting
collection: true
will define
that dataset as a reference data since it is
generated before other datasets.
objects:
users:
output:
collection: true
count: 3
type: object
properties:
first_name:
type: first_name
last_name:
type: last_name
user_uuid:
type: uuid
sensor:
output:
count: 10
type: object
properties:
user:
type: uuid
expression: users(random(1, 3)).user_uuid
attribute:
type: integer
content:
type: string
[
{"first_name": "Jack", "last_name": "Davis", "user_uuid": "53cb62d0-e534-4b25-bd5c-6ab41905d93a"},
{"first_name": "Jonathan", "last_name": "Browning", "user_uuid": "26b38baf-6e59-415a-943b-71c896c2e061"},
{"first_name": "George", "last_name": "Willis", "user_uuid": "16f1c40e-0c65-4cb3-9a01-e2278014df8c"}
]
{"user": "16f1c40e-0c65-4cb3-9a01-e2278014df8c", "attribute": 282, "content": "vukz1J6qxM8jY"}
{"user": "26b38baf-6e59-415a-943b-71c896c2e061", "attribute": 200, "content": "zk9cz3w6itYZNFnF"}
{"user": "26b38baf-6e59-415a-943b-71c896c2e061", "attribute": 326, "content": "YZBOxep4D"}
{"user": "53cb62d0-e534-4b25-bd5c-6ab41905d93a", "attribute": 140, "content": "JOL9SurzMfrjEzP4"}
{"user": "53cb62d0-e534-4b25-bd5c-6ab41905d93a", "attribute": 403, "content": "6N66o9v4"}
{"user": "26b38baf-6e59-415a-943b-71c896c2e061", "attribute": -175, "content": "6OGwnZEI8qS8"}
{"user": "16f1c40e-0c65-4cb3-9a01-e2278014df8c", "attribute": -320, "content": "P9TvTI5cyNzzPeF8tm"}
{"user": "53cb62d0-e534-4b25-bd5c-6ab41905d93a", "attribute": 324, "content": "JyZJSnvWo"}
{"user": "53cb62d0-e534-4b25-bd5c-6ab41905d93a", "attribute": 305, "content": "nfY4jUPCvv1Ga6lUlNW"}
{"user": "26b38baf-6e59-415a-943b-71c896c2e061", "attribute": -129, "content": "aKEmLLnh01FyV8Ae"}
Let's also make sure the "attribute" value represents a pattern, maybe a unique one per user.
Notice the
sensor
object was duplicated so each one
represents a specific user and can have it's own expression.
objects:
users:
output:
collection: true
count: 3
type: object
properties:
first_name:
type: first_name
last_name:
type: last_name
user_uuid:
type: uuid
sensor1:
output:
count: 5
type: object
properties:
user:
type: uuid
expression: users(1).user_uuid
attribute:
type: integer
expression: interval * 2 + 10
content:
type: string
sensor2:
output:
count: 5
type: object
properties:
user:
type: uuid
expression: users(2).user_uuid
attribute:
type: integer
expression: 10 * sin(interval * 2/3) + 11
content:
type: string
sensor3:
output:
count: 5
type: object
properties:
user:
type: uuid
expression: users(3).user_uuid
attribute:
type: integer
expression: 5 + interval ** 3
content:
type: string
[
{"first_name": "Christine", "last_name": "Howell", "user_uuid": "c29d10c7-00db-4f2c-83b2-5cbae225393a"},
{"first_name": "Susan", "last_name": "Chang", "user_uuid": "c2235e5e-3d82-4224-96ce-d31915da732e"},
{"first_name": "Ryan", "last_name": "Jones", "user_uuid": "bbf63933-29b5-43f8-a488-5ae15153d128"}
]
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 10, "content": "S1bDSqs11JX9YZN6nnM"}
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 12, "content": "R2YcoJSoiWG0PlleesUI"}
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 14, "content": "3SEhZyp5cj2X"}
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 16, "content": "SBCIEOmEe1rWQpp3JARi"}
{"user": "bbf63933-29b5-43f8-a488-5ae15153d128", "attribute": 18, "content": "PZZvckWpf"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 11, "content": "AAVuPAJIONR"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 17, "content": "X3JobELFhU1W4e"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 20, "content": "HJflwdxhoHX56K9diB"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 20, "content": "PBZTlAOUvrZC"}
{"user": "c2235e5e-3d82-4224-96ce-d31915da732e", "attribute": 15, "content": "7vKIHVRwSZrma"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 5, "content": "mRjdv1HVZ"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 6, "content": "jLh7zqKH"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 13, "content": "JQutPK9UFMceQQ9Ifg96"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 32, "content": "BuYXN4NKniprAOj"}
{"user": "c29d10c7-00db-4f2c-83b2-5cbae225393a", "attribute": 69, "content": "YIcewp2WKC50"}
Now we have multiple user sensor objects, one with a linear sensor progression, another with a sine-wave pattern, and another with exponential growth.
Data Formats
Read more about Formatting
How about writing things into different formats? Let's supply our user data as SQL to be loaded into a database, while the sensor data is XML.
Here, the sensor objects are the same type of object,
so we name that object "sensor" as the XML tag.
objects:
users:
output:
format: sql
collection: true
count: 3
type: object
properties:
first_name:
type: first_name
last_name:
type: last_name
user_uuid:
type: uuid
sensor1:
output:
format: xml
count: 5
xml_tag: sensor
type: object
properties:
user:
type: uuid
expression: users(1).user_uuid
xml_attr: true
attribute:
type: integer
expression: interval * 2 + 10
xml_attr: true
content:
type: string
sensor2:
output:
format: xml
count: 5
xml_tag: sensor
type: object
properties:
user:
type: uuid
expression: users(2).user_uuid
xml_attr: true
attribute:
type: integer
expression: 10 * sin(interval * 2/3) + 11
xml_attr: true
content:
type: string
sensor3:
output:
format: xml
count: 5
xml_tag: sensor
type: object
properties:
user:
type: uuid
expression: users(3).user_uuid
xml_attr: true
attribute:
type: integer
expression: 5 + interval ** 3
xml_attr: true
content:
type: string
insert into users (first_name, last_name, user_uuid) values ("Lisa", "Lyons", "f46c3608-7ebd-48c8-8b44-d6442c0794e0");
insert into users (first_name, last_name, user_uuid) values ("Michael", "Taylor", "bf4c132d-189b-4e5d-9cdf-cbf543a985d4");
insert into users (first_name, last_name, user_uuid) values ("Kevin", "Reynolds", "11c3b09f-7d9b-408c-a3c4-2e68d35c4972");
w7vI4It
cpwcH0dnTipZ8
fgzvwHAHKSQAiCdLTP
wwULgvtqa1ggOODz2MZe
BNgcVGp22LwLzq
WztCGYEpHDc8Y9LORq
0qd2zWkEaia3Ma
QaR7wrpFEQZEBMo1OQ
lRGy52
DgjMbhE6e14rkIEvV9J7
OXzNLeua
CR5LtUhZDtEj5Vrc4P
6Cpoz4Larda
eLfUuj7quGeP6m
2T6Vzh7yJbuFgkH4ZFLb
Need to write them to files? OK! Add an output
block with directory
set to the directory you want files to be written.
output:
directory: out
objects:
...
This is a global output for the project, and can be
redefined at the object-level to override specific values.
Now there's a new "out" directory containing 15 XML files (5 for each sensor object) and a SQL file with the 3 INSERT commands, just like what was in the console earlier.
Want some tabular data for a report? Got that too! You can duplicate one of the objects and make it reference the other object to be printed to console in a readable format.
For now, you'll need to set
directory: '-'
for the "users_report"
object'soutput
block to signify you want to write to consoleThis will change with v1.0
...
users_report:
output:
format: table
directory: '-'
collection: true
count: 3
type: object
properties:
first_name:
type: first_name
expression: users(interval + 1).first_name
last_name:
type: last_name
expression: users(interval + 1).last_name
user_uuid:
type: uuid
expression: users(interval + 1).user_uuid
first_name last_name user_uuid
===========================================================
Amanda Lowery ccca1b25-b600-4729-adf5-1ef767491e28
Kim Hammond 7b64afbc-c84b-4f50-b801-6576a7978fff
Eric Johnson ee3a9acb-7e5b-4d99-9f52-775f62b20175
Re-composing Projects
There are a few things that make YAML such a great choice for large, user-modified documents as compared to XML or JSON:
- Easy-to-read, white-spaced syntax
- Files may contain multiple documents (separated by
---
) - Custom tags to help serialize objects
- Anchors and Aliases with object overrides
All of these elements are helpful to break down large structures into smaller and re-useable elements. Unfortunately, there is one limitation with Anchors and Aliases: it does not support overriding nested objects. Instead, according to the YAML syntax, the alias will be applied but since another instance of the same key name exists, the object will be fully replaced in the later key.
..., it is necessary to impose an order on mapping keys and
employ alias nodes to indicate a subsequent occurrence of a
previously encountered node
For example, the &obj1
anchor is re-applied for obj2
with some modifications.
obj1: &obj1
key1: test
key2: word
key3:
some: keys
obj2:
<<: *obj1
key2: words
key3:
other: keys
key4: another test
which would serialize as:
obj1:
key1: test
key2: word
key3:
some: keys
obj2:
key1: test
key2: word
key3:
some: keys
key2: words
key3:
other: keys
key4: another test
The keys in obj2 have duplicates because of the alias. This is fine for updating simple keys like key2
and to introduce new keys like key4
. For objects that contain nested keys (like key3
), we don't get a merge of objects but a full replacement which is not great for large structures.
To resolve this limitation, a few YAML tags and meta objects were introduced (!syntrend/ref
, !syntrend/root
, and ). Seeing this in action, we can redefine the project in the previous section using these elements:
--- !syntrend/ref::users
output:
collection: true
count: 3
type: object
properties:
first_name:
type: first_name
last_name:
type: last_name
user_uuid:
type: uuid
--- !syntrend/ref::sensors
output:
format: xml
count: 5
xml_tag: sensor
type: object
properties:
user:
type: uuid
xml_attr: true
attribute:
type: integer
xml_attr: true
content:
type: string
--- !syntrend/root
output:
directory: out
objects:
users:
bases:
- ref: users
output:
format: sql
users_report:
bases:
- ref: users
output:
format: table
directory: '-'
properties:
first_name:
expression: users(interval + 1).first_name
last_name:
expression: users(interval + 1).last_name
user_uuid:
expression: users(interval + 1).user_uuid
sensor1:
bases:
- ref: sensors
properties:
user:
expression: users(1).user_uuid
attribute:
expression: interval * 2 + 10
sensor2:
bases:
- ref: sensors
properties:
user:
expression: users(2).user_uuid
attribute:
expression: 10 * sin(interval * 2/3) + 11
sensor3:
bases:
- ref: sensors
properties:
user:
expression: users(3).user_uuid
attribute:
expression: 5 + interval ** 3
Here, we've created two reference documents (named ref::users
and ref::sensors
) and one root document (tagged with !syntrend/root
to remove confusion in parsing logic). Each object contains a bases
meta object (can also use _bases
if you need to use bases
for something else) with overrides for edits to the reference objects.
How Will Anyone Use It?
I started writing how different people on a team would use it, but it became a very lengthy essay.
Instead, I'm releasing this project to the public and I want to hear how you or those around you will use it. Since this tool will be used mostly for development, testing and/or benchmarking, I have a goal to get Syntrend to be as stable as possible so everyone can rely on it. That means gathering a collection of use cases that I can then use to solidify my testing framework and document the tool for others to learn from.
🔭 What's Next?
Learn from the users! I'm starting to identify things to get this to v1.0, but I need feedback from you. Specifically:
- How would you use it?
- Is there functionality missing that would help to make an impact on your workflow?
- Any generators, formats or output targets you would like to see supported?
Please feel free to try it out, share contributions, suggestions, and/or issues.