Document status: draft - 04 Mar 2020 - apucher

As Pinot moves forward and becomes easier to set up and explore for humans, we're hitting a limit in terms of (a) what data sets we can include and (b) how much data we can package with the distribution. This is true for both, the source distributions and the pre-made docker images. Many public data sets are available for personal or academic use only and therefore, strictly speaking, prevent Apache Pinot from packaging or including them in other ways. Additionally, we can only package so much data before bloating the size of the repository and images. Finally, pre-existing data sets may not be able to showcase or stress a very specific part of Pinot for testing or demonstration purposes.

One way we could work around this limitation is by generating synthetic "mock" data that looks and feels like real datasets without actually including the original data. Instead of shipping pre-made data sets we can generate time series on-demand from templates and features that we designed or extracted previously. This works around both licensing and capacity issues, and allows us to generate well-suited testing and demo data on-demand.

Proposed approach

We want to add support for complex data generator "templates" to pinot-admin. The existing tool already has rudimentary abilities to generate data for benchmarking or testing, but this data is strictly random noise and usually unsuited for dimensional breakdowns. We propose to add generator templates that produce time series that would appear familiar to developers, analysts, and other stakeholders of businesses and intuitively "make sense". For example, these templates could produce diurnal (day-night) page view and click time series for an imaginary website or long-tail (spiky) error metrics that sensibly de-compose into multiple dimensions. This approach is trivially extensible and new templates can be added as needed.

We would re-use pinot-admins "GenerateData" command and extend the existing schema-annotations with a "template" property that enables both pinot contributors as well as pinot users to configure arbitrary generator templates in the familiar JSON format. We provide several examples below.

Additionally, we want to provide instructions, scripts, and examples for contributors and users to easily generate and load large amounts of synthetic data.

Examples - Time Series

Seasonal

The seasonal template we would be best suited to simulated page impression and click counts with a short-term diurnal (e.g. day-night) pattern overlaying a longer-term cycle of linear scaling factors (e.g. weekdays) over a long-term baseline and linear trend. Additionally, this template enables gaussian noise to be mixed in.

seasonal generator config

[
  {
    "column": "hoursSinceEpoch",
    "template": {
      "type": "SEQUENCE",
      "start": 420768
    }
  },
  {
    "column": "views",
    "template": {
      "type": "SEASONAL",
      "mean": 80,
      "sigma": 1.5,
      "trend": 0.0005,
      "wavelength": 24,
      "amplitude": 70,
      "scalingFactors": [ 0.4, 0.9, 1.0, 1.0, 1.0, 0.8, 0.4 ]
    }
  }
]

Rare events

The rare events template can simulate error count metrics or be mixed in with other templates to produce irregular anomalies. The generated time series is mostly flat but comes with spikes drawn from a long-tail (outlier) log-normal distribution at long-tail (infrequent) intervals. Additionally, we support autoregressive smoothing to simulate decay over time.

spiky generator config

[
  {
    "column": "hoursSinceEpoch",
    "template": {
      "type": "SEQUENCE",
      "start": 420768
    }
  },
  {
    "column": "errors",
    "template": {
      "type": "SPIKE",
      "arrivalMean": 2,
      "arrivalSigma": 1,
      "magnitudeMean": 3,
      "magnitudeSigma": 1,
      "smoothing": 0.1
    }
  }
]

Mixture models

Mixture models combine values generated by multiple models in an alternating and/or additive way. Generator "bins" contain templates to be drawn from in a cyclical fashion, and each bin can contain multiple templates to be sampled from and added up. We can use these models to generate data with irregular anomalies and produce independent dimensional slices for a single metric. Additionally, we support default values (shown further below for dimensional models)

mixture model config

[
  {
    "column": "hoursSinceEpoch",
    "template": {
      "type": "SEQUENCE",
      "start": 420768
    }
  },
  {
    "column": "views",
    "template": {
      "type": "MIXTURE",
      "generatorBins": [
        [
          {
            "mean": 90,
            "sigma": 6.0,
            "amplitude": 80,
            "type": "SEASONAL",
            "wavelength": 24,
            "scalingFactors": [ 0.4, 0.9, 1.0, 1.0, 1.0, 0.8, 0.4 ]
          },
          {
            "type": "SPIKE",
            "arrivalMean": 5,
            "arrivalSigma": 1,
            "magnitudeMean": 5,
            "magnitudeSigma": 1,
            "smoothing": 0.7
          }
        ]
      ]
    }
  }
]

Dimensional

Dimensional models produce string labels in a deterministic cyclical pattern and help us in generating dimensional cuts for a time series in combination with mixture models which alternate output templates in a cyclical way as well.

A limitation of the column-based templates currently is the need to manually specify the number of repetitions required for each dimension value to produce a full cross product. Additionally, this makes it difficult to generate sparse data in a deterministic way.

dimensional model config

[
  {
    "column": "hoursSinceEpoch",
    "template": {
      "type": "SEQUENCE", "start": 420768, "stepsize": 1, "repetitions": 18
    }
  },
  {
    "column": "country",
    "template": {
      "type": "STRING", "values": [ "us", "cn", "in" ], "repetitions": 6
    }
  },
  {
    "column": "platform",
    "template": {
      "type": "STRING", "values": [ "desktop", "mobile" ], "repetitions": 3
    }
  },
  {
    "column": "browser",
    "template": {
      "type": "STRING", "values": [ "chrome", "safari", "firefox" ]
    }
  },
  {
    "column": "errors",
    "template": {
      "type": "MIXTURE",
      "defaults": {
        "type": "SPIKE", "arrivalMean": 4, "arrivalSigma": 1, "magnitudeMean": 2, "magnitudeSigma": 1, "smoothing": 0.3
      },
      "generatorBins": [
        [ { "arrivalMean": 4, "magnitudeMean": 2.0 } ],
        [ { "arrivalMean": 3, "magnitudeMean": 1.0 } ],
        [ { "arrivalMean": 5, "magnitudeMean": 0.2 } ],
        [ { "arrivalMean": 4, "magnitudeMean": 2.5 } ],
        [ { "arrivalMean": 3, "magnitudeMean": 0.8 } ],
        [ { "arrivalMean": 5, "magnitudeMean": 0.1 } ],
        [ { "arrivalMean": 1, "magnitudeMean": 0.5 } ],
        [ { "arrivalMean": 3, "magnitudeMean": 1.0 } ],
        [ { "arrivalMean": 5, "magnitudeMean": 0.2 } ],
        [ { "arrivalMean": 1, "magnitudeMean": 0.5 } ],
        [ { "arrivalMean": 3, "magnitudeMean": 0.8 } ],
        [ { "arrivalMean": 5, "magnitudeMean": 0.1 } ],
        [ { "arrivalMean": 4, "magnitudeMean": 2.0 } ],
        [ { "arrivalMean": 3, "magnitudeMean": 1.0 } ],
        [ { "arrivalMean": 5, "magnitudeMean": 0.2 } ],
        [ { "arrivalMean": 4, "magnitudeMean": 2.5 } ],
        [ { "arrivalMean": 3, "magnitudeMean": 0.8 } ],
        [ { "arrivalMean": 5, "magnitudeMean": 0.1 } ]
      ]
    }
  }
]

Examples - Usage

Generate data from template

Synthetic data generation integrates with the existing pinot-admin "GenerateData" command and only adds a "format" specifier for convenience.

generator command-line

pinot-admin.sh GenerateData \
-numFiles 1 -numRecords 354780 -format csv \
-schemaFile ./pinot-tools/src/main/resources/generator/complexWebsite_schema.json \
-schemaAnnotationFile ./pinot-tools/src/main/resources/generator/complexWebsite_generator.json \
-outDir ./myTestData

Page tree

Synthetic Data Generator for Pinot