Test data generation 01 (the idea)


Products and solutions which uses multiple data sources and storage formats, a tool which can generate various types of data is very important. Such a tool is very handy for products in the field of ETL and Data management.

Regular usage:

Plenty of tools available for it for most of the regular use cases and if some are available for complex cases, they are cumbersome to use or expensive. Besides, for some cases there may not be any tool so far.  Some of the more regular use cases are as follows. In my next blog I would be discussing details on how to achieve or what am I doing to achieve this.

Generate data for a table in a relational system

  • data types
    • Primitive
    • Complex like Maps, Lists, Arrays
    • Geo-spatial
    • Binary (e.g. BLOB, CLOB )
  • boundary values for columns
  • Introduce exceptions in rows
  • respect indexes (for example, generate unique values for a unique indexed column)
  • control the number of rows and length of a row
  • If certain special data types are needed (knowledge of metadata). For example: semantically valid emails or zip-codes.
  • Statistical requirement for a column
    • Number of NULLs in a column
    • Distribution over the total number of rows. For example: uniform or skewed distribution.
    • Number of unique records
    • Sparse distribution
    • Some probability distributions like “Binomial” or “Normal”
    • How to introduce exceptions in columns?
  • Adding a little bit of more complexity, one can have valid cities and countries (regions) too. By valid I mean multiple things
    • First of all the city should be valid and not random texts
    • If there are two columns, one city and the other country, then if we generate the city as SFO, we should generate the country as USA.
  • If there are query patterns provided for the table, the tool can generate based on the patterns. For example: for a query, select <col list> from table where some_Column like ‘%some text%’, we may generate data following the pattern or potential mismatches.
  • Ability to generate valid data in different languages.
  • Data and relevant scripts could be created too. Below are a few examples
    • Delimited text files with “loader” scripts and control file depending on the database
    • Or directly load the data to the database.
    • Generate DDL, loader scripts or control files if the table needs to be created in some other database
    • Generate scripts to dynamically load specific targets
  • Ability to generate compressed data or data in specific formats.
  • Ability to take care of locale or geography sensitive data formats. For example: date in US is about MM/DD/YYYY while in India it is DD/MM/YYYY. In some countries, it is a norm to use decimal separators as “comma”.
  • If table is “partitioned” then generate data as per the partitions
  • All this need to be done as much as possible with user intervention or bare minimum intervention.

Generate data for a set of tables (schema) in a relational system

  • Most of the needs would be similar to a single table.
  • Need to take care of relationships defined through FK
  • Access pattern based datasets. Depending on the relationships, there would be data which could be generated for particular type of JOIN.
  • Since this schema is generic, ability to generate data for multiple databases or file-system.

Ad-hoc Data

  • User defined structures / schema
  • Data type and boundary conditions
  • Special data types taken care of (city, zip code, country, email and so on)
  • Most of the requirements are same as of the single table in terms of storage formats, compression, DDL, loader scripts, controlling the number of rows and length of a row or mappings.
  • The size of this data set may be defined in terms of “number of rows” or in raw byte size without compression.

Special usage

This involves a combination of “searching” data sets from a maintained pool and generating some data at run-time. The pool may be maintained or may be from public data sets.

Example of maintained data pools or the fancier term could be a “test data lake”:

  • Standard data schema which comes by default with data base installers or applications
  • As a result of proof of concept(POC) or while encountering customer story.
  • QA data sets
  • Already stored public data sets

Example of searchable public data sets:

  • data made public on
    1. Elections
    2. health care
    3. Policies
  • Data made public by organizations. A few examples below
    1. Big Data organisations typically demonstrate their prowess through datasets downloadable in their web sites
    2. Environment NGOs
    3. Airports
    4. Sea ports
    5. Any other corporate.
  • Social Media

Data generation for No SQL stores

  1. Document stores like Mongo DB
  2. Key value pairs like HBase or Cassandra
  3. Graph databases like neo4J or titan DB
  4. Taking care of Hierarchy and nesting
  5. Taking care of row key design for different use cases in key-value stores like HBase. For example: it can be designed based on access patterns or number of nodes in a cluster

Standard/typical schemas

  • OLTP
    • TPC-H
    • Public schemas from relational data bases like MySQL
    • Control the scale factor (i.e. scale factor of 1-5 with 1 being low data and 5 tending to maximum data)
    • Generate DDLs and loader scripts which is DB specific
    • Generate scripts dynamically loading to specific targets
  • OLAP
  • TPC-DS
  • Star Schema
  • Snow flake schema
  • Generate DDLs and loader scripts which is DB specific
  • Generate scripts dynamically to load to specific targets

Domain specific standard schemas

  • Healthcare
    • Standard Enterprise Data warehouse
    • File formats like HIPAA
  • Telecom
    • CDR data
  • Insurance
  • Finance
  • Manufacturing

Data modules

  • HR
  • CRM

Machine generated data sets

Public data sets

Time series data and events in time series


Real time data

The application serves as a streaming source and can connect with a queue.

  • Push data as a stream from the application
  • Integrate with some blogs or social media to keep pushing real, interaction data.


Big data and data for distributed systems

This data is specifically for distributed systems like Hadoop.

  1. Mainly concentrate on logs and data from social media for HDFS files (mostly semi structured data sets)
  2. Generate in High Volume and velocity
  3. Event data for HBase
    1. Concentrate on features of Hbase like use it more for “real time”, “low latency” use cases.
  4. For “SQL over HDFS” systems like Hive, generate hive DDL in different storage formats, compressions and “partitions”
  5. Generate based on access patterns.
  6. Generate Spark SQL, Scala code, Hive/Impala QL, scripts for HDFS, Hive or HBase such that data may be moved around.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s