Products and solutions which uses multiple data sources and storage formats, a tool which can generate various types of data is very important. Such a tool is very handy for products in the field of ETL and Data management.
Plenty of tools available for it for most of the regular use cases and if some are available for complex cases, they are cumbersome to use or expensive. Besides, for some cases there may not be any tool so far. Some of the more regular use cases are as follows. In my next blog I would be discussing details on how to achieve or what am I doing to achieve this.
Generate data for a table in a relational system
- data types
- Complex like Maps, Lists, Arrays
- Binary (e.g. BLOB, CLOB )
- boundary values for columns
- Introduce exceptions in rows
- respect indexes (for example, generate unique values for a unique indexed column)
- control the number of rows and length of a row
- If certain special data types are needed (knowledge of metadata). For example: semantically valid emails or zip-codes.
- Statistical requirement for a column
- Number of NULLs in a column
- Distribution over the total number of rows. For example: uniform or skewed distribution.
- Number of unique records
- Sparse distribution
- Some probability distributions like “Binomial” or “Normal”
- How to introduce exceptions in columns?
- Adding a little bit of more complexity, one can have valid cities and countries (regions) too. By valid I mean multiple things
- First of all the city should be valid and not random texts
- If there are two columns, one city and the other country, then if we generate the city as SFO, we should generate the country as USA.
- If there are query patterns provided for the table, the tool can generate based on the patterns. For example: for a query, select <col list> from table where some_Column like ‘%some text%’, we may generate data following the pattern or potential mismatches.
- Ability to generate valid data in different languages.
- Data and relevant scripts could be created too. Below are a few examples
- Delimited text files with “loader” scripts and control file depending on the database
- Or directly load the data to the database.
- Generate DDL, loader scripts or control files if the table needs to be created in some other database
- Generate scripts to dynamically load specific targets
- Ability to generate compressed data or data in specific formats.
- Ability to take care of locale or geography sensitive data formats. For example: date in US is about MM/DD/YYYY while in India it is DD/MM/YYYY. In some countries, it is a norm to use decimal separators as “comma”.
- If table is “partitioned” then generate data as per the partitions
- All this need to be done as much as possible with user intervention or bare minimum intervention.
Generate data for a set of tables (schema) in a relational system
- Most of the needs would be similar to a single table.
- Need to take care of relationships defined through FK
- Access pattern based datasets. Depending on the relationships, there would be data which could be generated for particular type of JOIN.
- Since this schema is generic, ability to generate data for multiple databases or file-system.
- User defined structures / schema
- Data type and boundary conditions
- Special data types taken care of (city, zip code, country, email and so on)
- Most of the requirements are same as of the single table in terms of storage formats, compression, DDL, loader scripts, controlling the number of rows and length of a row or mappings.
- The size of this data set may be defined in terms of “number of rows” or in raw byte size without compression.
This involves a combination of “searching” data sets from a maintained pool and generating some data at run-time. The pool may be maintained or may be from public data sets.
Example of maintained data pools or the fancier term could be a “test data lake”:
- Standard data schema which comes by default with data base installers or applications
- As a result of proof of concept(POC) or while encountering customer story.
- QA data sets
- Already stored public data sets
Example of searchable public data sets:
- data made public on
- health care
- Data made public by organizations. A few examples below
- Big Data organisations typically demonstrate their prowess through datasets downloadable in their web sites
- Environment NGOs
- Sea ports
- Any other corporate.
- Social Media
Data generation for No SQL stores
- Document stores like Mongo DB
- Key value pairs like HBase or Cassandra
- Graph databases like neo4J or titan DB
- Taking care of Hierarchy and nesting
- Taking care of row key design for different use cases in key-value stores like HBase. For example: it can be designed based on access patterns or number of nodes in a cluster
- Public schemas from relational data bases like MySQL
- Control the scale factor (i.e. scale factor of 1-5 with 1 being low data and 5 tending to maximum data)
- Generate DDLs and loader scripts which is DB specific
- Generate scripts dynamically loading to specific targets
- Star Schema
- Snow flake schema
- Generate DDLs and loader scripts which is DB specific
- Generate scripts dynamically to load to specific targets
Domain specific standard schemas
- Standard Enterprise Data warehouse
- File formats like HIPAA
- CDR data
Machine generated data sets
- Log files (http://www.loganalyzer.net/log-analysis/log-file-format.html)
- Router logs (http://www.cisco.com/c/en/us/td/docs/routers/access/wireless/software/guide/SysMsgLogging.html)
- Seismological data
- Example: http://www.dbms2.com/2010/04/08/machine-generated-data-example/
Public data sets
- data sets on elections and environment or other policies. A few examples
- Data sets made public by organizations like Ad companies or NGO
- University research data sets meant for say testing algorithms/logic
- Learning based algorithms
- Heuristic based systems
- Text similarity logics (distance algorithms for fuzzy search)
Time series data and events in time series
- Event data in CEF format (https://kc.mcafee.com/resources/sites/MCAFEE/content/live/CORP_KNOWLEDGEBASE/78000/KB78712/en_US/CEF_White_Paper_20100722.pdf)
- May be generate data for open TSDB if needed
Real time data
The application serves as a streaming source and can connect with a queue.
- Push data as a stream from the application
- Integrate with some blogs or social media to keep pushing real, interaction data.
Big data and data for distributed systems
This data is specifically for distributed systems like Hadoop.
- Mainly concentrate on logs and data from social media for HDFS files (mostly semi structured data sets)
- Generate in High Volume and velocity
- Event data for HBase
- Concentrate on features of Hbase like use it more for “real time”, “low latency” use cases.
- For “SQL over HDFS” systems like Hive, generate hive DDL in different storage formats, compressions and “partitions”
- Generate based on access patterns.
- Generate Spark SQL, Scala code, Hive/Impala QL, scripts for HDFS, Hive or HBase such that data may be moved around.