Datatype mess & data management

For products on data management, there is always a big challenge of mapping data types of one data source to another. The challenge becomes bigger for testers and engineers in test who have to test applications and products doing data integration across various databases. Imagine a software product used for data migration from Oracle to SQLServer or there can be any other combination. I faced this issue while verifying data warehousing and migration capabilities. For example: if a table in Oracle is defined as “NUMBER” then how should the target be created in SQLServer because unlike Oracle SQLserver has various datatypes representing NUMBER such as FLOAT, INT, BIGINT,  SMALLINT or NUMERIC? What is the best fit? Should some sample data be considered to guess the data type? Should it be random data sample or something like the MAX value should be able to tell?

To overcome this challenge, I created an utility which tells me the mapping of the data type from one DB to another or how should the conversion work. This is loosely represented by a JSON which defines the conversion rules. For example: If the condition is met that the NUMBER is defined as NUMBER(18,0) or that data precision is less than or equal to 18, the target datatype could be BIGINT.

“@source_db”: “ORACLE”,
“@destination_db”: “MS SQL Server”,
“DataTypes”: [
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision EQ -1 AND DataScale EQ -1”,
“DestinationDataType”: “FLOAT”
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision EQ -1 AND DataScale EQ 0”,
“DestinationDataType”: “NUMERIC(38, 0)”
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision LTE 4 AND DataScale EQ 0”,
“DestinationDataType”: “SMALLINT”
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision LTE 9 AND DataScale EQ 0”,
“DestinationDataType”: “INT”
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision LTE 18 AND DataScale EQ 0”,
“DestinationDataType”: “BIGINT”
}, ……………..

The above is an incomplete JSON but this was just to demonstrate how one can create such rules and based on this utilities or applications be built such that this mapping of data type is easily verified. I have created such conversion rules for various other combinations too. I have created an util which I will be sharing in public soon but in case of any questions or people who want to reach out to me or provide any feedback do leave your question in the comment section below.



Apache logs and Hive

The aim of this blog is to simplify handling of apache logs through Hive. I just thought of helping new users to Hive and trying to parse Apache logs. There are some easy ways. Before starting we should know a little about the formats of the logs and here I will be mentioning about three different formats:

  • Common log format (CLF)
  • Combined log format
  • Custom log format

The details of the log formats can be found in

The CLF is typically of the form, “%h %l %u %t \”%r\” %>s %b”

The combined log format is of the form, “%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\””

The meanings of the terms are provided in the link,

Sample records of the apache log

Common log format: – – [07/Mar/2004:16:35:19 -0800] “GET /mailman/listinfo/business HTTP/1.1” 200 6379

Combined log format: – leon [01/Jul/2002:12:11:52 +0000] “GET /index.html HTTP/1.1″ 200 431″” “Mozilla/4.05 [en] (WinNT; I)” “USERID=CustomerA;IMPID=01234”

Creating a Hive table

Approach 1 (Regex):

CREATE TABLE access_log_combined(
ipaddress STRING, identd STRING, username STRING,finishtime STRING,
requestline string, returncode string, size string,referer string, agent string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
“input.regex” = “([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \”]*|\”[^\”]*\”) (-|[0-9]*) (-|[0-9]*)(?: ([^ \”]*|\”[^\”]*\”) ([^ \”]*|\”[^\”]*\”))?”

Approach 2 (using dynamic SerDe):

CREATE TABLE apachelog (ipaddress STRING, identd STRING, user STRING,finishtime STRING,requestline string, returncode INT, size INT) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe’  WITH SERDEPROPERTIES (‘serialization.format’=’org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol’,‘quote.delim’='(“|\\[|\\])’,‘field.delim’=’ ‘,‘serialization.null.format’=’-‘) STORED AS TEXTFILE;


Personally I prefer the second approach. I haven’t looked into the performance aspect as yet though. The reason for it is because I can define the “serialization.null.format” where in a NULL character can be specified. In the logs mentioned above ‘-‘ is considered the NULL character. However not sure if this method is used nowadays often.

The above hive table could be then used as a source which can be accessed through SQL like language and If one uses a LOCATION clause with the above statements such that an external table is created pointing to a HDFS directory, one can keep adding files to it and keep querying the same way.

One can download sample logs from to try out.


Test data generation 01 (the idea)


Products and solutions which uses multiple data sources and storage formats, a tool which can generate various types of data is very important. Such a tool is very handy for products in the field of ETL and Data management.

Regular usage:

Plenty of tools available for it for most of the regular use cases and if some are available for complex cases, they are cumbersome to use or expensive. Besides, for some cases there may not be any tool so far.  Some of the more regular use cases are as follows. In my next blog I would be discussing details on how to achieve or what am I doing to achieve this.

Generate data for a table in a relational system

  • data types
    • Primitive
    • Complex like Maps, Lists, Arrays
    • Geo-spatial
    • Binary (e.g. BLOB, CLOB )
  • boundary values for columns
  • Introduce exceptions in rows
  • respect indexes (for example, generate unique values for a unique indexed column)
  • control the number of rows and length of a row
  • If certain special data types are needed (knowledge of metadata). For example: semantically valid emails or zip-codes.
  • Statistical requirement for a column
    • Number of NULLs in a column
    • Distribution over the total number of rows. For example: uniform or skewed distribution.
    • Number of unique records
    • Sparse distribution
    • Some probability distributions like “Binomial” or “Normal”
    • How to introduce exceptions in columns?
  • Adding a little bit of more complexity, one can have valid cities and countries (regions) too. By valid I mean multiple things
    • First of all the city should be valid and not random texts
    • If there are two columns, one city and the other country, then if we generate the city as SFO, we should generate the country as USA.
  • If there are query patterns provided for the table, the tool can generate based on the patterns. For example: for a query, select <col list> from table where some_Column like ‘%some text%’, we may generate data following the pattern or potential mismatches.
  • Ability to generate valid data in different languages.
  • Data and relevant scripts could be created too. Below are a few examples
    • Delimited text files with “loader” scripts and control file depending on the database
    • Or directly load the data to the database.
    • Generate DDL, loader scripts or control files if the table needs to be created in some other database
    • Generate scripts to dynamically load specific targets
  • Ability to generate compressed data or data in specific formats.
  • Ability to take care of locale or geography sensitive data formats. For example: date in US is about MM/DD/YYYY while in India it is DD/MM/YYYY. In some countries, it is a norm to use decimal separators as “comma”.
  • If table is “partitioned” then generate data as per the partitions
  • All this need to be done as much as possible with user intervention or bare minimum intervention.

Generate data for a set of tables (schema) in a relational system

  • Most of the needs would be similar to a single table.
  • Need to take care of relationships defined through FK
  • Access pattern based datasets. Depending on the relationships, there would be data which could be generated for particular type of JOIN.
  • Since this schema is generic, ability to generate data for multiple databases or file-system.

Ad-hoc Data

  • User defined structures / schema
  • Data type and boundary conditions
  • Special data types taken care of (city, zip code, country, email and so on)
  • Most of the requirements are same as of the single table in terms of storage formats, compression, DDL, loader scripts, controlling the number of rows and length of a row or mappings.
  • The size of this data set may be defined in terms of “number of rows” or in raw byte size without compression.

Special usage

This involves a combination of “searching” data sets from a maintained pool and generating some data at run-time. The pool may be maintained or may be from public data sets.

Example of maintained data pools or the fancier term could be a “test data lake”:

  • Standard data schema which comes by default with data base installers or applications
  • As a result of proof of concept(POC) or while encountering customer story.
  • QA data sets
  • Already stored public data sets

Example of searchable public data sets:

  • data made public on
    1. Elections
    2. health care
    3. Policies
  • Data made public by organizations. A few examples below
    1. Big Data organisations typically demonstrate their prowess through datasets downloadable in their web sites
    2. Environment NGOs
    3. Airports
    4. Sea ports
    5. Any other corporate.
  • Social Media

Data generation for No SQL stores

  1. Document stores like Mongo DB
  2. Key value pairs like HBase or Cassandra
  3. Graph databases like neo4J or titan DB
  4. Taking care of Hierarchy and nesting
  5. Taking care of row key design for different use cases in key-value stores like HBase. For example: it can be designed based on access patterns or number of nodes in a cluster

Standard/typical schemas

  • OLTP
    • TPC-H
    • Public schemas from relational data bases like MySQL
    • Control the scale factor (i.e. scale factor of 1-5 with 1 being low data and 5 tending to maximum data)
    • Generate DDLs and loader scripts which is DB specific
    • Generate scripts dynamically loading to specific targets
  • OLAP
  • TPC-DS
  • Star Schema
  • Snow flake schema
  • Generate DDLs and loader scripts which is DB specific
  • Generate scripts dynamically to load to specific targets

Domain specific standard schemas

  • Healthcare
    • Standard Enterprise Data warehouse
    • File formats like HIPAA
  • Telecom
    • CDR data
  • Insurance
  • Finance
  • Manufacturing

Data modules

  • HR
  • CRM

Machine generated data sets

Public data sets

Time series data and events in time series

Real time data

The application serves as a streaming source and can connect with a queue.

  • Push data as a stream from the application
  • Integrate with some blogs or social media to keep pushing real, interaction data.


Big data and data for distributed systems

This data is specifically for distributed systems like Hadoop.

  1. Mainly concentrate on logs and data from social media for HDFS files (mostly semi structured data sets)
  2. Generate in High Volume and velocity
  3. Event data for HBase
    1. Concentrate on features of Hbase like use it more for “real time”, “low latency” use cases.
  4. For “SQL over HDFS” systems like Hive, generate hive DDL in different storage formats, compressions and “partitions”
  5. Generate based on access patterns.
  6. Generate Spark SQL, Scala code, Hive/Impala QL, scripts for HDFS, Hive or HBase such that data may be moved around.