Sampling Methods

Why Sampling

Sampling is important because we possibly cannot gather data from the entire population. There are be two main reasons. Some of them are as follows:

  • Large well spread population makes some sections inaccessible for data collection
  • Time spent on data collection from a large population

During the course of my working data or enterprise products concerning data management, I found some important use cases where sampling helps a lot. I will be discussing two simple use cases which can be easily related to.

Use case 1: Determining the structure of a CSV file.

There is a text file of which we don’t know the structure defined in terms of delimiters separating the columns. The file may be very large to read all the lines and the time taken to just determine the columns in the file would take up a substantial amount of time and the patience of the user. In such cases a sample taken from the file can be of great help. So, pick a random sample of lines from the file and pass it through the parser which returns the best fit as far as the delimiter goes. For example, we can say from the sample, the most likely “delimiter” or “column separator” would be a comma or some other character like “|”. There may be some lines which don’t follow the pattern and these lines can be logged as “bad records” meant to be processed later or as rows which need human intervention.

Use case 2: Verifying the logic of a transformation applied on the data

Suppose a user applied a transformation on the data coming from a certain source. The transformation is a complex business logic which involves several “decisions based on some conditions”. Such decisions in programmatic terms can be represented as a “switch case” construct. Before finalizing the logic and passing the transformed data further, the user wants to test it and looks at the outcome of the logic on real data.  The source though has millions of records which means a lot of time will be gone while reviewing the logic and it may be tough to look through all the rows. Instead if there was a sample provided to the and the data review window was limited giving the user ample opportunities to verify the logic.

With these two examples, I will move on to the next section where there will be a discussion on different types of sampling methods. These methods are again based on my experiences and what worked with me.

Methods of sampling

A quick search on research papers will reveal that sampling methods are usually categorized into two:

  • Probability sampling
  • Non-probability sampling

Probability sampling

In simple words probability sampling is one which exercises some form of random selection. There are various methods of this method. Different journals have different terms associated but as I mentioned earlier, the techniques that I will be describing are purely from my experience. Below are the techniques.

Simple Random Sampling

In this technique every member of the population has equal chances of getting selected. This is possibly the most fundamental and primitive of the random sampling methods.

In real life, we apply random sampling methods by lottery method or pulling out a few cards from a shuffled deck of cards.

Use of a random number is another useful technique. For example:

  • if we want a sample of “X” from a population of “N” then associate the population members with a number of 1 to N.
  • Repeat the below “X” times
    1. Generate a random number between 1 to N
    2. Choose the member corresponding to the number generated in “a”

Advantages:

  • No Bias when it comes to selection of the members
  • Fair representation of the population. Only luck can challenge its representation. If the sample is not representative of the population then it is called a “sampling error”

Disadvantages:

  • The need to have the information of the entire population

Practical demonstration with postgres:

Different data bases provide or describe methods to achieve random samples from their systems. This helps the user in not having to implement these algorithms on their own, but the users should be aware of the methods to be used and their performance implications. Let us consider postgres sql. There is a simple query using the random function.

The first query:

select * from fordemo order by random() limit 2 ;

Since postgres 9.5, there is the introduction of a new construct called TABLESAMPLE. One can specify two methods of sampling here SYSTEM and BERNOULLI. As per the documentation, SYSTEM is faster while BERNOULLI gives the true random sample.

To use Bernoulli

sid=# select * from fordemo TABLESAMPLE bernoulli(10) ;

id | name

—-+——-

4 | row 4

(1 row)

To use system

sid=# select * from fordemo TABLESAMPLE SYSTEM(10) ;

The general query pattern is therefore as below:

select <cols/expression> from table TABLESAMPLE SAMPLING_METHOD(percentage)

To know more please refer to the blog, https://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/

Similarly, other databases provide such methods.

Practical demonstration with Hive

This is similar to the one discussed above. Suppose we want to get a sample of nearly 1000 records we can do the following:

Select * from table order by rand() limit 1000;

However, the performance may be a bottleneck as in this case data may be forced into a single reducer and it has to sort the entire data set while the data sample may be truly random.

Hive has other non-standard SQL in the form of “sort by” and “distribute by”. Using a combination of them may result in a random sample which may not be truly random but definitely the performance can improve.

If the total size of the table is known, then you can easily randomly select some proportion of the data on a record-by-record basis as below:

Select * from table where rand() < 0.0001 distribute by rand() sort by rand() limit 10000;

I tried these methods after reading them in the blog and that’s why recommending, http://www.joefkelley.com/736/

There are some other methods described in the hive docs which I will be discussing later on.

Practical demonstration with Python (for a sequence)

>>> demoList = list(range(1,100))

>>> demoList

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

>>> random.sample(demoList,5) – first attempt

[54, 75, 51, 2, 61]

>>> random.sample(demoList,5) – second attempt

[81, 83, 30, 28, 25]

>>> random.sample(demoList,5) – third attempt

[2, 66, 34, 8, 68]

Stratified sampling

In exit polls of elections, possibly this method is used. As per a definition, Stratified sampling is a probability sampling technique wherein the researcher divides the entire population into different subgroups or strata, then randomly selects the final subjects proportionally from the different strata.

Practical explanation

Suppose in a school the total number of students can be divided into sub groups or strata like the below:

  • Group 1: Male full-time students: 90
  • Group 2: Female full-time students: 63
  • Group 3: Male part-time students: 18
  • Group 4: Female part-time students: 9

Total students: 180

If one has to take a sample of 40 then how would one proceed:

Calculate the percentages constituted by each group.

  • Group 1: 90/180 = 0.50 or 50%
  • Group 2: 63/180 = 0.35 or 35%
  • Group 3: 18/180 = 0.10 or 10%
  • Group 4: 9/180 = 0.05 or 5%

So, how many from each group should appear in the sample:

  • Group 1: 0.5 x 40 = 20
  • Group 2: 0.35 x 40 = 14
  • Group 3: 0.10 x 40 = 4
  • Group 4: 0.05 x 40 = 2

Now apply simple random sampling method over each strata or group with the respective sample size.

Practical demonstration with Hive

Suppose there is a partitioned hive table based on say “YEAR”, each partition can be taken as “strata”.

We may make an assumption that every partition has similar number of records but what is important in the sample is that record from every partition need to be in there. In such cases if there are “N” partitions and the sample size is “X” such that X>N and each partition has good amount of data. Then the number of records from each partition is X/N. For example, if the sample size is 40 and the number of partition is 5, then we have get 8 from each partition. So, apply the random algorithm to each partition (by using the WHERE clause to have YEAR as an column) with limit as 8.

Hive also provides bucket sampling and block sampling methods. Please refer to the wiki page below for details.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

Non – Probability sampling

In simple terms, in this method all the members of the population don’t have a chance of getting selected. Various methods are mentioned like the following:

  • Convenience sampling. This is the quickest and basically the sample is taken from the most easily accessible part of the population. In my experience I think, the TOP “n” is one such sampling method
  • Judgemental sampling: This sample is taken with a specific purpose in mind. For example, if I were check a logic, then I would take data which would match the logic.

There are other methods but haven’t used them as such.

Conclusion:

Hope this gives an overall idea of sampling methods and how the methods can be useful. Under different circumstances different sampling methods work well. The knowledge of the need for the sample is important.  For example, to find the structure of the file, a simple random sample would suffice while if the purpose is for analysis of exit polls then stratified sampling is needed. If the randomness of the data is not so important, then convenience or judgemental sampling may be applied but they may not have statistical explanation or can cause sampling disasters.

I hope this is helpful to readers and if there is any feedback do let me know as comments.

 

 

 

 

 

 

Datatype mess & data management

For products on data management, there is always a big challenge of mapping data types of one data source to another. The challenge becomes bigger for testers and engineers in test who have to test applications and products doing data integration across various databases. Imagine a software product used for data migration from Oracle to SQLServer or there can be any other combination. I faced this issue while verifying data warehousing and migration capabilities. For example: if a table in Oracle is defined as “NUMBER” then how should the target be created in SQLServer because unlike Oracle SQLserver has various datatypes representing NUMBER such as FLOAT, INT, BIGINT,  SMALLINT or NUMERIC? What is the best fit? Should some sample data be considered to guess the data type? Should it be random data sample or something like the MAX value should be able to tell?

To overcome this challenge, I created an utility which tells me the mapping of the data type from one DB to another or how should the conversion work. This is loosely represented by a JSON which defines the conversion rules. For example: If the condition is met that the NUMBER is defined as NUMBER(18,0) or that data precision is less than or equal to 18, the target datatype could be BIGINT.

{
“@source_db”: “ORACLE”,
“@destination_db”: “MS SQL Server”,
“DataTypes”: [
{
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision EQ -1 AND DataScale EQ -1”,
“DestinationDataType”: “FLOAT”
},
{
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision EQ -1 AND DataScale EQ 0”,
“DestinationDataType”: “NUMERIC(38, 0)”
},
{
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision LTE 4 AND DataScale EQ 0”,
“DestinationDataType”: “SMALLINT”
},
{
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision LTE 9 AND DataScale EQ 0”,
“DestinationDataType”: “INT”
},
{
“OriginalDataType”: “NUMBER”,
“Condition”: “DataPrecision LTE 18 AND DataScale EQ 0”,
“DestinationDataType”: “BIGINT”
}, ……………..

The above is an incomplete JSON but this was just to demonstrate how one can create such rules and based on this utilities or applications be built such that this mapping of data type is easily verified. I have created such conversion rules for various other combinations too. I have created an util which I will be sharing in public soon but in case of any questions or people who want to reach out to me or provide any feedback do leave your question in the comment section below.

 

 

Apache logs and Hive

The aim of this blog is to simplify handling of apache logs through Hive. I just thought of helping new users to Hive and trying to parse Apache logs. There are some easy ways. Before starting we should know a little about the formats of the logs and here I will be mentioning about three different formats:

  • Common log format (CLF)
  • Combined log format
  • Custom log format

The details of the log formats can be found in https://httpd.apache.org/docs/1.3/logs.html

The CLF is typically of the form, “%h %l %u %t \”%r\” %>s %b”

The combined log format is of the form, “%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\””

The meanings of the terms are provided in the link, https://httpd.apache.org/docs/1.3/mod/mod_log_config.html

Sample records of the apache log

Common log format:

64.242.88.10 – – [07/Mar/2004:16:35:19 -0800] “GET /mailman/listinfo/business HTTP/1.1” 200 6379

Combined log format:

216.67.1.91 – leon [01/Jul/2002:12:11:52 +0000] “GET /index.html HTTP/1.1″ 200 431″http://www.loganalyzer.net/&#8221; “Mozilla/4.05 [en] (WinNT; I)” “USERID=CustomerA;IMPID=01234”

Creating a Hive table

Approach 1 (Regex):

CREATE TABLE access_log_combined(
ipaddress STRING, identd STRING, username STRING,finishtime STRING,
requestline string, returncode string, size string,referer string, agent string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = “([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \”]*|\”[^\”]*\”) (-|[0-9]*) (-|[0-9]*)(?: ([^ \”]*|\”[^\”]*\”) ([^ \”]*|\”[^\”]*\”))?”
)
STORED AS TEXTFILE

Approach 2 (using dynamic SerDe):

CREATE TABLE apachelog (ipaddress STRING, identd STRING, user STRING,finishtime STRING,requestline string, returncode INT, size INT) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe’  WITH SERDEPROPERTIES (‘serialization.format’=’org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol’,‘quote.delim’='(“|\\[|\\])’,‘field.delim’=’ ‘,‘serialization.null.format’=’-‘) STORED AS TEXTFILE;

Conclusion:

Personally I prefer the second approach. I haven’t looked into the performance aspect as yet though. The reason for it is because I can define the “serialization.null.format” where in a NULL character can be specified. In the logs mentioned above ‘-‘ is considered the NULL character. However not sure if this method is used nowadays often.

The above hive table could be then used as a source which can be accessed through SQL like language and If one uses a LOCATION clause with the above statements such that an external table is created pointing to a HDFS directory, one can keep adding files to it and keep querying the same way.

One can download sample logs from http://www.monitorware.com/en/logsamples/apache.php to try out.

 

Test data generation 01 (the idea)

Introduction:

Products and solutions which uses multiple data sources and storage formats, a tool which can generate various types of data is very important. Such a tool is very handy for products in the field of ETL and Data management.

Regular usage:

Plenty of tools available for it for most of the regular use cases and if some are available for complex cases, they are cumbersome to use or expensive. Besides, for some cases there may not be any tool so far.  Some of the more regular use cases are as follows. In my next blog I would be discussing details on how to achieve or what am I doing to achieve this.

Generate data for a table in a relational system

  • data types
    • Primitive
    • Complex like Maps, Lists, Arrays
    • Geo-spatial
    • Binary (e.g. BLOB, CLOB )
  • boundary values for columns
  • Introduce exceptions in rows
  • respect indexes (for example, generate unique values for a unique indexed column)
  • control the number of rows and length of a row
  • If certain special data types are needed (knowledge of metadata). For example: semantically valid emails or zip-codes.
  • Statistical requirement for a column
    • Number of NULLs in a column
    • Distribution over the total number of rows. For example: uniform or skewed distribution.
    • Number of unique records
    • Sparse distribution
    • Some probability distributions like “Binomial” or “Normal”
    • How to introduce exceptions in columns?
  • Adding a little bit of more complexity, one can have valid cities and countries (regions) too. By valid I mean multiple things
    • First of all the city should be valid and not random texts
    • If there are two columns, one city and the other country, then if we generate the city as SFO, we should generate the country as USA.
  • If there are query patterns provided for the table, the tool can generate based on the patterns. For example: for a query, select <col list> from table where some_Column like ‘%some text%’, we may generate data following the pattern or potential mismatches.
  • Ability to generate valid data in different languages.
  • Data and relevant scripts could be created too. Below are a few examples
    • Delimited text files with “loader” scripts and control file depending on the database
    • Or directly load the data to the database.
    • Generate DDL, loader scripts or control files if the table needs to be created in some other database
    • Generate scripts to dynamically load specific targets
  • Ability to generate compressed data or data in specific formats.
  • Ability to take care of locale or geography sensitive data formats. For example: date in US is about MM/DD/YYYY while in India it is DD/MM/YYYY. In some countries, it is a norm to use decimal separators as “comma”.
  • If table is “partitioned” then generate data as per the partitions
  • All this need to be done as much as possible with user intervention or bare minimum intervention.

Generate data for a set of tables (schema) in a relational system

  • Most of the needs would be similar to a single table.
  • Need to take care of relationships defined through FK
  • Access pattern based datasets. Depending on the relationships, there would be data which could be generated for particular type of JOIN.
  • Since this schema is generic, ability to generate data for multiple databases or file-system.

Ad-hoc Data

  • User defined structures / schema
  • Data type and boundary conditions
  • Special data types taken care of (city, zip code, country, email and so on)
  • Most of the requirements are same as of the single table in terms of storage formats, compression, DDL, loader scripts, controlling the number of rows and length of a row or mappings.
  • The size of this data set may be defined in terms of “number of rows” or in raw byte size without compression.

Special usage

This involves a combination of “searching” data sets from a maintained pool and generating some data at run-time. The pool may be maintained or may be from public data sets.

Example of maintained data pools or the fancier term could be a “test data lake”:

  • Standard data schema which comes by default with data base installers or applications
  • As a result of proof of concept(POC) or while encountering customer story.
  • QA data sets
  • Already stored public data sets

Example of searchable public data sets:

  • data made public on
    1. Elections
    2. health care
    3. Policies
  • Data made public by organizations. A few examples below
    1. Big Data organisations typically demonstrate their prowess through datasets downloadable in their web sites
    2. Environment NGOs
    3. Airports
    4. Sea ports
    5. Any other corporate.
  • Social Media

Data generation for No SQL stores

  1. Document stores like Mongo DB
  2. Key value pairs like HBase or Cassandra
  3. Graph databases like neo4J or titan DB
  4. Taking care of Hierarchy and nesting
  5. Taking care of row key design for different use cases in key-value stores like HBase. For example: it can be designed based on access patterns or number of nodes in a cluster

Standard/typical schemas

  • OLTP
    • TPC-H
    • Public schemas from relational data bases like MySQL
    • Control the scale factor (i.e. scale factor of 1-5 with 1 being low data and 5 tending to maximum data)
    • Generate DDLs and loader scripts which is DB specific
    • Generate scripts dynamically loading to specific targets
  • OLAP
  • TPC-DS
  • Star Schema
  • Snow flake schema
  • Generate DDLs and loader scripts which is DB specific
  • Generate scripts dynamically to load to specific targets

Domain specific standard schemas

  • Healthcare
    • Standard Enterprise Data warehouse
    • File formats like HIPAA
  • Telecom
    • CDR data
  • Insurance
  • Finance
  • Manufacturing

Data modules

  • HR
  • CRM

Machine generated data sets

Public data sets

Time series data and events in time series

https://en.wikipedia.org/wiki/Time_series

Real time data

The application serves as a streaming source and can connect with a queue.

  • Push data as a stream from the application
  • Integrate with some blogs or social media to keep pushing real, interaction data.

 

Big data and data for distributed systems

This data is specifically for distributed systems like Hadoop.

  1. Mainly concentrate on logs and data from social media for HDFS files (mostly semi structured data sets)
  2. Generate in High Volume and velocity
  3. Event data for HBase
    1. Concentrate on features of Hbase like use it more for “real time”, “low latency” use cases.
  4. For “SQL over HDFS” systems like Hive, generate hive DDL in different storage formats, compressions and “partitions”
  5. Generate based on access patterns.
  6. Generate Spark SQL, Scala code, Hive/Impala QL, scripts for HDFS, Hive or HBase such that data may be moved around.