Sampling is important because we possibly cannot gather data from the entire population. There are be two main reasons. Some of them are as follows:
- Large well spread population makes some sections inaccessible for data collection
- Time spent on data collection from a large population
During the course of my working data or enterprise products concerning data management, I found some important use cases where sampling helps a lot. I will be discussing two simple use cases which can be easily related to.
There is a text file of which we don’t know the structure defined in terms of delimiters separating the columns. The file may be very large to read all the lines and the time taken to just determine the columns in the file would take up a substantial amount of time and the patience of the user. In such cases a sample taken from the file can be of great help. So, pick a random sample of lines from the file and pass it through the parser which returns the best fit as far as the delimiter goes. For example, we can say from the sample, the most likely “delimiter” or “column separator” would be a comma or some other character like “|”. There may be some lines which don’t follow the pattern and these lines can be logged as “bad records” meant to be processed later or as rows which need human intervention.
Suppose a user applied a transformation on the data coming from a certain source. The transformation is a complex business logic which involves several “decisions based on some conditions”. Such decisions in programmatic terms can be represented as a “switch case” construct. Before finalizing the logic and passing the transformed data further, the user wants to test it and looks at the outcome of the logic on real data. The source though has millions of records which means a lot of time will be gone while reviewing the logic and it may be tough to look through all the rows. Instead if there was a sample provided to the and the data review window was limited giving the user ample opportunities to verify the logic.
With these two examples, I will move on to the next section where there will be a discussion on different types of sampling methods. These methods are again based on my experiences and what worked with me.
A quick search on research papers will reveal that sampling methods are usually categorized into two:
- Probability sampling
- Non-probability sampling
In simple words probability sampling is one which exercises some form of random selection. There are various methods of this method. Different journals have different terms associated but as I mentioned earlier, the techniques that I will be describing are purely from my experience. Below are the techniques.
In this technique every member of the population has equal chances of getting selected. This is possibly the most fundamental and primitive of the random sampling methods.
In real life, we apply random sampling methods by lottery method or pulling out a few cards from a shuffled deck of cards.
Use of a random number is another useful technique. For example:
- if we want a sample of “X” from a population of “N” then associate the population members with a number of 1 to N.
- Repeat the below “X” times
- Generate a random number between 1 to N
- Choose the member corresponding to the number generated in “a”
- No Bias when it comes to selection of the members
- Fair representation of the population. Only luck can challenge its representation. If the sample is not representative of the population then it is called a “sampling error”
- The need to have the information of the entire population
Practical demonstration with postgres:
Different data bases provide or describe methods to achieve random samples from their systems. This helps the user in not having to implement these algorithms on their own, but the users should be aware of the methods to be used and their performance implications. Let us consider postgres sql. There is a simple query using the random function.
The first query:
select * from fordemo order by random() limit 2 ;
Since postgres 9.5, there is the introduction of a new construct called TABLESAMPLE. One can specify two methods of sampling here SYSTEM and BERNOULLI. As per the documentation, SYSTEM is faster while BERNOULLI gives the true random sample.
To use Bernoulli
sid=# select * from fordemo TABLESAMPLE bernoulli(10) ;
id | name
4 | row 4
To use system
sid=# select * from fordemo TABLESAMPLE SYSTEM(10) ;
The general query pattern is therefore as below:
select <cols/expression> from table TABLESAMPLE SAMPLING_METHOD(percentage)
To know more please refer to the blog, https://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/
Similarly, other databases provide such methods.
Practical demonstration with Hive
This is similar to the one discussed above. Suppose we want to get a sample of nearly 1000 records we can do the following:
Select * from table order by rand() limit 1000;
However, the performance may be a bottleneck as in this case data may be forced into a single reducer and it has to sort the entire data set while the data sample may be truly random.
Hive has other non-standard SQL in the form of “sort by” and “distribute by”. Using a combination of them may result in a random sample which may not be truly random but definitely the performance can improve.
If the total size of the table is known, then you can easily randomly select some proportion of the data on a record-by-record basis as below:
Select * from table where rand() < 0.0001 distribute by rand() sort by rand() limit 10000;
I tried these methods after reading them in the blog and that’s why recommending, http://www.joefkelley.com/736/
There are some other methods described in the hive docs which I will be discussing later on.
Practical demonstration with Python (for a sequence)
>>> demoList = list(range(1,100))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
>>> random.sample(demoList,5) – first attempt
[54, 75, 51, 2, 61]
>>> random.sample(demoList,5) – second attempt
[81, 83, 30, 28, 25]
>>> random.sample(demoList,5) – third attempt
[2, 66, 34, 8, 68]
In exit polls of elections, possibly this method is used. As per a definition, Stratified sampling is a probability sampling technique wherein the researcher divides the entire population into different subgroups or strata, then randomly selects the final subjects proportionally from the different strata.
Suppose in a school the total number of students can be divided into sub groups or strata like the below:
- Group 1: Male full-time students: 90
- Group 2: Female full-time students: 63
- Group 3: Male part-time students: 18
- Group 4: Female part-time students: 9
Total students: 180
If one has to take a sample of 40 then how would one proceed:
Calculate the percentages constituted by each group.
- Group 1: 90/180 = 0.50 or 50%
- Group 2: 63/180 = 0.35 or 35%
- Group 3: 18/180 = 0.10 or 10%
- Group 4: 9/180 = 0.05 or 5%
So, how many from each group should appear in the sample:
- Group 1: 0.5 x 40 = 20
- Group 2: 0.35 x 40 = 14
- Group 3: 0.10 x 40 = 4
- Group 4: 0.05 x 40 = 2
Now apply simple random sampling method over each strata or group with the respective sample size.
Practical demonstration with Hive
Suppose there is a partitioned hive table based on say “YEAR”, each partition can be taken as “strata”.
We may make an assumption that every partition has similar number of records but what is important in the sample is that record from every partition need to be in there. In such cases if there are “N” partitions and the sample size is “X” such that X>N and each partition has good amount of data. Then the number of records from each partition is X/N. For example, if the sample size is 40 and the number of partition is 5, then we have get 8 from each partition. So, apply the random algorithm to each partition (by using the WHERE clause to have YEAR as an column) with limit as 8.
Hive also provides bucket sampling and block sampling methods. Please refer to the wiki page below for details.
In simple terms, in this method all the members of the population don’t have a chance of getting selected. Various methods are mentioned like the following:
- Convenience sampling. This is the quickest and basically the sample is taken from the most easily accessible part of the population. In my experience I think, the TOP “n” is one such sampling method
- Judgemental sampling: This sample is taken with a specific purpose in mind. For example, if I were check a logic, then I would take data which would match the logic.
There are other methods but haven’t used them as such.
Hope this gives an overall idea of sampling methods and how the methods can be useful. Under different circumstances different sampling methods work well. The knowledge of the need for the sample is important. For example, to find the structure of the file, a simple random sample would suffice while if the purpose is for analysis of exit polls then stratified sampling is needed. If the randomness of the data is not so important, then convenience or judgemental sampling may be applied but they may not have statistical explanation or can cause sampling disasters.
I hope this is helpful to readers and if there is any feedback do let me know as comments.