Apache logs and Hive

The aim of this blog is to simplify handling of apache logs through Hive. I just thought of helping new users to Hive and trying to parse Apache logs. There are some easy ways. Before starting we should know a little about the formats of the logs and here I will be mentioning about three different formats:

  • Common log format (CLF)
  • Combined log format
  • Custom log format

The details of the log formats can be found in https://httpd.apache.org/docs/1.3/logs.html

The CLF is typically of the form, “%h %l %u %t \”%r\” %>s %b”

The combined log format is of the form, “%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\””

The meanings of the terms are provided in the link, https://httpd.apache.org/docs/1.3/mod/mod_log_config.html

Sample records of the apache log

Common log format:

64.242.88.10 – – [07/Mar/2004:16:35:19 -0800] “GET /mailman/listinfo/business HTTP/1.1” 200 6379

Combined log format:

216.67.1.91 – leon [01/Jul/2002:12:11:52 +0000] “GET /index.html HTTP/1.1″ 200 431″http://www.loganalyzer.net/” “Mozilla/4.05 [en] (WinNT; I)” “USERID=CustomerA;IMPID=01234”

Creating a Hive table

Approach 1 (Regex):

CREATE TABLE access_log_combined(
ipaddress STRING, identd STRING, username STRING,finishtime STRING,
requestline string, returncode string, size string,referer string, agent string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = “([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \”]*|\”[^\”]*\”) (-|[0-9]*) (-|[0-9]*)(?: ([^ \”]*|\”[^\”]*\”) ([^ \”]*|\”[^\”]*\”))?”
)
STORED AS TEXTFILE

Approach 2 (using dynamic SerDe):

CREATE TABLE apachelog (ipaddress STRING, identd STRING, user STRING,finishtime STRING,requestline string, returncode INT, size INT) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe’  WITH SERDEPROPERTIES (‘serialization.format’=’org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol’,‘quote.delim’='(“|\\[|\\])’,‘field.delim’=’ ‘,‘serialization.null.format’=’-‘) STORED AS TEXTFILE;

Conclusion:

Personally I prefer the second approach. I haven’t looked into the performance aspect as yet though. The reason for it is because I can define the “serialization.null.format” where in a NULL character can be specified. In the logs mentioned above ‘-‘ is considered the NULL character. However not sure if this method is used nowadays often.

The above hive table could be then used as a source which can be accessed through SQL like language and If one uses a LOCATION clause with the above statements such that an external table is created pointing to a HDFS directory, one can keep adding files to it and keep querying the same way.

One can download sample logs from http://www.monitorware.com/en/logsamples/apache.php to try out.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s