前回に引き続き、EmbulkのFILTER系プラグインを使ってみるということで、"embulk-filter-insert"と"embulk-filter-split"を使ってみます。
なお、今回も基となるデータはexampleコマンド⇒guessコマンドによって生成されるデータ・ymlファイルです。
使用データ
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL
デフォルトでのpreviewコマンド実行結果
+---------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+---------+--------------+-------------------------+-------------------------+----------------------------+
| 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+---------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+---------+--------------+-------------------------+-------------------------+----------------------------+
| 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+---------+--------------+-------------------------+-------------------------+----------------------------+
1.embulk-filter-insert
"insert"とありますが、このプラグインはカラムのインサートを行います。単純にカラムをインサートするだけではなく、"どこ"にインサートするのか、"どの型"でインサートするのかを設定することができます。
特に"どこ"にインサートするのかの設定方法は多く用意されており、最初のカラムや最後のカラムといった指定から、何カラム目かの指定、◯◯カラムの前or後の指定ができるようになっています。
GitHub上のExampleの通り、最初のカラムとして"host_name"カラムを追加してみます。
Embulkの読み込み用ymlファイル(緑字を追記)
in:
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- { type: insert, column: { host_name: host01 }, at: top }
out: {type: stdout}
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- { type: insert, column: { host_name: host01 }, at: top }
out: {type: stdout}
デフォルトでのpreviewコマンド実行結果
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| host_name:string | id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| host01 | 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| host01 | 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| host01 | 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| host01 | 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| host_name:string | id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| host01 | 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| host01 | 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| host01 | 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| host01 | 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
最初のカラムにhost_nameカラムが追加されました。同様に"at"パラメータに"bottom"を指定すれば、最後のカラムとして追加されるようになりますし、数値を指定すれば、指定したカラムとして追加されるようになります。("2"を指定すれば、2カラム目として追加)
※なお、"at"パラメータには他に"head"、"tail"も指定できますが、実行してみた限りでは、"head"は"top"と同じ、"tail"は"bottom"と同じ動作をしていました。
また、"at"パラメータの代わりに"before"もしくは"after"パラメータの設定でカラム名を指定すると、指定したカラムの"前"or"後"にカラムをインサートすることができます。
冒頭で"どの型"でインサートするのかを設定できる、と書きましたが、上記の例では特に指定はしませんでした。その場合は"string"を設定したものと見なされます。
数値や時間をインサートする場合、正しい型になるように忘れずに設定しておかなくてはなりません。
例として、型設定に"long"を指定した場合と、何も指定しなかった場合を比較してみます。
"long"を指定した場合
Embulkの読み込み用ymlファイル(緑字を追記)
in:
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- { type: insert, column: { insert_id: 123, as: long }, at: top }
out: {type: stdout}
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- { type: insert, column: { insert_id: 123, as: long }, at: top }
out: {type: stdout}
previewコマンド実行結果
+----------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| insert_id:long | id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+----------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| 123 | 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 123 | 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 123 | 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 123 | 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+----------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| insert_id:long | id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+----------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| 123 | 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 123 | 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 123 | 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 123 | 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+----------------+---------+--------------+-------------------------+-------------------------+----------------------------+
何も指定しなかった場合
Embulkの読み込み用ymlファイル(緑字を追記)
in:
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- { type: insert, column: { insert_id: 123 }, at: top }
out: {type: stdout}
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- { type: insert, column: { insert_id: 123 }, at: top }
out: {type: stdout}
previewコマンド実行結果
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| insert_id:string | id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| 123 | 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 123 | 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 123 | 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 123 | 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| insert_id:string | id:long | account:long | time:timestamp | purchase:timestamp | comment:string |
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
| 123 | 1 | 32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 123 | 2 | 14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 123 | 3 | 27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 123 | 4 | 11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+------------------+---------+--------------+-------------------------+-------------------------+----------------------------+
previewコマンド実行結果を比較すると、"insert_id"カラムの型が"long"と"string"に分かれて要ることが分かります。
最後に複数のカラムを一度にインサートしてみます。
Embulkの読み込み用ymlファイル(緑字を追記)
in:
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- type: insert
columns:
- service_name: service01
- { user_id: 1234567, as: long }
after: account
out: {type: stdout}
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- type: insert
columns:
- service_name: service01
- { user_id: 1234567, as: long }
after: account
out: {type: stdout}
previewコマンド実行結果
+---------+--------------+---------------------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long | service_name:string | user_id:long | time:timestamp | purchase:timestamp | comment:string |
+---------+--------------+---------------------+--------------+-------------------------+-------------------------+----------------------------+
| 1 | 32,864 | service01 | 1,234,567 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 2 | 14,824 | service01 | 1,234,567 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 3 | 27,559 | service01 | 1,234,567 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 4 | 11,270 | service01 | 1,234,567 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+---------+--------------+---------------------+--------------+-------------------------+-------------------------+----------------------------+
| id:long | account:long | service_name:string | user_id:long | time:timestamp | purchase:timestamp | comment:string |
+---------+--------------+---------------------+--------------+-------------------------+-------------------------+----------------------------+
| 1 | 32,864 | service01 | 1,234,567 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 2 | 14,824 | service01 | 1,234,567 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 3 | 27,559 | service01 | 1,234,567 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 4 | 11,270 | service01 | 1,234,567 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+---------+--------------+---------------------+--------------+-------------------------+-------------------------+----------------------------+
2つのカラムが"account"カラムの後にインサートされています。
ただし、この方法では上記の例の"service_name"は"account"カラムの後でいいけれど、"user_id"は"id"カラムの後にしたいという場合に対応していません。
その場合は下記のような設定で対応できます。
Embulkの読み込み用ymlファイル(緑字を追記)
in:
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- { type: insert, column: { service_name: service01 }, after: account }
- { type: insert, column: { user_id: 1234567, as: long }, after: id }
out: {type: stdout}
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
filters:
- { type: insert, column: { service_name: service01 }, after: account }
- { type: insert, column: { user_id: 1234567, as: long }, after: id }
out: {type: stdout}
previewコマンド実行結果
+---------+--------------+--------------+---------------------+-------------------------+-------------------------+----------------------------+
| id:long | user_id:long | account:long | service_name:string | time:timestamp | purchase:timestamp | comment:string |
+---------+--------------+--------------+---------------------+-------------------------+-------------------------+----------------------------+
| 1 | 1,234,567 | 32,864 | service01 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 2 | 1,234,567 | 14,824 | service01 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 3 | 1,234,567 | 27,559 | service01 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 4 | 1,234,567 | 11,270 | service01 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+---------+--------------+--------------+---------------------+-------------------------+-------------------------+----------------------------+
| id:long | user_id:long | account:long | service_name:string | time:timestamp | purchase:timestamp | comment:string |
+---------+--------------+--------------+---------------------+-------------------------+-------------------------+----------------------------+
| 1 | 1,234,567 | 32,864 | service01 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC | embulk |
| 2 | 1,234,567 | 14,824 | service01 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC | embulk jruby |
| 3 | 1,234,567 | 27,559 | service01 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
| 4 | 1,234,567 | 11,270 | service01 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC | NULL |
+---------+--------------+--------------+---------------------+-------------------------+-------------------------+----------------------------+
2.embulk-filter-split
"split"から連想する通り、特定の文字でデータの切り分けを行うプラグインです。特定のカラムに対してデリミタになる文字を指定し、新しいカラムとしてデータを追加していきます。
どのような結果になるのか、実際の例を見てみると、理解しやすいと思います。
使用データ
id,keywords
1,"a,b,c"
2,"A,B,C"
3,ABC
1,"a,b,c"
2,"A,B,C"
3,ABC
Embulkの読み込み用ymlファイル(緑字を追記)
in:
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: keywords, type: string}
filters:
- type: split
delimiter: ','
keep_input: true
target_key: keywords
output_key: keyword
out: {type: stdout}
type: file
path_prefix: ~/lib/try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: keywords, type: string}
filters:
- type: split
delimiter: ','
keep_input: true
target_key: keywords
output_key: keyword
out: {type: stdout}
previewコマンド実行結果(プラグイン不使用)
+---------+-----------------+
| id:long | keywords:string |
+---------+-----------------+
| 1 | a,b,c |
| 2 | A,B,C |
| 3 | ABC |
+---------+-----------------+
| id:long | keywords:string |
+---------+-----------------+
| 1 | a,b,c |
| 2 | A,B,C |
| 3 | ABC |
+---------+-----------------+
previewコマンド実行結果(プラグイン使用)
+---------+-----------------+----------------+
| id:long | keywords:string | keyword:string |
+---------+-----------------+----------------+
| 1 | a,b,c | a |
| 1 | a,b,c | b |
| 1 | a,b,c | c |
| 2 | A,B,C | A |
| 2 | A,B,C | B |
| 2 | A,B,C | C |
| 3 | ABC | ABC |
+---------+-----------------+----------------+
| id:long | keywords:string | keyword:string |
+---------+-----------------+----------------+
| 1 | a,b,c | a |
| 1 | a,b,c | b |
| 1 | a,b,c | c |
| 2 | A,B,C | A |
| 2 | A,B,C | B |
| 2 | A,B,C | C |
| 3 | ABC | ABC |
+---------+-----------------+----------------+
結果を比較すると、行も列も増えていることが分かります。
追加されている"keyword"カラムは、値を見てみると何となく分かると思いますが、"keywords"カラムを","で切り分けた結果が入力されています。
ymlファイルでの設定とは、次のように対応しています。
- 切り分ける元のカラム名:target_key
- 切り分けた後の保存先カラム名:output_key
- デリミタ:delimiter
切り分けた結果で増えた行の他のカラムについては、切り分ける元となった行のデータがそのまま入力されるようになっています。
あくまで切り分けた結果しか必要でない場合は、"keep_input"をfalseにすることで実現できます。ただし、切り分けた結果のカラムしか表示されないので、"id"カラムは残しておきたいなどのケースには対応できません。
previewコマンド実行結果("keep_input"をfalseに設定)
+----------------+
| keyword:string |
+----------------+
| a |
| b |
| c |
| A |
| B |
| C |
| ABC |
+----------------+
| keyword:string |
+----------------+
| a |
| b |
| c |
| A |
| B |
| C |
| ABC |
+----------------+
次回は恐らく来年になると思いますが、Embulkに関してか、Railsに関しての題材にしようかなと考えています。
0 件のコメント:
コメントを投稿