Serde - OpenCSVSerde

CSV ํŒŒ์ผ์„ hadoop ์—์„œ ์ฝ์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” OpenCSVSerde๋ฅผ ์‚ฌ์šฉํ•ด๋ณด๊ณ , ๋ฐ์ดํ„ฐ ๊ฒ€์ฆ๊นŒ์ง€ ํ•ด๋ณผ ๊ฒƒ์ด๋‹ค.

* Hive 0.14 ๋ฒ„์ „๋ถ€ํ„ฐ ๊ธฐ๋ณธ ์ง€์›

๋จผ์ € ๋ฐ์ดํ„ฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ๋ฐ์ดํ„ฐ ๊ฒ€์ฆ ์˜ค๋ฅ˜๋ฅผ ์œ„ํ•ด ์ค‘๊ฐ„์— ์ค„๋ฐ”๊ฟˆ์„ ๋„ฃ์—ˆ๋‹ค.

# csvserdetest.csv

NO,TYPE,CODE,NAME,ADRES,RN_ADRES,TELNO,CTPRVN_CD
"1",ํŽธ์˜์ ,"123456",7์ผ๋ ˆ๋ธ,๊ฒฝ๊ธฐ๋„ ์„ค๋‚จ์‹œ ๋ถ„๋‹น๊ตฌ ์‚ผํ‰๋™ 656,๊ฒฝ๊ธฐ๋„์„ฑ๋‚จ์‹œ ๋ถ„๋‹น๊ตฌ ๋ถ„๋‹น๋‚ด๊ณก๋กœ 151,,"41"
"2",ํŽธ์˜์ ,"123456",GS25,๊ฒฝ๊ธฐ๋„ ์ˆ˜์›์‹œ ์žฅ์•ˆ๊ตฌ ์ฒœ์ฒœ๋™ 502-1, ๊ฒฝ๊ธฐ๋„ ์ˆ˜์›์‹œ ์žฅ์•ˆ๊ณ  ์ •์ž๋กœ 22,,"41"
"3",ํŽธ์˜์ ,"123456",๋ฏธ๋‹ˆ์Šคํ†ฑ,๊ฒฝ๊ธฐ๋„ ๊น€ํฌ์‹œ ์‚ฌ์šฐ๋™ 249-5,๊ฒฝ๊ธฐ๋„ ๊น€ํฌ์‹œ ์‚ฌ์šฐ์ค‘๋กœ 11๋ฒˆ๊ธธ 20,,"41"
"4",ํŽธ์˜์ ,"123456",electromart/ํ”ผ๊ทœ์–ด,์‚ผ์„ฑ,์—˜์ง€,์นด๋ฉ”๋ผ,๋Œ€ํ˜•๊ฐ€์ „,๋””์Šคํ”Œ๋ ˆ์ด,๊ฒฝ๊ธฐ๋„ ์„ฑ๋‚จ์‹œ ๋ถ„๋‹น๊ตฌ ๋ฐฑํ˜„๋™ 531, ๊ฒฝ๊ธฐ๋„ ์„ฑ๋‚จ์‹œ ๋ถ„๋‹น๊ตฌ ๋Œ€ํ™ฉํŒ๊ต๋กœ 606๋ฒˆ๊ธธ 10,,"41"
"5",ํŽธ์˜์ ,"123456",์œ„๋“œ๋ฏธ,์ถฉ์ฒญ๋ถ๋„ ์ฒญ์ฃผ์‹œ ์ฒญ์›๊ตฌ ์˜ค์ฐฝ์ ๊ฐ๋ฆฌ 646-1,์ถฉ์ฒญ๋ถ๋„ ์ฒญ์ฃผ์‹œ ์ฒญ์›๊ตฌ
์˜ค์ฐฝ์ ์˜ค์ฐฝ์ค‘์•™๋กœ 65,,"43"

Create๋ฌธ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

CREATE EXTERNAL TABLE csvserde_test(
	no			string,
    type		string,
    code		string,
    name		string,
    adres		string,
    rn_adres	string,
    telno		string,
    ctprvn_cd	string
)
ROW FORMAT SERDE
	'org.apache,hadoop,hive,serde2,OpenCSVSerde'
WITH SERDEPROPERTIES (
	'escapeChar'='\\',
    'quote'='"',
    'seperatorChar'=',')
)
STORED AS INPUTFORMAT
	'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
	'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
	'/user/hive/st/csvserde_test'
TBLPROPERTIES (
	'skip.header.line.count'='1'
)
;

- sepratorChar: ์นผ๋Ÿผ๊ฐ„์˜ ๊ตฌ๋ถ„์ž

- quoteChar: ์นผ๋Ÿผ์˜ ๊ฐ’์„ ์ง€์ •ํ•œ ๋ฌธ์ž๋กœ ๋ฌถ์–ด์ค€๋‹ค.

- escapeChar: ์นผ๋Ÿผ์— ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅํ•  ๋•Œ ํŒŒ์‹ฑํ•˜์ง€ ์•Š๊ณ  ๋ฌด์‹œ

ํ…Œ์ด๋ธ”์ด ์ƒ์„ฑ๋˜์—ˆ์œผ๋‹ˆ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด๋ณด์ž.

ํ•ญ์ƒ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ๊ธฐ์ „์— ํ™•์ธํ•ด ์ค˜์•ผ ํ•  ๊ฒƒ์ด ์žˆ๋‹ค. ๋ฐ”๋กœ ์ธ์ฝ”๋”ฉ์ด๋‹ค.

๋ฆฌ๋ˆ…์Šค์—์„œ ํŒŒ์ผ ์ธ์ฝ”๋”ฉ ํ™•์ธ ๋ช…๋ น์–ด๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

$ file -bi csvserdetest.csv

๋งŒ์•ฝ utf-8์ด ์•„๋‹ˆ๋ผ๋ฉด ์•„๋ž˜ ๋ช…๋ น์–ด๋กœ ๋ฐ”๊ฟ”์ฃผ์ž

$ iconv -f euc-kr -t utf8 csvserdetest.csv > csvserdetest_utf8.csv

utf-8๋กœ ์ธ์ฝ”๋”ฉ์ด ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‘ก์— ๋„ฃ์–ด๋ณด์ž

$ hadoop fs -put csvserdetest_utf8.csv /user/hive/st/csvserde_test

hive๋กœ ์กฐํšŒ๋ฅผ ํ•ด๋ณด๋ฉด ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์˜จ๋‹ค.

SELECT * FROM csvserde_test;

์ด์ œ ๋ฐ์ดํ„ฐ ๊ฒ€์ฆ์„ ํ•ด๋ณด์ž.

๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋ฉด 4๋ฒˆ ๋ฐ์ดํ„ฐ๋Š” ๋”ฐ์˜ดํ‘œ ์—†์ด ์‰ผํ‘œ๊ฐ€ ๋งŽ๊ณ , 5๋ฒˆ ๋ฐ์ดํ„ฐ๋Š” ์ค‘๊ฐ„์— ์ค„๋ฐ”๊ฟˆ์ด ์žˆ๋‹ค. ์ด๋Ÿฐ๊ฒƒ๋“ค์„ ๋ชจ๋‘ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ๊ฒ€์ฆ์€ python์˜ csv ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜๋ฉด ๊ฐ„๋‹จํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฝ”๋“œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

# select_error_in_csv.py

import csv

a_list = []
f = open('C:/csvserdetest.csv', 'r', encoding='udf8')
reader = csv.reader(f)
for row in reader:
    if len(row) != 12:
        print(row)
f.close()

Leave a comment