从Pig中提取一行

问题描述:

我想通过url将我的数据分组。我的数据目前存储在一个很长的行中。例如: { “移动”, “国家:美国”, “网址:1234.com”, “NEWUSER:Y”}等从Pig中提取一行

这是我到目前为止有:

RAW = LOAD '/data/events/raw/2014-08-21/' as (line:chararray); 
A = FILTER RAW BY (INDEXOF(line,'mobile') != -1) 
B = LIMIT A 800; 
URL = GROUP B BY (INDEXOF(line, 'url')); 
STORE URL INTO '/user/hadoopuser/RS_traffic.txt'; 

如何我是否需要从字符串中提取网址才能进行分组?我可以使用正则表达式吗?

+1

您的输入看起来像JSON,你可以尝试或使用负载JsonStorage http://pig.apache.org/docs/r0.10.0/ func.html#jsonloadstore – 2014-08-29 07:09:28

+0

这不是有效的JSON – 2014-09-10 07:53:15

可以使用REGEX_EXTRACT()功能:

REGEX_EXTRACT Javadoc

RAW = LOAD '/data/events/*' AS (line:chararray); 
C = FOREACH RAW GENERATE REGEX_EXTRACT(value, '<your_pattern>', 1) AS url:chararray; 
A = FILTER RAW BY (INDEXOF(line,'mobile') != -1) 
URL = GROUP C BY url; 
.... 
STORE URL INTO '/user/hadoopuser/RS_traffic.txt';