单词出现次数的计数
问题描述:
我正在寻找更好的SAS方法来计算某个单词出现在字符串中的次数。例如,搜索字符串中的“木”:单词出现次数的计数
how much wood could a woodchuck chuck if a woodchuck could chuck wood
...将返回2
结果。
这是我通常会做,但它的很多代码:
data _null_;
length sentence word $200;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
search_term = 'wood';
found_count = 0;
cnt=1;
word = scan(sentence,cnt);
do while (word ne '');
num_times_found = sum(num_times_found, word eq search_term);
cnt = cnt + 1;
word = scan(sentence,cnt);
end;
put num_times_found=;
run;
我可以把这个变成一个fcmp
功能,使其更加优雅,但我仍然觉得自己必须有更友好,更简洁的代码。
答
从Code Review的角度来看,以上可以有所改进。 do循环可以处理cnt
增量,如果将其切换为until
,则不必执行初始分配。你也有一个无关的变量found_count
,不知道那是什么。否则,我认为这是合理的,至少对于非复杂的解决方案而言。
data _null_;
length sentence word $200;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
search_term = 'wood';
do cnt=1 by 1 until (word eq '');
word = scan(sentence,cnt);
num_times_found = sum(num_times_found, word eq search_term);
end;
put num_times_found=;
run;
它也相当快 - 1e6迭代在我的盒子上不到9秒。当o
被添加到字符串选项时,PRX解决方案需要更少的时间(6秒),所以在使用非常大的数据集或大量变量时可能更可取,但我相信与I/O时间相比,增加的时间将会很重要。 FCMP解决方案与此解决方案具有相同的时间顺序(大约8-9秒)。最后,FINDW解决方案是最快的,大约2秒。
答
尝试用prxchange掉落木头,然后countw。
data _null_;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
count=countw(sentence,' ')-countw(prxchange('s/wood/$1/i',-1,sentence),' ');
put _all_;
run;
答
以及物品是否完整,这是作为一个钙镁磷肥功能:
钙镁磷肥定义:
options cmplib=work.temp.temp;
proc fcmp outlib=work.temp.temp;
function word_freq(sentence $, search_term $) ;
length sentence word $200;
do cnt=1 by 1 until (word eq '');
word = scan(sentence,cnt);
num_times_found = sum(num_times_found, word eq search_term);
end;
return (num_times_found);
endsub;
run;
用法:
data _null_;
num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood');
put num_times_found=;
run;
结果:
num_times_found=2
答
当FINDW将有效扫描您时,没有理由扫描所有单词。
33 data _null_;
34 length sentence search_term $200;
35 sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
36 search_term = 'wood';
37 cnt=0;
38 do s=findw(sentence,strip(search_term),1) by 0 while(s);
39 cnt+1;
40 s=findw(sentence,strip(search_term),s+1);
41 end;
42 put cnt= search_term=;
43 stop;
44 run;
cnt=2 search_term=wood
+0
绝对比SCAN方法快很多。 – Joe
我在这里发布了这个而不是codereview,因为我不认为codereview会有任何SAS受众。 –
这不就是countW么? –
@data_null_不 - 这是我第一次想到的,但'countw()'只是计算单词的总数,而不是特定单词出现的次数。 –