巴什 - 在 - 源码之家

巴什 - 在

问题描述：

列

交换价值，我有一些CSV /表格数据在一个文件中，像这样：巴什 - 在

1,7,3,2 
8,3,8,0 
4,9,5,3 
8,5,7,3 
5,6,1,9

（他们并不总是数字，只是随机逗号分隔值个位数号码。尽管如此，还是比较容易的。）

我想随机洗牌40％的任何列。举个例子，说第三个。所以也许3和1互相交换。现在，第三列是：

1 << Came from the last position 
8 
5 
7 
3 << Came from the first position

我试图从bash脚本，我的工作中的一个文件来做到这一点的地方，我没有多少运气。我一直徘徊在一些非常疯狂和没有结果的兔子洞口，这让我以为我走错了路（不断的失败是什么让我不知所措）。

我用一连串的东西标记了这个问题，因为我不完全确定我应该为此使用哪个工具。

编辑：我可能会最终接受鲁本斯的答案，但古怪的是，因为它直接包含了交换的概念（我想我可以强调在原来的问题更多），它允许我指定交换列的百分比。它也适用于工作，这总是一个加号。

对于不需要这个的人，只是想要一个基本的洗牌，Jim Garrison的答案也有效（我测试了它）。

但是，鲁本斯的解决方案的警告。我把这个：

for (i = 1; i <= NF; ++i) { 
    delim = (i != NF) ? "," : ""; 
    ... 
} 
printf "\n";

取出printf "\n";和移动换行符像这样：

for (i = 1; i <= NF; ++i) { 
    delim = (i != NF) ? "," : "\n"; 
    ... 
}

，因为只是在其他情况下，具有""是造成awk在每年年底写断字行（\00）。有一次，它甚至设法用中文字符替换我的整个文件。虽然，说实话，这可能让我在这个问题上做了一些额外的愚蠢行为。

随机化并不的强度文本处理工具，如'sed'或'awk' – 2013-03-19 04:52:55

你想选择40％的列并完全洗牌，或者选择一个（或多个）列并随机洗牌40％？ – FoolishSeth 2013-03-19 05:27:43

后者（40％一排柱子N）。 – 2013-03-19 05:28:49

答

算法：在线路

创建矢量与n双，从1到number of lines，和相应的值（对于选定的列），然后随机排序;
找到应该随机分配多少行：num_random = percentage * num_lines/100;
从您的随机化载体中选择第一个num_random条目;
您可以随机排列选定的行，但应该已经随机排序;

打印输出：

i = 0 
for num_line, value in column; do 
    if num_line not in random_vector: 
     print value; # printing non-randomized value 
    else: 
     print random_vector[i]; # randomized entry 
     i++; 
done

实施：

#! /bin/bash 

infile=$1 
col=$2 
n_lines=$(wc -l < ${infile}) 
prob=$(bc <<< "$3 * ${n_lines}/100") 

# Selected lines 
tmp=$(tempfile) 
paste -d ',' <(seq 1 ${n_lines}) <(cut -d ',' -f ${col} ${infile}) \ 
    | sort -R | head -n ${prob} > ${tmp} 

# Rewriting file 
awk -v "col=$col" -F "," ' 
(NR == FNR) {id[$1] = $2; next} 
(FNR == 1) { 
    i = c = 1; 
    for (v in id) {value[i] = id[v]; ++i;} 
} 
{ 
    for (i = 1; i <= NF; ++i) { 
     delim = (i != NF) ? "," : ""; 
     if (i != col) {printf "%s%c", $i, delim; continue;} 
     if (FNR in id) {printf "%s%c", value[c], delim; c++;} 
     else {printf "%s%c", $i, delim;} 
    } 
    printf "\n"; 
} 
' ${tmp} ${infile} 

rm ${tmp}

如果你想有一个贴近于贴装，你可以管的输出回到输入文件，使用sponge。

执行：

要执行，只需使用：

$ ./script.sh <inpath> <column> <percentage>

如：

$ ./script.sh infile 3 40 1,7,3,2 8,3,8,0 4,9,1,3 8,5,7,3 5,6,5,9

结论：

这允许您选择合作lumn，随机对该列中的百分比进行排序，并且替换原始文件中的新列。

该脚本与其他脚本一样，不仅仅是shell脚本非常有趣，但有些情况下它肯定会被使用而不是。（：

答

这将适用于专门指定的专栏，但应该足以让您指向正确的方向。这适用于现代的bash壳包括Cygwin的：

paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)

手术的特点是 “process substitution”。

的paste命令水平加入的文件，并且三个片从原始文件经由cut分割，通过shuf命令运行重新排序行第二部件（待随机化的列）。下面是从运行它几次的输出：

$ cat test.dat 
1,7,3,2 
8,3,8,0 
4,9,5,3 
8,5,7,3 
5,6,1,9 

$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat) 
1,7,1,2 
8,3,8,0 
4,9,7,3 
8,5,3,3 
5,6,5,9 

$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat) 
1,7,8,2 
8,3,1,0 
4,9,3,3 
8,5,7,3 
5,6,5,9

+1'shuf'可能需要用一个自定义的混洗器来替换，以处理40％的约束，但是否则很好（假设列数固定）。 – chepner 2013-03-19 12:11:20

答

我会使用2遍方法，首先获取行数并将文件读入数组，然后使用awk的rand（）函数生成随机数以标识您的行“会再次改变，然后兰特（）以确定将交换其对那些行中，然后在打印前交换阵列元件像这样的伪代码，粗糙算法：

awk -F, -v pct=40 -v col=3 ' 
NR == FNR { 
    array[++totNumLines] = $0 
    next 
} 

FNR == 1{ 
    pctNumLines = totNumLines * pct/100 

    srand() 

    for (i=1; i<=(pctNumLines/2); i++) { 
     oldLineNr = rand() * some factor to produce a line number that's in the 1 to totNumLines range but is not already recorded as processed in the "swapped" array. 
     newLineNr = ditto plus must not equal oldLineNr 

     swap field $col between array[oldLineNr] and array[newLineNr] 

     swapped[oldLineNr] 
     swapped[newLineNr] 
    } 
    next 
} 

{ print array[FNR] } 

' "$file" "$file" > tmp && 
mv tmp "$file"

巴什 - 在

相关推荐