遍历目录中的文件，创建输出文件linux

问题描述：

我想遍历特定目录（称为序列）中的每个文件，并对每个文件执行两个函数。我知道函数（'blastp'和'cat'行）起作用，因为我可以在单个文件上运行它们。通常我会有一个特定的文件名作为查询，输出等，但我试图使用一个变量，因此循环可以通过许多文件工作。我相信我在尝试在函数中使用我的文件名时遇到了严重的问题。事实上，我的代码将会执行，但它会创建一堆额外的非预期文件。这是我打算为我的脚本执行的操作：遍历目录中的文件，创建输出文件linux

第1行：遍历我的“序列”目录中的每个文件。（所有这些以“.fa”结尾，如果有帮助的话）。

第3行：将文件名识别为变量。（我知道，我知道，我认为我做了这个可怕的错误。）

第4行：使用文件名作为参数为“查询”标志运行blastp函数，始终使用“database.faa”作为“db”标志的参数，并将结果输出到与初始文件具有相同名称，但末尾带有“.txt”的新文件中。

第5行：将第4行输出文件的部分输出到与初始文件具有相同名称但末尾带有“_top_hits.txt”的新文件中。

for sequence in ./sequences/{.,}*; 
    do 
      echo "$sequence"; 
      blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7 
      cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt 
    done

当我运行这段代码，它给了我从每个文件来源的六个新目录中的文件（他们都在同一个目录 - 我宁愿让他们都在自己的文件夹如何。我可以那样做吗？）。他们都是空的。它们的后缀是“.txt”，“.txt.txt”，“.txt_top_hits.txt”，“_top_hits.txt”，“_top_hits.txt.txt”和“_top_hits.txt_top_hits.txt”。

如果我可以提供任何进一步的信息来澄清任何事情，请让我知道。

你看起来至少有一个问题，就是你试图在同一个目录下多次运行同一个函数。每次运行它时，我都相信你的循环会找到你在前一次运行中生成的新文件，并试图对它们进行操作。据我所知，你没有限制你的文件搜索以'* .fa'结尾的文件，但我建议你这样做。否则，您将继续处理新输出的.txt文件并生成更多错误的输出。 – aardvarkk

我同意，我确实需要这样做。我想另一种解决方法是将所有输出文件输出到一个单独的目录。我将如何使它只遍历以* .fa结尾的文件？我把它放在第一行吗？ – lynkyra

答

如果你只在*.fa文件感兴趣我你的输入限制只有这样那些符合条件的文件：

for sequence in sequences/*.fa; do

答

我可以建议你如下改进：

for fasta_file in ./sequences/*.fa # ";" is not necessary if you already have a new line for your "do" 
do 
    # ${variable%something} is the part of $variable 
    # before the string "something" 
    # basename path/to/file is the name of the file 
    # without the full path 
    # $(some command) allows you to use the result of the command as a string 
    # Combining the above, we can form a string based on our fasta file 
    # This string can be useful to name stuff in a clean manner later 
    sequence_name=$(basename ${fasta_file%.fa}) 
    echo ${sequence_name} 
    # Create a directory for the results for this sequence 
    # -p option avoids a failure in case the directory already exists 
    mkdir -p ${sequence_name} 
    # Define the name of the file for the results 
    # (including our previously created directory in its path) 
    blast_results=${sequence_name}/${sequence_name}_blast.txt 
    blastp -query ${fasta_file} -db database.faa \ 
     -out ${blast_results} \ 
     -evalue 1e-10 -outfmt 7 
    # Define a file name for the top hits 
    top_hits=${sequence_name}/${sequence_name}_top_hits.txt 
    # alternatively, using "%" 
    #top_hits=${blast_results%_blast.txt}_top_hits.txt 
    # No need to cat: awk can take a file as argument 
    awk '/hits found/{getline;print}' ${blast_results} \ 
     | grep -v "#" > ${sequence_name}_top_hits.txt 
done

我做了更多的中间变量，（有希望）有意义的名字。我用\来逃避行结束，并允许把命令放在几行。我希望这可以提高代码的可读性。

我还没有测试。可能有错别字。

遍历目录中的文件，创建输出文件linux

相关推荐