Taking data from a large txt file and entering selective data into a csv











up vote
0
down vote

favorite












I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.



Currently my output is like this (abridged):



>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
>Running on 1 core
>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).

>=== Summary ===

>Total read pairs processed: 28,794
> Read 1 with adapter: 28,248 (98.1%)
> Read 2 with adapter: 3,232 (11.2%)
>Pairs written (passing filters): 28,794 (100.0%)


I want to grab the sample name after the last / and before .fastq, the number of total read pairs processed, and the number of total pairs written and make a csv file out of those.
The problem is that not all samples have any reads processed, and some sample names come up multiple times. I've created RegEx patterns to turn up my three desired outputs, but I'm having trouble turning those searches into a CSV and entering None when a sample doesn't have any reads.



When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.



>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
Running on 1 core
Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.


This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.



import pandas as pd
import re
import collections
from pathlib import Path

data = Path("cutadapt-report.txt").read_text()
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")


Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')

def clean_adapt(filename):
try:
data = Path(filename).read_text()
except FileNotFoundError:
return 'Having trouble locating that file, please try again'
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
pattern_pairs = r"(?<=Total read pairs processed: ) *d+,?d+"
pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"
pattern_written = r"(?<=Pairs written (passing filters): ) *d+,?d+"
lines = re.findall(pattern_name, data)
pp =
wr =
for entry in split_data:
ok = re.findall(pattern_pairs, entry)
writ = re.findall(pattern_written, str(split_data))
pp.append(ok)
wr.append(writ)

print(lines)
# return Cleaned(lines, pp, wr)

clean_adapt("cutadapt-report.txt")


My CSV file should look like this:



 Sample ID, Total Read Pairs Processed, Pairs Written
MM12-112-pcr-mamm, 28,794, 28,794









share|improve this question




























    up vote
    0
    down vote

    favorite












    I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.



    Currently my output is like this (abridged):



    >Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
    >Running on 1 core
    >Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
    >Finished in 7.48 s (260 us/read; 0.23 M reads/minute).

    >=== Summary ===

    >Total read pairs processed: 28,794
    > Read 1 with adapter: 28,248 (98.1%)
    > Read 2 with adapter: 3,232 (11.2%)
    >Pairs written (passing filters): 28,794 (100.0%)


    I want to grab the sample name after the last / and before .fastq, the number of total read pairs processed, and the number of total pairs written and make a csv file out of those.
    The problem is that not all samples have any reads processed, and some sample names come up multiple times. I've created RegEx patterns to turn up my three desired outputs, but I'm having trouble turning those searches into a CSV and entering None when a sample doesn't have any reads.



    When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.



    >Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
    Running on 1 core
    Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
    No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.


    This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.



    import pandas as pd
    import re
    import collections
    from pathlib import Path

    data = Path("cutadapt-report.txt").read_text()
    split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")


    Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')

    def clean_adapt(filename):
    try:
    data = Path(filename).read_text()
    except FileNotFoundError:
    return 'Having trouble locating that file, please try again'
    split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
    pattern_pairs = r"(?<=Total read pairs processed: ) *d+,?d+"
    pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"
    pattern_written = r"(?<=Pairs written (passing filters): ) *d+,?d+"
    lines = re.findall(pattern_name, data)
    pp =
    wr =
    for entry in split_data:
    ok = re.findall(pattern_pairs, entry)
    writ = re.findall(pattern_written, str(split_data))
    pp.append(ok)
    wr.append(writ)

    print(lines)
    # return Cleaned(lines, pp, wr)

    clean_adapt("cutadapt-report.txt")


    My CSV file should look like this:



     Sample ID, Total Read Pairs Processed, Pairs Written
    MM12-112-pcr-mamm, 28,794, 28,794









    share|improve this question


























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.



      Currently my output is like this (abridged):



      >Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
      >Running on 1 core
      >Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
      >Finished in 7.48 s (260 us/read; 0.23 M reads/minute).

      >=== Summary ===

      >Total read pairs processed: 28,794
      > Read 1 with adapter: 28,248 (98.1%)
      > Read 2 with adapter: 3,232 (11.2%)
      >Pairs written (passing filters): 28,794 (100.0%)


      I want to grab the sample name after the last / and before .fastq, the number of total read pairs processed, and the number of total pairs written and make a csv file out of those.
      The problem is that not all samples have any reads processed, and some sample names come up multiple times. I've created RegEx patterns to turn up my three desired outputs, but I'm having trouble turning those searches into a CSV and entering None when a sample doesn't have any reads.



      When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.



      >Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
      Running on 1 core
      Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
      No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.


      This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.



      import pandas as pd
      import re
      import collections
      from pathlib import Path

      data = Path("cutadapt-report.txt").read_text()
      split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")


      Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')

      def clean_adapt(filename):
      try:
      data = Path(filename).read_text()
      except FileNotFoundError:
      return 'Having trouble locating that file, please try again'
      split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
      pattern_pairs = r"(?<=Total read pairs processed: ) *d+,?d+"
      pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"
      pattern_written = r"(?<=Pairs written (passing filters): ) *d+,?d+"
      lines = re.findall(pattern_name, data)
      pp =
      wr =
      for entry in split_data:
      ok = re.findall(pattern_pairs, entry)
      writ = re.findall(pattern_written, str(split_data))
      pp.append(ok)
      wr.append(writ)

      print(lines)
      # return Cleaned(lines, pp, wr)

      clean_adapt("cutadapt-report.txt")


      My CSV file should look like this:



       Sample ID, Total Read Pairs Processed, Pairs Written
      MM12-112-pcr-mamm, 28,794, 28,794









      share|improve this question















      I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.



      Currently my output is like this (abridged):



      >Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
      >Running on 1 core
      >Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
      >Finished in 7.48 s (260 us/read; 0.23 M reads/minute).

      >=== Summary ===

      >Total read pairs processed: 28,794
      > Read 1 with adapter: 28,248 (98.1%)
      > Read 2 with adapter: 3,232 (11.2%)
      >Pairs written (passing filters): 28,794 (100.0%)


      I want to grab the sample name after the last / and before .fastq, the number of total read pairs processed, and the number of total pairs written and make a csv file out of those.
      The problem is that not all samples have any reads processed, and some sample names come up multiple times. I've created RegEx patterns to turn up my three desired outputs, but I'm having trouble turning those searches into a CSV and entering None when a sample doesn't have any reads.



      When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.



      >Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
      Running on 1 core
      Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
      No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.


      This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.



      import pandas as pd
      import re
      import collections
      from pathlib import Path

      data = Path("cutadapt-report.txt").read_text()
      split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")


      Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')

      def clean_adapt(filename):
      try:
      data = Path(filename).read_text()
      except FileNotFoundError:
      return 'Having trouble locating that file, please try again'
      split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
      pattern_pairs = r"(?<=Total read pairs processed: ) *d+,?d+"
      pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"
      pattern_written = r"(?<=Pairs written (passing filters): ) *d+,?d+"
      lines = re.findall(pattern_name, data)
      pp =
      wr =
      for entry in split_data:
      ok = re.findall(pattern_pairs, entry)
      writ = re.findall(pattern_written, str(split_data))
      pp.append(ok)
      wr.append(writ)

      print(lines)
      # return Cleaned(lines, pp, wr)

      clean_adapt("cutadapt-report.txt")


      My CSV file should look like this:



       Sample ID, Total Read Pairs Processed, Pairs Written
      MM12-112-pcr-mamm, 28,794, 28,794






      python pandas csv






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 8 at 19:35

























      asked Nov 7 at 18:09









      Molly Cassatt

      13




      13





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53195313%2ftaking-data-from-a-large-txt-file-and-entering-selective-data-into-a-csv%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown






























          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53195313%2ftaking-data-from-a-large-txt-file-and-entering-selective-data-into-a-csv%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Schultheiß

          Verwaltungsgliederung Dänemarks

          Liste der Kulturdenkmale in Wilsdruff