Taking data from a large txt file and entering selective data into a csv

up vote
0
down vote

favorite

I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.

Currently my output is like this (abridged):

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq

>Running on 1 core

>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...

>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).



>=== Summary ===



>Total read pairs processed:             28,794

>  Read 1 with adapter:                  28,248 (98.1%)

>  Read 2 with adapter:                   3,232 (11.2%)

>Pairs written (passing filters):        28,794 (100.0%)

I want to grab the sample name after the last / and before .fastq, the number of total read pairs processed, and the number of total pairs written and make a csv file out of those.
The problem is that not all samples have any reads processed, and some sample names come up multiple times. I've created RegEx patterns to turn up my three desired outputs, but I'm having trouble turning those searches into a CSV and entering None when a sample doesn't have any reads.

When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq

Running on 1 core

Trimming 16 adapters with at most 30.0% errors in paired-end mode ...

No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.

This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.

import pandas as pd

import re

import collections

from pathlib import Path



data = Path("cutadapt-report.txt").read_text()

split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")





Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')



def clean_adapt(filename):

    try:

        data = Path(filename).read_text()

    except FileNotFoundError:

         return 'Having trouble locating that file, please try again'

    split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")

    pattern_pairs = r"(?<=Total read pairs processed:             ) *d+,?d+"  

    pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"

    pattern_written = r"(?<=Pairs written (passing filters):        ) *d+,?d+"

    lines = re.findall(pattern_name, data)   

    pp = 

    wr = 

    for entry in split_data:

        ok = re.findall(pattern_pairs, entry)

        writ = re.findall(pattern_written, str(split_data))

        pp.append(ok)

        wr.append(writ)



    print(lines)

#    return Cleaned(lines, pp, wr)



clean_adapt("cutadapt-report.txt")

My CSV file should look like this:

 Sample ID, Total Read Pairs Processed, Pairs Written

    MM12-112-pcr-mamm, 28,794, 28,794

edited Nov 8 at 19:35

asked Nov 7 at 18:09

Molly Cassatt

add a comment |

up vote
0
down vote

favorite

I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.

Currently my output is like this (abridged):

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq

>Running on 1 core

>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...

>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).



>=== Summary ===



>Total read pairs processed:             28,794

>  Read 1 with adapter:                  28,248 (98.1%)

>  Read 2 with adapter:                   3,232 (11.2%)

>Pairs written (passing filters):        28,794 (100.0%)

When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq

Running on 1 core

Trimming 16 adapters with at most 30.0% errors in paired-end mode ...

No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.

This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.

import pandas as pd

import re

import collections

from pathlib import Path



data = Path("cutadapt-report.txt").read_text()

split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")





Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')



def clean_adapt(filename):

    try:

        data = Path(filename).read_text()

    except FileNotFoundError:

         return 'Having trouble locating that file, please try again'

    split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")

    pattern_pairs = r"(?<=Total read pairs processed:             ) *d+,?d+"  

    pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"

    pattern_written = r"(?<=Pairs written (passing filters):        ) *d+,?d+"

    lines = re.findall(pattern_name, data)   

    pp = 

    wr = 

    for entry in split_data:

        ok = re.findall(pattern_pairs, entry)

        writ = re.findall(pattern_written, str(split_data))

        pp.append(ok)

        wr.append(writ)



    print(lines)

#    return Cleaned(lines, pp, wr)



clean_adapt("cutadapt-report.txt")

My CSV file should look like this:

 Sample ID, Total Read Pairs Processed, Pairs Written

    MM12-112-pcr-mamm, 28,794, 28,794

edited Nov 8 at 19:35

asked Nov 7 at 18:09

Molly Cassatt

add a comment |

up vote
0
down vote

favorite

I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.

Currently my output is like this (abridged):

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq

>Running on 1 core

>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...

>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).



>=== Summary ===



>Total read pairs processed:             28,794

>  Read 1 with adapter:                  28,248 (98.1%)

>  Read 2 with adapter:                   3,232 (11.2%)

>Pairs written (passing filters):        28,794 (100.0%)

When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq

Running on 1 core

Trimming 16 adapters with at most 30.0% errors in paired-end mode ...

No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.

This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.

import pandas as pd

import re

import collections

from pathlib import Path



data = Path("cutadapt-report.txt").read_text()

split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")





Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')



def clean_adapt(filename):

    try:

        data = Path(filename).read_text()

    except FileNotFoundError:

         return 'Having trouble locating that file, please try again'

    split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")

    pattern_pairs = r"(?<=Total read pairs processed:             ) *d+,?d+"  

    pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"

    pattern_written = r"(?<=Pairs written (passing filters):        ) *d+,?d+"

    lines = re.findall(pattern_name, data)   

    pp = 

    wr = 

    for entry in split_data:

        ok = re.findall(pattern_pairs, entry)

        writ = re.findall(pattern_written, str(split_data))

        pp.append(ok)

        wr.append(writ)



    print(lines)

#    return Cleaned(lines, pp, wr)



clean_adapt("cutadapt-report.txt")

My CSV file should look like this:

 Sample ID, Total Read Pairs Processed, Pairs Written

    MM12-112-pcr-mamm, 28,794, 28,794

edited Nov 8 at 19:35

asked Nov 7 at 18:09

Molly Cassatt

I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.

Currently my output is like this (abridged):

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq

>Running on 1 core

>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...

>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).



>=== Summary ===



>Total read pairs processed:             28,794

>  Read 1 with adapter:                  28,248 (98.1%)

>  Read 2 with adapter:                   3,232 (11.2%)

>Pairs written (passing filters):        28,794 (100.0%)

When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq

Running on 1 core

Trimming 16 adapters with at most 30.0% errors in paired-end mode ...

No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.

This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.

import pandas as pd

import re

import collections

from pathlib import Path



data = Path("cutadapt-report.txt").read_text()

split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")





Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')



def clean_adapt(filename):

    try:

        data = Path(filename).read_text()

    except FileNotFoundError:

         return 'Having trouble locating that file, please try again'

    split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")

    pattern_pairs = r"(?<=Total read pairs processed:             ) *d+,?d+"  

    pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"

    pattern_written = r"(?<=Pairs written (passing filters):        ) *d+,?d+"

    lines = re.findall(pattern_name, data)   

    pp = 

    wr = 

    for entry in split_data:

        ok = re.findall(pattern_pairs, entry)

        writ = re.findall(pattern_written, str(split_data))

        pp.append(ok)

        wr.append(writ)



    print(lines)

#    return Cleaned(lines, pp, wr)



clean_adapt("cutadapt-report.txt")

My CSV file should look like this:

 Sample ID, Total Read Pairs Processed, Pairs Written

    MM12-112-pcr-mamm, 28,794, 28,794

python pandas csv

edited Nov 8 at 19:35

asked Nov 7 at 18:09

Molly Cassatt

edited Nov 8 at 19:35

asked Nov 7 at 18:09

Molly Cassatt

edited Nov 8 at 19:35

asked Nov 7 at 18:09

Molly Cassatt

asked Nov 7 at 18:09

Molly Cassatt

asked Nov 7 at 18:09

Molly Cassatt

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53195313%2ftaking-data-from-a-large-txt-file-and-entering-selective-data-into-a-csv%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl