Taking data from a large txt file and entering selective data into a csv
up vote
0
down vote
favorite
I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.
Currently my output is like this (abridged):
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
>Running on 1 core
>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).
>=== Summary ===
>Total read pairs processed: 28,794
> Read 1 with adapter: 28,248 (98.1%)
> Read 2 with adapter: 3,232 (11.2%)
>Pairs written (passing filters): 28,794 (100.0%)
I want to grab the sample name after the last / and before .fastq, the number of total read pairs processed, and the number of total pairs written and make a csv file out of those.
The problem is that not all samples have any reads processed, and some sample names come up multiple times. I've created RegEx patterns to turn up my three desired outputs, but I'm having trouble turning those searches into a CSV and entering None when a sample doesn't have any reads.
When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
Running on 1 core
Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.
This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.
import pandas as pd
import re
import collections
from pathlib import Path
data = Path("cutadapt-report.txt").read_text()
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')
def clean_adapt(filename):
try:
data = Path(filename).read_text()
except FileNotFoundError:
return 'Having trouble locating that file, please try again'
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
pattern_pairs = r"(?<=Total read pairs processed: ) *d+,?d+"
pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"
pattern_written = r"(?<=Pairs written (passing filters): ) *d+,?d+"
lines = re.findall(pattern_name, data)
pp =
wr =
for entry in split_data:
ok = re.findall(pattern_pairs, entry)
writ = re.findall(pattern_written, str(split_data))
pp.append(ok)
wr.append(writ)
print(lines)
# return Cleaned(lines, pp, wr)
clean_adapt("cutadapt-report.txt")
My CSV file should look like this:
Sample ID, Total Read Pairs Processed, Pairs Written
MM12-112-pcr-mamm, 28,794, 28,794
python pandas csv
add a comment |
up vote
0
down vote
favorite
I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.
Currently my output is like this (abridged):
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
>Running on 1 core
>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).
>=== Summary ===
>Total read pairs processed: 28,794
> Read 1 with adapter: 28,248 (98.1%)
> Read 2 with adapter: 3,232 (11.2%)
>Pairs written (passing filters): 28,794 (100.0%)
I want to grab the sample name after the last / and before .fastq, the number of total read pairs processed, and the number of total pairs written and make a csv file out of those.
The problem is that not all samples have any reads processed, and some sample names come up multiple times. I've created RegEx patterns to turn up my three desired outputs, but I'm having trouble turning those searches into a CSV and entering None when a sample doesn't have any reads.
When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
Running on 1 core
Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.
This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.
import pandas as pd
import re
import collections
from pathlib import Path
data = Path("cutadapt-report.txt").read_text()
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')
def clean_adapt(filename):
try:
data = Path(filename).read_text()
except FileNotFoundError:
return 'Having trouble locating that file, please try again'
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
pattern_pairs = r"(?<=Total read pairs processed: ) *d+,?d+"
pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"
pattern_written = r"(?<=Pairs written (passing filters): ) *d+,?d+"
lines = re.findall(pattern_name, data)
pp =
wr =
for entry in split_data:
ok = re.findall(pattern_pairs, entry)
writ = re.findall(pattern_written, str(split_data))
pp.append(ok)
wr.append(writ)
print(lines)
# return Cleaned(lines, pp, wr)
clean_adapt("cutadapt-report.txt")
My CSV file should look like this:
Sample ID, Total Read Pairs Processed, Pairs Written
MM12-112-pcr-mamm, 28,794, 28,794
python pandas csv
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.
Currently my output is like this (abridged):
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
>Running on 1 core
>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).
>=== Summary ===
>Total read pairs processed: 28,794
> Read 1 with adapter: 28,248 (98.1%)
> Read 2 with adapter: 3,232 (11.2%)
>Pairs written (passing filters): 28,794 (100.0%)
I want to grab the sample name after the last / and before .fastq, the number of total read pairs processed, and the number of total pairs written and make a csv file out of those.
The problem is that not all samples have any reads processed, and some sample names come up multiple times. I've created RegEx patterns to turn up my three desired outputs, but I'm having trouble turning those searches into a CSV and entering None when a sample doesn't have any reads.
When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
Running on 1 core
Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.
This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.
import pandas as pd
import re
import collections
from pathlib import Path
data = Path("cutadapt-report.txt").read_text()
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')
def clean_adapt(filename):
try:
data = Path(filename).read_text()
except FileNotFoundError:
return 'Having trouble locating that file, please try again'
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
pattern_pairs = r"(?<=Total read pairs processed: ) *d+,?d+"
pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"
pattern_written = r"(?<=Pairs written (passing filters): ) *d+,?d+"
lines = re.findall(pattern_name, data)
pp =
wr =
for entry in split_data:
ok = re.findall(pattern_pairs, entry)
writ = re.findall(pattern_written, str(split_data))
pp.append(ok)
wr.append(writ)
print(lines)
# return Cleaned(lines, pp, wr)
clean_adapt("cutadapt-report.txt")
My CSV file should look like this:
Sample ID, Total Read Pairs Processed, Pairs Written
MM12-112-pcr-mamm, 28,794, 28,794
python pandas csv
I have a long txt file output from another script, and I want to search through for selective bits of information and enter those into a much cleaner .csv file.
Currently my output is like this (abridged):
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
>Running on 1 core
>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).
>=== Summary ===
>Total read pairs processed: 28,794
> Read 1 with adapter: 28,248 (98.1%)
> Read 2 with adapter: 3,232 (11.2%)
>Pairs written (passing filters): 28,794 (100.0%)
I want to grab the sample name after the last / and before .fastq, the number of total read pairs processed, and the number of total pairs written and make a csv file out of those.
The problem is that not all samples have any reads processed, and some sample names come up multiple times. I've created RegEx patterns to turn up my three desired outputs, but I'm having trouble turning those searches into a CSV and entering None when a sample doesn't have any reads.
When it runs past something like this, I need to keep the sample name and enter 0, none, NA, or something like that but not throw this entry out.
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
Running on 1 core
Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.
This is what I have so far, I was trying to store it into a named tuple, or maybe I'll try a dictionary next, but I'm pretty lost and don't know where to go from here.
import pandas as pd
import re
import collections
from pathlib import Path
data = Path("cutadapt-report.txt").read_text()
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')
def clean_adapt(filename):
try:
data = Path(filename).read_text()
except FileNotFoundError:
return 'Having trouble locating that file, please try again'
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
pattern_pairs = r"(?<=Total read pairs processed: ) *d+,?d+"
pattern_name = r"((?<=QC/fastq/)S+(?=-S))(?!.*1)"
pattern_written = r"(?<=Pairs written (passing filters): ) *d+,?d+"
lines = re.findall(pattern_name, data)
pp =
wr =
for entry in split_data:
ok = re.findall(pattern_pairs, entry)
writ = re.findall(pattern_written, str(split_data))
pp.append(ok)
wr.append(writ)
print(lines)
# return Cleaned(lines, pp, wr)
clean_adapt("cutadapt-report.txt")
My CSV file should look like this:
Sample ID, Total Read Pairs Processed, Pairs Written
MM12-112-pcr-mamm, 28,794, 28,794
python pandas csv
python pandas csv
edited Nov 8 at 19:35
asked Nov 7 at 18:09
Molly Cassatt
13
13
add a comment |
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53195313%2ftaking-data-from-a-large-txt-file-and-entering-selective-data-into-a-csv%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown