How to read multiple gzipped files from S3 into a single RDD with http request?











up vote
0
down vote

favorite












I have to download many gzipped files stored on S3 like this:



crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz


to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/



I have to download and decompress the files,then assemble the content as a single RDD.



Something similar to this:



JavaRDD<String> text = 
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");


I want to do this code with spark:



    for (String key : keys) {
object = s3.getObject(new GetObjectRequest(bucketName, key));

gzipStream = new GZIPInputStream(object.getObjectContent());
decoder = new InputStreamReader(gzipStream);
buffered = new BufferedReader(decoder);

sitemaps = new ArrayList<>();

String line = buffered.readLine();

while (line != null) {
if (line.matches("Sitemap:.*")) {
sitemaps.add(line);
}
line = buffered.readLine();
}









share|improve this question
























  • There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
    – Sebastian Nagel
    2 days ago

















up vote
0
down vote

favorite












I have to download many gzipped files stored on S3 like this:



crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz


to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/



I have to download and decompress the files,then assemble the content as a single RDD.



Something similar to this:



JavaRDD<String> text = 
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");


I want to do this code with spark:



    for (String key : keys) {
object = s3.getObject(new GetObjectRequest(bucketName, key));

gzipStream = new GZIPInputStream(object.getObjectContent());
decoder = new InputStreamReader(gzipStream);
buffered = new BufferedReader(decoder);

sitemaps = new ArrayList<>();

String line = buffered.readLine();

while (line != null) {
if (line.matches("Sitemap:.*")) {
sitemaps.add(line);
}
line = buffered.readLine();
}









share|improve this question
























  • There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
    – Sebastian Nagel
    2 days ago















up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have to download many gzipped files stored on S3 like this:



crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz


to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/



I have to download and decompress the files,then assemble the content as a single RDD.



Something similar to this:



JavaRDD<String> text = 
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");


I want to do this code with spark:



    for (String key : keys) {
object = s3.getObject(new GetObjectRequest(bucketName, key));

gzipStream = new GZIPInputStream(object.getObjectContent());
decoder = new InputStreamReader(gzipStream);
buffered = new BufferedReader(decoder);

sitemaps = new ArrayList<>();

String line = buffered.readLine();

while (line != null) {
if (line.matches("Sitemap:.*")) {
sitemaps.add(line);
}
line = buffered.readLine();
}









share|improve this question















I have to download many gzipped files stored on S3 like this:



crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz


to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/



I have to download and decompress the files,then assemble the content as a single RDD.



Something similar to this:



JavaRDD<String> text = 
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");


I want to do this code with spark:



    for (String key : keys) {
object = s3.getObject(new GetObjectRequest(bucketName, key));

gzipStream = new GZIPInputStream(object.getObjectContent());
decoder = new InputStreamReader(gzipStream);
buffered = new BufferedReader(decoder);

sitemaps = new ArrayList<>();

String line = buffered.readLine();

while (line != null) {
if (line.matches("Sitemap:.*")) {
sitemaps.add(line);
}
line = buffered.readLine();
}






java apache-spark amazon-s3 common-crawl






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 11:42

























asked Nov 8 at 10:36









fra96

84




84












  • There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
    – Sebastian Nagel
    2 days ago




















  • There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
    – Sebastian Nagel
    2 days ago


















There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago






There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago














1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted










To read something from S3, you can do this:



sc.textFiles("s3n://path/to/dir")


If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:



/root
/a
f1.gz
f2.gz
/b
f3.gz


or even this:



/root
f3.gz
/a
f1.gz
f2.gz


then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.



Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.






share|improve this answer





















  • Does this work even if files are not mine?
    – fra96
    Nov 8 at 11:29










  • What do you mean not yours?
    – Oli
    Nov 8 at 12:48










  • I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
    – fra96
    Nov 8 at 15:33










  • if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
    – fra96
    Nov 8 at 15:51










  • Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
    – Oli
    Nov 8 at 18:19











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53205942%2fhow-to-read-multiple-gzipped-files-from-s3-into-a-single-rdd-with-http-request%23new-answer', 'question_page');
}
);

Post as a guest
































1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote



accepted










To read something from S3, you can do this:



sc.textFiles("s3n://path/to/dir")


If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:



/root
/a
f1.gz
f2.gz
/b
f3.gz


or even this:



/root
f3.gz
/a
f1.gz
f2.gz


then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.



Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.






share|improve this answer





















  • Does this work even if files are not mine?
    – fra96
    Nov 8 at 11:29










  • What do you mean not yours?
    – Oli
    Nov 8 at 12:48










  • I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
    – fra96
    Nov 8 at 15:33










  • if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
    – fra96
    Nov 8 at 15:51










  • Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
    – Oli
    Nov 8 at 18:19















up vote
0
down vote



accepted










To read something from S3, you can do this:



sc.textFiles("s3n://path/to/dir")


If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:



/root
/a
f1.gz
f2.gz
/b
f3.gz


or even this:



/root
f3.gz
/a
f1.gz
f2.gz


then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.



Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.






share|improve this answer





















  • Does this work even if files are not mine?
    – fra96
    Nov 8 at 11:29










  • What do you mean not yours?
    – Oli
    Nov 8 at 12:48










  • I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
    – fra96
    Nov 8 at 15:33










  • if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
    – fra96
    Nov 8 at 15:51










  • Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
    – Oli
    Nov 8 at 18:19













up vote
0
down vote



accepted







up vote
0
down vote



accepted






To read something from S3, you can do this:



sc.textFiles("s3n://path/to/dir")


If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:



/root
/a
f1.gz
f2.gz
/b
f3.gz


or even this:



/root
f3.gz
/a
f1.gz
f2.gz


then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.



Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.






share|improve this answer












To read something from S3, you can do this:



sc.textFiles("s3n://path/to/dir")


If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:



/root
/a
f1.gz
f2.gz
/b
f3.gz


or even this:



/root
f3.gz
/a
f1.gz
f2.gz


then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.



Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 8 at 11:04









Oli

1,149212




1,149212












  • Does this work even if files are not mine?
    – fra96
    Nov 8 at 11:29










  • What do you mean not yours?
    – Oli
    Nov 8 at 12:48










  • I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
    – fra96
    Nov 8 at 15:33










  • if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
    – fra96
    Nov 8 at 15:51










  • Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
    – Oli
    Nov 8 at 18:19


















  • Does this work even if files are not mine?
    – fra96
    Nov 8 at 11:29










  • What do you mean not yours?
    – Oli
    Nov 8 at 12:48










  • I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
    – fra96
    Nov 8 at 15:33










  • if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
    – fra96
    Nov 8 at 15:51










  • Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
    – Oli
    Nov 8 at 18:19
















Does this work even if files are not mine?
– fra96
Nov 8 at 11:29




Does this work even if files are not mine?
– fra96
Nov 8 at 11:29












What do you mean not yours?
– Oli
Nov 8 at 12:48




What do you mean not yours?
– Oli
Nov 8 at 12:48












I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33




I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33












if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51




if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51












Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19




Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53205942%2fhow-to-read-multiple-gzipped-files-from-s3-into-a-single-rdd-with-http-request%23new-answer', 'question_page');
}
);

Post as a guest




















































































Popular posts from this blog

Schultheiß

Liste der Kulturdenkmale in Wilsdruff

Android Play Services Check