How to read multiple gzipped files from S3 into a single RDD with http request?

up vote
0
down vote

favorite

I have to download many gzipped files stored on S3 like this:

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz

to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/

I have to download and decompress the files,then assemble the content as a single RDD.

Something similar to this:

JavaRDD<String> text = 

    sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");

I want to do this code with spark:

    for (String key : keys) {

        object = s3.getObject(new GetObjectRequest(bucketName, key));



        gzipStream = new GZIPInputStream(object.getObjectContent());

        decoder = new InputStreamReader(gzipStream);

        buffered = new BufferedReader(decoder);



        sitemaps = new ArrayList<>();



        String line = buffered.readLine();



        while (line != null) {

            if (line.matches("Sitemap:.*")) {

                sitemaps.add(line);

            }

            line = buffered.readLine();

        }

edited Nov 8 at 11:42

asked Nov 8 at 10:36

fra96

There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago

add a comment |

up vote
0
down vote

favorite

I have to download many gzipped files stored on S3 like this:

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz

to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/

I have to download and decompress the files,then assemble the content as a single RDD.

Something similar to this:

JavaRDD<String> text = 

    sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");

I want to do this code with spark:

    for (String key : keys) {

        object = s3.getObject(new GetObjectRequest(bucketName, key));



        gzipStream = new GZIPInputStream(object.getObjectContent());

        decoder = new InputStreamReader(gzipStream);

        buffered = new BufferedReader(decoder);



        sitemaps = new ArrayList<>();



        String line = buffered.readLine();



        while (line != null) {

            if (line.matches("Sitemap:.*")) {

                sitemaps.add(line);

            }

            line = buffered.readLine();

        }

edited Nov 8 at 11:42

asked Nov 8 at 10:36

fra96

There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago

add a comment |

up vote
0
down vote

favorite

I have to download many gzipped files stored on S3 like this:

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz

to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/

I have to download and decompress the files,then assemble the content as a single RDD.

Something similar to this:

JavaRDD<String> text = 

    sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");

I want to do this code with spark:

    for (String key : keys) {

        object = s3.getObject(new GetObjectRequest(bucketName, key));



        gzipStream = new GZIPInputStream(object.getObjectContent());

        decoder = new InputStreamReader(gzipStream);

        buffered = new BufferedReader(decoder);



        sitemaps = new ArrayList<>();



        String line = buffered.readLine();



        while (line != null) {

            if (line.matches("Sitemap:.*")) {

                sitemaps.add(line);

            }

            line = buffered.readLine();

        }

edited Nov 8 at 11:42

asked Nov 8 at 10:36

fra96

I have to download many gzipped files stored on S3 like this:

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz

to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/

I have to download and decompress the files,then assemble the content as a single RDD.

Something similar to this:

JavaRDD<String> text = 

    sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");

I want to do this code with spark:

    for (String key : keys) {

        object = s3.getObject(new GetObjectRequest(bucketName, key));



        gzipStream = new GZIPInputStream(object.getObjectContent());

        decoder = new InputStreamReader(gzipStream);

        buffered = new BufferedReader(decoder);



        sitemaps = new ArrayList<>();



        String line = buffered.readLine();



        while (line != null) {

            if (line.matches("Sitemap:.*")) {

                sitemaps.add(line);

            }

            line = buffered.readLine();

        }

java apache-spark amazon-s3 common-crawl

edited Nov 8 at 11:42

asked Nov 8 at 10:36

fra96

edited Nov 8 at 11:42

asked Nov 8 at 10:36

fra96

edited Nov 8 at 11:42

asked Nov 8 at 10:36

fra96

asked Nov 8 at 10:36

fra96

asked Nov 8 at 10:36

fra96

There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago

add a comment |

There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago

There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

To read something from S3, you can do this:

sc.textFiles("s3n://path/to/dir")

If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:

/root

  /a

    f1.gz

    f2.gz

  /b

    f3.gz

or even this:

/root

  f3.gz

  /a

    f1.gz

    f2.gz

then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.

Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.

answered Nov 8 at 11:04

Oli

1,149212

Does this work even if files are not mine?
– fra96
Nov 8 at 11:29

What do you mean not yours?
– Oli
Nov 8 at 12:48

I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33

if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51

Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53205942%2fhow-to-read-multiple-gzipped-files-from-s3-into-a-single-rdd-with-http-request%23new-answer', 'question_page');
}
);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

To read something from S3, you can do this:

sc.textFiles("s3n://path/to/dir")

If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:

/root

  /a

    f1.gz

    f2.gz

  /b

    f3.gz

or even this:

/root

  f3.gz

  /a

    f1.gz

    f2.gz

then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.

Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.

answered Nov 8 at 11:04

Oli

1,149212

Does this work even if files are not mine?
– fra96
Nov 8 at 11:29

What do you mean not yours?
– Oli
Nov 8 at 12:48

I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33

if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51

Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19

add a comment |

up vote
0
down vote

accepted

To read something from S3, you can do this:

sc.textFiles("s3n://path/to/dir")

If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:

/root

  /a

    f1.gz

    f2.gz

  /b

    f3.gz

or even this:

/root

  f3.gz

  /a

    f1.gz

    f2.gz

then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.

Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.

answered Nov 8 at 11:04

Oli

1,149212

Does this work even if files are not mine?
– fra96
Nov 8 at 11:29

What do you mean not yours?
– Oli
Nov 8 at 12:48

I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33

if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51

Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19

add a comment |

up vote
0
down vote

accepted

To read something from S3, you can do this:

sc.textFiles("s3n://path/to/dir")

If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:

/root

  /a

    f1.gz

    f2.gz

  /b

    f3.gz

or even this:

/root

  f3.gz

  /a

    f1.gz

    f2.gz

then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.

Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.

answered Nov 8 at 11:04

Oli

1,149212

To read something from S3, you can do this:

sc.textFiles("s3n://path/to/dir")

If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:

/root

  /a

    f1.gz

    f2.gz

  /b

    f3.gz

or even this:

/root

  f3.gz

  /a

    f1.gz

    f2.gz

then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.

Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.

answered Nov 8 at 11:04

Oli

1,149212

answered Nov 8 at 11:04

Oli

1,149212

answered Nov 8 at 11:04

Oli

1,149212

answered Nov 8 at 11:04

Oli

1,149212

Does this work even if files are not mine?
– fra96
Nov 8 at 11:29

What do you mean not yours?
– Oli
Nov 8 at 12:48

I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33

if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51

Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19

add a comment |

Does this work even if files are not mine?
– fra96
Nov 8 at 11:29

What do you mean not yours?
– Oli
Nov 8 at 12:48

I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33

if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51

Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19

Does this work even if files are not mine?
– fra96
Nov 8 at 11:29

What do you mean not yours?
– Oli
Nov 8 at 12:48

I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33

if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51

Have you tried something like this? sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl