How to read multiple gzipped files from S3 into a single RDD with http request?
up vote
0
down vote
favorite
I have to download many gzipped files stored on S3 like this:
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz
to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/
I have to download and decompress the files,then assemble the content as a single RDD.
Something similar to this:
JavaRDD<String> text =
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");
I want to do this code with spark:
for (String key : keys) {
object = s3.getObject(new GetObjectRequest(bucketName, key));
gzipStream = new GZIPInputStream(object.getObjectContent());
decoder = new InputStreamReader(gzipStream);
buffered = new BufferedReader(decoder);
sitemaps = new ArrayList<>();
String line = buffered.readLine();
while (line != null) {
if (line.matches("Sitemap:.*")) {
sitemaps.add(line);
}
line = buffered.readLine();
}
java apache-spark amazon-s3 common-crawl
add a comment |
up vote
0
down vote
favorite
I have to download many gzipped files stored on S3 like this:
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz
to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/
I have to download and decompress the files,then assemble the content as a single RDD.
Something similar to this:
JavaRDD<String> text =
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");
I want to do this code with spark:
for (String key : keys) {
object = s3.getObject(new GetObjectRequest(bucketName, key));
gzipStream = new GZIPInputStream(object.getObjectContent());
decoder = new InputStreamReader(gzipStream);
buffered = new BufferedReader(decoder);
sitemaps = new ArrayList<>();
String line = buffered.readLine();
while (line != null) {
if (line.matches("Sitemap:.*")) {
sitemaps.add(line);
}
line = buffered.readLine();
}
java apache-spark amazon-s3 common-crawl
There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have to download many gzipped files stored on S3 like this:
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz
to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/
I have to download and decompress the files,then assemble the content as a single RDD.
Something similar to this:
JavaRDD<String> text =
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");
I want to do this code with spark:
for (String key : keys) {
object = s3.getObject(new GetObjectRequest(bucketName, key));
gzipStream = new GZIPInputStream(object.getObjectContent());
decoder = new InputStreamReader(gzipStream);
buffered = new BufferedReader(decoder);
sitemaps = new ArrayList<>();
String line = buffered.readLine();
while (line != null) {
if (line.matches("Sitemap:.*")) {
sitemaps.add(line);
}
line = buffered.readLine();
}
java apache-spark amazon-s3 common-crawl
I have to download many gzipped files stored on S3 like this:
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz
to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/
I have to download and decompress the files,then assemble the content as a single RDD.
Something similar to this:
JavaRDD<String> text =
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");
I want to do this code with spark:
for (String key : keys) {
object = s3.getObject(new GetObjectRequest(bucketName, key));
gzipStream = new GZIPInputStream(object.getObjectContent());
decoder = new InputStreamReader(gzipStream);
buffered = new BufferedReader(decoder);
sitemaps = new ArrayList<>();
String line = buffered.readLine();
while (line != null) {
if (line.matches("Sitemap:.*")) {
sitemaps.add(line);
}
line = buffered.readLine();
}
java apache-spark amazon-s3 common-crawl
java apache-spark amazon-s3 common-crawl
edited Nov 8 at 11:42
asked Nov 8 at 10:36
fra96
84
84
There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago
add a comment |
There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago
There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago
There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
To read something from S3, you can do this:
sc.textFiles("s3n://path/to/dir")
If dir
contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:
/root
/a
f1.gz
f2.gz
/b
f3.gz
or even this:
/root
f3.gz
/a
f1.gz
f2.gz
then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*")
and spark will recursively find the files in dir
and its subdirectories.
Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.
Does this work even if files are not mine?
– fra96
Nov 8 at 11:29
What do you mean not yours?
– Oli
Nov 8 at 12:48
I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33
if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51
Have you tried something like this?sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
To read something from S3, you can do this:
sc.textFiles("s3n://path/to/dir")
If dir
contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:
/root
/a
f1.gz
f2.gz
/b
f3.gz
or even this:
/root
f3.gz
/a
f1.gz
f2.gz
then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*")
and spark will recursively find the files in dir
and its subdirectories.
Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.
Does this work even if files are not mine?
– fra96
Nov 8 at 11:29
What do you mean not yours?
– Oli
Nov 8 at 12:48
I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33
if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51
Have you tried something like this?sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19
add a comment |
up vote
0
down vote
accepted
To read something from S3, you can do this:
sc.textFiles("s3n://path/to/dir")
If dir
contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:
/root
/a
f1.gz
f2.gz
/b
f3.gz
or even this:
/root
f3.gz
/a
f1.gz
f2.gz
then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*")
and spark will recursively find the files in dir
and its subdirectories.
Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.
Does this work even if files are not mine?
– fra96
Nov 8 at 11:29
What do you mean not yours?
– Oli
Nov 8 at 12:48
I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33
if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51
Have you tried something like this?sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
To read something from S3, you can do this:
sc.textFiles("s3n://path/to/dir")
If dir
contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:
/root
/a
f1.gz
f2.gz
/b
f3.gz
or even this:
/root
f3.gz
/a
f1.gz
f2.gz
then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*")
and spark will recursively find the files in dir
and its subdirectories.
Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.
To read something from S3, you can do this:
sc.textFiles("s3n://path/to/dir")
If dir
contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:
/root
/a
f1.gz
f2.gz
/b
f3.gz
or even this:
/root
f3.gz
/a
f1.gz
f2.gz
then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*")
and spark will recursively find the files in dir
and its subdirectories.
Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.
answered Nov 8 at 11:04
Oli
1,149212
1,149212
Does this work even if files are not mine?
– fra96
Nov 8 at 11:29
What do you mean not yours?
– Oli
Nov 8 at 12:48
I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33
if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51
Have you tried something like this?sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19
add a comment |
Does this work even if files are not mine?
– fra96
Nov 8 at 11:29
What do you mean not yours?
– Oli
Nov 8 at 12:48
I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33
if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51
Have you tried something like this?sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19
Does this work even if files are not mine?
– fra96
Nov 8 at 11:29
Does this work even if files are not mine?
– fra96
Nov 8 at 11:29
What do you mean not yours?
– Oli
Nov 8 at 12:48
What do you mean not yours?
– Oli
Nov 8 at 12:48
I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33
I found this error The request signature we calculated does not match the signature you provided. Check your key and signing method
– fra96
Nov 8 at 15:33
if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51
if i use s3n the error is: Relative path in absolute URI: S3ObjectSummary
– fra96
Nov 8 at 15:51
Have you tried something like this?
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19
Have you tried something like this?
sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/*")
– Oli
Nov 8 at 18:19
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53205942%2fhow-to-read-multiple-gzipped-files-from-s3-into-a-single-rdd-with-http-request%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
There is already a tool which extracts all sitemaps from Common Crawl robots.txt archives: github.com/commoncrawl/cc-mrjob/blob/master/… It's Python and based on mrjob, but it would be easy to port it to Spark, cf. cc-pyspark.
– Sebastian Nagel
2 days ago