Python: make a list generator JSON serializable
up vote
14
down vote
favorite
How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.
My fist try was to use jq, but it looks like jq -s is not optimized for a large input.
jq -s -r '[.]' *.js
This command works, but takes way too long to complete and I really would like to solve this with Python.
Here is my current code:
def concatFiles(outName, inFileNames):
def listGenerator():
for inName in inFileNames:
with open(inName, 'r') as f:
for item in json.load(f):
yield item
with open(outName, 'w') as f:
json.dump(listGenerator(), f)
I'm getting:
TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable
Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?
python json out-of-memory generator
add a comment |
up vote
14
down vote
favorite
How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.
My fist try was to use jq, but it looks like jq -s is not optimized for a large input.
jq -s -r '[.]' *.js
This command works, but takes way too long to complete and I really would like to solve this with Python.
Here is my current code:
def concatFiles(outName, inFileNames):
def listGenerator():
for inName in inFileNames:
with open(inName, 'r') as f:
for item in json.load(f):
yield item
with open(outName, 'w') as f:
json.dump(listGenerator(), f)
I'm getting:
TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable
Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?
python json out-of-memory generator
1
How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30
You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37
how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03
Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08
Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with.
– Joel Cornett
Jun 5 '14 at 6:28
add a comment |
up vote
14
down vote
favorite
up vote
14
down vote
favorite
How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.
My fist try was to use jq, but it looks like jq -s is not optimized for a large input.
jq -s -r '[.]' *.js
This command works, but takes way too long to complete and I really would like to solve this with Python.
Here is my current code:
def concatFiles(outName, inFileNames):
def listGenerator():
for inName in inFileNames:
with open(inName, 'r') as f:
for item in json.load(f):
yield item
with open(outName, 'w') as f:
json.dump(listGenerator(), f)
I'm getting:
TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable
Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?
python json out-of-memory generator
How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.
My fist try was to use jq, but it looks like jq -s is not optimized for a large input.
jq -s -r '[.]' *.js
This command works, but takes way too long to complete and I really would like to solve this with Python.
Here is my current code:
def concatFiles(outName, inFileNames):
def listGenerator():
for inName in inFileNames:
with open(inName, 'r') as f:
for item in json.load(f):
yield item
with open(outName, 'w') as f:
json.dump(listGenerator(), f)
I'm getting:
TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable
Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?
python json out-of-memory generator
python json out-of-memory generator
asked Feb 9 '14 at 19:25
Sebastian Wagner
91221021
91221021
1
How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30
You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37
how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03
Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08
Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with.
– Joel Cornett
Jun 5 '14 at 6:28
add a comment |
1
How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30
You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37
how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03
Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08
Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with.
– Joel Cornett
Jun 5 '14 at 6:28
1
1
How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30
How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30
You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37
You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37
how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03
how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03
Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08
Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08
Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with
.– Joel Cornett
Jun 5 '14 at 6:28
Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with
.– Joel Cornett
Jun 5 '14 at 6:28
add a comment |
4 Answers
4
active
oldest
votes
up vote
16
down vote
accepted
You should derive from list
and override __iter__
method.
import json
def gen():
yield 20
yield 30
yield 40
class StreamArray(list):
def __iter__(self):
return gen()
# according to the comment below
def __len__(self):
return 1
a = [1,2,3]
b = StreamArray()
print(json.dumps([1,a,b]))
Result is [1, [1, 2, 3], [20, 30, 40]]
.
3
With Python 2.7.8, theStreamArray
class also has to override the__len__
method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the__iter__
method
– Tristan
Mar 25 '15 at 8:56
Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty".json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26
I believe we should notreturn 1
for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18
this is great - cheers
– frankster
May 10 '17 at 16:29
add a comment |
up vote
19
down vote
As of simplejson 3.8.0, you can use the iterable_as_array
option to make any iterable serializable into an array
# Since simplejson is backwards compatible, you should feel free to import
# it as `json`
import simplejson as json
json.dumps((i*i for i in range(10)), iterable_as_array=True)
result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
add a comment |
up vote
5
down vote
A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6
import itertools
class SerializableGenerator(list):
"""Generator that is serializable by JSON
It is useful for serializing huge data by JSON
>>> json.dumps(SerializableGenerator(iter([1, 2])))
"[1, 2]"
>>> json.dumps(SerializableGenerator(iter()))
""
It can be used in a generator of json chunks used e.g. for a stream
>>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
>>> tuple(iter_json)
('[1', ']')
# >>> for chunk in iter_json:
# ... stream.write(chunk)
# >>> SerializableGenerator((x for x in range(3)))
# [<generator object <genexpr> at 0x7f858b5180f8>]
"""
def __init__(self, iterable):
tmp_body = iter(iterable)
try:
self._head = iter([next(tmp_body)])
self.append(tmp_body)
except StopIteration:
self._head =
def __iter__(self):
return itertools.chain(self._head, *self[:1])
# -- test --
import unittest
import json
class Test(unittest.TestCase):
def combined_dump_assert(self, iterable, expect):
self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)
def combined_iterencode_assert(self, iterable, expect):
encoder = json.JSONEncoder().iterencode
self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)
def test_dump_data(self):
self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')
def test_dump_empty(self):
self.combined_dump_assert(iter(), '')
def test_iterencode_data(self):
self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))
def test_iterencode_empty(self):
self.combined_iterencode_assert(iter(), ('',))
def test_that_all_data_are_consumed(self):
gen = SerializableGenerator(iter([1, 2]))
list(gen)
self.assertEqual(list(gen), )
Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).
Useful simplification are:
- It is not necessary to evaluate the first item lazily and it can be it done in
__init__
because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution) - It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like
__repr__
. It is better to store the generator also to the list to provide meaningful results like[<generator object ...>]
. (against Claude). Default methods__len__
and__bool__
works now correctly to recognize an empty and not empty object.
An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator)
is undesirable then I recommend IterEncoder answer.
1
Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04
add a comment |
up vote
2
down vote
Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:
- The suggestion that
self.__tail__
might be immutable
len(StreamArray(some_gen))
is either 0 or 1
.
class StreamArray(list):
def __init__(self, gen):
self.gen = gen
def destructure(self):
try:
return self.__head__, self.__tail__, self.__len__
except AttributeError:
try:
self.__head__ = self.gen.__next__()
self.__tail__ = self.gen
self.__len__ = 1 # A lie
except StopIteration:
self.__head__ = None
self.__tail__ =
self.__len__ = 0
return self.__head__, self.__tail__, self.__len__
def rebuilt_gen(self):
def rebuilt_gen_inner():
head, tail, len_ = self.destructure()
if len_ > 0:
yield head
for elem in tail:
yield elem
try:
return self.__rebuilt_gen__
except AttributeError:
self.__rebuilt_gen__ = rebuilt_gen_inner()
return self.__rebuilt_gen__
def __iter__(self):
return self.rebuilt_gen()
def __next__(self):
return self.rebuilt_gen()
def __len__(self):
return self.destructure()[2]
Single use only!
+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24
Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained fromitertools
. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09
add a comment |
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
16
down vote
accepted
You should derive from list
and override __iter__
method.
import json
def gen():
yield 20
yield 30
yield 40
class StreamArray(list):
def __iter__(self):
return gen()
# according to the comment below
def __len__(self):
return 1
a = [1,2,3]
b = StreamArray()
print(json.dumps([1,a,b]))
Result is [1, [1, 2, 3], [20, 30, 40]]
.
3
With Python 2.7.8, theStreamArray
class also has to override the__len__
method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the__iter__
method
– Tristan
Mar 25 '15 at 8:56
Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty".json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26
I believe we should notreturn 1
for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18
this is great - cheers
– frankster
May 10 '17 at 16:29
add a comment |
up vote
16
down vote
accepted
You should derive from list
and override __iter__
method.
import json
def gen():
yield 20
yield 30
yield 40
class StreamArray(list):
def __iter__(self):
return gen()
# according to the comment below
def __len__(self):
return 1
a = [1,2,3]
b = StreamArray()
print(json.dumps([1,a,b]))
Result is [1, [1, 2, 3], [20, 30, 40]]
.
3
With Python 2.7.8, theStreamArray
class also has to override the__len__
method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the__iter__
method
– Tristan
Mar 25 '15 at 8:56
Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty".json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26
I believe we should notreturn 1
for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18
this is great - cheers
– frankster
May 10 '17 at 16:29
add a comment |
up vote
16
down vote
accepted
up vote
16
down vote
accepted
You should derive from list
and override __iter__
method.
import json
def gen():
yield 20
yield 30
yield 40
class StreamArray(list):
def __iter__(self):
return gen()
# according to the comment below
def __len__(self):
return 1
a = [1,2,3]
b = StreamArray()
print(json.dumps([1,a,b]))
Result is [1, [1, 2, 3], [20, 30, 40]]
.
You should derive from list
and override __iter__
method.
import json
def gen():
yield 20
yield 30
yield 40
class StreamArray(list):
def __iter__(self):
return gen()
# according to the comment below
def __len__(self):
return 1
a = [1,2,3]
b = StreamArray()
print(json.dumps([1,a,b]))
Result is [1, [1, 2, 3], [20, 30, 40]]
.
edited Mar 25 '15 at 14:18
answered Jun 4 '14 at 9:04
Vadim Pushtaev
1,8351227
1,8351227
3
With Python 2.7.8, theStreamArray
class also has to override the__len__
method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the__iter__
method
– Tristan
Mar 25 '15 at 8:56
Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty".json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26
I believe we should notreturn 1
for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18
this is great - cheers
– frankster
May 10 '17 at 16:29
add a comment |
3
With Python 2.7.8, theStreamArray
class also has to override the__len__
method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the__iter__
method
– Tristan
Mar 25 '15 at 8:56
Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty".json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26
I believe we should notreturn 1
for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18
this is great - cheers
– frankster
May 10 '17 at 16:29
3
3
With Python 2.7.8, the
StreamArray
class also has to override the __len__
method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__
method– Tristan
Mar 25 '15 at 8:56
With Python 2.7.8, the
StreamArray
class also has to override the __len__
method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__
method– Tristan
Mar 25 '15 at 8:56
Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty".
json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26
Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty".
json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26
I believe we should not
return 1
for length if the iterable is "empty".– Vadim Pushtaev
May 25 '16 at 16:18
I believe we should not
return 1
for length if the iterable is "empty".– Vadim Pushtaev
May 25 '16 at 16:18
this is great - cheers
– frankster
May 10 '17 at 16:29
this is great - cheers
– frankster
May 10 '17 at 16:29
add a comment |
up vote
19
down vote
As of simplejson 3.8.0, you can use the iterable_as_array
option to make any iterable serializable into an array
# Since simplejson is backwards compatible, you should feel free to import
# it as `json`
import simplejson as json
json.dumps((i*i for i in range(10)), iterable_as_array=True)
result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
add a comment |
up vote
19
down vote
As of simplejson 3.8.0, you can use the iterable_as_array
option to make any iterable serializable into an array
# Since simplejson is backwards compatible, you should feel free to import
# it as `json`
import simplejson as json
json.dumps((i*i for i in range(10)), iterable_as_array=True)
result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
add a comment |
up vote
19
down vote
up vote
19
down vote
As of simplejson 3.8.0, you can use the iterable_as_array
option to make any iterable serializable into an array
# Since simplejson is backwards compatible, you should feel free to import
# it as `json`
import simplejson as json
json.dumps((i*i for i in range(10)), iterable_as_array=True)
result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
As of simplejson 3.8.0, you can use the iterable_as_array
option to make any iterable serializable into an array
# Since simplejson is backwards compatible, you should feel free to import
# it as `json`
import simplejson as json
json.dumps((i*i for i in range(10)), iterable_as_array=True)
result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
answered Jul 20 '15 at 13:28
Nick Babcock
3,21221828
3,21221828
add a comment |
add a comment |
up vote
5
down vote
A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6
import itertools
class SerializableGenerator(list):
"""Generator that is serializable by JSON
It is useful for serializing huge data by JSON
>>> json.dumps(SerializableGenerator(iter([1, 2])))
"[1, 2]"
>>> json.dumps(SerializableGenerator(iter()))
""
It can be used in a generator of json chunks used e.g. for a stream
>>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
>>> tuple(iter_json)
('[1', ']')
# >>> for chunk in iter_json:
# ... stream.write(chunk)
# >>> SerializableGenerator((x for x in range(3)))
# [<generator object <genexpr> at 0x7f858b5180f8>]
"""
def __init__(self, iterable):
tmp_body = iter(iterable)
try:
self._head = iter([next(tmp_body)])
self.append(tmp_body)
except StopIteration:
self._head =
def __iter__(self):
return itertools.chain(self._head, *self[:1])
# -- test --
import unittest
import json
class Test(unittest.TestCase):
def combined_dump_assert(self, iterable, expect):
self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)
def combined_iterencode_assert(self, iterable, expect):
encoder = json.JSONEncoder().iterencode
self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)
def test_dump_data(self):
self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')
def test_dump_empty(self):
self.combined_dump_assert(iter(), '')
def test_iterencode_data(self):
self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))
def test_iterencode_empty(self):
self.combined_iterencode_assert(iter(), ('',))
def test_that_all_data_are_consumed(self):
gen = SerializableGenerator(iter([1, 2]))
list(gen)
self.assertEqual(list(gen), )
Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).
Useful simplification are:
- It is not necessary to evaluate the first item lazily and it can be it done in
__init__
because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution) - It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like
__repr__
. It is better to store the generator also to the list to provide meaningful results like[<generator object ...>]
. (against Claude). Default methods__len__
and__bool__
works now correctly to recognize an empty and not empty object.
An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator)
is undesirable then I recommend IterEncoder answer.
1
Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04
add a comment |
up vote
5
down vote
A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6
import itertools
class SerializableGenerator(list):
"""Generator that is serializable by JSON
It is useful for serializing huge data by JSON
>>> json.dumps(SerializableGenerator(iter([1, 2])))
"[1, 2]"
>>> json.dumps(SerializableGenerator(iter()))
""
It can be used in a generator of json chunks used e.g. for a stream
>>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
>>> tuple(iter_json)
('[1', ']')
# >>> for chunk in iter_json:
# ... stream.write(chunk)
# >>> SerializableGenerator((x for x in range(3)))
# [<generator object <genexpr> at 0x7f858b5180f8>]
"""
def __init__(self, iterable):
tmp_body = iter(iterable)
try:
self._head = iter([next(tmp_body)])
self.append(tmp_body)
except StopIteration:
self._head =
def __iter__(self):
return itertools.chain(self._head, *self[:1])
# -- test --
import unittest
import json
class Test(unittest.TestCase):
def combined_dump_assert(self, iterable, expect):
self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)
def combined_iterencode_assert(self, iterable, expect):
encoder = json.JSONEncoder().iterencode
self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)
def test_dump_data(self):
self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')
def test_dump_empty(self):
self.combined_dump_assert(iter(), '')
def test_iterencode_data(self):
self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))
def test_iterencode_empty(self):
self.combined_iterencode_assert(iter(), ('',))
def test_that_all_data_are_consumed(self):
gen = SerializableGenerator(iter([1, 2]))
list(gen)
self.assertEqual(list(gen), )
Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).
Useful simplification are:
- It is not necessary to evaluate the first item lazily and it can be it done in
__init__
because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution) - It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like
__repr__
. It is better to store the generator also to the list to provide meaningful results like[<generator object ...>]
. (against Claude). Default methods__len__
and__bool__
works now correctly to recognize an empty and not empty object.
An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator)
is undesirable then I recommend IterEncoder answer.
1
Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04
add a comment |
up vote
5
down vote
up vote
5
down vote
A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6
import itertools
class SerializableGenerator(list):
"""Generator that is serializable by JSON
It is useful for serializing huge data by JSON
>>> json.dumps(SerializableGenerator(iter([1, 2])))
"[1, 2]"
>>> json.dumps(SerializableGenerator(iter()))
""
It can be used in a generator of json chunks used e.g. for a stream
>>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
>>> tuple(iter_json)
('[1', ']')
# >>> for chunk in iter_json:
# ... stream.write(chunk)
# >>> SerializableGenerator((x for x in range(3)))
# [<generator object <genexpr> at 0x7f858b5180f8>]
"""
def __init__(self, iterable):
tmp_body = iter(iterable)
try:
self._head = iter([next(tmp_body)])
self.append(tmp_body)
except StopIteration:
self._head =
def __iter__(self):
return itertools.chain(self._head, *self[:1])
# -- test --
import unittest
import json
class Test(unittest.TestCase):
def combined_dump_assert(self, iterable, expect):
self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)
def combined_iterencode_assert(self, iterable, expect):
encoder = json.JSONEncoder().iterencode
self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)
def test_dump_data(self):
self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')
def test_dump_empty(self):
self.combined_dump_assert(iter(), '')
def test_iterencode_data(self):
self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))
def test_iterencode_empty(self):
self.combined_iterencode_assert(iter(), ('',))
def test_that_all_data_are_consumed(self):
gen = SerializableGenerator(iter([1, 2]))
list(gen)
self.assertEqual(list(gen), )
Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).
Useful simplification are:
- It is not necessary to evaluate the first item lazily and it can be it done in
__init__
because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution) - It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like
__repr__
. It is better to store the generator also to the list to provide meaningful results like[<generator object ...>]
. (against Claude). Default methods__len__
and__bool__
works now correctly to recognize an empty and not empty object.
An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator)
is undesirable then I recommend IterEncoder answer.
A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6
import itertools
class SerializableGenerator(list):
"""Generator that is serializable by JSON
It is useful for serializing huge data by JSON
>>> json.dumps(SerializableGenerator(iter([1, 2])))
"[1, 2]"
>>> json.dumps(SerializableGenerator(iter()))
""
It can be used in a generator of json chunks used e.g. for a stream
>>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
>>> tuple(iter_json)
('[1', ']')
# >>> for chunk in iter_json:
# ... stream.write(chunk)
# >>> SerializableGenerator((x for x in range(3)))
# [<generator object <genexpr> at 0x7f858b5180f8>]
"""
def __init__(self, iterable):
tmp_body = iter(iterable)
try:
self._head = iter([next(tmp_body)])
self.append(tmp_body)
except StopIteration:
self._head =
def __iter__(self):
return itertools.chain(self._head, *self[:1])
# -- test --
import unittest
import json
class Test(unittest.TestCase):
def combined_dump_assert(self, iterable, expect):
self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)
def combined_iterencode_assert(self, iterable, expect):
encoder = json.JSONEncoder().iterencode
self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)
def test_dump_data(self):
self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')
def test_dump_empty(self):
self.combined_dump_assert(iter(), '')
def test_iterencode_data(self):
self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))
def test_iterencode_empty(self):
self.combined_iterencode_assert(iter(), ('',))
def test_that_all_data_are_consumed(self):
gen = SerializableGenerator(iter([1, 2]))
list(gen)
self.assertEqual(list(gen), )
Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).
Useful simplification are:
- It is not necessary to evaluate the first item lazily and it can be it done in
__init__
because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution) - It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like
__repr__
. It is better to store the generator also to the list to provide meaningful results like[<generator object ...>]
. (against Claude). Default methods__len__
and__bool__
works now correctly to recognize an empty and not empty object.
An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator)
is undesirable then I recommend IterEncoder answer.
edited Nov 9 at 18:58
answered Oct 20 '17 at 3:19
hynekcer
8,30523867
8,30523867
1
Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04
add a comment |
1
Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04
1
1
Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04
Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04
add a comment |
up vote
2
down vote
Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:
- The suggestion that
self.__tail__
might be immutable
len(StreamArray(some_gen))
is either 0 or 1
.
class StreamArray(list):
def __init__(self, gen):
self.gen = gen
def destructure(self):
try:
return self.__head__, self.__tail__, self.__len__
except AttributeError:
try:
self.__head__ = self.gen.__next__()
self.__tail__ = self.gen
self.__len__ = 1 # A lie
except StopIteration:
self.__head__ = None
self.__tail__ =
self.__len__ = 0
return self.__head__, self.__tail__, self.__len__
def rebuilt_gen(self):
def rebuilt_gen_inner():
head, tail, len_ = self.destructure()
if len_ > 0:
yield head
for elem in tail:
yield elem
try:
return self.__rebuilt_gen__
except AttributeError:
self.__rebuilt_gen__ = rebuilt_gen_inner()
return self.__rebuilt_gen__
def __iter__(self):
return self.rebuilt_gen()
def __next__(self):
return self.rebuilt_gen()
def __len__(self):
return self.destructure()[2]
Single use only!
+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24
Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained fromitertools
. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09
add a comment |
up vote
2
down vote
Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:
- The suggestion that
self.__tail__
might be immutable
len(StreamArray(some_gen))
is either 0 or 1
.
class StreamArray(list):
def __init__(self, gen):
self.gen = gen
def destructure(self):
try:
return self.__head__, self.__tail__, self.__len__
except AttributeError:
try:
self.__head__ = self.gen.__next__()
self.__tail__ = self.gen
self.__len__ = 1 # A lie
except StopIteration:
self.__head__ = None
self.__tail__ =
self.__len__ = 0
return self.__head__, self.__tail__, self.__len__
def rebuilt_gen(self):
def rebuilt_gen_inner():
head, tail, len_ = self.destructure()
if len_ > 0:
yield head
for elem in tail:
yield elem
try:
return self.__rebuilt_gen__
except AttributeError:
self.__rebuilt_gen__ = rebuilt_gen_inner()
return self.__rebuilt_gen__
def __iter__(self):
return self.rebuilt_gen()
def __next__(self):
return self.rebuilt_gen()
def __len__(self):
return self.destructure()[2]
Single use only!
+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24
Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained fromitertools
. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09
add a comment |
up vote
2
down vote
up vote
2
down vote
Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:
- The suggestion that
self.__tail__
might be immutable
len(StreamArray(some_gen))
is either 0 or 1
.
class StreamArray(list):
def __init__(self, gen):
self.gen = gen
def destructure(self):
try:
return self.__head__, self.__tail__, self.__len__
except AttributeError:
try:
self.__head__ = self.gen.__next__()
self.__tail__ = self.gen
self.__len__ = 1 # A lie
except StopIteration:
self.__head__ = None
self.__tail__ =
self.__len__ = 0
return self.__head__, self.__tail__, self.__len__
def rebuilt_gen(self):
def rebuilt_gen_inner():
head, tail, len_ = self.destructure()
if len_ > 0:
yield head
for elem in tail:
yield elem
try:
return self.__rebuilt_gen__
except AttributeError:
self.__rebuilt_gen__ = rebuilt_gen_inner()
return self.__rebuilt_gen__
def __iter__(self):
return self.rebuilt_gen()
def __next__(self):
return self.rebuilt_gen()
def __len__(self):
return self.destructure()[2]
Single use only!
Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:
- The suggestion that
self.__tail__
might be immutable
len(StreamArray(some_gen))
is either 0 or 1
.
class StreamArray(list):
def __init__(self, gen):
self.gen = gen
def destructure(self):
try:
return self.__head__, self.__tail__, self.__len__
except AttributeError:
try:
self.__head__ = self.gen.__next__()
self.__tail__ = self.gen
self.__len__ = 1 # A lie
except StopIteration:
self.__head__ = None
self.__tail__ =
self.__len__ = 0
return self.__head__, self.__tail__, self.__len__
def rebuilt_gen(self):
def rebuilt_gen_inner():
head, tail, len_ = self.destructure()
if len_ > 0:
yield head
for elem in tail:
yield elem
try:
return self.__rebuilt_gen__
except AttributeError:
self.__rebuilt_gen__ = rebuilt_gen_inner()
return self.__rebuilt_gen__
def __iter__(self):
return self.rebuilt_gen()
def __next__(self):
return self.rebuilt_gen()
def __len__(self):
return self.destructure()[2]
Single use only!
edited Oct 12 '17 at 18:58
answered Oct 5 '17 at 16:32
user1158559
1,67211421
1,67211421
+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24
Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained fromitertools
. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09
add a comment |
+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24
Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained fromitertools
. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09
+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24
+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24
Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from
itertools
. Very pleased to know that this works as is.– user1158559
Oct 21 '17 at 11:09
Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from
itertools
. Very pleased to know that this works as is.– user1158559
Oct 21 '17 at 11:09
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f21663800%2fpython-make-a-list-generator-json-serializable%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30
You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37
how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03
Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08
Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with
.
– Joel Cornett
Jun 5 '14 at 6:28