Python: make a list generator JSON serializable

up vote
14
down vote

favorite

How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.

My fist try was to use jq, but it looks like jq -s is not optimized for a large input.

jq -s -r '[.]' *.js

This command works, but takes way too long to complete and I really would like to solve this with Python.

Here is my current code:

def concatFiles(outName, inFileNames):

    def listGenerator():

        for inName in inFileNames:

            with open(inName, 'r') as f:

                for item in json.load(f):

                    yield item



    with open(outName, 'w') as f:

        json.dump(listGenerator(), f)

I'm getting:

TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable

Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?

asked Feb 9 '14 at 19:25

Sebastian Wagner

91221021

1

How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30

You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37

how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03

Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08

Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
– Joel Cornett
Jun 5 '14 at 6:28

add a comment |

up vote
14
down vote

favorite

How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.

My fist try was to use jq, but it looks like jq -s is not optimized for a large input.

jq -s -r '[.]' *.js

This command works, but takes way too long to complete and I really would like to solve this with Python.

Here is my current code:

def concatFiles(outName, inFileNames):

    def listGenerator():

        for inName in inFileNames:

            with open(inName, 'r') as f:

                for item in json.load(f):

                    yield item



    with open(outName, 'w') as f:

        json.dump(listGenerator(), f)

I'm getting:

TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable

Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?

asked Feb 9 '14 at 19:25

Sebastian Wagner

91221021

1

How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30

You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37

how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03

Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08

Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
– Joel Cornett
Jun 5 '14 at 6:28

add a comment |

up vote
14
down vote

favorite

How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.

My fist try was to use jq, but it looks like jq -s is not optimized for a large input.

jq -s -r '[.]' *.js

This command works, but takes way too long to complete and I really would like to solve this with Python.

Here is my current code:

def concatFiles(outName, inFileNames):

    def listGenerator():

        for inName in inFileNames:

            with open(inName, 'r') as f:

                for item in json.load(f):

                    yield item



    with open(outName, 'w') as f:

        json.dump(listGenerator(), f)

I'm getting:

TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable

Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?

asked Feb 9 '14 at 19:25

Sebastian Wagner

91221021

How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.

My fist try was to use jq, but it looks like jq -s is not optimized for a large input.

jq -s -r '[.]' *.js

This command works, but takes way too long to complete and I really would like to solve this with Python.

Here is my current code:

def concatFiles(outName, inFileNames):

    def listGenerator():

        for inName in inFileNames:

            with open(inName, 'r') as f:

                for item in json.load(f):

                    yield item



    with open(outName, 'w') as f:

        json.dump(listGenerator(), f)

I'm getting:

TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable

Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?

python json out-of-memory generator

asked Feb 9 '14 at 19:25

Sebastian Wagner

91221021

asked Feb 9 '14 at 19:25

Sebastian Wagner

91221021

asked Feb 9 '14 at 19:25

Sebastian Wagner

91221021

asked Feb 9 '14 at 19:25

Sebastian Wagner

91221021

asked Feb 9 '14 at 19:25

Sebastian Wagner

91221021

1

How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30

You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37

how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03

Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08

Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
– Joel Cornett
Jun 5 '14 at 6:28

add a comment |

1

How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30

You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37

how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03

Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08

Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
– Joel Cornett
Jun 5 '14 at 6:28

How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30

You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37

how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03

Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08

Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
– Joel Cornett
Jun 5 '14 at 6:28

add a comment |

4 Answers
4

active

oldest

votes

up vote
16
down vote

accepted

You should derive from list and override __iter__ method.

import json



def gen():

    yield 20

    yield 30

    yield 40



class StreamArray(list):

    def __iter__(self):

        return gen()



    # according to the comment below

    def __len__(self):

        return 1



a = [1,2,3]

b = StreamArray()



print(json.dumps([1,a,b]))

Result is [1, [1, 2, 3], [20, 30, 40]].

edited Mar 25 '15 at 14:18

answered Jun 4 '14 at 9:04

Vadim Pushtaev

1,8351227

3

With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
– Tristan
Mar 25 '15 at 8:56

Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26

I believe we should not return 1 for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18

this is great - cheers
– frankster
May 10 '17 at 16:29

add a comment |

up vote
19
down vote

As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array

# Since simplejson is backwards compatible, you should feel free to import

# it as `json`

import simplejson as json

json.dumps((i*i for i in range(10)), iterable_as_array=True)

result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

answered Jul 20 '15 at 13:28

Nick Babcock

3,21221828

add a comment |

up vote
5
down vote

A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6

import itertools



class SerializableGenerator(list):

    """Generator that is serializable by JSON



    It is useful for serializing huge data by JSON

    >>> json.dumps(SerializableGenerator(iter([1, 2])))

    "[1, 2]"

    >>> json.dumps(SerializableGenerator(iter()))

    ""



    It can be used in a generator of json chunks used e.g. for a stream

    >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))

    >>> tuple(iter_json)

    ('[1', ']')

    # >>> for chunk in iter_json:

    # ...     stream.write(chunk)

    # >>> SerializableGenerator((x for x in range(3)))

    # [<generator object <genexpr> at 0x7f858b5180f8>]

    """



    def __init__(self, iterable):

        tmp_body = iter(iterable)

        try:

            self._head = iter([next(tmp_body)])

            self.append(tmp_body)

        except StopIteration:

            self._head = 



    def __iter__(self):

        return itertools.chain(self._head, *self[:1])





# -- test --



import unittest

import json





class Test(unittest.TestCase):



    def combined_dump_assert(self, iterable, expect):

        self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)



    def combined_iterencode_assert(self, iterable, expect):

        encoder = json.JSONEncoder().iterencode

        self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)



    def test_dump_data(self):

        self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')



    def test_dump_empty(self):

        self.combined_dump_assert(iter(), '')



    def test_iterencode_data(self):

        self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))



    def test_iterencode_empty(self):

        self.combined_iterencode_assert(iter(), ('',))



    def test_that_all_data_are_consumed(self):

        gen = SerializableGenerator(iter([1, 2]))

        list(gen)

        self.assertEqual(list(gen), )

Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).

Useful simplification are:

It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.

An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator) is undesirable then I recommend IterEncoder answer.

edited Nov 9 at 18:58

answered Oct 20 '17 at 3:19

hynekcer

8,30523867

1

Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04

add a comment |

up vote
2
down vote

Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:

The suggestion that self.__tail__ might be immutable

len(StreamArray(some_gen)) is either 0 or 1

class StreamArray(list):



    def __init__(self, gen):

        self.gen = gen



    def destructure(self):

        try:

            return self.__head__, self.__tail__, self.__len__

        except AttributeError:

            try:

                self.__head__ = self.gen.__next__()

                self.__tail__ = self.gen

                self.__len__ = 1 # A lie

            except StopIteration:

                self.__head__ = None

                self.__tail__ = 

                self.__len__ = 0

            return self.__head__, self.__tail__, self.__len__



    def rebuilt_gen(self):

        def rebuilt_gen_inner():

            head, tail, len_ = self.destructure()

            if len_ > 0:

                yield head

            for elem in tail:

                yield elem

        try:

            return self.__rebuilt_gen__

        except AttributeError:

            self.__rebuilt_gen__ = rebuilt_gen_inner()

            return self.__rebuilt_gen__



    def __iter__(self):

        return self.rebuilt_gen()



    def __next__(self):

        return self.rebuilt_gen()



    def __len__(self):

        return self.destructure()[2]

Single use only!

edited Oct 12 '17 at 18:58

answered Oct 5 '17 at 16:32

user1158559

1,67211421

+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24

Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f21663800%2fpython-make-a-list-generator-json-serializable%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
16
down vote

accepted

You should derive from list and override __iter__ method.

import json



def gen():

    yield 20

    yield 30

    yield 40



class StreamArray(list):

    def __iter__(self):

        return gen()



    # according to the comment below

    def __len__(self):

        return 1



a = [1,2,3]

b = StreamArray()



print(json.dumps([1,a,b]))

Result is [1, [1, 2, 3], [20, 30, 40]].

edited Mar 25 '15 at 14:18

answered Jun 4 '14 at 9:04

Vadim Pushtaev

1,8351227

3

With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
– Tristan
Mar 25 '15 at 8:56

Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26

I believe we should not return 1 for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18

this is great - cheers
– frankster
May 10 '17 at 16:29

add a comment |

up vote
16
down vote

accepted

You should derive from list and override __iter__ method.

import json



def gen():

    yield 20

    yield 30

    yield 40



class StreamArray(list):

    def __iter__(self):

        return gen()



    # according to the comment below

    def __len__(self):

        return 1



a = [1,2,3]

b = StreamArray()



print(json.dumps([1,a,b]))

Result is [1, [1, 2, 3], [20, 30, 40]].

edited Mar 25 '15 at 14:18

answered Jun 4 '14 at 9:04

Vadim Pushtaev

1,8351227

3

With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
– Tristan
Mar 25 '15 at 8:56

Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26

I believe we should not return 1 for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18

this is great - cheers
– frankster
May 10 '17 at 16:29

add a comment |

up vote
16
down vote

accepted

You should derive from list and override __iter__ method.

import json



def gen():

    yield 20

    yield 30

    yield 40



class StreamArray(list):

    def __iter__(self):

        return gen()



    # according to the comment below

    def __len__(self):

        return 1



a = [1,2,3]

b = StreamArray()



print(json.dumps([1,a,b]))

Result is [1, [1, 2, 3], [20, 30, 40]].

edited Mar 25 '15 at 14:18

answered Jun 4 '14 at 9:04

Vadim Pushtaev

1,8351227

You should derive from list and override __iter__ method.

import json



def gen():

    yield 20

    yield 30

    yield 40



class StreamArray(list):

    def __iter__(self):

        return gen()



    # according to the comment below

    def __len__(self):

        return 1



a = [1,2,3]

b = StreamArray()



print(json.dumps([1,a,b]))

Result is [1, [1, 2, 3], [20, 30, 40]].

edited Mar 25 '15 at 14:18

answered Jun 4 '14 at 9:04

Vadim Pushtaev

1,8351227

edited Mar 25 '15 at 14:18

answered Jun 4 '14 at 9:04

Vadim Pushtaev

1,8351227

answered Jun 4 '14 at 9:04

Vadim Pushtaev

1,8351227

answered Jun 4 '14 at 9:04

Vadim Pushtaev

1,8351227

3

With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
– Tristan
Mar 25 '15 at 8:56

Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26

I believe we should not return 1 for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18

this is great - cheers
– frankster
May 10 '17 at 16:29

add a comment |

3

With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
– Tristan
Mar 25 '15 at 8:56

Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26

I believe we should not return 1 for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18

this is great - cheers
– frankster
May 10 '17 at 16:29

With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
– Tristan
Mar 25 '15 at 8:56

Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
– Mišo
May 25 '16 at 13:26

I believe we should not return 1 for length if the iterable is "empty".
– Vadim Pushtaev
May 25 '16 at 16:18

this is great - cheers
– frankster
May 10 '17 at 16:29

add a comment |

up vote
19
down vote

As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array

# Since simplejson is backwards compatible, you should feel free to import

# it as `json`

import simplejson as json

json.dumps((i*i for i in range(10)), iterable_as_array=True)

result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

answered Jul 20 '15 at 13:28

Nick Babcock

3,21221828

add a comment |

up vote
19
down vote

As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array

# Since simplejson is backwards compatible, you should feel free to import

# it as `json`

import simplejson as json

json.dumps((i*i for i in range(10)), iterable_as_array=True)

result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

answered Jul 20 '15 at 13:28

Nick Babcock

3,21221828

add a comment |

up vote
19
down vote

As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array

# Since simplejson is backwards compatible, you should feel free to import

# it as `json`

import simplejson as json

json.dumps((i*i for i in range(10)), iterable_as_array=True)

result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

answered Jul 20 '15 at 13:28

Nick Babcock

3,21221828

As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array

# Since simplejson is backwards compatible, you should feel free to import

# it as `json`

import simplejson as json

json.dumps((i*i for i in range(10)), iterable_as_array=True)

result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

answered Jul 20 '15 at 13:28

Nick Babcock

3,21221828

answered Jul 20 '15 at 13:28

Nick Babcock

3,21221828

answered Jul 20 '15 at 13:28

Nick Babcock

3,21221828

answered Jul 20 '15 at 13:28

Nick Babcock

3,21221828

add a comment |

up vote
5
down vote

A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6

import itertools



class SerializableGenerator(list):

    """Generator that is serializable by JSON



    It is useful for serializing huge data by JSON

    >>> json.dumps(SerializableGenerator(iter([1, 2])))

    "[1, 2]"

    >>> json.dumps(SerializableGenerator(iter()))

    ""



    It can be used in a generator of json chunks used e.g. for a stream

    >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))

    >>> tuple(iter_json)

    ('[1', ']')

    # >>> for chunk in iter_json:

    # ...     stream.write(chunk)

    # >>> SerializableGenerator((x for x in range(3)))

    # [<generator object <genexpr> at 0x7f858b5180f8>]

    """



    def __init__(self, iterable):

        tmp_body = iter(iterable)

        try:

            self._head = iter([next(tmp_body)])

            self.append(tmp_body)

        except StopIteration:

            self._head = 



    def __iter__(self):

        return itertools.chain(self._head, *self[:1])





# -- test --



import unittest

import json





class Test(unittest.TestCase):



    def combined_dump_assert(self, iterable, expect):

        self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)



    def combined_iterencode_assert(self, iterable, expect):

        encoder = json.JSONEncoder().iterencode

        self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)



    def test_dump_data(self):

        self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')



    def test_dump_empty(self):

        self.combined_dump_assert(iter(), '')



    def test_iterencode_data(self):

        self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))



    def test_iterencode_empty(self):

        self.combined_iterencode_assert(iter(), ('',))



    def test_that_all_data_are_consumed(self):

        gen = SerializableGenerator(iter([1, 2]))

        list(gen)

        self.assertEqual(list(gen), )

Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).

Useful simplification are:

It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.

edited Nov 9 at 18:58

answered Oct 20 '17 at 3:19

hynekcer

8,30523867

1

Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04

add a comment |

up vote
5
down vote

A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6

import itertools



class SerializableGenerator(list):

    """Generator that is serializable by JSON



    It is useful for serializing huge data by JSON

    >>> json.dumps(SerializableGenerator(iter([1, 2])))

    "[1, 2]"

    >>> json.dumps(SerializableGenerator(iter()))

    ""



    It can be used in a generator of json chunks used e.g. for a stream

    >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))

    >>> tuple(iter_json)

    ('[1', ']')

    # >>> for chunk in iter_json:

    # ...     stream.write(chunk)

    # >>> SerializableGenerator((x for x in range(3)))

    # [<generator object <genexpr> at 0x7f858b5180f8>]

    """



    def __init__(self, iterable):

        tmp_body = iter(iterable)

        try:

            self._head = iter([next(tmp_body)])

            self.append(tmp_body)

        except StopIteration:

            self._head = 



    def __iter__(self):

        return itertools.chain(self._head, *self[:1])





# -- test --



import unittest

import json





class Test(unittest.TestCase):



    def combined_dump_assert(self, iterable, expect):

        self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)



    def combined_iterencode_assert(self, iterable, expect):

        encoder = json.JSONEncoder().iterencode

        self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)



    def test_dump_data(self):

        self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')



    def test_dump_empty(self):

        self.combined_dump_assert(iter(), '')



    def test_iterencode_data(self):

        self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))



    def test_iterencode_empty(self):

        self.combined_iterencode_assert(iter(), ('',))



    def test_that_all_data_are_consumed(self):

        gen = SerializableGenerator(iter([1, 2]))

        list(gen)

        self.assertEqual(list(gen), )

Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).

Useful simplification are:

It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.

edited Nov 9 at 18:58

answered Oct 20 '17 at 3:19

hynekcer

8,30523867

1

Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04

add a comment |

up vote
5
down vote

A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6

import itertools



class SerializableGenerator(list):

    """Generator that is serializable by JSON



    It is useful for serializing huge data by JSON

    >>> json.dumps(SerializableGenerator(iter([1, 2])))

    "[1, 2]"

    >>> json.dumps(SerializableGenerator(iter()))

    ""



    It can be used in a generator of json chunks used e.g. for a stream

    >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))

    >>> tuple(iter_json)

    ('[1', ']')

    # >>> for chunk in iter_json:

    # ...     stream.write(chunk)

    # >>> SerializableGenerator((x for x in range(3)))

    # [<generator object <genexpr> at 0x7f858b5180f8>]

    """



    def __init__(self, iterable):

        tmp_body = iter(iterable)

        try:

            self._head = iter([next(tmp_body)])

            self.append(tmp_body)

        except StopIteration:

            self._head = 



    def __iter__(self):

        return itertools.chain(self._head, *self[:1])





# -- test --



import unittest

import json





class Test(unittest.TestCase):



    def combined_dump_assert(self, iterable, expect):

        self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)



    def combined_iterencode_assert(self, iterable, expect):

        encoder = json.JSONEncoder().iterencode

        self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)



    def test_dump_data(self):

        self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')



    def test_dump_empty(self):

        self.combined_dump_assert(iter(), '')



    def test_iterencode_data(self):

        self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))



    def test_iterencode_empty(self):

        self.combined_iterencode_assert(iter(), ('',))



    def test_that_all_data_are_consumed(self):

        gen = SerializableGenerator(iter([1, 2]))

        list(gen)

        self.assertEqual(list(gen), )

Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).

Useful simplification are:

It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.

edited Nov 9 at 18:58

answered Oct 20 '17 at 3:19

hynekcer

8,30523867

A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6

import itertools



class SerializableGenerator(list):

    """Generator that is serializable by JSON



    It is useful for serializing huge data by JSON

    >>> json.dumps(SerializableGenerator(iter([1, 2])))

    "[1, 2]"

    >>> json.dumps(SerializableGenerator(iter()))

    ""



    It can be used in a generator of json chunks used e.g. for a stream

    >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))

    >>> tuple(iter_json)

    ('[1', ']')

    # >>> for chunk in iter_json:

    # ...     stream.write(chunk)

    # >>> SerializableGenerator((x for x in range(3)))

    # [<generator object <genexpr> at 0x7f858b5180f8>]

    """



    def __init__(self, iterable):

        tmp_body = iter(iterable)

        try:

            self._head = iter([next(tmp_body)])

            self.append(tmp_body)

        except StopIteration:

            self._head = 



    def __iter__(self):

        return itertools.chain(self._head, *self[:1])





# -- test --



import unittest

import json





class Test(unittest.TestCase):



    def combined_dump_assert(self, iterable, expect):

        self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)



    def combined_iterencode_assert(self, iterable, expect):

        encoder = json.JSONEncoder().iterencode

        self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)



    def test_dump_data(self):

        self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')



    def test_dump_empty(self):

        self.combined_dump_assert(iter(), '')



    def test_iterencode_data(self):

        self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))



    def test_iterencode_empty(self):

        self.combined_iterencode_assert(iter(), ('',))



    def test_that_all_data_are_consumed(self):

        gen = SerializableGenerator(iter([1, 2]))

        list(gen)

        self.assertEqual(list(gen), )

Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).

Useful simplification are:

It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.

edited Nov 9 at 18:58

answered Oct 20 '17 at 3:19

hynekcer

8,30523867

edited Nov 9 at 18:58

answered Oct 20 '17 at 3:19

hynekcer

8,30523867

answered Oct 20 '17 at 3:19

hynekcer

8,30523867

answered Oct 20 '17 at 3:19

hynekcer

8,30523867

1

Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04

add a comment |

1

Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04

Nicely done, and +1 for having tests!
– user1158559
Oct 21 '17 at 11:04

add a comment |

up vote
2
down vote

Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:

The suggestion that self.__tail__ might be immutable

len(StreamArray(some_gen)) is either 0 or 1

class StreamArray(list):



    def __init__(self, gen):

        self.gen = gen



    def destructure(self):

        try:

            return self.__head__, self.__tail__, self.__len__

        except AttributeError:

            try:

                self.__head__ = self.gen.__next__()

                self.__tail__ = self.gen

                self.__len__ = 1 # A lie

            except StopIteration:

                self.__head__ = None

                self.__tail__ = 

                self.__len__ = 0

            return self.__head__, self.__tail__, self.__len__



    def rebuilt_gen(self):

        def rebuilt_gen_inner():

            head, tail, len_ = self.destructure()

            if len_ > 0:

                yield head

            for elem in tail:

                yield elem

        try:

            return self.__rebuilt_gen__

        except AttributeError:

            self.__rebuilt_gen__ = rebuilt_gen_inner()

            return self.__rebuilt_gen__



    def __iter__(self):

        return self.rebuilt_gen()



    def __next__(self):

        return self.rebuilt_gen()



    def __len__(self):

        return self.destructure()[2]

Single use only!

edited Oct 12 '17 at 18:58

answered Oct 5 '17 at 16:32

user1158559

1,67211421

+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24

Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09

add a comment |

up vote
2
down vote

Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:

The suggestion that self.__tail__ might be immutable

len(StreamArray(some_gen)) is either 0 or 1

class StreamArray(list):



    def __init__(self, gen):

        self.gen = gen



    def destructure(self):

        try:

            return self.__head__, self.__tail__, self.__len__

        except AttributeError:

            try:

                self.__head__ = self.gen.__next__()

                self.__tail__ = self.gen

                self.__len__ = 1 # A lie

            except StopIteration:

                self.__head__ = None

                self.__tail__ = 

                self.__len__ = 0

            return self.__head__, self.__tail__, self.__len__



    def rebuilt_gen(self):

        def rebuilt_gen_inner():

            head, tail, len_ = self.destructure()

            if len_ > 0:

                yield head

            for elem in tail:

                yield elem

        try:

            return self.__rebuilt_gen__

        except AttributeError:

            self.__rebuilt_gen__ = rebuilt_gen_inner()

            return self.__rebuilt_gen__



    def __iter__(self):

        return self.rebuilt_gen()



    def __next__(self):

        return self.rebuilt_gen()



    def __len__(self):

        return self.destructure()[2]

Single use only!

edited Oct 12 '17 at 18:58

answered Oct 5 '17 at 16:32

user1158559

1,67211421

+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24

Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09

add a comment |

up vote
2
down vote

Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:

The suggestion that self.__tail__ might be immutable

len(StreamArray(some_gen)) is either 0 or 1

class StreamArray(list):



    def __init__(self, gen):

        self.gen = gen



    def destructure(self):

        try:

            return self.__head__, self.__tail__, self.__len__

        except AttributeError:

            try:

                self.__head__ = self.gen.__next__()

                self.__tail__ = self.gen

                self.__len__ = 1 # A lie

            except StopIteration:

                self.__head__ = None

                self.__tail__ = 

                self.__len__ = 0

            return self.__head__, self.__tail__, self.__len__



    def rebuilt_gen(self):

        def rebuilt_gen_inner():

            head, tail, len_ = self.destructure()

            if len_ > 0:

                yield head

            for elem in tail:

                yield elem

        try:

            return self.__rebuilt_gen__

        except AttributeError:

            self.__rebuilt_gen__ = rebuilt_gen_inner()

            return self.__rebuilt_gen__



    def __iter__(self):

        return self.rebuilt_gen()



    def __next__(self):

        return self.rebuilt_gen()



    def __len__(self):

        return self.destructure()[2]

Single use only!

edited Oct 12 '17 at 18:58

answered Oct 5 '17 at 16:32

user1158559

1,67211421

Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:

The suggestion that self.__tail__ might be immutable

len(StreamArray(some_gen)) is either 0 or 1

class StreamArray(list):



    def __init__(self, gen):

        self.gen = gen



    def destructure(self):

        try:

            return self.__head__, self.__tail__, self.__len__

        except AttributeError:

            try:

                self.__head__ = self.gen.__next__()

                self.__tail__ = self.gen

                self.__len__ = 1 # A lie

            except StopIteration:

                self.__head__ = None

                self.__tail__ = 

                self.__len__ = 0

            return self.__head__, self.__tail__, self.__len__



    def rebuilt_gen(self):

        def rebuilt_gen_inner():

            head, tail, len_ = self.destructure()

            if len_ > 0:

                yield head

            for elem in tail:

                yield elem

        try:

            return self.__rebuilt_gen__

        except AttributeError:

            self.__rebuilt_gen__ = rebuilt_gen_inner()

            return self.__rebuilt_gen__



    def __iter__(self):

        return self.rebuilt_gen()



    def __next__(self):

        return self.rebuilt_gen()



    def __len__(self):

        return self.destructure()[2]

Single use only!

edited Oct 12 '17 at 18:58

answered Oct 5 '17 at 16:32

user1158559

1,67211421

edited Oct 12 '17 at 18:58

answered Oct 5 '17 at 16:32

user1158559

1,67211421

answered Oct 5 '17 at 16:32

user1158559

1,67211421

answered Oct 5 '17 at 16:32

user1158559

1,67211421

+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24

Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09

add a comment |

+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24

Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09

+1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
– hynekcer
Oct 20 '17 at 3:24

Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
– user1158559
Oct 21 '17 at 11:09

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl