Python: make a list generator JSON serializable











up vote
14
down vote

favorite
4












How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.



My fist try was to use jq, but it looks like jq -s is not optimized for a large input.



jq -s -r '[.]' *.js 


This command works, but takes way too long to complete and I really would like to solve this with Python.



Here is my current code:



def concatFiles(outName, inFileNames):
def listGenerator():
for inName in inFileNames:
with open(inName, 'r') as f:
for item in json.load(f):
yield item

with open(outName, 'w') as f:
json.dump(listGenerator(), f)


I'm getting:



TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable


Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?










share|improve this question


















  • 1




    How about just textually concatenating the documents inserting commas between?
    – bereal
    Feb 9 '14 at 19:30










  • You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
    – Sebastian Wagner
    Feb 9 '14 at 19:37










  • how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
    – Alex
    Feb 9 '14 at 20:03










  • Yes, that's why calling list(..) is not going to work.
    – Sebastian Wagner
    Feb 9 '14 at 20:08










  • Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
    – Joel Cornett
    Jun 5 '14 at 6:28















up vote
14
down vote

favorite
4












How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.



My fist try was to use jq, but it looks like jq -s is not optimized for a large input.



jq -s -r '[.]' *.js 


This command works, but takes way too long to complete and I really would like to solve this with Python.



Here is my current code:



def concatFiles(outName, inFileNames):
def listGenerator():
for inName in inFileNames:
with open(inName, 'r') as f:
for item in json.load(f):
yield item

with open(outName, 'w') as f:
json.dump(listGenerator(), f)


I'm getting:



TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable


Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?










share|improve this question


















  • 1




    How about just textually concatenating the documents inserting commas between?
    – bereal
    Feb 9 '14 at 19:30










  • You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
    – Sebastian Wagner
    Feb 9 '14 at 19:37










  • how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
    – Alex
    Feb 9 '14 at 20:03










  • Yes, that's why calling list(..) is not going to work.
    – Sebastian Wagner
    Feb 9 '14 at 20:08










  • Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
    – Joel Cornett
    Jun 5 '14 at 6:28













up vote
14
down vote

favorite
4









up vote
14
down vote

favorite
4






4





How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.



My fist try was to use jq, but it looks like jq -s is not optimized for a large input.



jq -s -r '[.]' *.js 


This command works, but takes way too long to complete and I really would like to solve this with Python.



Here is my current code:



def concatFiles(outName, inFileNames):
def listGenerator():
for inName in inFileNames:
with open(inName, 'r') as f:
for item in json.load(f):
yield item

with open(outName, 'w') as f:
json.dump(listGenerator(), f)


I'm getting:



TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable


Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?










share|improve this question













How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.



My fist try was to use jq, but it looks like jq -s is not optimized for a large input.



jq -s -r '[.]' *.js 


This command works, but takes way too long to complete and I really would like to solve this with Python.



Here is my current code:



def concatFiles(outName, inFileNames):
def listGenerator():
for inName in inFileNames:
with open(inName, 'r') as f:
for item in json.load(f):
yield item

with open(outName, 'w') as f:
json.dump(listGenerator(), f)


I'm getting:



TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable


Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?







python json out-of-memory generator






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Feb 9 '14 at 19:25









Sebastian Wagner

91221021




91221021








  • 1




    How about just textually concatenating the documents inserting commas between?
    – bereal
    Feb 9 '14 at 19:30










  • You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
    – Sebastian Wagner
    Feb 9 '14 at 19:37










  • how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
    – Alex
    Feb 9 '14 at 20:03










  • Yes, that's why calling list(..) is not going to work.
    – Sebastian Wagner
    Feb 9 '14 at 20:08










  • Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
    – Joel Cornett
    Jun 5 '14 at 6:28














  • 1




    How about just textually concatenating the documents inserting commas between?
    – bereal
    Feb 9 '14 at 19:30










  • You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
    – Sebastian Wagner
    Feb 9 '14 at 19:37










  • how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
    – Alex
    Feb 9 '14 at 20:03










  • Yes, that's why calling list(..) is not going to work.
    – Sebastian Wagner
    Feb 9 '14 at 20:08










  • Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
    – Joel Cornett
    Jun 5 '14 at 6:28








1




1




How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30




How about just textually concatenating the documents inserting commas between?
– bereal
Feb 9 '14 at 19:30












You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37




You need to remove the outer array of each file. Removing the fist and last character of each file should work, but I'd like to control (and remove) the json indentation.
– Sebastian Wagner
Feb 9 '14 at 19:37












how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03




how large are the files actually? could it be that holding the complete serialized data is larger than your memory ?
– Alex
Feb 9 '14 at 20:03












Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08




Yes, that's why calling list(..) is not going to work.
– Sebastian Wagner
Feb 9 '14 at 20:08












Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
– Joel Cornett
Jun 5 '14 at 6:28




Do you also need to validate the JSON before processing it? If not, there is no need to convert string -> JSON -> string. Just put commas between each filestream and surround with .
– Joel Cornett
Jun 5 '14 at 6:28












4 Answers
4






active

oldest

votes

















up vote
16
down vote



accepted










You should derive from list and override __iter__ method.



import json

def gen():
yield 20
yield 30
yield 40

class StreamArray(list):
def __iter__(self):
return gen()

# according to the comment below
def __len__(self):
return 1

a = [1,2,3]
b = StreamArray()

print(json.dumps([1,a,b]))


Result is [1, [1, 2, 3], [20, 30, 40]].






share|improve this answer



















  • 3




    With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
    – Tristan
    Mar 25 '15 at 8:56










  • Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
    – Mišo
    May 25 '16 at 13:26












  • I believe we should not return 1 for length if the iterable is "empty".
    – Vadim Pushtaev
    May 25 '16 at 16:18










  • this is great - cheers
    – frankster
    May 10 '17 at 16:29


















up vote
19
down vote













As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array



# Since simplejson is backwards compatible, you should feel free to import
# it as `json`
import simplejson as json
json.dumps((i*i for i in range(10)), iterable_as_array=True)


result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]






share|improve this answer




























    up vote
    5
    down vote













    A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6



    import itertools

    class SerializableGenerator(list):
    """Generator that is serializable by JSON

    It is useful for serializing huge data by JSON
    >>> json.dumps(SerializableGenerator(iter([1, 2])))
    "[1, 2]"
    >>> json.dumps(SerializableGenerator(iter()))
    ""

    It can be used in a generator of json chunks used e.g. for a stream
    >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
    >>> tuple(iter_json)
    ('[1', ']')
    # >>> for chunk in iter_json:
    # ... stream.write(chunk)
    # >>> SerializableGenerator((x for x in range(3)))
    # [<generator object <genexpr> at 0x7f858b5180f8>]
    """

    def __init__(self, iterable):
    tmp_body = iter(iterable)
    try:
    self._head = iter([next(tmp_body)])
    self.append(tmp_body)
    except StopIteration:
    self._head =

    def __iter__(self):
    return itertools.chain(self._head, *self[:1])


    # -- test --

    import unittest
    import json


    class Test(unittest.TestCase):

    def combined_dump_assert(self, iterable, expect):
    self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)

    def combined_iterencode_assert(self, iterable, expect):
    encoder = json.JSONEncoder().iterencode
    self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)

    def test_dump_data(self):
    self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')

    def test_dump_empty(self):
    self.combined_dump_assert(iter(), '')

    def test_iterencode_data(self):
    self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))

    def test_iterencode_empty(self):
    self.combined_iterencode_assert(iter(), ('',))

    def test_that_all_data_are_consumed(self):
    gen = SerializableGenerator(iter([1, 2]))
    list(gen)
    self.assertEqual(list(gen), )


    Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).



    Useful simplification are:




    • It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

    • It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.




    An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator) is undesirable then I recommend IterEncoder answer.






    share|improve this answer



















    • 1




      Nicely done, and +1 for having tests!
      – user1158559
      Oct 21 '17 at 11:04


















    up vote
    2
    down vote













    Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:




    1. The suggestion that self.__tail__ might be immutable


    2. len(StreamArray(some_gen)) is either 0 or 1


    .



    class StreamArray(list):

    def __init__(self, gen):
    self.gen = gen

    def destructure(self):
    try:
    return self.__head__, self.__tail__, self.__len__
    except AttributeError:
    try:
    self.__head__ = self.gen.__next__()
    self.__tail__ = self.gen
    self.__len__ = 1 # A lie
    except StopIteration:
    self.__head__ = None
    self.__tail__ =
    self.__len__ = 0
    return self.__head__, self.__tail__, self.__len__

    def rebuilt_gen(self):
    def rebuilt_gen_inner():
    head, tail, len_ = self.destructure()
    if len_ > 0:
    yield head
    for elem in tail:
    yield elem
    try:
    return self.__rebuilt_gen__
    except AttributeError:
    self.__rebuilt_gen__ = rebuilt_gen_inner()
    return self.__rebuilt_gen__

    def __iter__(self):
    return self.rebuilt_gen()

    def __next__(self):
    return self.rebuilt_gen()

    def __len__(self):
    return self.destructure()[2]


    Single use only!






    share|improve this answer























    • +1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
      – hynekcer
      Oct 20 '17 at 3:24










    • Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
      – user1158559
      Oct 21 '17 at 11:09











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f21663800%2fpython-make-a-list-generator-json-serializable%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    16
    down vote



    accepted










    You should derive from list and override __iter__ method.



    import json

    def gen():
    yield 20
    yield 30
    yield 40

    class StreamArray(list):
    def __iter__(self):
    return gen()

    # according to the comment below
    def __len__(self):
    return 1

    a = [1,2,3]
    b = StreamArray()

    print(json.dumps([1,a,b]))


    Result is [1, [1, 2, 3], [20, 30, 40]].






    share|improve this answer



















    • 3




      With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
      – Tristan
      Mar 25 '15 at 8:56










    • Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
      – Mišo
      May 25 '16 at 13:26












    • I believe we should not return 1 for length if the iterable is "empty".
      – Vadim Pushtaev
      May 25 '16 at 16:18










    • this is great - cheers
      – frankster
      May 10 '17 at 16:29















    up vote
    16
    down vote



    accepted










    You should derive from list and override __iter__ method.



    import json

    def gen():
    yield 20
    yield 30
    yield 40

    class StreamArray(list):
    def __iter__(self):
    return gen()

    # according to the comment below
    def __len__(self):
    return 1

    a = [1,2,3]
    b = StreamArray()

    print(json.dumps([1,a,b]))


    Result is [1, [1, 2, 3], [20, 30, 40]].






    share|improve this answer



















    • 3




      With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
      – Tristan
      Mar 25 '15 at 8:56










    • Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
      – Mišo
      May 25 '16 at 13:26












    • I believe we should not return 1 for length if the iterable is "empty".
      – Vadim Pushtaev
      May 25 '16 at 16:18










    • this is great - cheers
      – frankster
      May 10 '17 at 16:29













    up vote
    16
    down vote



    accepted







    up vote
    16
    down vote



    accepted






    You should derive from list and override __iter__ method.



    import json

    def gen():
    yield 20
    yield 30
    yield 40

    class StreamArray(list):
    def __iter__(self):
    return gen()

    # according to the comment below
    def __len__(self):
    return 1

    a = [1,2,3]
    b = StreamArray()

    print(json.dumps([1,a,b]))


    Result is [1, [1, 2, 3], [20, 30, 40]].






    share|improve this answer














    You should derive from list and override __iter__ method.



    import json

    def gen():
    yield 20
    yield 30
    yield 40

    class StreamArray(list):
    def __iter__(self):
    return gen()

    # according to the comment below
    def __len__(self):
    return 1

    a = [1,2,3]
    b = StreamArray()

    print(json.dumps([1,a,b]))


    Result is [1, [1, 2, 3], [20, 30, 40]].







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Mar 25 '15 at 14:18

























    answered Jun 4 '14 at 9:04









    Vadim Pushtaev

    1,8351227




    1,8351227








    • 3




      With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
      – Tristan
      Mar 25 '15 at 8:56










    • Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
      – Mišo
      May 25 '16 at 13:26












    • I believe we should not return 1 for length if the iterable is "empty".
      – Vadim Pushtaev
      May 25 '16 at 16:18










    • this is great - cheers
      – frankster
      May 10 '17 at 16:29














    • 3




      With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
      – Tristan
      Mar 25 '15 at 8:56










    • Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
      – Mišo
      May 25 '16 at 13:26












    • I believe we should not return 1 for length if the iterable is "empty".
      – Vadim Pushtaev
      May 25 '16 at 16:18










    • this is great - cheers
      – frankster
      May 10 '17 at 16:29








    3




    3




    With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
    – Tristan
    Mar 25 '15 at 8:56




    With Python 2.7.8, the StreamArray class also has to override the __len__ method and returns a value greater than 0 (1 for instance). Otherwise the json encoder doesn't even call the __iter__ method
    – Tristan
    Mar 25 '15 at 8:56












    Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
    – Mišo
    May 25 '16 at 13:26






    Please note, that this solution creates invalid JSON when used with indent parameter and the iterable is "empty". json.dumps({"products": StreamArray()}, indent=2) # {"products": ]}
    – Mišo
    May 25 '16 at 13:26














    I believe we should not return 1 for length if the iterable is "empty".
    – Vadim Pushtaev
    May 25 '16 at 16:18




    I believe we should not return 1 for length if the iterable is "empty".
    – Vadim Pushtaev
    May 25 '16 at 16:18












    this is great - cheers
    – frankster
    May 10 '17 at 16:29




    this is great - cheers
    – frankster
    May 10 '17 at 16:29












    up vote
    19
    down vote













    As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array



    # Since simplejson is backwards compatible, you should feel free to import
    # it as `json`
    import simplejson as json
    json.dumps((i*i for i in range(10)), iterable_as_array=True)


    result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]






    share|improve this answer

























      up vote
      19
      down vote













      As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array



      # Since simplejson is backwards compatible, you should feel free to import
      # it as `json`
      import simplejson as json
      json.dumps((i*i for i in range(10)), iterable_as_array=True)


      result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]






      share|improve this answer























        up vote
        19
        down vote










        up vote
        19
        down vote









        As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array



        # Since simplejson is backwards compatible, you should feel free to import
        # it as `json`
        import simplejson as json
        json.dumps((i*i for i in range(10)), iterable_as_array=True)


        result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]






        share|improve this answer












        As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array



        # Since simplejson is backwards compatible, you should feel free to import
        # it as `json`
        import simplejson as json
        json.dumps((i*i for i in range(10)), iterable_as_array=True)


        result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jul 20 '15 at 13:28









        Nick Babcock

        3,21221828




        3,21221828






















            up vote
            5
            down vote













            A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6



            import itertools

            class SerializableGenerator(list):
            """Generator that is serializable by JSON

            It is useful for serializing huge data by JSON
            >>> json.dumps(SerializableGenerator(iter([1, 2])))
            "[1, 2]"
            >>> json.dumps(SerializableGenerator(iter()))
            ""

            It can be used in a generator of json chunks used e.g. for a stream
            >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
            >>> tuple(iter_json)
            ('[1', ']')
            # >>> for chunk in iter_json:
            # ... stream.write(chunk)
            # >>> SerializableGenerator((x for x in range(3)))
            # [<generator object <genexpr> at 0x7f858b5180f8>]
            """

            def __init__(self, iterable):
            tmp_body = iter(iterable)
            try:
            self._head = iter([next(tmp_body)])
            self.append(tmp_body)
            except StopIteration:
            self._head =

            def __iter__(self):
            return itertools.chain(self._head, *self[:1])


            # -- test --

            import unittest
            import json


            class Test(unittest.TestCase):

            def combined_dump_assert(self, iterable, expect):
            self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)

            def combined_iterencode_assert(self, iterable, expect):
            encoder = json.JSONEncoder().iterencode
            self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)

            def test_dump_data(self):
            self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')

            def test_dump_empty(self):
            self.combined_dump_assert(iter(), '')

            def test_iterencode_data(self):
            self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))

            def test_iterencode_empty(self):
            self.combined_iterencode_assert(iter(), ('',))

            def test_that_all_data_are_consumed(self):
            gen = SerializableGenerator(iter([1, 2]))
            list(gen)
            self.assertEqual(list(gen), )


            Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).



            Useful simplification are:




            • It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

            • It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.




            An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator) is undesirable then I recommend IterEncoder answer.






            share|improve this answer



















            • 1




              Nicely done, and +1 for having tests!
              – user1158559
              Oct 21 '17 at 11:04















            up vote
            5
            down vote













            A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6



            import itertools

            class SerializableGenerator(list):
            """Generator that is serializable by JSON

            It is useful for serializing huge data by JSON
            >>> json.dumps(SerializableGenerator(iter([1, 2])))
            "[1, 2]"
            >>> json.dumps(SerializableGenerator(iter()))
            ""

            It can be used in a generator of json chunks used e.g. for a stream
            >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
            >>> tuple(iter_json)
            ('[1', ']')
            # >>> for chunk in iter_json:
            # ... stream.write(chunk)
            # >>> SerializableGenerator((x for x in range(3)))
            # [<generator object <genexpr> at 0x7f858b5180f8>]
            """

            def __init__(self, iterable):
            tmp_body = iter(iterable)
            try:
            self._head = iter([next(tmp_body)])
            self.append(tmp_body)
            except StopIteration:
            self._head =

            def __iter__(self):
            return itertools.chain(self._head, *self[:1])


            # -- test --

            import unittest
            import json


            class Test(unittest.TestCase):

            def combined_dump_assert(self, iterable, expect):
            self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)

            def combined_iterencode_assert(self, iterable, expect):
            encoder = json.JSONEncoder().iterencode
            self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)

            def test_dump_data(self):
            self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')

            def test_dump_empty(self):
            self.combined_dump_assert(iter(), '')

            def test_iterencode_data(self):
            self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))

            def test_iterencode_empty(self):
            self.combined_iterencode_assert(iter(), ('',))

            def test_that_all_data_are_consumed(self):
            gen = SerializableGenerator(iter([1, 2]))
            list(gen)
            self.assertEqual(list(gen), )


            Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).



            Useful simplification are:




            • It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

            • It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.




            An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator) is undesirable then I recommend IterEncoder answer.






            share|improve this answer



















            • 1




              Nicely done, and +1 for having tests!
              – user1158559
              Oct 21 '17 at 11:04













            up vote
            5
            down vote










            up vote
            5
            down vote









            A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6



            import itertools

            class SerializableGenerator(list):
            """Generator that is serializable by JSON

            It is useful for serializing huge data by JSON
            >>> json.dumps(SerializableGenerator(iter([1, 2])))
            "[1, 2]"
            >>> json.dumps(SerializableGenerator(iter()))
            ""

            It can be used in a generator of json chunks used e.g. for a stream
            >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
            >>> tuple(iter_json)
            ('[1', ']')
            # >>> for chunk in iter_json:
            # ... stream.write(chunk)
            # >>> SerializableGenerator((x for x in range(3)))
            # [<generator object <genexpr> at 0x7f858b5180f8>]
            """

            def __init__(self, iterable):
            tmp_body = iter(iterable)
            try:
            self._head = iter([next(tmp_body)])
            self.append(tmp_body)
            except StopIteration:
            self._head =

            def __iter__(self):
            return itertools.chain(self._head, *self[:1])


            # -- test --

            import unittest
            import json


            class Test(unittest.TestCase):

            def combined_dump_assert(self, iterable, expect):
            self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)

            def combined_iterencode_assert(self, iterable, expect):
            encoder = json.JSONEncoder().iterencode
            self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)

            def test_dump_data(self):
            self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')

            def test_dump_empty(self):
            self.combined_dump_assert(iter(), '')

            def test_iterencode_data(self):
            self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))

            def test_iterencode_empty(self):
            self.combined_iterencode_assert(iter(), ('',))

            def test_that_all_data_are_consumed(self):
            gen = SerializableGenerator(iter([1, 2]))
            list(gen)
            self.assertEqual(list(gen), )


            Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).



            Useful simplification are:




            • It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

            • It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.




            An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator) is undesirable then I recommend IterEncoder answer.






            share|improve this answer














            A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6



            import itertools

            class SerializableGenerator(list):
            """Generator that is serializable by JSON

            It is useful for serializing huge data by JSON
            >>> json.dumps(SerializableGenerator(iter([1, 2])))
            "[1, 2]"
            >>> json.dumps(SerializableGenerator(iter()))
            ""

            It can be used in a generator of json chunks used e.g. for a stream
            >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter()))
            >>> tuple(iter_json)
            ('[1', ']')
            # >>> for chunk in iter_json:
            # ... stream.write(chunk)
            # >>> SerializableGenerator((x for x in range(3)))
            # [<generator object <genexpr> at 0x7f858b5180f8>]
            """

            def __init__(self, iterable):
            tmp_body = iter(iterable)
            try:
            self._head = iter([next(tmp_body)])
            self.append(tmp_body)
            except StopIteration:
            self._head =

            def __iter__(self):
            return itertools.chain(self._head, *self[:1])


            # -- test --

            import unittest
            import json


            class Test(unittest.TestCase):

            def combined_dump_assert(self, iterable, expect):
            self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)

            def combined_iterencode_assert(self, iterable, expect):
            encoder = json.JSONEncoder().iterencode
            self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)

            def test_dump_data(self):
            self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')

            def test_dump_empty(self):
            self.combined_dump_assert(iter(), '')

            def test_iterencode_data(self):
            self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))

            def test_iterencode_empty(self):
            self.combined_iterencode_assert(iter(), ('',))

            def test_that_all_data_are_consumed(self):
            gen = SerializableGenerator(iter([1, 2]))
            list(gen)
            self.assertEqual(list(gen), )


            Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).



            Useful simplification are:




            • It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)

            • It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.




            An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator) is undesirable then I recommend IterEncoder answer.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 9 at 18:58

























            answered Oct 20 '17 at 3:19









            hynekcer

            8,30523867




            8,30523867








            • 1




              Nicely done, and +1 for having tests!
              – user1158559
              Oct 21 '17 at 11:04














            • 1




              Nicely done, and +1 for having tests!
              – user1158559
              Oct 21 '17 at 11:04








            1




            1




            Nicely done, and +1 for having tests!
            – user1158559
            Oct 21 '17 at 11:04




            Nicely done, and +1 for having tests!
            – user1158559
            Oct 21 '17 at 11:04










            up vote
            2
            down vote













            Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:




            1. The suggestion that self.__tail__ might be immutable


            2. len(StreamArray(some_gen)) is either 0 or 1


            .



            class StreamArray(list):

            def __init__(self, gen):
            self.gen = gen

            def destructure(self):
            try:
            return self.__head__, self.__tail__, self.__len__
            except AttributeError:
            try:
            self.__head__ = self.gen.__next__()
            self.__tail__ = self.gen
            self.__len__ = 1 # A lie
            except StopIteration:
            self.__head__ = None
            self.__tail__ =
            self.__len__ = 0
            return self.__head__, self.__tail__, self.__len__

            def rebuilt_gen(self):
            def rebuilt_gen_inner():
            head, tail, len_ = self.destructure()
            if len_ > 0:
            yield head
            for elem in tail:
            yield elem
            try:
            return self.__rebuilt_gen__
            except AttributeError:
            self.__rebuilt_gen__ = rebuilt_gen_inner()
            return self.__rebuilt_gen__

            def __iter__(self):
            return self.rebuilt_gen()

            def __next__(self):
            return self.rebuilt_gen()

            def __len__(self):
            return self.destructure()[2]


            Single use only!






            share|improve this answer























            • +1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
              – hynekcer
              Oct 20 '17 at 3:24










            • Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
              – user1158559
              Oct 21 '17 at 11:09















            up vote
            2
            down vote













            Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:




            1. The suggestion that self.__tail__ might be immutable


            2. len(StreamArray(some_gen)) is either 0 or 1


            .



            class StreamArray(list):

            def __init__(self, gen):
            self.gen = gen

            def destructure(self):
            try:
            return self.__head__, self.__tail__, self.__len__
            except AttributeError:
            try:
            self.__head__ = self.gen.__next__()
            self.__tail__ = self.gen
            self.__len__ = 1 # A lie
            except StopIteration:
            self.__head__ = None
            self.__tail__ =
            self.__len__ = 0
            return self.__head__, self.__tail__, self.__len__

            def rebuilt_gen(self):
            def rebuilt_gen_inner():
            head, tail, len_ = self.destructure()
            if len_ > 0:
            yield head
            for elem in tail:
            yield elem
            try:
            return self.__rebuilt_gen__
            except AttributeError:
            self.__rebuilt_gen__ = rebuilt_gen_inner()
            return self.__rebuilt_gen__

            def __iter__(self):
            return self.rebuilt_gen()

            def __next__(self):
            return self.rebuilt_gen()

            def __len__(self):
            return self.destructure()[2]


            Single use only!






            share|improve this answer























            • +1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
              – hynekcer
              Oct 20 '17 at 3:24










            • Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
              – user1158559
              Oct 21 '17 at 11:09













            up vote
            2
            down vote










            up vote
            2
            down vote









            Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:




            1. The suggestion that self.__tail__ might be immutable


            2. len(StreamArray(some_gen)) is either 0 or 1


            .



            class StreamArray(list):

            def __init__(self, gen):
            self.gen = gen

            def destructure(self):
            try:
            return self.__head__, self.__tail__, self.__len__
            except AttributeError:
            try:
            self.__head__ = self.gen.__next__()
            self.__tail__ = self.gen
            self.__len__ = 1 # A lie
            except StopIteration:
            self.__head__ = None
            self.__tail__ =
            self.__len__ = 0
            return self.__head__, self.__tail__, self.__len__

            def rebuilt_gen(self):
            def rebuilt_gen_inner():
            head, tail, len_ = self.destructure()
            if len_ > 0:
            yield head
            for elem in tail:
            yield elem
            try:
            return self.__rebuilt_gen__
            except AttributeError:
            self.__rebuilt_gen__ = rebuilt_gen_inner()
            return self.__rebuilt_gen__

            def __iter__(self):
            return self.rebuilt_gen()

            def __next__(self):
            return self.rebuilt_gen()

            def __len__(self):
            return self.destructure()[2]


            Single use only!






            share|improve this answer














            Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:




            1. The suggestion that self.__tail__ might be immutable


            2. len(StreamArray(some_gen)) is either 0 or 1


            .



            class StreamArray(list):

            def __init__(self, gen):
            self.gen = gen

            def destructure(self):
            try:
            return self.__head__, self.__tail__, self.__len__
            except AttributeError:
            try:
            self.__head__ = self.gen.__next__()
            self.__tail__ = self.gen
            self.__len__ = 1 # A lie
            except StopIteration:
            self.__head__ = None
            self.__tail__ =
            self.__len__ = 0
            return self.__head__, self.__tail__, self.__len__

            def rebuilt_gen(self):
            def rebuilt_gen_inner():
            head, tail, len_ = self.destructure()
            if len_ > 0:
            yield head
            for elem in tail:
            yield elem
            try:
            return self.__rebuilt_gen__
            except AttributeError:
            self.__rebuilt_gen__ = rebuilt_gen_inner()
            return self.__rebuilt_gen__

            def __iter__(self):
            return self.rebuilt_gen()

            def __next__(self):
            return self.rebuilt_gen()

            def __len__(self):
            return self.destructure()[2]


            Single use only!







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Oct 12 '17 at 18:58

























            answered Oct 5 '17 at 16:32









            user1158559

            1,67211421




            1,67211421












            • +1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
              – hynekcer
              Oct 20 '17 at 3:24










            • Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
              – user1158559
              Oct 21 '17 at 11:09


















            • +1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
              – hynekcer
              Oct 20 '17 at 3:24










            • Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
              – user1158559
              Oct 21 '17 at 11:09
















            +1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
            – hynekcer
            Oct 20 '17 at 3:24




            +1: Your solution works, but it is too complicated. I think that I implemented the same easier. Look at mine if you find any disadvantage against mine.
            – hynekcer
            Oct 20 '17 at 3:24












            Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
            – user1158559
            Oct 21 '17 at 11:09




            Yours looks fine! For my use case, lazily evaluating the first item is a feature. In hindsight there might be some simplification to be gained from itertools. Very pleased to know that this works as is.
            – user1158559
            Oct 21 '17 at 11:09


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f21663800%2fpython-make-a-list-generator-json-serializable%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Schultheiß

            Verwaltungsgliederung Dänemarks

            Liste der Kulturdenkmale in Wilsdruff