Proxying Python file-like objects for fun and profit

Categories: English Geeky Uncategorized

As part of a project I’m working on, I wanted to be able to do some “side processing” while writing to a file-like object. The processing is basically checksumming on-the-fly. I’m essentially doing something like:

source = get_a_readable_file_like_object()
destination = get_a_writable_file_like_object()

destination.write(source.read())

what I’d like is to be able to also get the data read from source and use hashlib’s update mechanism to get a checksum of the object. The easiest way to do it would be using temporary storage (an actual file or a StringIO), but I’d prefer to avoid that since the files can be quite large. The second way to do it is to read the source twice. But since that may come from a network, it makes no sense to read it twice just to get the checksum. A third way would be to have destination be a file-like derivative that updates an internal hash with each read block from source, and then provides a way to retrieve the hash.

Instead of creating my own file-like where I’d mostly be “passing through” all the calls to the underlying destination object (which incidentally also writes to a network resource), I decided to use padme which already should do most of what I need. I just needed to unproxy a couple of methods, add a new method to retrieve the checksum at the end, and presto.

A first implementation looks like this:

#!/usr/bin/python
from __future__ import print_function
import urllib2 as requestlib
import hashlib
import padme

class sha256file(padme.proxy):
    @padme.unproxied
    def __init__(self, *args, **kwargs):
        self.hash = hashlib.new('sha256')
        return super(sha256file, self).__init__()

    @padme.unproxied
    def write(self, data):
        self.hash.update(data)
        return super(sha256file, self).write(data)

    @padme.unproxied
    def getsha256(self):
        return self.hash.hexdigest()

url = "http://www.canonical.com"
request = requestlib.Request(url)

reader = requestlib.urlopen(request)
with open("output.html", "wb") as destfile:
    proxy_destfile = sha256file(destfile)
    for read_chunk in reader:
        proxy_destfile.write(read_chunk)
print("SHA256 is {}".format(proxy_destfile.getsha256()))

This however doesn’t work for reasons I was unable to fathom on my own:

python ./cp2.py
Traceback (most recent call last):
   File "./cp2.py", line 33, in
     proxy_destfile.write(read_chunk)
   File "./cp2.py", line 20, in write
     return super(sha256file, self).write(data)
AttributeError: 'super' object has no attribute 'write'

This is clearly because super(sha256file, self) refers to the *class* and I need the *instance* which is the one with the write method. So Zygmunt helped me get a working version ready:

#!/usr/bin/python
from __future__ import print_function
try:
    import urllib2 as requestlib
except:
    from urllib import request as requestlib
import hashlib
import padme


from padme import _logger


class stateful_proxy(padme.proxy):

    @padme.unproxied
    def add_proxy_state(self, *names):
        """ make all of the names listed proxy state attributes """
        cls = type(self)
        cls.__unproxied__ = set(cls.__unproxied__)
        cls.__unproxied__.update(names)
        cls.__unproxied__ = frozenset(cls.__unproxied__)

    def __setattr__(self, name, value):
        cls = type(self)
        if name not in cls.__unproxied__:
            proxiee = cls.__proxiee__
            _logger.debug("__setattr__ %r on proxiee (%r)", name, proxiee)
            setattr(proxiee, name, value)
        else:
            _logger.debug("__setattr__ %r on proxy itself", name)
            object.__setattr__(self, name, value)

    def __delattr__(self, name):
        cls = type(self)
        if name not in cls.__unproxied__:
            proxiee = type(self).__proxiee__
            _logger.debug("__delattr__ %r on proxiee (%r)", name, proxiee)
            delattr(proxiee, name)
        else:
            _logger.debug("__delattr__ %r on proxy itself", name)
            object.__delattr__(self, name)


class sha256file(stateful_proxy):

    @padme.unproxied
    def __init__(self, *args, **kwargs):
        # Declare 'hash' as a state variable of the proxy itself
        self.add_proxy_state('_hash')
        self._hash = hashlib.new('sha256')
        return super(sha256file, self).__init__(*args, **kwargs)

    @padme.unproxied
    def write(self, data):
        self._hash.update(data)
        return type(self).__proxiee__.write(data)

    @padme.unproxied
    def getsha256(self):
        return self._hash.hexdigest()


url = "http://www.canonical.com"
request = requestlib.Request(url)

reader = requestlib.urlopen(request)
with open("output.html", "wb") as destfile:
    proxy_destfile = sha256file(destfile)
    for read_chunk in reader:
        proxy_destfile.write(read_chunk)
print("SHA256 is {}".format(proxy_destfile.getsha256()))

here’s the explanation of what was wrong:

– first of all the exception tells you that the super-object (which is a relative of base_proxy) has no write method. This is correct. A proxy is not a subclass of the proxied object’s class (some classes cannot be subclasses). The solution is to call the real write method. This can be accomplished with type(self).\__proxiee__.write()

– second of all we need to be able to hold state, namely the hash attribute (I’ve renamed it to _hash but it’s irrelevant to the problem at hand). Proxy objects can store state, it’s just not terribly easy to do. The proxied object (here a file) may or may not be able to store state (here it cannot). The solution is to make it possible to access some of the state via standard means. The new (small) satateful_proxy class implements __setattr__ and __delattr__ in the same way __getattribute__ was always implemented. That is, those methods look at the __unproxied__ set to know if access should be routed to the original or to the proxy. – the last problem is that __unproxied__ is only collected by the proxy_meta meta-class. It’s extremely hard to change that meta-class (because padme.proxy is not the real class that you ever use, it’s all a big fake to make proxy() both a function-like and class-like object.)

The really cool thing about all this is not so much that my code is now working, but that those ideas and features will make it into an upcoming version of Padme 🙂 So down the line the code should become a bit simpler.