It'd be a shame if anything happened to them...
I'll be describing a solution that Jazkarta built for Washington Trails Association earlier this year.
WTA is a large nonprofit in Washington state which maintains trails, advocates for their protection, and promotes hiking. They have a Plone site with a lot of images. Hikers can upload photos they take while hiking in Washington state.
In spring of this year, 650GB and growing by gigabytes per day!
Thought about this: new field on image with S3 URL, celery task to move blob to S3 and update this field, serve from S3 when that is there.
But then we started thinking about how images are used in Plone. Needs access to a file on disk for things like building image scales, reading image dimensions.
Do we really want to have to customize every add-on that does something with images?
So we started looking for a ZODB-level solution.
Right away I thought of the blob cache that already exists in ZEO clients when you aren't using a shared blobs directory.
When you try to access a blob in the ZODB, it first checks to see if it is in a cache directory on the local disk, and if not it fetches it from the ZEO server over the network. Then opens the file that is in the cache directory.
Even better, the size of that cache directory is limited to a configurable size.
Maybe all we needed to do was modify this to get the blob from S3 if it isn't found in ZEO.
As it turns out, Jim Fulton already built something like this when he was at Zope Corporation.
s3blobstorage is a modified ZEO client storage which fetches blobs from an S3 bucket in addition to ZEO.
But it really seems like experimental software:
This is good because we're assuming that local access is both faster and cheaper than loading from S3.
This is basically copied from the ZEO client cache.
Can specify the maximum size of the cache directory before purging files.
The cache is shared between multiple ZEO clients running on the same machine so we don't duplicate data.
And we don't waste cache space on blobs that are already on disk.
$ bin/archive-blobs -a 1 -s 2000000 -d
How do blobs get added to the S3 bucket? Using the archive-blobs script.
This gives some control for prioritizing fast local access against lower disk use. Newly created files are accessed locally until the script is run, so access for generating image scales is fast.
It also means transition to S3 storage can be done progressively without downtime. Start with the largest files.
storage-wrapper = %%import collective.s3blobs <s3blobcache> cache-dir ${buildout:directory}/var/blobcache cache-size 10000000000 bucket-name my-blob-bucket %s </s3blobcache>
This is what configuration looks like in your zope2 instance section in buildout.
It gets written into ZODB, and %s is replaced with the "normal"
It also needs AWS keys which are loaded from environment variables.
Even if you're using different cloud storage and can't use collective.s3blobs, the storage wrapper pattern may be a good model for similar solutions.
S3BlobCache -> ClientStorage (ZEO)
or
S3BlobCache -> FileStorage
This means that you can use it with ZEO or not.
That can be nice for local development. Just point your copy of the site at the same blob bucket on S3 and it can fetch the images. It's read only
Caveat time!
Needs to be implemented
Would be nice to have a way to clean up blobs from the bucket that are no longer referenced after packing the ZODB.
Could keep track during pack and then delete those, or just list all blobs from the bucket and check each one to see if it can be loaded.
Could be implemented
We thought about making some app-level changes to link directly to files on AWS instead of serving them through the Plone web server.
Security considerations
And in the end it wasn't important for WTA's use case
davisagli
Image credits: WTA users denemiles, Hkng grl, lomorg, Jiajia, SarahEC, journeycake, albertb539, Luckysnow
Space | Forward |
---|---|
Right, Down, Page Down | Next slide |
Left, Up, Page Up | Previous slide |
P | Open presenter console |
H | Toggle this help |