Comparing methodologies for accessing AWS S3 cloud object storage

More than just a mount point

It’s a common question, “How can I mount and use S3 storage for production workloads?” It makes a lot of sense – elastic, highly durable, and cost effective – sounds perfect! And if you Google it, you’ll find plenty of solutions and products available. Yet, when you begin peeling the marketing layers off these solutions, one soon discovers they come with pretty significant  trade-offs.
What these solutions have in common is in that they must deal with the two primary challenges associated with cloud object storage – latency and interface.

Latency

Applications expect their data to be next to them, and do not deal with latency well – most file based applications are developed with local disk/LAN type latency in mind. Therefore any use case for cloud object storage, outside of backup and archive, focuses on pulling the file from the cloud to the device where the application resides. This takes time, consumes space, and creates a new copy which must be kept in sync with the source of truth and kept secure.

Interface

There is no file system associated with object storage, so to use the data (objects) stored there, they must first be translated or ingested into an existing file system. In AWS-speak, you use “puts” and “gets” as you push a file up and into a bucket, or retrieve it. The only time you can avoid this is when the application itself has been refactored or written natively for object storage. In the case of a cloud native application, sharing the data with traditional applications becomes challenging.

Breaking new ground

When we first started speaking with customers, describing how LucidLink allows them to  mount and use cloud object storage as primary storage, they would often respond, “But can’t [xxxx] do that?” And the answer is usually, “sort of.” They often dealt with the same problem, but the solutions were constrained by existing technology – a cloud first, streaming, distributed file system, sold as a service, didn’t yet exist.
In the words of one senior level storage expert at AWS, “This is the first truly cloud first implementation for S3 storage I’ve seen, and I think it has the potential to really shake things up.”
Simply put, LucidLink is not a gateway, and not an app that syncs files over SMB, NFS or CIFS. It is a cloud file system optimized for object storage.
Let’s take a look at some of the approaches out there, and how LucidLink is different.

Full sync or sync on demand

This is the big hammer approach, and was the first technology to address the issue over 10 years ago by Dropbox and Box. The idea is simple, put a copy of the file wherever you may need to access it, and keep changes synchronized across all devices.
The approach is OK for individual users with smaller sized files and data sets with casual sharing requirements, and both companies have achieved great success with their services.
However as data sets and file sizes grow, this approach does not easily scale to production use for primary data. As anyone who has a large data set can attest, it is too noisy of an approach for production.
While there are solutions for those seeking to use data stored in these services – generally connectors which pull the data from the service for specific applications – they are fraught with compromise. Users still need to inefficiently download and sync files across multiple locations, consuming precious time and resources.
Besides the noisy overhead, one of the biggest downsides we hear from users is the lack of customer control over where and how their data is stored. Supreme trust in your provider is a prerequisite.
Accessing object storage via file sync products is possible, they just weren’t designed to be used for live production with primary data.

Cloud storage as a mount point

These approaches leverage the native cloud object storage get/put API commands emulated through a local OS mount point. They provide a view of your cloud storage with a local feel, but when you access a file, it must fully download, modify, and upload it again.
They are basically a way to map to your bucket, browse and sync files as required – very convenient for casual access of small files, but not for production or heavy collaboration. Examples include Expandrive, Mountain Duck and others.
In contrast, LucidLink is a distributed file system that leverages S3 compatible object storage as backend disk. Clients access files, in place, where they are stored in the cloud. To deal with latency, we distribute metadata and present files as if they are local, prefetching and caching only the bits the application requires, streaming on demand as requested.
LucidLink does not use a 1 file = 1 object approach. We use our own data layout writing the file across multiple objects. This means that for very large files, we can open multiple connections to the different objects comprising the file to improve performance, and can provide random access to modify only the parts of the file being changed.
For example, we published a test where we stored a 133 GB VHD in a LucidLink File Space, presented it to a local server with MS Hyper-V, and proceeded to boot a VM off that image – over a mobile data connection. Of the 133 GB, we only needed to download about 3 GB in order to boot it. If you were to use any of the products above, you would need to download the entire 133 GB file first.

(For more detail on the above, please visit the LucidLink forum topic.

Gateway devices and File Fabric

This is infrastructure sitting between the client endpoints and the cloud storage. It could be a physical or a virtual appliance, but in either case, needs to be set up and managed. (Although there are products offering IaaS as an option, we characterize that as a business model differentiation, and not a true cloud first approach – someone still needs to set it up, monitor and manage it.)
For this approach to work, end-points must be on the same LAN environment as the gateway device where the files are cached, and therefore it requires multiple devices for multiple locations, significantly increasing cost and complexity. Ultimately, users have simply traded storage devices for caching gateways that have the ability to tier to cloud storage.
Besides the additional infrastructure, there are scale-out ramifications. If there is a gateway involved, all connections and data must pass through it, becoming a bottleneck. A couple of representative players here are Nasuni, Panzura, and SoftNAS.
In contrast, LucidLink clients read and write directly to cloud object storage. Each client connects independently to the store, and scale is (theoretically) infinite. Take a look at the test we performed for a customer comparing 300 concurrent connections accessing a 1GB dataset through a (massive) file server vs the same using LucidLink.

Massive scale out capability vs. fileserver

Distributed File System

We believe that this is the best approach. It is a true cloud approach to the architecture, dealing with both the challenges and the advantages that cloud object storage present. This is the approach that LucidLink took.
LucidLink distributes metadata to the devices, removing the bulk of the chatter over the wire between applications and the file system. This allows us to transfer the data much more efficiently to the client. We further enhance performance through caching and prefetching. The LucidLink service coordinates metadata and runs garbage collection on the side, outside the data path and completely transparent to the end user.
The benefits of this approach are an immediate view to all of your data, without consuming local storage resources, and nearly immediate access to that same data. LucidLink provides a significant advantage in the concept of “time to file”. This is the amount of time required before you can begin using the application in conjunction with your data. A dramatic example of this is opening a large (4.5 GB) ISO archive. We compared a popular EFSS solution where it required nearly 4 hours before we could open it, while LucidLink could mount and start working with the files in under 30 seconds. Obviously we did not download the entire archive, and that is the point – with a streaming file system, you don’t have to!

Measuring the time required before using 4.5GB ISO. The file was shared over a high speed Comcast connection within the same metro area. Sync requires a file to be completely uploaded, then completely downloaded before the application can begin using it. LucidLink presents the file as if it is local and only streams the bits required by the application as it is requested.

By implementing as a file system, LucidLink ensures customers existing workflows and applications are preserved.
We believe end users should have full control over their data which is why we allow them to use their own credentials and bring their own cloud. They know exactly where their data is stored at all times.
Finally, we implemented end-to-end client-side strong encryption for additional protection of client data. Most storage services tout encryption, but it is often just “encryption at rest”. However, if a storage service can search, analyze or index your files for you, guess what – they can access your data.
LucidLink customers control their own keys, and data is encrypted on the device, in flight and at rest.

 

 The future

If you read our post “How it all got started”, you will see that we began to wonder when the internet would be fast, stable and ubiquitous enough to directly access files remotely. We believe that we’re experiencing that today – connections within and between clouds are already there, and consumer connections in transition.
Companies already using compute in the cloud can immediately benefit from using object storage as primary storage – even across vendors. (Azure servers accessing S3 storage for example)
Requirements for on premise office locations are slightly more nuanced. Extremely high bandwidth or intensive transactional workloads still may require high performance all-flash arrays over a Fibre Channel networks, but this is generally a fraction of the unstructured data often stored on premise. Companies may elect to shift some of their expenditure based on savings realized by moving their storage to the cloud on the quality of their internet connection, enabling direct access to file services on premise to cloud object storage – without downloading, syncing, or caching everything first.
The user experience for edge cases depends on the quality of their connection. My Comcast connection in San Francisco is 200 Mbps download for $50/month, while my colleague in Melbourne Australia still “suffers” with an 8Mbps ADSL connection.
However, 5G rollout has already begun, and as we all get better connected, accessing files directly from the cloud will become commonplace.
Think about it – if you could connect to your data seamlessly as if it was stored locally from anywhere, why wouldn’t you keep it in your own cloud account, fully protected and encrypted, in an elastic object store, with the highest availability and durability possible, at the lowest cost? LucidLink thinks you should.

The last storage upgrade you'll ever need

We believe cloud object storage has the power to fundamentally change the way individuals and businesses store and access their files.

Contact us