Cloud Data Archiving - Cloud-based Archiving Solution - MobiLab

Introduction

Half a year ago Mahoor and I joined the spirited team at MobiLab as full-stack developers. In this time we were given the chance to work on the company’s very first own product: Archivum. Building upon a previous project which stood the test of time as a scalable data integration solution, we added novel features, shaping it into a unique cloud-native archiving solution with organizational needs in mind.

“What are those ‘organizational needs’?”, you might ask. The reasons for Archivum are numerous, but put simply: While aspiring to adopt data enablement, more and more businesses are facing exponential growths of data. The scaling and maintaining of these on-premises infrastructures sooner or later become too difficult to manage, resulting in excessive running costs and overheads. Exactly this was what we recognized as a huge opportunity for innovation…

Hot & Cold Storage

In the present, data growth has made data management difficult for organizations. Exponentially increasing data storage costs and the need for data retention due to government and regulatory timelines added greater complexity than ever before. The issue is that this very process does not scale and cannot be employed on an organizational level. Because of this fact, organizations are in need of a new process to not only choose their storage type smartly (cutting off their unnecessary costs), but also to take care of their data lifecycle (archiving & deleting).

With this in mind, Archivum’s development journey kicked off. Firstly we tackled the problem by categorizing the data based on access rates. We identified that there are two kinds of data that companies mainly utilize:

  1. Online data – that are needed frequently and should be instantly accessible at any time.

  2. Offline data – that won’t be needed frequently and access delays are tolerated.

The initial idea was to take advantage of online & offline storage tiers managed by cloud providers (also known as “Hot Storage” & “Cold Storage”) to store the data correspondingly and use them in a configurable way that fits every organization’s internal processes. Since it is cheaper to store data in cold storage than in hot storage, this separation enables businesses to absorb the exponential data inflow while flattening their costs curve.

Cloud Data Archiving - Hot and Cold Storage

Mentioned separation into hot & cold storage was already a good foundation to start from, but we had a greater vision for Archivum: From the very beginning, we intended to not only cut extra costs by using corresponding storage tiers but also to automate the data lifecycle by considering users’ needs and moving files from hot storage to cold storage accordingly. Users of the platform should only be faced with a single unified storage on the cloud, which they can use for easily accessing their files from any device and sharing with different users without worrying about each file’s fate.

We approached this by logically grouping data in containers defined by the user. Inside the containers, durations of archiving and retention periods are able to be specified as well. Being able to define different policies for varying data groups provides a decent degree of flexibility, making it convenient for organizations to fine-tune their usage with respect to existing processes. This also requires a certain restoring logic, hence getting access to files on cold storage is not immediate.

Keeping the whole process smooth and seamless from the user’s perspective was quite a challenge. What we ended up with was a combination of using the messaging system in our architecture to periodically check currently restoring files against their storage status (hot/cold). With this approach, the files that take a couple of hours to restore do not block other storage services. Additionally, we implemented an email service on top of it to notify the user when files are restored on hot storage, abstracting away the complications of moving files between cold and hot tiers.

The resulting design is quite magical. We were able to provide a flexible and configurable lifecycle for data, based on their logical grouping in a container and automate the process of archiving. Furthermore, Archivum now allows users to restore their data from cold to hot storage and to get notified when the related data is ready to access.

—> Check out this blog post for a more in-depth look at this topic!

Easy Access

Sharing data in organizations is a critical and complicated matter. Each organization has its own data access policies and governance on data. There are many rules that define which person or department can access which data based on their role and requirements. Implementing an access management system on our platform to satisfy organizations’ needs of defining access policies proved to be a tricky task: On one hand, the system had to be configurable and flexible enough to meet all the varying requirements different organizations might have, and on the other hand, we had to be able to integrate it frictionless into our own platform as well.

The inspiration for our approach came from observing many different data access policies across multiple organizations. We realized that there is a common pattern between all of them: “Hierarchical Access Policies“. Each staff member or department has to have a role for accessing data that obeys the overarching hierarchical flow, hence Archivum’s design includes three hierarchical roles: Administrator, normal user, and auditor. It’s worth pointing out that auditors are a somewhat special case: As a third party they are only able to access data for examinations of trustworthiness and the file’s history, but they do not have any further operating permissions.

Cloud Data Archiving - Easy Access

These user levels constitute from top to bottom of this hierarchy respectively, but inside organizations, new people may join and others may leave or change their role within it. Having typical user-based access management demands keeping each user’s rights aligned with any changes in the organization structure and it is obvious that this approach does not scale. To replace it with a more flexible one, Archivum’s final access management is based on the grouping of users and assigning groups to containers with respect to hierarchical access management. Single users can be a member of multiple groups (with different levels of access for different containers).

All of this provides the required flexibility which reflects existing company structures in the real world. Following the principle that any type of access to a system should be coupled with a type of access management, we added another authorization layer based on user-defined rules within containers, which we labeled “Document-based Access Management”. Ultimately, this innovative feature enables organizations to not only define hierarchically grouped access policies on containers but also to define fine-grained rules for a specific access (e.g. read/write) to a particular group of documents within each container.

Trust, but verify

In addition to optimizing for short-term access/long-term archiving by smart allocation between hot and cold storage, we also aimed for optimized security and tamper-proof audits. Implementing such a system was a very complex task because regular users and third-parties (like auditors) had to be able to trust it alike. To gain such trustworthiness we had to be able to prove any illegal manipulation and make it trackable. For this, we tied Archivum to the “Zero Trust Policy”.

Cloud Data Archiving - Trust but verify

Within Archivum, we use a service called Elasticsearch as a single point of truth to store and search all applied operations on documents. In other words, we record the whole history of each document (also known as an “audit-trail”) and provide searchability through all of it. For this to work, we had to assure Archivum‘s user and auditor that any tampering within this service or a document’s content is determinable and trackable. Using the Bitcoin system decentralizes the centrally stored audit-trail by submitting each applied operation on documents as a transaction. Hence each activity on every document is verifiable, attaining Archivum its compliance target of “zero trust”!

Cloud Data Archiving - Chaining Operation

Aside from that, we not only use the Bitcoin system to keep our commitment to the Zero-Trust Policy, but we were also inspired by its methodology of chaining the blocks. Chaining together the applied operations of each document reduces untraceable actions/illegal tampering to zero.

Utilizing the Bitcoin system allows data within Archivum to stay verifiable in a very innovative and unique way. Every single operation in the decentralized environment is trackable, organizations are enabled to use our service without any second thoughts about possible audit regulations by other third parties.

—> Check out this blog post for a more in-depth look at this topic!

Conclusion

It was fascinating for us to experiment with cutting-edge technologies and wire them up in unprecedented ways to unlock their full potentials. Maintaining the initial robustness of the platform, now that it became so much more complicated, was a big challenge as well. All in all, Archivum is able to utilize the best of both worlds by celebrating a perfect marriage between centralized and decentralized architectures: Greater performance and cost-efficiency of scalable cloud storage alongside independent verification of every storage operation via Bitcoin.

Cloud Data Archiving - triangle of advantages

This first blog post served as a brief elaboration on how we came up with and approached Archivum’s unique features. For a closer look at the three main components mentioned, there will be subsequent posts in the following weeks. Be sure to check out the Archivum product page in the meantime and stay tuned!