TECHNOLOGY

My Vision for AI at Wasabi

May 2, 2024

By Aaron Edell
Senior Vice President, AI & Innovation

In 2016, I saw my first demo of a machine learning model that changed the trajectory of my career. Just prior to that, I was one of three employees at a startup called GrayMeta where we were trying to build a platform that extracted metadata from files to make them searchable by more than just their filename. We wanted to make life easier on the content creators, editors, knowledge workers, and others who waste up to 50% of their time looking for stuff in their company’s various storage locations. Media files like video and audio are particularly tricky because they can contain so much content, yet traditionally only be allowed a few human-curated metadata fields in an asset management system somewhere. And once that archive grows, their ability to find what they need diminishes to near zero.

But we could only extract so much information from a media file—basically whatever metadata existed in the file headers and EXIF information. Useful, but not game changing. That all changed when I saw my first demo of a speech to text system that could easily be accessed via API and was reasonably accurate. Now we were turning media files into searchable text. Then came face recognition, logo detection, object and image classification, and so much more. We were suddenly able to create treasure troves of searchable, text-based metadata from unstructured files – automatically.

A quick primer on AI and ML

AI is a very general term, and honestly, I dislike using it. Any computer program that mimics a human in even the smallest of ways falls under the term artificial intelligence. For example, a program developed in the 1950s to play checkers was considered to be AI. I even think you can even make a case to describe your kids as artificial intelligence.

What I’m most interested in is machine learning (ML). ML is a computer system that can learn and adapt without needing explicit instructions. One very popular way to achieve this capability is through a neural network (NN). NN is software designed to learn the same way we learn. I got to experience what this looks like in humans when I had kids. For example, when my son was very young, he would point at a ball and make a sound. The adults around him would see what he was pointing at and utter the sound ‘ball’. He would hear the sound, and slightly adjust the sound he made when pointing at the ball. The closer he got to saying ‘ball’ the more excited we would get and reinforce his learning when he made the correct sound. In his brain, the neural network was getting rewarded when it produced the correct output. Over many iterations, it eventually learned that a certain input (seeing a ball) leads to a certain output (saying the word ‘ball’).

In computing, NNs work in a very similar way. Here is my overly simplistic explanation of what is going on. NNs rely on training data to learn and improve its accuracy over time. There are layers of nodes connected to each other. Each node has a number associated with it, and if the output of any node connected to it exceeds that number, it activates and passes data to the next node in the next layer. When you train it, you tell it what the input is and what output you are expecting. It then figures out how to arrange those numbers in the internal layers (sometimes called hidden layers) of nodes such that the input will produce the desired output. Picture of a ball on the input leads to the word “ball” on the output.

How does this work in the real world?

When I was 15 or so, one of my first jobs was interning for my dad at the television station where he worked. My job was to watch some tapes of his show and log everything that happened by hand. Humans are good at this. We have incredible pattern recognition and prediction capabilities. But it took me hours to get through 5 minutes of video.

For organizations that have massive libraries, archives, or collections of audio/visual material, having interns log it all is not economical. Humans, although great at the task, make errors when doing laborious, repetitive work en masse. Plus, there are not enough of us available to get through millions of hours of material. Repetitive, laborious tasks are great candidates to be taken over by machine learning.

This task, while simple enough for a 15-year-old to do, is pretty complex for a computer. To log a video file, you need to know what was said, what is going on, who is visible, and what words appear on screen. You need to tie this information to the time that it appears in the timeline. So, we built a platform that does just that with ML. It can transcribe the audio, turn any text on screen into text you can search for, recognize faces and people, and tag objects and images. And it keeps track of when in the timeline these things appear. What you get at the other end is an index that is searchable. So, someone who is editing together some marketing content can find relevant moments within a vast archive in seconds. They can then use those moments to complete their edited material and move on to the next task. No humans needed to watch every second of that vast archive, it was all done in moments by software using ML.

At this point, you can probably imagine all sorts of use cases for having every file in your storage environment tagged, indexed, transcribed, and otherwise made searchable. That is because there are many. But why is a cloud storage company pursuing this, and why did they buy this technology from GrayMeta?

Object storage and you

Object storage is a type of data storage architecture designed to handle large amounts of unstructured data. You can think of unstructured data as data that doesn’t fit into an Excel spreadsheet. So, media files, web pages, PDFs, etc. The reason you need it is because other types of storage architectures don’t scale. File storage, like what you have on your computer, requires organizing data into folders, which themselves get organized into folders, etc. Just like your filing cabinet in your office, if you want to retrieve a file you need to know where it is, go into that cabinet, and into that folder to retrieve it. As the number of files grows, it becomes time consuming to search and retrieve the data you are looking for.

One solution to this is to use block storage. Block storage breaks your files into separate blocks and stores them separately. Each chunk of data has an ID that can be used to reassemble your data later. You don’t need a single path to your data so you can store these blocks wherever it’s most convenient. But in order to do that, you need an operating system that knows how to reassemble these blocks. It’s also expensive to store data this way and it has no native way of storing metadata.

Object storage saves files in a storage pool in a flat way. The ‘object’ is self-contained meaning it has all the metadata, data, unique IDs, permissions, policies, etc. built into it. No need to have a hierarchical folder structure or an operating system. Because the environment is flat, it scales very nicely. Storage pools can be spread across different devices, giving you unlimited scale. Retrieving the blocks of data you need (your files) is fast and complexity-free.

Metadata is an object’s best friend

What makes objects so powerful at scale is their ability to contain metadata, data about the data it stores. This is why we’re marrying our ML capabilities to an object storage solution. Without metadata, it is very easy to lose track of your data in a storage environment. Keeping your metadata in an operating system or some other kind of proprietary data controller makes it vulnerable to being lost. What if the company that makes your controller goes out of business, or if you want to switch to another vendor, or if you discover that data isn’t backed up or kept secure?

It is much safer, and way more economical, if your metadata and your data are as tightly coupled as possible. There is a lot of metadata that ML can generate, especially about a video file, so I wouldn’t necessarily recommend embedding all of it into the object wrapper itself. Instead, your object storage should come with an index, one that is optimized for search, which is where the ML generated metadata would be most effective.

If you were to log into your favorite hyperscaler’s object storage console today, you’d be able to list all of your folders and objects, but you’ll note that there is no search bar anywhere to be found. When you’ve got 60 million objects in your bucket, not being able to search that bucket becomes a problem. Object storage can scale to infinity (almost), but if you’re not indexing those objects, then you’re creating a larger problem for yourself. I think that this is the fundamental issue that I’m excited to solve.

Today and tomorrow

Today, I’m building a service at Wasabi that is object storage with an index. This index is pre-populated with all kinds of metadata that our ML services generate. From a user’s perspective, all you need to do is put your stuff into Wasabi’s storage. The rest takes care of itself.

This is the way it should be. Object storage should come with an index. Can you imagine building a library without a card catalogue (even if its digital)? What is the point of paying to store stuff if you can’t find it or use it?

Our customers will not only be able to find things instantly, but because the index is so closely coupled with the objects themselves, everything built on top of it can have new superpowers. Imagine automatically moving content that has interviews with a specific person or about a specific topic to your editing suites overnight because those topics or people were trending in the news. Or being able to review your titles for distribution to other countries 100x faster because most of the laborious translation or content moderation work is done for you.

It would take a whole new blog post to dive into the many use cases that come out of what we’ve built. The point I’m trying to make is that I remain focused on making data more accessible and valuable, and making people more efficient at work with ML. This mantra is what guides the decisions I make.

And I really believe we’re still at the very beginning of this journey. I am only ever accelerating our work into solving these problems for customers with ML. The emergence of ChatGPT, language learning models (LLMs), and other multi-modal technologies makes this and even more exciting time to be using machine learning. Yet challenges in how we think about ML remain.

Machine learning is like electricity. It is a utility, a technology, a type of mathematics that we use to solve problems and build solutions. I’m always wary of those who point to AI or ML as the solution in and of itself. We don’t run around telling people our tech stack is built using electricity. If it were the late 1800’s and we were advertising that our hotel features electricity and running water, that might be more appropriate. But those things are table stakes now. AI and ML will get there at some point as well. They are not the end goal, but simply another powerful tool that helps us scale what we do.

My vision is to continue to build technology that makes data more accessible and valuable to the people I serve, our customers. Machine learning will be a big part of how I go about solving their problems, but it is not the end goal.

In the boundless digital expanse, where data grows as fast as our ambitions, we at Wasabi stand as architects of data accessibility, transforming the amorphous cloud into a meticulously indexed tapestry. As your SVP of AI and Innovation, I pledge to weave only the best technology into the very fabric of your data's journey, ensuring that every byte serves not just as a static repository, but as a dynamic asset, eager to reveal its secrets. Together, we will not just store the future, we will empower it. One searchable, discoverable, and invaluable piece of data at a time.