Learnings from four years as the Head of Machine Learning at a VC-backed early-stage startup
This week concluded my four-year role as the Head of Machine Learning for Vic.ai, as I have moved to a research position. My years in this role were an epic journey in building a next-generation commercial AI product and the team that operates it today. When I accepted the role, Vic was a first-year startup in Oslo with around 15 employees and $1.5M of seed-stage funding. Vic has since grown exponentially and gained the backing of leading US-based VC investors. Today Vic operates globally, and thousands of companies have their accounting done with the AI I’ve designed for Vic.
The vision for Vic.ai is ambitious and far-reaching: building an autonomous AI that will entirely replace the manual accounting workflow for processing invoices. This reminded me of the initial reason for inventing computers: replacing the work of human “computers” whose occupation was to meticulously calculate and correct numbers. Like the human computers a century ago, accountants today increasingly work through volumes of invoices, instead of tasks requiring creative problem solving and human interaction. To an ML scientist, this problem had a real-world impact, and raised questions: how to build the AI, and if this type of task could be made autonomous, what other types of tedious work could be automated using autonomous AI?
The role of the initial “Head of Machine Learning” in early-growth startups is considerably different than in larger companies and involves wearing many hats. During my tenure, I managed all of the data science projects, designed the ML algorithms Vic uses today, and hired the current ML and backend engineers. Until the last year, I did all of the analysis to drive the AI development forward, the technology choices for building the AI stack, trained the deployed models and did a fair share of the analysis to resolve prediction errors. This workload sounds exhausting, but in the early phase each of these takes limited time, and improved software has greatly reduced these times.
As someone who’s worked in AI research, startups, and consulting for the past two decades, Vic.ai struck me as unique in many ways. The usual pitfalls of AI projects were absent, and the work could focus on the core challenges of building an AI product. Many of these remaining challenges weren’t fully recognized by AI research or industry. I’ve decided to share the learnings gained along this journey, as these challenges are common to most organizations building AI, and this experience can be of interest to others.
CRISP-DM Meets DevOps Culture
Before Vic, my commercial AI experience was as an external consultant, and as an in-house ML developer for several startups. It hadn’t taken me long to notice the mismatch between managing AI and software development projects. The majority of AI development projects were conducted with the CRISP-DM framework originating in data mining consulting, resulting in a high success rate and fast turnaround for clients. Software development was done with Scrum-influenced practices, with less than half of the projects succeeding. The two solved different problems and didn’t fit together. CRISP-DM is non-linear, with undetermined outcomes. Scrum is linear, with planned outcomes. Most commonly, AI projects were prototyped and delivered for production use, but the organizational support failed to turn these into successful products.
Like most of the AI community, I got increasingly pessimistic about commercial AI due to these organizational problems, and the complete mismatch between CRISP-DM and Scrum. Encountering Vic changed this, as the company had an exceptionally professional early developer team. Many of them had CTO-level experience or continued to a CTO role from Vic. Engineering and the rest of the company practiced DevOps culture and unlike Scrum applied the Agile #1 value “people before processes”. After close encounters with Scrum, this changed my perception of software engineering.
The ML efforts had the complete support of the company, and I was fortunate that the CTO had a passion for ML, having led ML projects for the past few years. There had been some temporary contractors who had done some initial AI work for Vic. Some of the organizational planning was solved before I started, and there was implicitly already a type of “Agile CRISP-DM” framework in practice.
My team consisted of 2–4 experienced backend and ML developers for the first three years, and this was the ideal team size for building the foundation of the AI codebase. Later, I expanded this into an 8-person MLOps team with more specialized roles. The work was asynchronous and fully remote, with typically weekly or biweekly meetings over a Kanban board. Areas of responsibility were clearly defined as per Conway’s Law, with the main splits being between backend/ML and dev/ops.
The team worked on a DevOps basis, both developing and maintaining the code in production with regular release cycles. If there were requests from other teams, these were addressed first, so that their work was unblocked. Our requests to other teams were handled likewise, usually within a day or two, so that company-wide goals progressed rapidly. This prevented silo effects from blocking projects and allowed us to consider new projects since we could trust to have rapid support and prioritization from other teams. Typically around half of any member’s work time went to development, and the other half to responding to issues and requests.
The goals and milestones for the team were negotiated at the company level, and projects were handled autonomously by the team. This meant that tasks were often rapidly added, changed, and reprioritized between sync calls. Early termination, parallel tasks, task halting, and even failures were all normal parts of the process. In the Scrum context, these would be unthinkable, but for developing AI this is just the normal routine. What mattered was reaching the long-term goals, not the outcome of individual tasks.
Another aspect of this framework was fully data-driven development for AI. Instead of executives mandating “use deep learning to solve X”, every problem went through the same process. The best solution for the problem could be string matching completed within a day, a complex AI solution taking several months to develop, or reporting that the problem would need to be revised. All of these were common outcomes. Using this approach, the best solutions for problems were achieved at a fast rate of completion. The overall outcome was an AI system that combined techniques from traditional software and AI, selected by data-driven analysis for each case.
Some time after developing my first models, I got feedback about one. Reported accuracy from initial testing was around 30%. This was stunning, as we trusted the development metrics, which gave good results for the model. Looking into the details, it seemed there were multiple data quality issues for any type of AI trained from this data. I understood that this was a technical problem and we could still produce high-accuracy models, by treating the data preprocessing and model development as a single problem. I started working on solutions for this type of AI, like many others who discovered the limits of model-centric AI, and around this time the field was named “data-centric AI”.
Developing labeling processes through “label engineering” was a critical type of data-centric AI for us. There are various data quality problems in accounting platforms, but the most severe for AI is that label information is often uncertain, incomplete, incorrect, outdated, or available as indirectly matchable meta-data. As a thumb rule, 20% of labels are missing or have wrong values. Supervised ML trained on the data requires correct label information, or it reproduces the errors seen in training. Some of the techniques I used to address this included noisy regex matching, score-based heuristics, decision trees, information theory-based matching, and model-based pseudo-labeling. In some cases, these were best applied as an external process. In other cases, it was useful, even mandatory, that the ML model had an internal labeling process. For all of the problems I encountered, various techniques could be combined to yield high-quality training data, although at the cost of engineering and analyzing the labeling process for each model.
The labeling processes were the most crucial part to get completed, but data-centric AI presents a whole host of challenges on top of normal DevOps for ML. As more models were deployed and more customers were onboarded, more different types of data issues would arrive. These included data conversion issues, OCR errors, platform bugs, etc. Data issues would be detected from a variety of sources: release QA, incident management, client feedback, bug reports, monitoring, data validation, metrics, etc. Many more went undetected until exploratory data analysis (EDA) was done, which took time. I essentially needed to automate EDA for data quality and found a dedicated developer to work on this as my first hire to the team.
Detecting the issues was crucial, but soon the majority of our time went to maintaining the models and acceptable accuracy. It was obvious that we’d need better tooling and processes in the long run. Some of the ops toolings we built early on were metrics and monitoring dashboards, with support from the engineering ops team. On the model side, we built held-out testing of all models as part of training, as well as model-testing with a worker process at each pull request, and a QA procedure following each release. The model-testing was a game-changer in preventing any production issues from models. We had close to no model failures and had to roll back models around once a year, due to having these tests in place. One additional layer of prevention I did was adding variable typing and validation to all of the models so that data incoming to the models was always typed and validated.
Once I had tackled the main data quality problems in building ML models in this domain, I started developing the autonomous AI solution for invoice processing. This was a challenging task, as there was no existing research on the topic, and it wasn’t known how well this could be done with the current technology. But, I saw no problems from an ML design point of view. I had access to massive amounts of data, solid support from both engineering and management, and relaxed time-frames to build the solution. I had zero doubts about accepting the challenge.
The autonomous AI problem in accounting has some simplifying properties, compared to the commonly known problem of autonomous driving. Most importantly, we don’t need to automate all documents. With autonomous driving, the AI can’t choose to stop controlling the vehicle in a difficult situation. In our case, it is acceptable or even desirable that an anomalous document will get sent for human review instead of automation. Another simplification is that automation error rates don’t need to be arbitrarily close to 0. With autonomous driving, any type of error has a potential cost of a life. In our case, the worst types of errors can be easily prevented, many types of errors are inconsequential or average out, and many important types require only an error rate matching humans.
A key advantage was that I was designing the system on top of the platform we controlled, which at this stage was processing invoices for hundreds of companies, with an exponentially growing stream of documents. This meant I could develop the system to work on exactly the stream of data that it was deployed on, eliminating any bias coming from transferring an ML model from an external source. Also, I could fix any ML components causing issues, and request changes when needed to the backend, and the increasing data volume would enable better modeling later.
One complication in our case was the number and diversity of required predictions to automate. To automate the accounting workflow, the AI needed to predict all of the outcomes produced by an accountant, for invoices in different languages and legislations. Fortunately, AI toolkits were going through constant evolution, providing better tools to address each modeling problem. When I started, Vowpal Wabbit and CPU-based gradient boosting were industry-standard solutions for large-scale problems. Available ML tools have since gone through numerous developments, such as GPU-based gradient boosting, Pytorch for deep learning, and transformer-based AI. Transformers in particular is a true revolution in applied ML and could become for AI what transistors were for electronics. This expanding toolkit allowed the design of solutions that were inconceivable just a few years before.
It took me a couple of months of focused work, and a third major iteration of the modeling approach to find a solution that worked well. I proposed the name “Accounting Autopilot” for the product, which got shortened to Autopilot. The solution was conceptually simple but would require considerable work getting the details right. Any solution to this problem would require high accuracy from all of the ML components, consistency of the platform, and detailed information from the platform at the time of each prediction. This meant I had to double down on the data-centric AI efforts, but at the same time, we were growing exponentially, so we had to upscale our AI stack and platform to match the growth.
Scaling AI on the Cloud
Since the beginning, we’ve had a partnership with one of the largest accounting technology companies in the Nordics, providing access to hundreds of millions of invoices. As we expanded across US and EU, I had data on a similar scale arriving from multiple regions from our clients. It was clear the solution needed to scale massively, regularly training and predicting on millions of documents. The AI had to be scalable from the beginning, and portable to distributed cloud-based processing. Effectively this is composed of two parts: writing efficient, maintainable, and distributable code, and setting up distributed processing for both training and prediction.
The choice of Python tooling and best practices was a simple way to improve scalability. I discarded Jupyter Notebooks from any prototyping work. Instead, we deployed the same Python scripts that were prototyped. Soon after I came across Joel Grus’s famous “I Don’t Like Notebooks” presentation, which said everything I had been thinking about notebooks. Deploying scripts instead of notebooks removed the time required for production conversions, and meant that production issues could be debugged using the same exact code offline. The scripts could be optimized much further as well, starting with multiprocessing and serialization. This was done as far as reasonable with the Python data stack, so the base cost of operating the models was minimized. This included caching dataframes, Cython extensions for critical model code, use of GPU where possible, model pruning and compression, and regular analysis and optimization of processing bottlenecks.
A while before Vic, I had worked with Spark for a couple of years. The functional MapReduce paradigm it offered was exciting, but the PySpark integration had fundamental shortcomings. Inspired by Spark, I developed my own Python library Wordbatch, which could do MapReduce processing of any Python method or function, executed by a chosen scheduler backend, such as a serial process, multiprocessing, Dask, or Spark. I could develop code on a single node, and scale it on the cloud using Wordbatch, with nearly linear speedups in the number of cores. On a single 128-thread workstation, Wordbatch speeded up most bottlenecks by 100x, enabling fast development iteration and model training. Soon after developing the first models, Ray was announced. I started using Ray as a Wordbatch backend extensively, for both model development and training. It simplified using nested parallelism and allowed production training of models at a massive scale at a low cost. Today, we train and operate hundreds of thousands of models simultaneously in production.
On model serving, I considered the alternatives, as our initial Celery-based workers were not scalable or reliable. A couple of the senior developers in the team thought about the FastAPI-based model serving. I considered FastAPI or Ray Serve, and even discussed with the FastAPI creator Sebastián Ramírez about collaborating with us. Ultimately, the Ops team was determined on AWS serverless processing and wanted us to use Lambda and other AWS serverless tools. The choice was up to the backend developers, and the model serving side was scaled with AWS serverless tooling.
The third type of cloud computing was the use of external AI microservices. When I started these were not available, but over the past four years, the cloud giants (Amazon, Google, Microsoft) started to offer an increasingly comprehensive selection of AI APIs, many usable for the type of document processing we were doing. None of these directly competed with our AI but offered useful basic building blocks for building a solution, including OCR, machine translation, table processing, document AI, etc. I organized systematic evaluations on many of these, and we started using external APIs for some AI workloads. This had several benefits: the external services were scalable and reliable, the costs of each process were known, and the AI results were high accuracy in many cases. I could jumpstart projects knowing that the external API could be deployed to production within a couple of weeks, and the executive threshold for approving projects was reduced, compared to launching internal R&D efforts with uncertain costs and outcomes.
Building an MLOps Team
Building and maintaining a commercial AI such as ours eventually required a larger team, due to the high requirements for accuracy and automation. I recruited the teams for Vic.ai in two stages, first as the 2–4 person ML team to build the initial solution, then expanding this to an 8-person MLOps team once the workloads became unsustainable for the small team. I received the full support of the execs to proceed with this, and personally planned each role and conducted the hiring processes.
One obstacle during this time was the “Great Resignation” which was starting to affect hiring in AI. Especially US-based candidates had a wide selection of companies to choose from, high expectations for the role, and a low threshold to move on. Rather than compete for the talent pool in the US, engineering had the idea early on to hire EU-based developers and started an engineering office in Oslo before covid. Many of these early developers were great hires and became valuable contributors.
I took this idea further and decided to hire in regions with time zones close to New York and Oslo, but with better talent pools for hiring: especially developers with advanced engineering degrees in data and ML, but with limited local AI companies to offer them roles. For Oslo, countries such as Poland, Spain, Portugal, Italy, and France had good engineering colleges and universities, but limited AI startup scenes. For New York, Brazil and Canada were countries with similar qualities. Thanks to remote work, developers in the southern hemisphere were just as close to collaboration. Only the time difference to their team mattered, and even this not much.
I planned each role according to current and future requirements we’d have for operating the AI, roughly corresponding to Google’s “Data science steps for ML”. For the small team, the hires were for data analysis, OCR, and backend engineering. With the larger team, I divided the team into data, ML, and ops, with at least 2-3 people operating each area. This meant that for each area of the codebase there would be at least two people who knew that area in detail, in case of the proverbial bus incident, as well as to have multiple developers for pair coding on each area. The overall goal of the larger team would be building towards full MLOps maturity. All candidates were required to have a minimum of a CS-related degree, some years of commercial software development experience, and specialization skills for their role.
The rest of our process was fairly standard. For all roles, we had multiple interviews, with the full team given a chance to interview and veto a candidate. All our advertised positions received hundreds of candidates, some thousands, so we needed to efficiently screen the suitable hires if any were available. From the beginning, I conducted an initial screening interview, followed by a technical interview with the team, a coding test, and a final interview with one of the execs present. For the larger team, we had support from a technical recruiter, including an additional HR screening round. The process created a funnel that selected suitable candidates, but the time committed by the team was still considerable. Overall this was worth it. We got amazing coworkers who started contributing from their first weeks.
The most important lesson from these years is that the organizational foundation is a key requirement for building an AI solution. This includes several matters outside the work of AI developers, including the company vision, culture, executives, product, and industry connections. The executives and culture are required for AI development to have sufficient support and autonomy to succeed, while other matters external to AI development determine if the developed solution will lead to success in the market.
The success with Vic has shown that autonomous small remote teams are highly effective in building commercial AI. During the first 4 years, I designed, developed, and trained all of the models deployed in production, while the 1–3 other developers in the ML team handled the required ops, data engineering, and backend workloads. Moreover, the work was conducted almost entirely remotely, due to the pandemic. The effectiveness of small teams is well established in software engineering, and AI is not an exception. The crucial difference is in task management, where AI and data science require a different process at the project level.
Building a commercial AI solution is altogether separate from what is learned in ML degrees, research, or competitions, even if the theory and tools learned from these provide the required technical foundation. Commercial AI solutions are built in a cutting-edge environment, where the product vision strives for solutions not possible with last-generation technologies, and perhaps not with current ones. At the same time, the data-centric and scalability considerations introduce additional layers of complexity to the design problem. Finally, the work integrates with software engineering and product management, in a dynamic environment where various events determine the planning and long-term goals.
We’re currently in a golden age for AI, with the amounts of research, data, computation, software, developers, applications, companies, and funding all growing at exponential rates for the past decade. This is a period similar to what the 90s were for PCs when every year brought unexpected ways to use increasingly better computers. At the same time, there’s increasingly heavy competition from both startups and the cloud giants. If one wants to build a product in AI, one strategy is to target a new AI application and establish a leadership position, much like Vic has done with autonomous accounting. Most of all, this is a time to be excited about the numerous uses for AI that are coming available.
Interested in joining our team here at Vic.ai? Click here to view our open positions.
Tech(e)valuation Podcast: Vic.ai Co-Founder & CEO Alexander Hagerup
Tech(e)valuation Podcast: Vic.ai Co-Founder & CEO Alexander Hagerup
PODCAST: What Autonomy Means for Accounting with Alexander Hagerup of Vic.ai
PODCAST: What Autonomy Means for Accounting with Alexander Hagerup of Vic.ai