Behind the Disruption of Open Source Technology

Open source makes technology cheaper and better

If you are a developer, you can’t really work around open source technologies. Your Operating System can be Linux. You can have MySQL or MongoDB as your database. From Hadoop, Spark to Pandas, from PyCharm to Jupyter, from sklearn to Tensorflow and PyTorch, open sourced software has been everywhere in your development process. Not only software, but also open source machine learning models have been extensively used in production systems, like those computer vison pretrained models generated from ImageNet, or Natural Language Processing pretrained models like word embeddings or even BERT. When we are talking about how AI is changing the world, we are really talking about how Open Source is changing the world. Most state-of-art tools and algorithms are published as academic research paper and open sourced as repositories on Gihub, and continuously adopted and improved by developers all over the world.

Chinese companies were long considered as only imitators. Now companies like Baidu, Alibaba and Tencent are becoming strong power in the open source community. Open sourcing their own technology can promote themselves to attract developers to join their ecosystem and build products upon it. Sometimes contribution to a core open source program like Tensorflow is also a good testimony and advertisement to prove the technology capabilities. Nowadays not only internet companies but also traditional enterprises start to have their own open source policy.

With open source, even a junior student can start using the most advanced technology within a fairly short time. The barrier for AI technology is fading away, bringing a lot of disruptions.

However, Data is not cheap

When technology becomes free, data is the King. Companies with high volume of data enjoys high valuation from investors because data can create a barrier that is almost impossible for its competitors to catch up.

However not all the data can be efficiently used for machine learning models. Data scientists will probably share the same experience: most of the time for a machine learning project will be spent on cleansing the data rather than building the model. And if we are not so lucky that we don’t have enough high-quality label data then we will need even more time to crawl and label the data manually.

In some other circumstances, different participants own different data and such data is not shared and creates multiple “data islands”. Without joining the different datasets it is hard to get its real value. We are exploring cutting edge technologies like Federated Learning to break such barriers in the data islands to build an ecosystem for the industry while protecting data privacy and data ownership for the participants.

Computational Resource is not cheap

With the development of new technologies like deep learning, computational resource has become critical for using the state-of-art algorithms. Just several years ago as a hobby I could run some machine learning models with some personal GPU cards. Now those pretrained models for computer vision and natural language processing usually contains millions of parameters and is way beyond standard for personal computing. Democracy for machine learning is in past tense and we have to pay more to get the enough computing resource for state-of-art algorithms, probably from cloud services like GPU on AWS or even TPU on Google Cloud Service.

Product Development is not cheap

Open source covers the technology under the hood, but whether a company or a project can be successful totally depends on the product we provide. If customers don’t buy your product, any fancy technologies behind it will make no sense. developing a cool product will require lots of work. A beautiful and simple user interface is usually the key.

For example, we have seen a lot of open source autoML projects where data scientists don’t need to tune the model anymore and let machine to decide which model best fits your data. A nice autoML product like Datarobot will push that even further and let users build models without writing a line of code. Instead users just drag and drop the data, pick the prediction target and click run. Similar examples can be also seen such as Rasa for building chatbots and Prodigy for data annotation. In those products, a beautiful and simple user interface is usually the key.

The same delicate product development goes for insurance products that are backed up by machine learning models. Risk thresholds need to be carefully tuned, privacy and fairness need to be thoroughly considered, claim frauds need to have low false positives, and personal recommendations need to be accurate and emotional.

Domain Knowledge is not cheap

I often get questions from my actuary and underwriting colleagues asking whether what we are doing are stealing their jobs. The truth is, technology is not stealing jobs, but to revolutionize them. As we mentioned before, label data is the most important part for training a supervised machine learning model. For industries like insurance, people cannot do the right annotations without domain knowledge. Besides, the predictions made by machines should always be audited by domain experts to ensure its correctness and fairness. AI is not only Artificial Intelligence, but more as Augmented Intelligence, where its goal is to make the job of professionals like underwriters and actuaries easier.

One thing I find amazing is how people with domain knowledge like actuaries desire to learn data science. In the near future they will be the new data scientist with unparalleled domain knowledge.

And Data Scientists are not (yet) cheap

To put everything above together, we still need the right people. Data science is still a new subject without a rigorous education system. It doesn’t matter if data scientists appear either as PhDs from best universities in the world or just hold Coursera certificates as long as they can spot opportunities, solve problems and generate value. Companies are reaching out to those data science professionals and they all want to recruit the best, who are still quite expensive at this moment.