Updated: Jul 17
What I learned from reading Robert Monarch’s book about Data Science.
Robert Monarch in his lovely book does a great job of expressing how data science is more about logistics than it is about algorithms. Today’s greatest challenge with artificial intelligence (AI) is moving from frozen neural networks to those that can keep learning on the fly. While Siri and other personal assistants have done a poor job of inviting a meaningful conversation, ChatGPT is the first to show signs of true intelligence. Even though ChatGPT is a huge leapfrog for humanity, it is still struggling with staying up to speed with the latest developments on the Internet. It learned what it could at the time of training, and it needs to get an update to stay relevant.
Humans are an essential part of machine-learning
As most functional deep-learning engines utilize supervised learning, supervisors, still being human, are key to the process. The same goes for any form of a ‘living’ machine learning system. The loop of continuously adapting the weights and biases, although it possess similarities to the adaptation existing in transfer learning, it still requires humans to teach the machine right from wrong.
Monarch has done a terrific job in clearly defining the challenges and the state-of-the-art methodology to tackle them. The book is full of real-life examples, cheat sheets at the end of each chapter, as well as clear diagrams and concise summaries. Examples are drawn from computer vision tasks, such as object detection and semantic segmentation, from natural language processing tasks,such as sequence labeling and text generation, as well as from other tasks of machine learning, like information retrieval. The book is as comprehensive as one can find on practical data science.
Why I enjoyed this book and highly recommend it to data scientists?
This is the most comprehensive book I have read on practical data science. As AI is all about statistics, the two pillars this book leans on are the re-sampling to account for uncertainty and diversity.
The two are often intertwined and so are the sampling strategies proposed by Monarch. Those sampling strategies are crucial to meet the logistical challenge of needing human annotators in the loop.
Perhaps the most intriguing strategies the book covers relate to active transfer learning. Ask your model about its own performance. Given only partial, extremely limited new annotations, the model can improve human-in-the-loop efficiency, by simply learning to assess its accuracy or applicability to data that is different from those it was trained on; hence picking the right samples for the human annotator to add.
Applied adaptively, active transfer learning lets you even predict how effective the human intervention in the continual learning process will be and thus may assist with better planning. Mistakes in the model in fact help it improve, which is one of the greater goals to achieve with machine learning. Human intelligence learns from its mistakes all the time.
Embeddings and contextual representations are among the most practical ways to ease the annotation effort, where time-consuming annotation tasks are replaced by easier ones, allowing dramatically more data to be included in the model training. There is a chapter in the book covering embeddings alongside the use of synthetic data and other means to face the data annotation challenge.
The book is a real delight for data science practitioners. It is very easy to navigate and can be used as a reference. There are even code samples in the book and on an accompanying GitHub repository, as well as an introduction to eel, a Python library to easily connect Python with HTML, that is very relevant for human-in-the-loop active learning applications.
I obviously enjoyed this book and found it applicable to what we do every day. I would be happy to hear from anyone else who feels the same or is just getting started on a project and needs some advice.