The flux of data is increasing exponentially in this age of Digital awakening. Data has become so important to major industries and sectors around the globe that it can literally be referred to as digital gold! From simple company centric applications to major platforms interweaving people from all around the world, data has started to shape major decisions for not only autonomous machines, but also for the human race as a whole. This imagery is as intriguing as it is terrifying, but only if we make it so.
In order to handle this rapidly incoming data with relative ease, a competent system is required to act instantly and deliver results on the fly. Otherwise, such large-scale investments on data gathering and data generation will go to waste since the data will be left in its dormant state without any active or competent agent acting on it. This is where the concept of real time data streaming and processing comes up. So, what is real time data streaming?
As is already known, data is being generated from various sources at a lightening pace. If we stop to ingest enough data, process it in batches and then provide the results after enough time has passed, the results will tend to lose its relevance and will reflect outdated patterns and trends. This happens majorly because of the high rate of variance in incoming data and also because of time constraints.
For instance, suppose that you have a machine which tells you which horse to bet upon in a horse race. You have the option of changing your choice during the race until the last lap commences. In such a case, if your machine gives you predictions based on the first lap where horse A was showing promise, and predicts that horse A will win, where in fact, during the third lap, horse B shows further promise, you will lose your bet just because of a machine which lags behind by two laps. This problem can be avoided by processing incoming data instantly, or in other words, real time data streaming. A stack of old data or historical data is studied and incoming records are processed based on the studied patterns such that the results are delivered within milliseconds. For our example, the horse race prediction machine would have already studied data about the different horses in the race previously and then based on the incoming data (the horse number, position of the horse, time since beginning of race, number of contestants, etc.), will be able to instantly allocate a rank for the different participants with the help of real time data streaming.
How to Go About Real-Time Data Streaming?
In real time mission-critical applications, Apache Kafka has turned out to be one of the most widely used frameworks for implementation. Apache Kafka is integrated with efficient machine learning frameworks in order to enable model training and speedy deliverance by supporting real time data streaming.
What is Apache Kafka?
As per Kafka’s website, it defines itself and its tasks as follows:
“Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.”
“The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue designed as a distributed transaction log”, making it highly valuable for enterprise infrastructures to process streaming data.” – Wikipedia
These definitions might seem like a mouthful at first, but as we go through with this subject step by step in this discussion, one will easily get the hang of it in no time!
Why use Tensorflow as the machine learning platform which is to be integrated with Apache Kafka?
Tensorflow is one of the most popular and efficient open source machine learning platforms available. It has a beautiful and well-suited architecture which enables data flow with extreme grace and optimization. It enables users and developers to establish large-scale projects with minimal hassles and maximal resource optimization. It is thus, a very competent platform to integrate with Apache Kafka for the purpose of serving real-time data streaming.
Tensorflow’s tf.keras and tf.data are responsible for streaming data in and out. Previously however, these modules were limited in their usage and could only support a few data formats. Support for Kafka streaming was not included during the earlier versions of Tensorflow. It was also difficult to use Tensorflow supported modules like tf. Examples and TFRecord in Big data and the general community of Data Science as a whole and were, therefore, rarely spotted.
It was thus, a difficult task to integrate the Apache Kafka and Tensorflow frameworks. A lot of intermediary bridges had to be constructed in order to establish reliable handshakes between these two frameworks and ensure smooth integration. This was a burdensome process since it included designing of an entire infrastructure which turned out to be a fault prone mechanism most of the time. These were the steps which were required to be followed in order to establish a working data streaming flow:
Read data from the Kafka stream -> Convert to TFRecord format -> call Tensorflow’s function to read the TFRecord object from file system -> execute model and deliver result -> save the result in the file system again -> write results/ inference back to Kafka
Source: Kafka Summit NYC 2019, Yong Tang
However, with the release of Tensorflow 2.0, the tables turned and the support for Apache Kafka data streaming module was issued along with support for a varied set of other data formats in the interest of the data science and statistics community (released in the IO package from Tensorflow: here).
Source: Kafka Summit NYC 2019, Yong Tang
With this development, it is now possible to enable real time streaming with Kafka and Tensorflow with relative ease and minimized error. This process is implemented with the use of KafkaDataset module (written in C++) which is a part of the new release of the Tensorflow IO package. KafkaDataset module has been integrated as a subclass of tf.data.Dataset module. This module works just like any other data streaming module where users can simply read data from a kafka stream and use it in a Tensorflow graph or feed it to tf.keras and other Tensorflow specific modules for model training and evaluation purpose. The option of writing back through output stream is also possible of course.
Here is how to implement data streaming, processing, model training and inference gathering in just a few lines of code with Kafka support on Tensorflow:
Line 2 simply streams in data with the help of the KafkaDataset module and data processing and modeling are immediately commenced as can be seen in lines 3 and 4. Thereafter, we move on to the keras callback stage. Keras callbacks are very informative since they provide an overview of the internal stages and statistical details of the model during the training or prediction process. The callback function is written in the 7th line. The KafkaOutputSequence is responsible for writing the results to the output stream (with so much relative ease!). In line 13 the predict function is called to get the model details and inference on the test dataset.
Source: Kafka Summit NYC 2019, Yong Tang
Real time data streaming with Kafka and Tensorflow has not only helped in the elimination of the complicated infrastructure which previously bridged the wide gap between the two popular platforms, but has also made the process less error prone and more approachable for real time mission critical systems with respect to machine learning and data science. The above picture shows how easy it is now to implement Kafka along with Tensorflow with just one call for data streaming. Further development in this area looks highly promising and is sure to contribute manifold in the ease of scalability and smooth integration when it comes to Big Data, live or real time data streaming, machine learning and deep learning techniques to develop smart and autonomous systems across the globe!
Get a grip on the machine learning, data science, big data and several other intriguing topics by following our blogs or even our detailed courses provided in the links below:
TensorFlow 2.0 is coming soon. And boy, are we super-excited! TensorFlow first began the trend of open-sourcing AI and DL frameworks for use by the community. And what has been the result? TensorFlow has become an entire ML ecosystem for all kinds of AI technology. Just to give you an idea, here are the features that an absolutely incredible community has added to the original TensorFlow package:
From Medium.com
Features of TensorFlow contributed from the Open Source Community
TensorFlow started out as a difficult-to-learn framework for deep learning from Google. With one difference – it was open-sourced. That may appear as stupidity for a commercial company that focuses on profits, but it was the right thing to do. Because the open source community took it up as their own property and ported it to nearly every platform available today including mobile, web, IoT, embedded, Edge Computing and so much more. And even more: from Python and C, it was ported to JavaScript, C++, C#, Node.js, F#, React.js, Go, Julia, R, Rust, Android, Swift, Kotlin, and even a port to Scala, Haskell, and numerous other coding languages. Then, after that complete conquest, Google went into the next level for optimization – hardware.
Which means – now we have CUDA (library for executing ML code on GPUs) v8-v9-v10 (9.2 left out), GPGPU, GPU-Native Code, TPU (Tensor Processing Unit – custom hardware provided by Google specially designed for TensorFlow), Cloud TPUs, FPGAs (Field-Programmable Gate Arrays – Custom Programmable Hardware), ASIC (Application Specific Integrated Circuits) chip hardware specially designed for TensorFlow, and now MKL for Intel, BLAS optimization, LINPACK optimization (the last three all low-level software optimization for matrix algebra, vector algebra, and linear algebra packages), and so much more that I can’t fit it into the space I have to write this article. To give you a rough idea of what the TensorFlow architecture looks like now, have a look at this highly limited graphic:
Source: planspaces.org
Note: XLA stands for A(X)ccelerated Linear Algebra compiler still in development that provides highly optimized computational performance gains.
And Now TensorFlow 2.0
This release is expected shortly in the next six months from Google. Some of its most exciting features are:
Keras Integration as the Main API instead of raw TensorFlow code
Simplified and Integrated Workflow
Eager Execution
More Support for TensorFlow Lite and TensorFlow Edge Computing
Extensions to TensorFlow.js for Web Applications and Node.js
TensorFlow Integration for Swift and iOS
TensorFlow Optimization for Android
Unified Programming Paradigms (Directed Acyclic Graph/Functional and Stack/Sequential)
Support for the new upcoming WebGPU Chrome RFC proposal
Integration of tf.contrib best Package implementations into the core package
Expansion of tf.contrib into Separate Repos
TensorFlow AIY (Artificial Intelligence for Yourself) support
Improved TPU & TPU Pod support, Distributed Computation Support
Improved HPC integration for Parallel Computing
Support for TPU Pods up to v3
Community Integration for Development, Support and Research
Domain-Specific Community Support
Extra Support for Model Validation and Reuse
End-to-End ML Pipelines and Products available at TensorFlow Hub
And yes – there is still much more that I can’t cover in this blog.
Wow – that’s an Ocean! What can you Expand Upon?
Yes – that is an ocean. But to keep things as simple as possible (and yes – stick to the word limit – cause I could write a thousand words on every one of these topics and end up with a book instead of a blog post!) we’ll focus on the most exciting and striking topics (ALLare exciting – we’ll cover the ones with the most scope for our audience).
1. Keras as the Main API to TensorFlow
From www.keras.io
Earlier, comments like these below were common on the Internet:
“TensorFlow is broken” – Reddit user
“Implementation so tightly coupled to specification that there is no scope for extension and modification easily in TensorFlow” – from a post on Blogger.com
“We need a better way to design deep learning systems than TensorFlow” – Google Plus user
Understanding the feedback from the community, Keras was created as an open source project designed to be an easier interface to TensorFlow. Its popularity grew very rapidly, and now nearly 95% of ML tasks happening in the real world can be written just using Keras. Packaged as ‘Deep Learning for Humans’, Keras is simpler to use. Though, of course, PyTorch gives it a real run for the money as far as simplicity is concerned!
In TensorFlow 2.0, Keras has been adopted as the main API to interact with TensorFlow. Support for pure TensorFlow has not been removed, and thus TensorFlow 2.0 will be completely backwards-compatible, including a conversion tool that can be used to convert TensorFlow 1.x to TensorFlow 2.0 where implementation details differ. Kind of like the Python tool 2to3.py! So now, Keras is the main API for TensorFlow deep learning applications – which takes out a huge amount of unnecessary complexity burdens from the ML engineer.
Use tf.data for data loading and preprocessing or use NumPy.
Use Keras or Premade Estimators to do your model construction and validation work.
Use tf.function for DAG graph-based execution or use eager execution ( a technique to smoothly debug and run your deep learning model, on by default in TF 2.0).
For TPUs, GPUs, distributed computing, or TPU Pods, utilize Distribution Strategy for high-performance-computing distributed deep learning applications.
TF 2.0 standardizes SavedModel as a serialized version of a TensorFlow graph for a variety of different platforms like Mobile, JavaScript, Edge, Lite, TensorBoard, TensorHub, and TensorServing. This makes it easier to move models around different architectures. This was one feature that was highly necessary compared to the older scenario.
This means that now even novices at machine learning can perform deep learning tasks with relative ease. And of course, did we mention the wide variety of end-to-end pluggable deep learning solutions available at TensorHub and on the Tutorials section? And guess what – they’re all free to download and use for commercial purposes. Google, you are truly the best friend of the open source community!
3. Expanded Support for Mobile (Android and iOS), Web (JavaScript), TF Lite, TF Edge and IoT
From Medium.com
In all the above platforms, where computational and memory resources are scarce, there is a common trend in TF 2.0 that extends over most of these platforms.
Greater support for various ops in TF 2.0 and several deployment techniques
SIMD+ support for WebAssembly
Support for Swift (iOS) in Colab.
Increased support for data input pipelines, and data visualization libraries in JavaScript.
A smaller and lighter footprint for Edge Computing, Mobile Computing and IoT
Better support for audio and text-based models
Easier conversion of trained TF 2.0 graphs
Increased and improved mobile model optimization techniques
As you can see, Google knows that Edge and Mobile is the future as far as computing is concerned, and has designed its products accordingly. TF Mobile should be replaced by TF Lite soon.
4. Unified Programming Models and Methodologies
There are two/three major ways to code deep learning networks in Keras. They are:
We build models symbolically by describing the structure of its DAG (Directed Acyclic Graph) or a sequential stack. This following image is an example of Keras code written symbolically.
From Medium.com TensorFlow publication
This looks familiar to most of us since this is how we use Keras usually. The advantages of this process are that it’s easy to visualize, has debugging errors usually only at compile time, and corresponds to our mental model of the deep learning network and is thus easy to work with.
Stack-Based/Subclassing/Imperative API
The following code is an example of the Sequential paradigm or subclassing paradigm to building a deep learning network:
From Medium.com TensorFlow publication (code still in development)
Rather similar to Object Oriented Python, this style was first introduced into the deep learning community in 2015 and has since been used by a variety of deep learning libraries. TF 2.0 has complete support for it. Although it appears simpler, it has some serious disadvantages.
Imperative models are not a data structure that is transparent but an opaque class instead. You are prone to many errors at runtime following this approach. As a deep learning practitioner, you are obliged to know both symbolic as well as imperative and subclassing methodologies of coding your deep neural network. For example, recursive or recurrent neural networks cannot be defined by the symbolic programming model. So it is good to know both. But be aware of the disparate advantages and disadvantages of them!
5. TensorFlow AIY
From slideshare.com
This is a brand new offering from Google and other AI companies such as Intel. AIY stands for Artificial Intelligence for Yourself (a play on DIY – Do It Yourself) and is a new marketing scheme from Google to show consumers how easy it is to use TensorFlow in your own DIY devices to create your own AI-enabled projects and gadgets. This is a very welcome trend, since it literally brings the power of AI to the masses, at a very low price. I honestly feel that now the day is nearing when schoolchildren will bring their AIY projects for school exhibitions and that the next generation of whiz kids will be chock full of AI expertise and development of new and highly creative and innovative AI products. This is a fantastic trend and now I have my own to-buy-and-play-with list if I can order these products on Google at a minimal shipping charge. So cool!
6. Guidelines and New Incentives for Community Participation and Research Papers
We are running out of the word limit very fast! I hoped to cover TPUs and TPU Pods and Distributed Computation, but for right now, this is my final point. Realizing and recognizing the massive role the open source community has played in the development of TensorFlow as a worldwide brand for deep learning neural nets, the company has set up various guidelines to introduce domain-specific innovation and the authoring of research papers and white papers from the TensorFlow community, in collaboration with each other. To quote:
Grow global TensorFlow communities and user groups.
Collaborate with partners to co-develop and publish research papers.
Continue to publish blog posts and YouTube videos showcasing applications of TensorFlow and build user case studies for high impact application
In fact, when I read more of the benefits of participating in the TensorFlow community open source development process, I could not help it, I joined the TensorFlow development community, myself as well!
A Dimensionless Technologies employee contributing to TensorFlow!
Who knows – maybe, God-willing, one day my code will be a part of TensorFlow 2.0/2.x! Or – even better – there could be a research paper published under my name with collaborators, perhaps. The world is now built around open source technologies, and as a developer, there has never been a better time to be alive!
In Conclusion
So don’t forget, on the day of writing this blog article, 31th January 2019, TensorFlow 2.0 is yet to be released, but since its an open source project, there are no secrets and Google is (literally) being completely ‘open’ about the steps it will take to take TF further as the world market leader in deep learning. I hope this article has increased your interest in AI, open source development, Google, TensorFlow, deep learning, and artificial neural nets. Finally, I would like to point you to some other articles on this blog that focus on Google TensorFlow. Visit any of the following blog posts for more details on TensorFlow, Artificial intelligence Trends and Deep Learning: