9923170071 / 8108094992 info@dimensionless.in
Real-Time Data Streaming with Kafka and TensorFlow

Real-Time Data Streaming with Kafka and TensorFlow

The flux of data is increasing exponentially in this age of Digital awakening. Data has become so important to major industries and sectors around the globe that it can literally be referred to as digital gold! From simple company centric applications to major platforms interweaving people from all around the world, data has started to shape major decisions for not only autonomous machines, but also for the human race as a whole. This imagery is as intriguing as it is terrifying, but only if we make it so.

In order to handle this rapidly incoming data with relative ease, a competent system is required to act instantly and deliver results on the fly. Otherwise, such large-scale investments on data gathering and data generation will go to waste since the data will be left in its dormant state without any active or competent agent acting on it. This is where the concept of real time data streaming and processing comes up. So, what is real time data streaming?

As is already known, data is being generated from various sources at a lightening pace. If we stop to ingest enough data, process it in batches and then provide the results after enough time has passed, the results will tend to lose its relevance and will reflect outdated patterns and trends. This happens majorly because of the high rate of variance in incoming data and also because of time constraints.

For instance, suppose that you have a machine which tells you which horse to bet upon in a horse race. You have the option of changing your choice during the race until the last lap commences. In such a case, if your machine gives you predictions based on the first lap where horse A was showing promise, and predicts that horse A will win, where in fact, during the third lap, horse B shows further promise, you will lose your bet just because of a machine which lags behind by two laps. This problem can be avoided by processing incoming data instantly, or in other words, real time data streaming. A stack of old data or historical data is studied and incoming records are processed based on the studied patterns such that the results are delivered within milliseconds. For our example, the horse race prediction machine would have already studied data about the different horses in the race previously and then based on the incoming data (the horse number, position of the horse, time since beginning of race, number of contestants, etc.), will be able to instantly allocate a rank for the different participants with the help of real time data streaming.

 

How to Go About Real-Time Data Streaming?

In real time mission-critical applications, Apache Kafka has turned out to be one of the most widely used frameworks for implementation. Apache Kafka is integrated with efficient machine learning frameworks in order to enable model training and speedy deliverance by supporting real time data streaming.

 

What is Apache Kafka?

As per Kafka’s website, it defines itself and its tasks as follows:

“Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.”

“The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue designed as a distributed transaction log”, making it highly valuable for enterprise infrastructures to process streaming data.” – Wikipedia

These definitions might seem like a mouthful at first, but as we go through with this subject step by step in this discussion, one will easily get the hang of it in no time!

 

Why use Tensorflow as the machine learning platform which is to be integrated with Apache Kafka?

 

Tensorflow is one of the most popular and efficient open source machine learning platforms available. It has a beautiful and well-suited architecture which enables data flow with extreme grace and optimization. It enables users and developers to establish large-scale projects with minimal hassles and maximal resource optimization. It is thus, a very competent platform to integrate with Apache Kafka for the purpose of serving real-time data streaming.

Tensorflow’s tf.keras and tf.data are responsible for streaming data in and out. Previously however, these modules were limited in their usage and could only support a few data formats. Support for Kafka streaming was not included during the earlier versions of Tensorflow. It was also difficult to use Tensorflow supported modules like tf. Examples and TFRecord in Big data and the general community of Data Science as a whole and were, therefore, rarely spotted.

 

It was thus, a difficult task to integrate the Apache Kafka and Tensorflow frameworks. A lot of intermediary bridges had to be constructed in order to establish reliable handshakes between these two frameworks and ensure smooth integration. This was a burdensome process since it included designing of an entire infrastructure which turned out to be a fault prone mechanism most of the time. These were the steps which were required to be followed in order to establish a working data streaming flow:

Read data from the Kafka stream -> Convert to TFRecord format -> call Tensorflow’s function to read the TFRecord object from file system -> execute model and deliver result -> save the result in the file system again -> write results/ inference back to Kafka

inference for Kafka

Source: Kafka Summit NYC 2019, Yong Tang

 

However, with the release of Tensorflow 2.0, the tables turned and the support for Apache Kafka data streaming module was issued along with support for a varied set of other data formats in the interest of the data science and statistics community (released in the IO package from Tensorflow: here).

kafka dataset for tensorflow

Source: Kafka Summit NYC 2019, Yong Tang

 

With this development, it is now possible to enable real time streaming with Kafka and Tensorflow with relative ease and minimized error. This process is implemented with the use of KafkaDataset module (written in C++) which is a part of the new release of the Tensorflow IO package. KafkaDataset module has been integrated as a subclass of tf.data.Dataset module. This module works just like any other data streaming module where users can simply read data from a kafka stream and use it in a Tensorflow graph or feed it to tf.keras and other Tensorflow specific modules for model training and evaluation purpose. The option of writing back through output stream is also possible of course.

Here is how to implement data streaming, processing, model training and inference gathering in just a few lines of code with Kafka support on Tensorflow:

1. import tensorflow_io.kafka as kafka_io

2.dataset = kafka_io.KafkaDataset(‘topic’, server=’localhost’,group=’’)

#Preprocessing, if required

3.dataset=dataset.map(lambda x: ….)

#Model building

4.model = tf.keras.models….

5.model.compile(…)

6.model.fit(dataset, epochs=5)

#keras callback

7.class OutputCallback(tf.keras.callbacks.Callback):

8.  def.__init__(self, batch_size, topic, servers):

9. self.sequence = kafka_io.KafkaOutputSequence(topic=topic, servers=servers)

10.  self._batch_size = batch_size

11. def on_predict_batch_end(self, batch, logs=None):

12. self._sequence.setitem(index, class_names[np.argmax(output)])

#results with callback for streaming input and output

13.model.predict(test_dataset, callbacks=[OutputCallback(32,’topic’,’localhost’)])


Source: Kafka Summit NYC 2019, Yong Tang

 

Code Overview/ Explanation:

Line 2 simply streams in data with the help of the KafkaDataset module and data processing and modeling are immediately commenced as can be seen in lines 3 and 4. Thereafter, we move on to the keras callback stage. Keras callbacks are very informative since they provide an overview of the internal stages and statistical details of the model during the training or prediction process. The callback function is written in the 7th line. The KafkaOutputSequence is responsible for writing the results to the output stream (with so much relative ease!). In line 13 the predict function is called to get the model details and inference on the test dataset.

Kafka Dataset

Source: Kafka Summit NYC 2019, Yong Tang

 

Real time data streaming with Kafka and Tensorflow has not only helped in the elimination of the complicated infrastructure which previously bridged the wide gap between the two popular platforms, but has also made the process less error prone and more approachable for real time mission critical systems with respect to machine learning and data science. The above picture shows how easy it is now to implement Kafka along with Tensorflow with just one call for data streaming. Further development in this area looks highly promising and is sure to contribute manifold in the ease of scalability and smooth integration when it comes to Big Data, live or real time data streaming, machine learning and deep learning techniques to develop smart and autonomous systems across the globe!

Get a grip on the machine learning, data science, big data and several other intriguing topics by following our blogs or even our detailed courses provided in the links below:

Follow this link, if you are looking to learn data science online!

You can follow this link for our Big Data course, which is a step further into advanced data analysis and processing!

Additionally, if you are having an interest in learning Data Science, click here to start the Online Data Science Course

Furthermore, if you want to read more about data science, read our Data Science Blogs

What’s New in TensorFlow 2.0

What’s New in TensorFlow 2.0

New Features in TensorFlow 2.0

TensorFlow 2.0 is coming soon. And boy, are we super-excited! TensorFlow first began the trend of open-sourcing AI and DL frameworks for use by the community. And what has been the result? TensorFlow has become an entire ML ecosystem for all kinds of AI technology. Just to give you an idea,  here are the features that an absolutely incredible community has added to the original TensorFlow package:

TF 2.0 Features

From Medium.com  

Features of TensorFlow contributed from the Open Source Community

TensorFlow started out as a difficult-to-learn framework for deep learning from Google. With one difference – it was open-sourced. That may appear as stupidity for a commercial company that focuses on profits, but it was the right thing to do. Because the open source community took it up as their own property and ported it to nearly every platform available today including mobile, web, IoT, embedded, Edge Computing and so much more. And even more: from Python and C, it was ported to JavaScript, C++, C#, Node.js, F#, React.js, Go, Julia, R, Rust, Android, Swift, Kotlin, and even a port to Scala, Haskell, and numerous other coding languages. Then, after that complete conquest, Google went into the next level for optimization – hardware.

Which means – now we have CUDA (library for executing ML code on GPUs) v8-v9-v10 (9.2 left out), GPGPU, GPU-Native Code, TPU (Tensor Processing Unit – custom hardware provided by Google specially designed for TensorFlow), Cloud TPUs, FPGAs (Field-Programmable Gate Arrays – Custom Programmable Hardware), ASIC (Application Specific Integrated Circuits) chip hardware specially designed for TensorFlow, and now MKL for Intel, BLAS optimization, LINPACK optimization (the last three all low-level software optimization for matrix algebra, vector algebra, and linear algebra packages), and so much more that I can’t fit it into the space I have to write this article. To give you a rough idea of what the TensorFlow architecture looks like now, have a look at this highly limited graphic:

Some of TensorFlow features

Source: planspaces.org

Note: XLA stands for A(X)ccelerated Linear Algebra compiler still in development that provides highly optimized computational performance gains.

And Now TensorFlow 2.0

This release is expected shortly in the next six months from Google. Some of its most exciting features are:

  1. Keras Integration as the Main API instead of raw TensorFlow code
  2. Simplified and Integrated Workflow
  3. Eager Execution
  4. More Support for TensorFlow Lite and TensorFlow Edge Computing
  5. Extensions to TensorFlow.js for Web Applications and Node.js
  6. TensorFlow Integration for Swift and iOS
  7. TensorFlow Optimization for Android
  8. Unified Programming Paradigms (Directed Acyclic Graph/Functional and Stack/Sequential)
  9. Support for the new upcoming WebGPU Chrome RFC proposal
  10. Integration of tf.contrib best Package implementations into the core package
  11. Expansion of tf.contrib into Separate Repos
  12. TensorFlow AIY (Artificial Intelligence for Yourself) support
  13. Improved TPU & TPU Pod support, Distributed Computation Support
  14. Improved HPC integration for Parallel Computing
  15. Support for TPU Pods up to v3
  16. Community Integration for Development, Support and Research
  17. Domain-Specific Community Support
  18. Extra Support for Model Validation and Reuse
  19. End-to-End ML Pipelines and Products available at TensorFlow Hub

And yes – there is still much more that I can’t cover in this blog.  

Wow – that’s an Ocean! What can you Expand Upon?

Yes – that is an ocean. But to keep things as simple as possible (and yes – stick to the word limit – cause I could write a thousand words on  every one of these topics and end up with a book instead of a blog post!) we’ll focus on the most exciting and striking topics (ALL are exciting – we’ll cover the ones with the most scope for our audience).

1. Keras as the Main API to TensorFlow

Keras Deep Learning

From www.keras.io

Earlier, comments like these below were common on the Internet:

“TensorFlow is broken” – Reddit user

“Implementation so tightly coupled to specification that there is no scope for extension and modification easily in TensorFlow” – from a post on Blogger.com

“We need a better way to design deep learning systems than TensorFlow” – Google Plus user

Understanding the feedback from the community, Keras was created as an open source project designed to be an easier interface to TensorFlow. Its popularity grew very rapidly, and now nearly 95% of ML tasks happening in the real world can be written just using Keras. Packaged as ‘Deep Learning for Humans’, Keras is simpler to use.  Though, of course, PyTorch gives it a real run for the money as far as simplicity is concerned!

In TensorFlow 2.0, Keras has been adopted as the main API to interact with TensorFlow. Support for pure TensorFlow has not been removed, and thus TensorFlow 2.0 will be completely backwards-compatible, including a conversion tool that can be used to convert TensorFlow 1.x to TensorFlow 2.0 where implementation details differ. Kind of like the Python tool 2to3.py! So now, Keras is the main API for TensorFlow deep learning applications – which takes out a huge amount of unnecessary complexity burdens from the ML engineer.

2. Simplified and Integrated WorkFlow

There is a step-by-step simpler and extremely flexible workflow process for designing deep learning models: (visit https://medium.com/tensorflow/whats-coming-in-tensorflow-2-0 for more details)

  1. Use tf.data for data loading and preprocessing or use NumPy.
  2. Use Keras or Premade Estimators to do your model construction and validation work.
  3. Use tf.function for DAG graph-based execution or use eager execution ( a technique to smoothly debug and run your deep learning model, on by default in TF 2.0).
  4. For TPUs, GPUs, distributed computing, or TPU Pods, utilize Distribution Strategy for high-performance-computing distributed deep learning applications.
  5. TF 2.0 standardizes SavedModel as a serialized version of a TensorFlow graph for a variety of different platforms like Mobile, JavaScript, Edge, Lite, TensorBoard, TensorHub, and TensorServing. This makes it easier to move models around different architectures. This was one feature that was highly necessary compared to the older scenario.

This means that now even novices at machine learning can perform deep learning tasks with relative ease. And of course, did we mention the wide variety of end-to-end pluggable deep learning solutions available at TensorHub and on the Tutorials section? And guess what – they’re all free to download and use for commercial purposes. Google, you are truly the best friend of the open source community!

3. Expanded Support for Mobile (Android and iOS), Web (JavaScript), TF Lite, TF Edge and IoT

TF Lite Architecture

From Medium.com

In all the above platforms, where computational and memory resources are scarce, there is a common trend in TF 2.0 that extends over most of these platforms.

  1. Greater support for various ops in TF 2.0 and several deployment techniques
  2. SIMD+ support for WebAssembly
  3. Support for Swift (iOS) in Colab.
  4. Increased support for data input pipelines, and data visualization libraries in JavaScript.
  5. A smaller and lighter footprint for Edge Computing, Mobile Computing and IoT
  6. Better support for audio and text-based models
  7. Easier conversion of trained TF 2.0 graphs
  8. Increased and improved mobile model optimization techniques

As you can see, Google knows that Edge and Mobile is the future as far as computing is concerned, and has designed its products accordingly. TF Mobile should be replaced by TF Lite soon.

4. Unified Programming Models and Methodologies

There are two/three major ways to code deep learning networks in Keras. They are:

  1. Symbolic or Declarative APIs
  2. Imperative APIs / Subclassing

We shall look at both of them in turn, in a very minute way. For more on this topic, visit https://medium.com/tensorflow/what-are-symbolic-and-imperative-apis-in-tensorflow-2-0

Symbolic/Declarative/Graph-Based/Functional API

We build models symbolically by describing the structure of its DAG (Directed Acyclic Graph) or a sequential stack. This following image is an example of Keras code written symbolically.

Keras code

From Medium.com TensorFlow publication

This looks familiar to most of us since this is how we use Keras usually. The advantages of this process are that it’s easy to visualize, has debugging errors usually only at compile time, and corresponds to our mental model of the deep learning network and is thus easy to work with.

Stack-Based/Subclassing/Imperative API

The following code is an example of the Sequential paradigm or subclassing paradigm to building a deep learning network:

Subclassing

From Medium.com TensorFlow publication (code still in development)

Rather similar to Object Oriented Python, this style was first introduced into the deep learning community in 2015 and has since been used by a variety of deep learning libraries. TF 2.0 has complete support for it. Although it appears simpler, it has some serious disadvantages.

Imperative models are not a data structure that is transparent but an opaque class instead. You are prone to many errors at runtime following this approach. As a deep learning practitioner, you are obliged to know both symbolic as well as imperative and subclassing methodologies of coding your deep neural network. For example, recursive or recurrent neural networks cannot be defined by the symbolic programming model. So it is good to know both. But be aware of the disparate advantages and disadvantages of them!

5. TensorFlow AIY

AIY / DIY

From slideshare.com

This is a brand new offering from Google and other AI companies such as Intel. AIY stands for Artificial Intelligence for Yourself (a play on DIY – Do It Yourself) and is a new marketing scheme from Google to show consumers how easy it is to use TensorFlow in your own DIY devices to create your own AI-enabled projects and gadgets. This is a very welcome trend, since it literally brings the power of AI to the masses, at a very low price. I honestly feel that now the day is nearing when schoolchildren will bring their AIY projects for school exhibitions and that the next generation of whiz kids will be chock full of AI expertise and development of new and highly creative and innovative AI products. This is a fantastic trend and now I have my own to-buy-and-play-with list if I can order these products on Google at a minimal shipping charge. So cool!

6. Guidelines and New Incentives for Community Participation and Research Papers

We are running out of the word limit very fast! I hoped to cover TPUs and TPU Pods and Distributed Computation, but for right now, this is my final point. Realizing and recognizing the massive role the open source community has played in the development of TensorFlow as a worldwide brand for deep learning neural nets, the company has set up various guidelines to introduce domain-specific innovation and the authoring of research papers and white papers from the TensorFlow community, in collaboration with each other. To quote:

From the website https://www.tensorflow.org/community/roadmap :

Community

  • Continue public feedback on significant design decisions through the Request-for-Comment (RFC) process.

  • Create a contributors’ guide to augment our published governance and process.

  • Grow global TensorFlow communities and user groups.

  • Collaborate with partners to co-develop and publish research papers.

  • Continue to publish blog posts and YouTube videos showcasing applications of TensorFlow and build user case studies for high impact application

In fact, when I read more of the benefits of participating in the TensorFlow community open source development process, I could not help it, I joined the TensorFlow development community, myself as well!

TensorFlow Community

A Dimensionless Technologies employee contributing to TensorFlow!

Who knows – maybe, God-willing, one day my code will be a part of TensorFlow 2.0/2.x! Or – even better – there could be a research paper published under my name with collaborators, perhaps. The world is now built around open source technologies, and as a developer, there has never been a better time to be alive!

In Conclusion

So don’t forget, on the day of writing this blog article, 31th January 2019, TensorFlow 2.0 is yet to be released, but since its an open source project, there are no secrets and Google is (literally) being completely ‘open’ about the steps it will take to take TF further as the world market leader in deep learning. I hope this article has increased your interest in AI, open source development, Google, TensorFlow, deep learning, and artificial neural nets. Finally, I would like to point you to some other articles on this blog that focus on Google TensorFlow. Visit any of the following blog posts for more details on TensorFlow, Artificial intelligence Trends and Deep Learning:

Top 10 Data Science Tools (other than SQL Python R)

Top Trends for Data Science in 2019

Finally, do apply for our Deep Learning course (link given below) if you truly wish to become acquainted with TensorFlow in detail:

Deep Learning

May the joy of learning something new never leave you, no matter how old or young you are. Cheers!