Unless you’ve been living with your head under a rock for the last 4 years, you will definitely have heard of Bitcoin. You would also have heard about the technology behind Bitcoin, Blockchain. Now cryptocurrencies are banned in most cases in India and China, but the Americas and Europe still use cryptocurrencies extensively. And in my opinion, Asia stands to lose a lot if blockchain is not adopted extensively everywhere. Because make no mistake about it – blockchain technology will change the world as we know it. Forever.
Blockchain is the technology powering Bitcoin and other cryptocurrencies. To explain what blockchain is and what bitcoin is you can go through anyone of the articles below. Don’t worry these articles are carefully selected to be as interesting and fun to read as possible. (This also gives me space to add my own original ideas instead of copying or rewording existing articles – and I have plenty (of ideas)!
In fact, that last link is so amazingly simple visual and clear that I recommend everyone read it. Just so that we’re on the same page.
Cut to the chase. A little confession here. I was asked to do this article nearly 16 days ago. Now I have some experience with blockchain before since having gone through it extensively as a research topic for my own blog. Then a remarkable idea hit me. An idea for a startup that could (in theory) become a multi-billion dollar enterprise. I spent a few days refining it, even going so far as to see if I could start this company with this area myself, until reality set in – I lacked the experience and the business skills.
No sooner had this realization struck me and the excitement cooled a little, another idea to improve blockchain struck me, and I promise to sketch out that idea as well. I am doing this for two reasons:
I am staunch support of the FOSS (free open source software movement and would like to be credited with the idea, and I am starting a free to use, open source project on GitHub – working on it, currently moving towards an alpha release as of now.
I believe in the power of technology to remove economic inequality. Now you may say that technology has evolved to the point that 4-5 monolithic companies dominate the entire world. But I believe that technology when used ethically has the potential to create more opportunities than it removes.
Blockchain has two major problems – energy consumption and resource consumption. But there are techniques that can alleviate both of these problems. We’ll deal with that as well in Part 2.
Finally, the vaunted hype about security for blockchain and cryptocurrencies is ridiculous when you think about it. For the sake of brevity, I will address the main security issues with blockchain in a separate article on Medium – (not here, since it has no relation to data science).
Application – A Personal Blockchain For Every Person On The Planet
In points (I assume you’ve gone through the graphical explanation of blockchain at least – if not you can review it here):
The trouble with end products of all types that are produced today is that there are so many intermediaries between the producer and the consumer that the producers receives a pittance compared to the end final price. It would be nice if we could track a product everywhere that it is used.
This is also applicable for books, music, articles, poems, pictures, any digital content of any sort. Currently Amazon and YouTube monopolize content distribution, the latter with a complete disregard for copyright and media ownership and payment. Suppose we had a tracking system that viewed every view of a video, and rewarded the original producer for it?
To emphasize the previous point, let us consider the case of Lindsey Stirling. Lindsey Stirling is a famous contemporary violinist who dances while playing. Her 118 video uploads have earned her 2,575,305,706 views, 2.5 billion approx, and her earnings from YouTube ads last month was 100K a month. Her net worth as on 10th April 2019 is 12 million USD (12,000,000).
But suppose Lindsey Stirling distributed her videos at a price of 1 USD every view. Her net worth would be 2.6 billion USD at the very least! She would be a multi-billionaire had this platform existed. It doesn’t – yet. And because it doesn’t exist she is 2.49 billion USD poorer!
Now everyone who knows blockchain technology will now realize this idea, the concept, and how blockchain can be used to overcome this problem – and its power. Disruptive power!
The blockchain is a service that immutably assigns ownership.
The blockchain is also a database that stores every single transaction on a particular digitisable entity.
Finally, the Ethereum smart contract technology means that we can assign payments to go to every person on his own personal blockchain of all his digitisable goods.
This means we can build a world where producer pays a user-defined amount to every entity which created a particular digitisable product.
On this platform or website or marketplace, producers can adjust their prices and their payments and consumers can buy directly from them.
Everything can be tracked on the blockchain. Your own database of your own transactions can be used with smart contracts to pay the maximum possible fee to the most deserving person in the supply chain – fixed by each producer.
Hugely, Massively Disruptive
If you are interested or want to know more, you can leave a comment below with your email address. If you want to be a part of this new revolution and the new decentralised world – with all services provided free – please provide a comment below asking for my email ID with a statement of what and how you want to contribute to this endeavor. I promise to reply to every sincere query.
This is a fledgling project and a lot of work remains to be done. I will be writing articles and creating a team to work on this idea. Those of you who are interested please mail me at firstname.lastname@example.org.
This will be an open source project and all services have to be offered free of cost. How do you go about making a profit from this? You don’t! The only way this can be fair to all players in countries like India is if it is specially designed to be applicable to anyone.
So this article gave a small glimpse into a world without intermediaries, corporations, money-making middlemen, and running purely on smart contracts. This is applicable to AI and data science since this technology will not reach anywhere significant without extensive use of AI and data science.
The more data that is available, the more analysis can be performed on it. And unless we have analysts who are running monitoring fraud detection systems fulltime on such a system, we might as well never build it – because blockchain data integrity cannot be hacked, but cryptocurrencies are hackable and have been hacked extensively since the beginning of Bitcoin.
For Part 2 of this series on Blockchain Applications of Data Science, you can go to the link below:
The boundaries of the enterprise are becoming diffused. You have data on the network, on the endpoint, and on the cloud. Enabling visibility into your data flows is a critical first step to understanding which data is at risk for theft or misuse. You need to know what data you have, where it’s located, and why that data exists in order to properly protect it. This is where data discovery and data classification come into play.
Data Discovery is an important foundation to gain that knowledge of the what, where, and why of your data. Data Classification allows you to create a scalable security solution. Such solutions as file tagging can be used across platforms from Windows to Mac and also enables you to tag files across the endpoint, network and cloud. This, in turn, gives you visibility into data across all of your infrastructures so you can apply the appropriate policies. Hence,
What is Amazon EMR and Apache Atlas
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
Apache Atlas is the one-stop solution for data governance and metadata management on enterprise Hadoop clusters. Atlas has a scalable and extensible architecture which can plug into many Hadoop components to manage their metadata in a central repository. In this blog, we are going to look on one such data discovery and classification tool i.e Apache Atlas. For further use, we will be using Apache Atlas on Amazon EMR. Let us look at more features of Apache Atlas at our disposal!
Apache Atlas Features
1. Centralised Metadata Store Atlas provides true visibility in Hadoop. By using the native connector to Hadoop components, Atlas provides technical and operational tracking enriched by business taxonomical metadata. Atlas facilitates easy exchange of metadata by enabling any metadata consumer to share a common metadata store that facilitates interoperability across many metadata producers
2. Data Classification Ability to dynamically create classifications — like pii, expires_on, data_quality. Classifications can include attributes — like an expiry_date attribute in EXPIRES_ON classification. Entities can be associated with multiple classifications, enabling easier discovery and security enforcement. Propagation of classifications via lineage — automatically ensures that classifications follow the data as it goes through various processing.
3. Data lifecycle Management It leverages existing investment in Apache Falcon with a focus on provenance, multi-cluster replication, data set retention and eviction, late data handling, and automation.
4. Centralised Security Fine-grained security for metadata access, enabling controls on access to entity instances and operations like add/update/to remove classifications. Integration with Apache Ranger enables authorization/data-masking on data access based on classifications associated with entities in Apache Atlas. Integration with HDP security that enables you to establish global security policies based on data classifications and that leverages Apache Ranger plug-in architecture for security policy enforcement.
The Architecture of Atlas
This is the basic structure of how this Atlas work. It has got a core component for ingestion and export. In the back-end, it’s using the HBase database for metadata store. It also requires a solar okay that is for index and again the between the different components, the message passing is through Kafka. It can connect Atlas with REST API calls. Also, we have an admin console to manage and monitor all the operations. So, this is a core structure of Atlas and as we know now that it requires HBase hence before we do the installation of Atlas so we need to have a working HBase instance running. Also, we need a SOLAR incensed.
Amazon EMR–Apache Atlas Workflow
To demonstrate the functionality of Apache Atlas, we will be doing the following in this post:
Launch an Amazon EMR cluster using the AWS CLI or AWS Cloud Formation
View the data lineage of a hive table
Create a classification
Discover metadata using the Atlas domain-specific language
Step 1: Launching Amazon EMR and Atlas
Now, we will be looking into running atlas on Amazon EMR. I have one cluster running on Amazon AWS as a single node instance I have installed Atlas and all its pre-required components like HBase and Kafka beforehand.
Now, we can open Atlas window by clicking Atlas. Default credentials are admin as username and admin as password. Once you are able to log in successfully, you will be on Atlas page
Once, we are logged into Atlas, we can see a couple of tabs available right in front. They are tags, taxonomy and search.
The tag is nothing but a way of grouping certain objects say for example we have PII tag (personally identified information). We can add this tag to certain databases tables or column wherever we need it. Through tags, by using a Ranger, we can control the access of these objects
In the search page, we have two options text-based search or DSL. DSL stands for a domain-specific language. It is similar to SQL kind of a query. You can use similar queries and get the details. By default, there are a lot of data sources you can connect to. We will select hive DB here. Once selected, it will show all the hive databases available. Furthermore, we can create a few more databases which will get listed here.
Step 2: Viewing Data Lineage
Data lineage is defined as a data life cycle that conveys data origin and where data moves over time. In Apache Hive, if I create a table (TableA) and then insert data (from another table TableB), the data lineage will display TableA as the target and Table B as the source/origin. These two tables are linked together by a process “insert into Table..”, allowing a user to understand the data life cycle. In a Hadoop ecosystem, Apache Atlas contains the data lineage for various systems like Apache Hive, Apache Falcon and Apache Sqoop.
Data lineage is one of the most important features of Apache Atlas. If we click any table or view the table, we can see how the data is flowing. In simple terms, we will get a history of data flow.
In the above pic, we have an initial file. I created a table on Hive and loaded the values into it from a text file. We can easily see this data upload process in the first two nodes. After the data load, we placed all the data into the table “sij” underscore. After that, we made a view of that table.
You can see a much more complex data lineage in the image below
In short, this is a data lineage which Apache atlas provides us and this is how you can get the lineage.
Step 3: Creating a Classification
Classification propagation enables classifications associated with an entity to be automatically associated with other related entities of the entity. This is very useful in dealing with scenarios where a dataset derives its data from other datasets — like a table loaded with data in a file, a report generated from a table/view, etc. For example, when a table is classified as “PII”, tables or views that derive data from this table (via CTAS or ‘create view’ operation) will be automatically classified as “PII”.
On the Atlas web UI, click CLASSIFICATION, then click the + icon.
On the “Create a new classification” pop-up, type in a name and an optional description for the classification. You can use the Select classifications to inherit attributes box to inherit attributes from other classifications. Click Add New Attributes to add one or more new attributes to the classification. Click Create to create the new classification.
The new classification appears in the Classifications list.
Step 4: Using DSL for Discovering Metadata
Atlas DSL (Domain-Specific Language) is a SQL like a query language that enables you to search metadata using complex queries. In the search tab, you have two options of searching through the databases. One is normal text search and the other is DSL search.
After selecting the DSL search, we can use SQL like queries to extract information out from the tables present in our HIVE database. In the optional, conditions input we can write our SQL queries. We wrote one such query here to extract all the error logs in a specific format. In the image below, you can see results
In this article, we focused on Apache Atlas as an example to explain and demonstrate metadata management in enterprise governance. We had a look at important topics like data lineage, data discovery, and classification. Apache Atlas is one of the prime tools handling all the metadata management tasks and has a lot of future prospects.
Follow this link, if you are looking to learn more about data science online!
I have just completed my survey of data (from articles, blogs, white papers, university websites, curated tech websites, and research papers all available online) about predictive analytics.
And I have a reason to believe that we are standing on the brink of a revolution that will transform everything we know about data science and predictive analytics.
But before we go there, you need to know: why the hype about predictive analytics? What is predictive analytics?
Let’s cover that first.
Importance of Predictive Analytics
By PhotoMix Ltd
According to Wikipedia:
Predictive analytics is an area of statistics that deals with extracting information from data and using it to predict trends and behavior patterns. The enhancement of predictive web analytics calculates statistical probabilities of future events online. Predictive analytics statistical techniques include data modeling, machine learning, AI, deep learning algorithms and data mining.
Predictive analytics is why every business wants data scientists. Analytics is not just about answering questions, it is also about finding the right questions to answer. The applications for this field are many, nearly every human endeavor can be listed in the excerpt from Wikipedia that follows listing the applications of predictive analytics:
Predictive analytics is used in actuarial science, marketing, financial services, insurance, telecommunications, retail, travel, mobility, healthcare, child protection, pharmaceuticals, capacity planning, social networking, and a multitude of numerous other fields ranging from the military to online shopping websites, Internet of Things (IoT), and advertising.
In a very real sense, predictive analytics means applying data science models to given scenarios that forecast or generate a score of the likelihood of an event occurring. The data generated today is so voluminous that experts estimate that less than 1% is actually used for analysis, optimization, and prediction. In the case of Big Data, that estimate falls to 0.01% or less.
Common Example Use-Cases of Predictive Analytics
Components of Predictive Analytics
A skilled data scientist can utilize the prediction scores to optimize and improve the profit margin of a business or a company by a massive amount. For example:
If you buy a book for children on the Amazon website, the website identifies that you have an interest in that author and that genre and shows you more books similar to the one you just browsed or purchased.
YouTube also has a very similar algorithm behind its video suggestions when you view a particular video. The site identifies (or rather, the analytics algorithms running on the site identifies) more videos that you would enjoy watching based upon what you are watching now. In ML, this is called a recommender system.
Netflix is another famous example where recommender systems play a massive role in the suggestions for ‘shows you may like’ section, and the recommendations are well-known for their accuracy in most cases
Google AdWords (text ads at the top of every Google Search) that are displayed is another example of a machine learning algorithm whose usage can be classified under predictive analytics.
Departmental stores often optimize products so that common groups are easy to find. For example, the fresh fruits and vegetables would be close to the health foods supplements and diet control foods that weight-watchers commonly use. Coffee/tea/milk and biscuits/rusks make another possible grouping. You might think this is trivial, but department stores have recorded up to 20% increase in sales when such optimal grouping and placement was performed – again, through a form of analytics.
Bank loans and home loans are often approved with the credit scores of a customer. How is that calculated? An expert system of rules, classification, and extrapolation of existing patterns – you guessed it – using predictive analytics.
Allocating budgets in a company to maximize the total profit in the upcoming year is predictive analytics. This is simple at a startup, but imagine the situation in a company like Google, with thousands of departments and employees, all clamoring for funding. Predictive Analytics is the way to go in this case as well.
IoT (Internet of Things) smart devices are one of the most promising applications of predictive analytics. It will not be too long before the sensor data from aircraft parts use predictive analytics to tell its operators that it has a high likelihood of failure. Ditto for cars, refrigerators, military equipment, military infrastructure and aircraft, anything that uses IoT (which is nearly every embedded processing device available in the 21st century).
Fraud detection, malware detection, hacker intrusion detection, cryptocurrency hacking, and cryptocurrency theft are all ideal use cases for predictive analytics. In this case, the ML system detects anomalous behavior on an interface used by the hackers and cybercriminals to identify when a theft or a fraud is taking place, has taken place, or will take place in the future. Obviously, this is a dream come true for law enforcement agencies.
So now you know what predictive analytics is and what it can do. Now let’s come to the revolutionary new technology.
End-to-End Predictive Analytics Product – for non-tech users!
In a remarkable first, a research team at MIT, USA have created a new science called social physics, or sociophysics. Now, much about this field is deliberately kept highly confidential, because of its massive disruptive power as far as data science is concerned, especially predictive analytics. The only requirement of this science is that the system being modeled has to be a human-interaction based environment. To keep the discussion simple, we shall explain the entire system in points.
All systems in which human beings are involved follow scientific laws.
These laws have been identified, verified experimentally and derived scientifically.
Bylaws we mean equations, such as (just an example) Newton’s second law: F = m.a (Force equals mass times acceleration)
These equations establish laws of invariance – that are the same regardless of which human-interaction system is being modeled.
Hence the term social physics – like Maxwell’s laws of electromagnetism or Newton’s theory of gravitation, these laws are a new discovery that are universal as long as the agents interacting in the system are humans.
The invariance and universality of these laws have two important consequences:
The need for large amounts of data disappears – Because of the laws, many of the predictive capacities of the model can be obtained with a minimal amount of data. Hence small companies now have the power to use analytics that was mostly used by the FAMGA(Facebook, Amazon, Microsoft, Google, Apple) set of companies since they were the only ones with the money to maintain Big Data warehouses and data lakes.
There is no need for data cleaning. Since the model being used is canonical, it is independent of data problems like outliers, missing data, nonsense data, unavailable data, and data corruption. This is due to the orthogonality of the model ( a Knowledge Sphere) being constructed and the data available.
Performance is superior to deep learning, Google TensorFlow, Python, R, Julia, PyTorch, and scikit-learn. Consistently, the model has outscored the latter models in Kaggle competitions, without any data pre-processing or data preparation and cleansing!
Data being orthogonal to interpretation and manipulation means that encrypted data can be used as-is. There is no need to decrypt encrypted data to perform a data science task or experiment. This is significant because the independence of the model functioning even for encrypted data opens the door to blockchain technology and blockchain data to be used in standard data science tasks. Furthermore, this allows hashing techniques to be used to hide confidential data and perform the data mining task without any knowledge of what the data indicates.
Are You Serious?
That’s a valid question given these claims! And that is why I recommend everyone who has the slightest or smallest interest in data science to visit and completely read and explore the following links:
Now when I say completely read, I mean completely read. Visit every section and read every bit of text that is available on the three sites above. You will soon understand why this is such a revolutionary idea.
These links above are articles about the social physics book and about the science of sociophysics in general.
For more details, please visit the following articles on Medium. These further document Endor.coin, a cryptocurrency built around the idea of sharing data with the public and getting paid for using the system and usage of your data. Preferably, read all, if busy, at least read Article No, 1.
Upon every data set, the first action performed by the Endor Analytics Platform is clustering, also popularly known as automatic classification. Endor constructs what is known as a Knowledge Sphere, a canonical representation of the data set which can be constructed even with 10% of the data volume needed for the same project when deep learning was used.
Creation of the Knowledge Sphere takes 1-4 hours for a billion records dataset (which is pretty standard these days).
Now an explanation of the mathematics behind social physics is beyond our scope, but I will include the change in the data science process when the Endor platform was compared to a deep learning system built to solve the same problem the traditional way (with a 6-figure salary expert data scientist).
From Appendix A: Social Physics Explained, Section 3.1, pages 28-34 (some material not included):
Prediction Demonstration using the Endor System:
Data: The data that was used in this example originated from a retail financial investment platform and contained the entire investment transactions of members of an investment community. The data was anonymized and made public for research purposes at MIT (the data can be shared upon request).
Summary of the dataset: – 7 days of data – 3,719,023 rows – 178,266 unique users
Automatic Clusters Extraction: Upon first analysis of the data the Endor system detects and extracts “behavioral clusters” – groups of users whose data dynamics violates the mathematical invariances of the Social Physics. These clusters are based on all the columns of the data, but is limited only to the last 7 days – as this is the data that was provided to the system as input.
Behavioural Clusters Summary
Number of clusters:268,218 Clusters sizes: 62 (Mean), 15 (Median), 52508 (Max), 5 (Min) Clusters per user:164 (Mean), 118 (Median), 703 (Max), 2 (Min) Users in clusters: 102,770 out of the 178,266 users Records per user: 6 (Median), 33 (Mean): applies only to users in clusters
Prediction Queries The following prediction queries were defined: 1. New users to become “whales”: users who joined in the last 2 weeks that will generate at least $500 in commission in the next 90 days 2. Reducing activity : users who were active in the last week that will reduce activity by 50% in the next 30 days (but will not churn, and will still continue trading) 3. Churn in “whales”: currently active “whales” (as defined by their activity during the last 90 days), who were active in the past week, to become inactive for the next 30 days 4. Will trade in Apple share for the first time: users who had never invested in Apple share, and would buy it for the first time in the coming 30 days
Knowledge Sphere Manifestation of Queries It is again important to note that the definition of the search queries is completely orthogonal to the extraction of behavioral clusters and the generation of the Knowledge Sphere, which was done independently of the queries definition.
Therefore, it is interesting to analyze the manifestation of the queries in the clusters detected by the system: Do the clusters contain information that is relevant to the definition of the queries, despite the fact that:
1. The clusters were extracted in a fully automatic way, using no semantic information about the data, and –
2. The queries were defined after the clusters were extracted, and did not affect this process.
This analysis is done by measuring the number of clusters that contain a very high concentration of “samples”; In other words, by looking for clusters that contain “many more examples than statistically expected”.
A high number of such clusters (provided that it is significantly higher than the amount received when randomly sampling the same population) proves the ability of this process to extract valuable relevant semantic insights in a fully automatic way.
Comparison to Google TensorFlow
In this section a comparison between prediction process of the Endor system and Google’s TensorFlow is presented. It is important to note that TensorFlow, like any other Deep Learning library, faces some difficulties when dealing with data similar to the one under discussion:
1. An extremely uneven distribution of the number of records per user requires some canonization of the data, which in turn requires:
2. Some manual work, done by an individual who has at least some understanding of data science.
3. Some understanding of the semantics of the data, that requires an investment of time, as well as access to the owner or provider of the data
4. A single-class classification, using an extremely uneven distribution of positive vs. negative samples, tends to lead to the overfitting of the results and require some non-trivial maneuvering.
This again necessitates the involvement of an expert in Deep Learning (unlike the Endor system which can be used by Business, Product or Marketing experts, with no perquisites in Machine Learning or Data Science).
An expert in Deep Learning spent 2 weeks crafting a solution that would be based on TensorFlow and has sufficient expertise to be able to handle the data. The solution that was created used the following auxiliary techniques:
1.Trimming the data sequence to 200 records per customer, and padding the streams for users who have less than 200 records with neutral records.
2.Creating 200 training sets, each having 1,000 customers (50% known positive labels, 50% unknown) and then using these training sets to train the model.
3.Using sequence classification (RNN with 128 LSTMs) with 2 output neurons (positive, negative), with the overall result being the difference between the scores of the two.
Observations (all statistics available in the white paper – and it’s stunning)
1.Endor outperforms Tensor Flow in 3 out of 4 queries, and results in the same accuracy in the 4th . 2.The superiority of Endor is increasingly evident as the task becomes “more difficult” – focusing on the top-100 rather than the top-500.
3.There is a clear distinction between “less dynamic queries” (becoming a whale, churn, reduce activity” – for which static signals should likely be easier to detect) than the “Who will trade in Apple for the first time” query, which are (a) more dynamic, and (b) have a very low baseline, such that for the latter, Endor is 10x times more accurate!
4.As previously mentioned – the Tensor Flow results illustrated here employ 2 weeks of manual improvements done by a Deep Learning expert, whereas the Endor results are 100% automatic and the entire prediction process in Endor took 4 hours.
Clearly, the path going forward for predictive analytics and data science is Endor, Endor, and Endor again!
Predictions for the Future
Personally, one thing has me sold – the robustness of the Endor system to handle noise and missing data. Earlier, this was the biggest bane of the data scientist in most companies (when data engineers are not available). 90% of the time of a professional data scientist would go into data cleaning and data preprocessing since our ML models were acutely sensitive to noise. This is the first solution that has eliminated this ‘grunt’ level work from data science completely.
The second prediction: the Endor system works upon principles of human interaction dynamics. My intuition tells me that data collected at random has its own dynamical systems that appear clearly to experts in complexity theory. I am completely certain that just as this tool developed a prediction tool with human society dynamical laws, data collected in general has its own laws of invariance. And the first person to identify these laws and build another Endor-style platform on them will be at the top of the data science pyramid – the alpha unicorn.
Final prediction – democratizing data science means that now data scientists are not required to have six-figure salaries. The success of the Endor platform means that anyone can perform advanced data science without resorting to TensorFlow, Python, R, Anaconda, etc. This platform will completely disrupt the entire data science technological sector. The first people to master it and build upon it to formalize the rules of invariance in the case of general data dynamics will for sure make a killing.
It is an exciting time to be a data science researcher!
Data Science is a broad field and it would require quite a few things to learn to master all these skills.
Computing infrastructure is an ever-changing landscape of technology advancements. Current changes affect the way companies deploy smart manufacturing systems to make the most of advancements.
The rise of edge computing capabilities coupled with traditional industrial control system (ICS) architectures provides increasing levels of flexibility. In addition, time-synchronized applications and analytics augment, or in some cases minimize, the need for larger Big Data operations in the cloud, regardless of cloud premise.
In this blog, we will start with the definition of edge computing. After that, we will discuss the need of edge computing and it’s applications. Also, we will try to understand the scope of edge computing in the future.
What is Edge computing
Consolidation and the centralized nature of cloud computing have proven cost-effective and flexible, but the rise of the IIoT and mobile computing has put a strain on networking bandwidth. Ultimately, not all smart devices need to use cloud computing to operate. In some cases, architects can — and should — avoid the back and forth. Edge computing could prove more efficient in some areas where cloud computing operates.
Furthermore, edge computing permits data processing closer to it’s origin (i.e., motors, pumps, generators or other sensors), reducing the need to transfer that data back and forth between the cloud.
Additionally, think of edge computing in manufacturing as a network of micro data centers capable of hosting, storage, computing and analysis on a localized basis while pushing aggregate data to a centralized plant or enterprise data center, or even the cloud (private or public, on-premise or off) for further analysis, deeper learning, or to feed an artificial intelligence (AI) engine hosted elsewhere.
According to Microsoft, in edge computing, compute resources are “placed closer to information-generation sources to reduce network latency and bandwidth usage generally associated with cloud computing.” This helps to ensure continuity of services and operations even if cloud connections aren’t steady.
Also, this moving of compute and storage to the “edge” of the network, away from the data centre and closer to the user, cuts down the amount of time it takes to exchange messages compared with traditional centralized cloud computing. Moreover, according to research by IEEE, it can help to balance network traffic, extend the life of IoT devices and, ultimately, reduce “response times for real-time IoT applications.”
Terms in Edge Computing
Like most technology areas, edge computing has its own lexicon. Here are brief definitions of some of the more commonly used terms
Edge devices: These can be any device that produces data. These could be sensors, industrial machines or other devices that produce or collect data.
Edge: What the edge depends on the use case. In a telecommunications field, perhaps the edge is a cell phone or maybe it’s a cell tower. Furthermore, in an automotive scenario, the edge of the network could be a car. Also, in manufacturing, it could be a machine on a shop floor. Additionally, in enterprise IT, the edge could be a laptop.
Edge gateway: A gateway is a buffer between where edge computing processing is done and the broader fog network. The gateway is the window into the larger environment beyond the edge of the network.
Fat client: Software that can do some data processing in edge devices. This is opposite to a thin client, which would merely transfer data.
Edge computing equipment: Edge computing uses a range of existing and new equipment. We can outfit many devices, sensors and machines to work in an edge computing environment by simply making them Internet-accessible. Cisco and other hardware vendors have a line of rugged network equipment that has hardened exteriors meant to be used in field environments. A range of compute servers and even storage-based hardware systems like Amazon Web Service’s Snowball have usage in edge computing deployments.
Mobile edge computing: This refers to the buildout of edge computing systems in telecommunications systems, particularly 5G scenarios
Why Rise in Edge Computing
1. Latency in decision making
Businesses are getting a huge boost from computerised systems, especially as they evolve into the cloud era. But bringing that same level of technology across different sites has proven to be not so straightforward for many companies, particularly as the sites started generating more data. The main concern is latency, that being the time it takes for data to move between points. As with the NYSE, a little distance goes a long way in the computer world, so it stands to reason that delays in sending data needed to reach decisions will translate into delays for the business.
2. Decentralisation and scaling
To some, it may seem counterintuitive to move away from the centre. Wasn’t centralisation the whole point of cloud systems? But the cloud isn’t about pooling everything in the middle. It’s about scale and making it easier to access the services that the business uses every day. Also, the transfer gap problem between sites and data centres predates the cloud era. Yet cloud can exacerbate it. The only way to overcome this transfer gap is to move some of the data centres to where the data is.
3. Process Optimisation
With edge computing, data centres can execute rules that are time sensitive (like “stop the car” in case of driverless vehicles), and then stream data to the cloud in batches when bandwidth needs aren’t as high. Furthermore,the cloud can then take the time to analyze data from the edge, and send back recommended rule changes — like “decelerate slowly when the car senses human activity within 50 feet.”
Cost is also a driving factor for edge computing. The bulk of telemetry data that is from the sensors and actuators is likely not relevant for the IoT application. The fact a temperature sensor reports a 20ºC reading every second might not be interesting until the sensor reports a 40ºC reading. Edge computing allows for the filtering and processing of data before sending it to the cloud. This reduces the network cost of data transmission. It also reduces the cloud storage and processing cost of data that is not relevant to the application.
Storing and processing data on the edge and only sending out to the cloud what will be used and useful saves bandwidth and server space.
Where all we are using it
1. Grid Edge Control and Analytics
Grid Edge computing solutions are helping the utility monitor and analyse these additional renewable power generating resources integrated into their grid, in real time. This is something legacy SCADA systems are unable to offer.
From residential rooftop solar to solar farms, commercial solar, electric vehicles and wind farms, smart meters are generating a ton of data that helps utilities to view the amount of energy available and required, allowing their demand response to become more efficient, avoid peaks and reduce costs. This data is first processed in the Grid Edge Controllers that perform local computation and analysis of the data only send necessary actionable information over a wireless network to the Utility.
2. Oil and Gas Remote Monitoring
Safety monitoring within critical infrastructures such as oil and gas utilities is of utmost importance. For this reason, many cutting edge IoT monitoring devices are being deployed in order to safeguard against disaster. Edge computing allows data to be analysed, processed, and then delivered to end-users in real-time, allowing for control centres to access data as it occurs in order to foresee and prevent malfunctions or incidents before they occur. This is really important. As, when dealing with critical infrastructures such as oil and gas or other energy services, any failures within a particular system have the potential to be catastrophic and should always warrant the highest levels of precaution.
3. Internet of Things
A smart window firm monitors windows for errors, weather information, maintenance needs and performance. This generates a massive stream of data as each device is regularly reporting information. Edge services filter this information and report a summary back to a centralized service that is running from the firm’s primary data centres. By summarizing information before reporting it, global bandwidth consumption is reduced by 99%.
An e-commerce company delivers images and static web content from a content delivery network. They also perform processing at edge data centres to quickly calculate product recommendations for customers.
A hedge fund pays an expensive premium for servers that are in close proximity to various stock exchanges to achieve extremely low latency trading. Trading algorithms are deployed on these machines. These servers are expensive and resource constrained. As such, they connect back to a cloud service for processing support.
A game platform executes certain real-time elements of the game experience on edge servers near the user. The edges connect to a cloud backend for support processing. The backend is run from three regions that need not be close to the end-user.
Predictions for Edge Computing in Future
According to IDC by 2020, the IT spend on edge infrastructure will reach up to 18% of the total spend on IoT infrastructure. That spend is driven by the deployment of converged IT and OT systems which reduces the time to value of data collected from their connected devices IDC adds. It’s what we explained and illustrated in a nutshell.
According to a November 1, 2017, announcement regarding research of the edge computing market across hardware, platforms, solutions and applications (smart city, augmented reality, analytics etc.) the global edge computing market is expected to reach USD 6.72 billion by 2022 at a compound annual growth rate of a whopping 35.4 per cent.
The major trends responsible for the growth of the market in North America are all too familiar. Also, there is a growing number of devices and dependency on IoT devices. Hence, the need for faster processing, the increase in cloud adoption, and the increase in pressure on networks.
Edge is still in early stage adoption, but one thing is clear: Edge devices are subject to large-scale investments from cloud suppliers to offload bandwidth. Also, there are latency issues due to an explosion of the Internet of Things (IoT) data in both industrial and commercial applications.
Edge soon will likely increase in adoption where users have questions about how or if the cloud applies for the specific use case. Cloud-level interfaces and apps will migrate to the edge. Industrial application hosting and analytics will become common at the edge, using virtual servers and simplified operational technology-friendly hardware and software.
Benefits in network simplification, security and bandwidth accompany the IT simplification.
Follow this link, if you are looking to learn more about data science online!
You have finally trained your first Machine Learning model. Congratulations! What’s next though? How would you unleash the power of your model outside of your laptop? In this tutorial, we help you take the first step towards deploying models. We will use flask to create apps and flasgger to create a beautiful UI.
Github repository for this tutorial – https://github.com/DhruvilKarani/MLapp_using_flask_and_flasgger
You can install libraries using Anaconda Prompt (use the search option on windows) by typing –pip install <name of the library>. For flasgger, use pip install flasgger==0.8.1
Of course, don’t include the <>
Building an ML model
Before we deploy any model, let’s first build one. For now, let’s build a simple model on a simple dataset so that we can spend more time on the deployment part. We use the Iris dataset from sklearn’s datasets. The required imports are given below.
The Iris dataset looks something like this –
Image Credits: Analyticskhoj
The variable to be predicted, i.e., Species, has three categories – Sentosa, Virginica, Versicolour. Now, we build our model in just 6 lines of code
The model we build is saved as a pickle file. A pickle file saves any file into its binary form. Next time we want to use this model, we don’t have to train it again. We can merely load this pickle file.
The above command saves the model as a pickle file under the name model_pkl in the path specified (in this case – C:/Users….model.pkl). Also, make sure you have / and not \. You might also want to check once if the file is present in the folder. Once you have made sure the file exists, the next step is to use flask and flasgger to make a fantastic UI. Make a new Python script and import the following modules and read the pickle file.
Next, we create a Flask object and name it app. The argument to this Flask object is the special keyword __name__. To create an easy UI for this app, we use the Swagger module in the Flasgger library.
Now, we create two apps – One which accepts individual values for all 4 input values and the other which accepts a CSV file as inputs. Let’s create the first app
The first line @app.route(‘/predict’) specifies the part of the URL which runs this particular app. If you do not understand this as of now, don’t worry. Things get more evident as we use the app. The next thing we do is create a function, named predict_iris. Under this function, we have a long docstring. Swagger uses this string for creating a UI. It says that the app requires 4 parameters namely S_length, S_width, P_length, P_width. All of these input values are of the query type. Next, the app uses the GET method to accept these values which means that we need to enter the numbers by ourselves. Then we pass these values to our model in a 2 D numpy array and return the predictions. Two things here –
Predictions returns the element in the prediction numpy array
We always output a string, never a numeric value to avoid errors.
Now we build the second app, the one that accepts a file. However, before the app, we create a file that has all four variable values for which we predict the output. In the Python console, type the following
Select any 2-4 rows at random, copy them and save them in a CSV file. We use this file to test our second app. The file would look something like this
Notice the changes here. The @app.route decorator has ‘/predict_file’ as one of its argument. The docstring under our new function predict_iris_file tells Swagger to set the file as an input file. Next, we read the CSV using read_csv and make sure the header is set to None if you haven’t set the column names while making the CSV. Next, we use the model to make the predictions and return them as a string.
Finally, we run the app using
In the console, the output generates a local URL, something like this –
Copy the URL (the one highlighted) and paste it in your browser. Add /apidocs to it and hit Enter. For example http://127.0.0.1:5000/apidocs. Something like this opens up –
Click on default and then on /predict. You’ll find something like this –
Above is a UI for your first app. Go ahead, insert the four values and click on ‘Try it out!’. Under the response body, you find the predicted class. For our second app, upload the file by clicking on choose file.
When you try it out, you get a string of predicted classes.
Congratulations! You have just created a nice UI for your ML model. Feel free to play around and try out new things.
In very simple words, Amazon Web Services is a subsidiary of Amazon that provides on-demand cloud computing platforms to individuals, companies and governments, on a paid subscription basis. The technology allows subscribers to have at their disposal a virtual cluster of computers, available all the time, through the Internet.
Let us give a shot at a very technical description of AWS. Amazon Web Services (AWS) is a secure cloud services platform, offering computing power, database storage, content delivery and other functionality to help businesses scale and grow. Explore how millions of customers are currently leveraging AWS cloud products and solutions to build sophisticated applications with increased flexibility, scalability and reliability.
Websites & Website Hosting: Amazon Web Services offers cloud web hosting solutions that provide businesses, non-profits, and governmental organizations with low-cost ways to deliver their websites and web applications. Whether you’re looking for marketing, rich media, or e-commerce website, AWS offers a wide range of website hosting options, and we’ll help you select the one that is right for you.
Backup & Recovery: AWS offers the most storage services, data-transfer methods, and networking options to build solutions that protect your data with unmatched durability and security
Data Archive: Amazon Web Services offers a complete set of cloud storage services for archiving. You can choose Amazon Glacier for affordable, non-time sensitive cloud storage, or Amazon Simple Storage Service (S3) for faster storage, depending on your needs. With AWS Storage Gateway and our solution provider ecosystem, you can build a comprehensive, storage solution.
DevOps: AWS provides a set of flexible services designed to enable companies to more rapidly and reliably build and deliver products using AWS and DevOps practices. These services simplify provisioning and managing infrastructure, deploying application code, automating software release processes, and monitoring your application and infrastructure performance.
Big Data: AWS delivers an integrated suite of services that provide everything needed to quickly and easily build and manage a data lake for analytics. AWS-powered data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights, in ways that traditional data silos and data warehouses cannot. AWS gives customers the widest array of analytics and machine learning services, for easy access to all relevant data, without compromising on security or governance.
Why learn AWS?
You don’t want your data scientists spending time on DevOps tasks like creating AMIs, defining Security Groups, and creating EC2 instances. Data science workloads benefit from large machines for exploratory analysis in tools like Jupyter or RStudio, as well as elastic scalability to support bursty demand from teams, or parallel execution of data science experiments, which are often computationally intensive.
Cost controls, resource monitoring, and reporting
Data science workloads often benefit from high-end hardware, which can be expensive. When data scientists have more access to scalable compute, how do you mitigate the risk of runaway costs, enforce limits, and attribute across multiple groups or teams?
Data scientists need agility to experiment with new open source tools and packages, which are evolving faster than ever before. System administrators must ensure stability and safety of environments. How can you balance these two points in tension?
Neural networks and other effective data science techniques benefit from GPU acceleration, but configuring and utilizing GPUs remains easier said than done. How can you provide efficient access to GPUs for your data scientists without miring them in DevOps configuration tasks?
AWS offers world-class security in their environment — but you must still make choices about how you configure security for your applications running on AWS. These choices can make a significant difference in mitigating risk as your data scientists transfer logic (source code) and data sets that may represent your most valuable intellectual property.
Our AWS Course
1. AWS Introduction
This section covers the basic and different concepts and terms which are AWS specific. This lays out the basic setting where learners are fed with all the AWS specific terms and are prepared for the deep dive.
2. VPC Subnet
A virtual private cloud (VPC) is a virtual network dedicated to your AWS account. It is logically isolated from other virtual networks in the AWS Cloud. You can launch your AWS resources, such as Amazon EC2 instances, into your VPC.
A route table contains a set of rules, called routes, that are used to determine where network traffic is directed. Each subnet in your VPC must be associated with a route table; the table controls the routing for the subnet. A subnet can only be associated with one route table at a time, but you can associate multiple subnets with the same route table.
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.
AWS Identity and Access Management (IAM) is a web service that helps you securely control access to AWS resources. You use IAM to control who is authenticated (signed in) and authorized (has permissions) to use resources.
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.
AWS Lambda is a ‘compute’ service that lets you run code without provisioning or managing servers. AWS Lambda executes your code only when needed and scales automatically, from a few requests per day to thousands per second
Amazon Simple Notification Service (SNS) is a highly available, durable, secure, fully managed pub/sub messaging service that enables you to decouple microservices, distributed systems, and serverless applications. Amazon SNS provides topics for high-throughput, push-based, many-to-many messaging.
Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. SQS eliminates the complexity and overhead associated with managing and operating message-oriented middleware and empowers developers to focus on differentiating work.
Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching and backups.
11. Dynamo DB
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It’s a fully managed, multi-region, multi-master database with built-in security, backup and restores, and in-memory caching for internet-scale applications. DynamoDB can handle more than 10 trillion requests per day and support peaks of more than 20 million requests per second.
13. Cloud Formation
AWS CloudFormation provides a common language for you to describe and provision all the infrastructure resources in your cloud environment. CloudFormation allows you to use a simple text file to model and provision, in an automated and secure manner, all the resources needed for your applications across all regions and accounts. This file serves as the single source of truth for your cloud environment.
No learning can happen without doing any project. This is our mantra at Dimensionless Technologies. We have different projects planned for our learners which will help in implementing all the learners during the course.
Why Dimensionless as your learning partner?
Dimensionless Technologies provide instructor-led LIVE online training with hands-on different problems. We do not provide classroom training but we deliver more as compared to what a classroom training could provide you with
Are you sceptical of online training or you feel that online mode is not the best platform to learn? Let us clear your insecurities about online training!
Live and Interactive sessions We conduct classes through live sessions and not pre-recorded videos. The interactivity level is similar to classroom training and you get it in the comfort of your home.
Highly Experienced Faculty We have very highly experienced faculty with us (IIT`ians) to help you grasp complex concepts and kick-start your career success journey
Up to Data Course content Our course content is up to date which involves all the latest technologies and tools. Our course is well equipped for learners to grasp the knowledge required to solve real-world problems through their data analytical skills
Availability of software and computing resource Any laptop with 2GB RAM and Windows 7 and above is perfectly fine for this course. All the software used in this course are Freely downloadable from the Internet. The trainers help you set it up in your systems. We also provide access to our Cloud-based online lab where these are already installed.
Industry-Based Projects During the training, you will be solving multiple case studies from different domains. Once the LIVE training is done, you will start implementing your learnings on Real Time Datasets. You can work on data from various domains like Retail, Manufacturing, Supply Chain, Operations, Telecom, Oil and Gas and many more.
Course Completion Certificate Yes, we will be issuing a course completion certificate to all individuals who successfully complete the training.
Placement Assistance We provide you with real-time industry requirements on a daily basis through our connection in the industry. These requirements generally come through referral channels, hence the probability to get through increases manifold
Dimensionless technologies have the right courses for you if you are aiming to kick-start your career in the field of data science. Not only we cover all the important concepts and technologies but also focus on their implementation and usage in real-world business problems. Follow the link to register yourself for the free demo of the courses!