If you are reading this blog, you most likely have gone through all the artificial intelligence and machine learning proof of concepts, convinced yourself on its potential use cases to solve business problems and develop a fair idea on how it could help you gain deep business insights. You also most likely have built an AI driven application prototype that seems to be working fairly well with pretty good accuracy. And now the moment of truth…..you need to put this application in production, use production ready technology that will enable this application to provide real AI powered insights to your business. This application needs to be highly available and be able to scale with your volume and computing needs.
So how do you embark on this exciting journey……
Natural Language Processing, Machine Learning and Unsupervised Learning based applications are fairly unique and they come with their own challenges and strengths. Trying to retrofit relational and de normalized / normalized data models and their associated design principals to architect these applications usually does not work well.
In this blog, I have gone through an example of a Natural Language Processing (NLP) and Unsupervised Machine Learning (UML) based application that can classify and cluster similar articles and predict its theme and topic. This is called topic modeling and is one of the most exciting feature that is today offered as a service by most cloud vendors.
Application Description – The application in this example is powered by NLP and UML. It is regularly trained with terabytes of data to learn about theme and content extraction of articles, news feeds, blogs, wiki articles, twitter feeds, internal company documents, etc. The application would be used to deliver customized content to the employees. For example, an employee interested in Cloud and Big Data technologies will have related content sourced from various external sources and presented to them in a secure manner on demand. You can search related content topics with keywords, enabling you to build a search application. The application also informs the management on the quality and classification of documents being produced internally and can be used as a marker to see where most of the employee spends their time…. There are many potential uses but I have provided a simple use case to keep this blog simple.
The above blueprint shows the Document Processing Pipeline for such an application. We will break this down in sections and look at the computing needs of each section.
Data Sources and ETL on documents: The typical data sources for these applications have a high ‘variety’ of data. Most likely you are dealing with heterogeneous data sources like HTML, XML, JSON, text format, data with non ASCII characters, mixed languages……I can keep going here.
The data here is very dark and needs a lot of preprocessing and cleaning to bring it to enterprise standards for any practical usability. In my opinion, this is the most overlooked part of the pipeline and the major cause of application design flaws. Do not take this step lightly if you want your application to be considered true production!
Team Name ‘ Ready’
Developer dynamics: You need developers who have good experience with web scrapping and have worked with very diverse file and data formats. Developers here would also need to have a solid understanding of publication and subscription frameworks and be familiar with real time streaming ETL (Extract, transform and load) of documents. This is very different from traditional batch ETL on relational data.
Typical Data Cleaning operations: XML / HTML tag identification, content extraction from tags, API calls for content extraction, removal of special characters, image removal from web pages, irrelevant sub topic removal from web pages, removal of non asci characters, removing non language specific words…..to name a few
I have seen applications where there are over a 100 filters to clean data at this stage ! We are talking about some serious and heavy ETL work on documents here to make it usable for the application.
Compute Environment: Here you are dealing with having to process several thousand content files every day. The size of each content or article is not very large here, probably 1 GB is the max size of an article you would process. The typical compute environment here would be a Kafka based content publishing and subscription pipeline. You need to design the ingestion pipeline with several broker services for a high availability ingestion pipeline. Streaming ETL products like StreamSets is most likely what team ‘Ready’ would be using.
Data Preparation for Natural Language Processing –
If team ‘Ready’ was successful so far, you now have converted the very dark data to grey data and placed it in ‘clean documents box’. The data here is considered grey because though the data has been cleaned, it is still not ready for feeding into AI engine.
Team Name ‘Get’
Developer dynamics: They are data scientists who have very unique skills. They have a very good hold on the language they are dealing with. They have a very good understanding of NLP data modeling. What makes team ‘Get’ unique is that their work is as much a work or art as it is a work of science!
Typical data operations: Stemming, case conversion, stop word removal, Lemmatization.
Compute Environment: Similar to Team ‘Ready’
Data Ready for Natural Language Processing
The data here is white data and it can be finally used to feed into NLP algorithms.
Team Name ‘Set’
Developer dynamics: Very skilled data scientists and statisticians with expertise in deep learning, allocation and clustering models and topic modeling. Here again, their work is as much a work of art as it is a work of science. But errors here can drastically decrease the accuracy of the application. Do not try to low ball the developers you are hiring for team ‘Set’. Get the best you can afford.
Typical Data Operations: Dictionary building, Neutral word removal, Frequency and occurrence based word removal, Document Term Matrix, Allocation based Model building, Topic Labeling, Training Model
Compute Environment: Training and model creation can have an exponentially increasing computing demand as size of training set grows. Consider Big Data, Hadoop, Distributed Computing based technologies here for scalability and resiliency.
Topic Detection of Input data set
Any content that is input to this application is assigned a topic label based on the model developed and clustered based on similar topics.
Team Name ‘Go’
Developer dynamics: Web Developers, database developers with Python / Java/ Hadoop programming skills
Typical Data Operations: content classification, content search, content clustering
Compute Environment: Similar to team ‘Ready’ and team ‘Get’. Consider High Availability and low latency compute frameworks.
The blogs talks about the developers you would need. But like any usual Enterprise Applications, you would need to plan for scrum masters, process orchestration engineers, solution architects, projects managers, security and audit controls on your application.
At this point, building an AI application may seem like an overwhelming task. In a way it is. If you want to develop it in house, there are a lot of nuts and bolts, ‘I wish I knew this before’, and integration work needed to make it a success. Success is heavily dependent on the synergies between the ‘Ready’, ‘Get’, ‘Set’ and ‘Go’ teams and trying to de couple them completely be build a conveyor belt style application development pipeline might not work well. We are new in AI based application development, so my recommendation would be to start with a simple use case, get it into production and keep enhancing it with time.
You may not have the budget, time and resources to build an AI stack in-house but you may still have a business need to use AI. This is where Cloud based AI service and pay per service model comes in.
In the next blog I will discuss about potential things to consider if you are looking at using Cloud based AI service for your AI application. Stay tuned
Lets get some deeper insights………………….