Topic identification, also known as pattern recognition, is a critical component in providing context for data analysis if you want to achieve real, valuable business insights.
In our previous articles we’ve discussed the reasons why undertaking data analysis can be crucial for companies to improve their growth and operations. You can read our intro article here, and our second article in this series here which focuses on using analysis as part of your compliance and regulatory processes. As we’ve mentioned before, valuable business insights can be achieved with NLP (Natural Language Processing) and ML (Machine Learning) analysis tools to identify patterns, behaviours and sentiments within customer interactions. There are three important components that make up these solutions with the first one being topic identification.
Topic identification is carried out by NLP and ML methods to identify the different subjects that are referenced in human interactions. It is the most important of the three components and is particularly complex to implement due to the conversational nature of human interaction. As you’ll know, when communicating a variety of subjects can be, and usually are, discussed in a single phone call or email. Being able to correctly identify the right topic at the right moment within a conversation is absolutely critical because without this information, any analysis will be lacking context and become irrelevant. It’s all about context. An additional benefit of topic identification is that inaccuracies within transcription records will have minimal impact on the overall results because while specific words or sentences can sometimes not achieve 100% accuracy, the ‘gist’ and context relevant to the topic is what really provides the insight.
So how do you measure the accuracy of topic identification to ensure you’re getting the right information? Firstly, you must be testing a pre-processed data set so that you know what information it contains. Using a manually processed data set will allow you to identify the accuracy levels and determine which specific algorithms will give you the best results for that data set (it can vary based upon factors such as the communication medium). Secondly, you need to evaluate the two aspects that are used to measure accuracy. These two aspects are precision and recall. Precision is the accuracy of identifying the correct topic and recall is how many times the topic has been correctly identified within the conversation.
Below you can see the testing that we carried out here at Synetec in relation to the detection of advice. We tested 6 different methods which were Decision Tree, Logistic Regression, Support Vector Machines, Random Forest, XGBoost and Deep Learning (MLP). We measured the precision and recall levels for each method separately with their own benchmarks. For precision, we selected a benchmark of 60% and for recall we selected a benchmark of 75%.
We chose these benchmarks because it’s more important we’re able to identify a higher level of instances where advice is detected even if this also produces more false results. If we increase the precision accuracy for correctly identifying each instance this would lower our recall accuracy, resulting in less instances where advice is detected overall. We identified (as illustrated in the graph below) that for our purposes Logistic Regression is the best performer when the two aspects are combined.
It’s crucial for achieving relevant insights that the method you choose will provide the accuracy levels you’ve benchmarked. These benchmarks will vary based on the volume and type of data you want to analyse, and the purpose of your analysis. As you can see below, another benefit of the Logistic Regression method is that the precision and recall measurements are variable with an increase in one effectively lowering the accuracy of the other. This means that when identifying different topics we can adjust the accuracy levels to align with an organisation’s analysis goals.
As you can see there are many important considerations to make when looking at topic identification. Skipping large portions of relevant data or providing too many irrelevant results will mean any business insights become far less valuable and possibly inaccurate. Topic identification provides the context for your data analysis, regardless of its purpose, so it’s critical to invest in getting this right.
Stay tuned to read about behaviour identification and sentiment analysis which we’ll be tackling in our future articles!
If you would like further advice on topic identification or how machine learning and data analysis can provide a meaningful contribution please contact us.
Synetec is an Agile solutions provider with expertise in diverse development technologies, such as Angular, the .Net Framework, SQL Server and other cloud friendly data stores. We are certified and have successfully delivered projects across different cloud technology stacks such as Microsoft Azure and AWS, delivering integration and development solutions since 2000.
We work with a number of the UK’s most respected financial institutions to deliver a range of innovative solutions. We have expertise in working with both established businesses as well as start-ups and extreme growth businesses.