Hi Folks,
I am learning Artificial Intelligence and trying out my first real-life AI application. What I am trying to do is taking as an input various sentences, and then classifying the sentences into one of X number of categories based on keywords, and 'action' in the sentence.
The keywords are, for example, Merger, Acquisition, Award, product launch etc. so in essence I am trying to detect if the sentence in question talks about a merger between two organizations, or an acquisition by an organisation, a person or an organization winning an award, or launching of a new product etc.
To do this, I have made custom models based on the basic NLTK package model, for each keyword, and trying to improve the classification by dynamically tagging/updating the models with related keywords, synonyms etc to improve the detection capability. Also, given a set of sentences, I am presenting the user with the detected categorization and asking whether its correct or wrong, and if wrong, what is the correct categorization, and also identify the entities.
So the object is to first classify the sentence into a category, and additionally, detect the named entities in the sentence, based on the category.
The idea is, to be able to automatically re-train the models based on this feedback to improve its performance over time and to be able to retrain with as less manual intervention as possible. For the sake of this project, we can assume that user feedback would be accurate.
The problem I am facing is that NLK is allowing fixed length entities while training, so, for example, a two-word award is being detected as two awards.
What should be my approach to solve this problem? Is there a better NLU (even a commercial one) which can address this problem? It seems to me that this would be a common AI problem, and I am missing something basic. Would love you guys to have an input on this.
Thanks & Regards
Camillelola