Author

Author- Ram Ranjeet Kumar
Showing posts with label natural language processing. Show all posts
Showing posts with label natural language processing. Show all posts

Sunday, November 3, 2019

Natural Language Processing

Natural Language Processing


Introduction
Natural language processing (NLP) can be defined as the automatic (or semi-automatic) processing of human language. The term ‘NLP’ is sometimes used rather more narrowly than that, often excluding information retrieval and sometimes even excluding machine translation. NLP is sometimes contrasted with ‘computational linguistics’, with NLP being thought of as more applied. Nowadays, alternative terms are often preferred, like ‘Language Technology’ or ‘Language Engineering’. Language is often used in contrast with speech (e.g., Speech and Language Technology).
NLP is essentially multidisciplinary: it is closely related to linguistics (although the extent to which NLP overtly draws on linguistic theory varies considerably). It also has links to research in cognitive science, psychology, philosophy and maths (especially logic). Within Computer Science (CS), it relates to formal language theory, compiler techniques, theorem proving, machine learning and human-computer interaction. Of course it is also related to Artificial Intelligence (AI), though nowadays it’s not generally thought of as part of AI.
NLP  is an aspect of AI  that helps computers understand, interpret, and utilize human languages. NLP allows computers to communicate with people, using a human language. Natural Language Processing also provides computers with the ability to read text, hear speech, and interpret it. NLP draws from several disciplines, including computational linguistics and computer science, as it attempts to close the gap between human and computer communications.
Generally speaking, NLP breaks down language into shorter, more basic pieces, called tokens (words, periods, etc.), and attempts to understand the relationships of the tokens.

History
In the early 1900s, a Swiss linguistics professor named Ferdinand de Saussure died, and in the process, almost deprived the world of the concept of “Language as a Science.” From 1906 to 1911, Professor Saussure offered three courses at the University of Geneva, where he developed an approach describing languages as “systems.” Within the language, a sound represents a concept – a concept that shifts meaning as the context changes.
He argued that meaning is created inside language, in the relations and differences between its parts. Saussure proposed “meaning” is created within a language’s relationships and contrasts. A shared language system makes communication possible. Saussure viewed society as a system of “shared” social norms that provides conditions for reasonable, “extended” thinking, resulting in decisions and actions by individuals. (The same view can be applied to modern computer languages).
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower.
You may heard about Eliza. It is the most popular AI bot of its time, developed at MIT in the mid-1960. It’s not perfectly understand or not understand at all meaning of input sentence and it is neither pass the Turing test but it is still acknowledgeable for its behavior of giving replies. It is impressive. So how it works? It is very simple to understand. There is a database of words that are called keywords. For each keyword there is an integer value which shows the rank of keyword, a pattern to match against the input and a specification of the output. So when user type sentence it separate the words and look for keyword which is similar to database’s keyword (we can say that here program try to understand meaning of sentence. The first component NLU). If there is more than one keyword found it pick the one with highest rank value and according to that give output which is already In database (here it is actually generates the answer. The second component NLG). And if keyword not found in database it simply generate the statements like “tell me more” or “Go on” etc..
In 1966, the National Research Council (NRC) and Automatic Language Processing Advisory Committee (ALPAC) initiated the first AI and NLP stoppage, by halting the funding of research on Natural Language Processing and machine translation. After twelve years of research, and $20 million dollars, machine translations were still more expensive than manual human translations. In 1966, Artificial Intelligence and Natural Language Processing (NLP) research was considered a dead end by many (though not all).
Until the 1980s, the majority of NLP systems used complex, “handwritten” rules. But in the late 1980s, a revolution in NLP came about. This was the result of both the steady increase of computational power, and the shift to Machine Learning algorithms. While some of the early Machine Learning algorithms (decision trees provide a good example) produced systems similar to the old school handwritten rules, research has increasingly focused on statistical models. These statistical models are capable making soft, probabilistic decisions. Throughout the 1980s, IBM was responsible for the development of several successful, complicated statistical models.
In the 1990s, the popularity of statistical models for Natural Language Processes analyses rose dramatically. The pure statistics NLP methods have become remarkably valuable in keeping pace with the tremendous flow of online text.
In 2001, Yoshio Bengio and his team proposed the first neural “language” model, using a feed-forward neural network. The feed-forward neural network describes an artificial neural network that does not use connections to form a cycle. In this type of network, the data moves only in one direction, from input nodes, through any hidden nodes, and then on to the output nodes. The feed-forward neural network has no cycles or loops, and is quite different from the recurrent neural networks.
By using Machine Learning techniques, the owner’s speaking pattern doesn’t have to match exactly with predefined expressions. The sounds just have to be reasonably close for an NLP system to translate the meaning correctly. By using a feedback loop, NLP engines can significantly improve the accuracy of their translations, and increase the system’s vocabulary. A well-trained system would understand the words, “Where can I get help with Big Data?” “Where can I find an expert in Big Data?,” or “I need help with Big Data,” and provide the appropriate response.
The combination of a dialog manager with NLP makes it possible to develop a system capable of holding a conversation, and sounding human-like, with back-and-forth questions, prompts, and answers. Our modern AIs, however, are still not able to pass Alan Turing’s test, and currently do not sound like real human beings. (Not yet, anyway.).

Turing Test
The Turing test developed by Alan Turing(Computer scientist) in 1950. He proposed that “Turing test is used to determine whether or not computer(machine) can think intelligently like human”?
Imagine a game of three players having two humans and one computer, an interrogator(as human) is isolated from other two players. The interrogator job is to try and figure out which one is human and which one is computer by asking questions from both of them. To make the things harder computer is trying to make the interrogator guess wrongly. In other words computer would try to indistinguishable from human as much as possible.
The “standard interpretation” of the Turing Test, in which player C, the interrogator, is given the task of trying to determine which player – A or B – is a computer and which is a human. The interrogator is limited to using the responses to written questions to make the determination
The conversation between interrogator and computer would be like this:

C(Interrogator): Are you a computer?

A(Computer): No

C: Multiply one large number to another, 158745887 * 56755647

A: After a long pause, an incorrect answer!

C: Add 5478012, 4563145

A: (Pause about 20 second and then give as answer)10041157


If interrogator wouldn’t be able to distinguish the answers provided by both human and computer then the computer passes the test and machine(computer) is considered as intelligent as human. In other words, a computer would be considered intelligent if it’s conversation couldn’t be easily distinguished from a human’s. The whole conversation would a computer “would be able to play the imitation game so well that an average interrogator will not have more than a 70-percent chance of making the right identification (machine or human) after five minutes of questioning.” No computer has come close to this standard.
But in year 1980, Mr. John searle proposed the “Chinese room argument“. He argued that Turing test could not be used to determine “whether or not a machine is considered as intelligent like humans”. He argued that any machine like ELIZA and PARRY could easily pass Turing Test simply by manipulating symbols of which they had no understanding. Without understanding, they could not be described as “thinking” in the same sense people do. We will discuss more about this in next article.

In 1990, The Newyork business man Hugh Loebner announce to reward $100,000 prize for the first computer program to pass the test. however no AI program has so far come close to passing an undiluted Turing Test

Application Of NLP
·         Speech Recognition: Speech recognition is an interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition, computer speech recognition or speech to text
·         Sentiment Analysis : Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information
·         Machine Translation : Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.
·         Chat Boat : A chatbot is a piece of software that conducts a conversation via auditory or textual methods. Such pro grams are often designed to convincingly simulate how a human would behave as a conversational partner, although as of 2019, they are far short of being able to pass the Turing test.
·         Spell Checking : In software, a spell checker is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic dictionary, or search engine.
·         Keyword Searching : Keyword research is a practice search engine optimization professionals use to find and research alternative search terms that people enter into search engines while looking for a similar subject.
·         Advertisement Matching : It is based on our day to day browsing history.
·         Information Extraction : Information extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.

Components of NLP
Mainly there are two component of NLP namely
     1.   Natural Language Understanding (NLU) and
     2.  Natural Language Generation (NLG) 


The first one is Natural language Understanding or we can say NLU, as name says “understanding” the main thing to do is to understand the input given by user as a part of natural language. It deals with machine reading comprehension à ability to read text, to do process on it, and understand its meaning. It is involved with mapping the given inputs (let’s take plain text as input) in useful representations and analyzing the different aspects of the language.
Lexical Ambiguity: This type of ambiguity represents words that can have multiple assertions. For instance, in English, the word “back” can be a noun ( back stage), an adjective (back door) or an adverb (back away).
Syntactic Ambiguity: This type of ambiguity represents sentences that can be parsed in multiple syntactical forms. Take the following sentence: “ I heard his cell phone rin in my office”. The propositional phrase “in my office” can e parsed in a way that modifies the noun or on another way that modifies the verb.

Now about second one Natural Language Generation (NLG). Main task of this component is to generate meaningful output or parser in form of natural language from some internal representing according to given input.
It includes :-
     1.  Text planning : - retrieving the relevant content from database. Here database can be includes vocabulary, sentences, knowledge, sample data and many more.
     2.  Sentence planning : - we get our content using text planning now next step to do is choosing required words and forming meaningful sentence setting the words in right grammatical way.
     3.  Text realization :- we have all the thing need to create actual text in humans language.

Steps In Natural Language Processing
  • 1.      Lexical Analysis
  • 2.      Syntactic Analysis (Parsing)
  • 3.      Semantic Analysis
  • 4.      Discourse Integration
  • 5.      Pragmatic Analysis


1.       Lexical Analysis − It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of txt into paragraphs, sentences, and words.
2.       Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for grammar and arranging words in a manner that shows the relationship among the words. The sentence such as “The school goes to boy” is rejected by English syntactic analyzer.
3.       Semantic Analysis − It draws the exact meaning or the dictionary meaning from the text. The text is checked for meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The semantic analyzer disregards sentence such as “hot ice-cream”.
4.       Discourse Integration − The meaning of any sentence depends upon the meaning of the sentence just before it. In addition, it also brings about the meaning of immediately succeeding sentence.
5.       Pragmatic Analysis − During this, what was said is re-interpreted on what it actually meant. It involves deriving those aspects of language which require real world knowledge.

Future
Currently there are so much work are going on in the field of NLP but the most remarkable NLP system is been building by Google which is Google Duplex. Previously there were previously NLP system already present in market are Alexa, Siri, Crotona but the Google Duplex is far more advanced and much developed system then those previous NLP system.

Challenges
• Meeting the expectations of the user.

• Understanding ambiguity in natural language.
• Understanding the effect of context on meaning.
• Understanding the referents of phrases like: he, she, and it. (Anaphoric Referencing)
• Speed and efficiency of the interface.
• Recognising relevant data, while disregarding the irrelevant data like age & gender.


 References
https://www.scedt.tees.ac.uk/
https://www.wikipedia.com/
https://www.pocket-lint.com/