Natural Language Processing
Introduction
Natural language
processing (NLP) can be defined as the automatic (or semi-automatic) processing
of human language. The term ‘NLP’ is sometimes used rather more narrowly than
that, often excluding information retrieval and sometimes even excluding
machine translation. NLP is sometimes contrasted with ‘computational
linguistics’, with NLP being thought of as more applied. Nowadays, alternative
terms are often preferred, like ‘Language Technology’ or ‘Language
Engineering’. Language is often used in contrast with speech (e.g., Speech and
Language Technology).
NLP is essentially
multidisciplinary: it is closely related to linguistics (although the extent to
which NLP overtly draws on linguistic theory varies considerably). It also has
links to research in cognitive science, psychology, philosophy and maths
(especially logic). Within Computer Science (CS), it relates to formal language
theory, compiler techniques, theorem proving, machine learning and
human-computer interaction. Of course it is also related to Artificial
Intelligence (AI), though nowadays it’s not generally thought of as part of AI.
NLP is an aspect of AI that helps computers understand, interpret,
and utilize human languages. NLP allows computers to communicate with people,
using a human language. Natural Language Processing also provides computers
with the ability to read text, hear speech, and interpret it. NLP draws from several
disciplines, including computational linguistics and computer science, as it
attempts to close the gap between human and computer communications.
Generally
speaking, NLP breaks down language into shorter, more basic pieces, called
tokens (words, periods, etc.), and attempts to understand the relationships of
the tokens.
History
In the early
1900s, a Swiss linguistics professor named Ferdinand de Saussure died, and in
the process, almost deprived the world of the concept of “Language as a
Science.” From 1906 to 1911, Professor Saussure offered three courses at the
University of Geneva, where he developed an approach describing languages as
“systems.” Within the language, a sound represents a concept – a concept that
shifts meaning as the context changes.
He argued that
meaning is created inside language, in the relations and differences between
its parts. Saussure proposed “meaning” is created within a language’s
relationships and contrasts. A shared language system makes communication
possible. Saussure viewed society as a system of “shared” social norms that
provides conditions for reasonable, “extended” thinking, resulting in decisions
and actions by individuals. (The same view can be applied to modern computer
languages).
In 1950, Alan
Turing published an article titled "Computing Machinery and
Intelligence" which proposed what is now called the Turing test as a
criterion of intelligence.
The Georgetown
experiment in 1954 involved fully automatic translation of more than sixty
Russian sentences into English. The authors claimed that within three or five
years, machine translation would be a solved problem. However, real progress
was much slower.
You may heard
about Eliza. It is the most popular AI bot of its time, developed at MIT in the
mid-1960. It’s not perfectly understand or not understand at all meaning of
input sentence and it is neither pass the Turing test but it is still
acknowledgeable for its behavior of giving replies. It is impressive. So how it
works? It is very simple to understand. There is a database of words that are
called keywords. For each keyword there is an integer value which shows the
rank of keyword, a pattern to match against the input and a specification of
the output. So when user type sentence it separate the words and look for
keyword which is similar to database’s keyword (we can say that here program
try to understand meaning of sentence. The first component NLU). If there is
more than one keyword found it pick the one with highest rank value and
according to that give output which is already In database (here it is actually
generates the answer. The second component NLG). And if keyword not found in
database it simply generate the statements like “tell me more” or “Go on” etc..
In 1966, the
National Research Council (NRC) and Automatic Language Processing Advisory
Committee (ALPAC) initiated the first AI and NLP stoppage, by halting the
funding of research on Natural Language Processing and machine translation.
After twelve years of research, and $20 million dollars, machine translations
were still more expensive than manual human translations. In 1966, Artificial
Intelligence and Natural Language Processing (NLP) research was considered a
dead end by many (though not all).
Until the 1980s,
the majority of NLP systems used complex, “handwritten” rules. But in the late
1980s, a revolution in NLP came about. This was the result of both the steady
increase of computational power, and the shift to Machine Learning algorithms.
While some of the early Machine Learning algorithms (decision trees provide a
good example) produced systems similar to the old school handwritten rules,
research has increasingly focused on statistical models. These statistical
models are capable making soft, probabilistic decisions. Throughout the 1980s,
IBM was responsible for the development of several successful, complicated
statistical models.
In the 1990s, the
popularity of statistical models for Natural Language Processes analyses rose
dramatically. The pure statistics NLP methods have become remarkably valuable
in keeping pace with the tremendous flow of online text.
In 2001, Yoshio
Bengio and his team proposed the first neural “language” model, using a
feed-forward neural network. The feed-forward neural network describes an
artificial neural network that does not use connections to form a cycle. In
this type of network, the data moves only in one direction, from input nodes,
through any hidden nodes, and then on to the output nodes. The feed-forward
neural network has no cycles or loops, and is quite different from the
recurrent neural networks.
By using Machine
Learning techniques, the owner’s speaking pattern doesn’t have to match exactly
with predefined expressions. The sounds just have to be reasonably close for an
NLP system to translate the meaning correctly. By using a feedback loop, NLP
engines can significantly improve the accuracy of their translations, and
increase the system’s vocabulary. A well-trained system would understand the
words, “Where can I get help with Big Data?” “Where can I find an expert in Big
Data?,” or “I need help with Big Data,” and provide the appropriate response.
The combination of
a dialog manager with NLP makes it possible to develop a system capable of
holding a conversation, and sounding human-like, with back-and-forth questions,
prompts, and answers. Our modern AIs, however, are still not able to pass Alan
Turing’s test, and currently do not sound like real human beings. (Not yet,
anyway.).
Turing Test
The Turing test developed
by Alan Turing(Computer scientist) in 1950. He proposed that “Turing test is
used to determine whether or not computer(machine) can think intelligently like
human”?
Imagine a game of three
players having two humans and one computer, an interrogator(as human) is
isolated from other two players. The interrogator job is to try and figure out
which one is human and which one is computer by asking questions from both of
them. To make the things harder computer is trying to make the interrogator
guess wrongly. In other words computer would try to indistinguishable from
human as much as possible.
The “standard interpretation”
of the Turing Test, in which player C, the interrogator, is given the task of
trying to determine which player – A or B – is a computer and which is a human.
The interrogator is limited to using the responses to written questions to make
the determination
The conversation
between interrogator and computer would be like this:
C(Interrogator): Are you a computer?
A(Computer): No
C: Multiply one large number to another, 158745887
* 56755647
A: After a long pause, an incorrect answer!
C: Add 5478012, 4563145
A: (Pause about 20 second and then give as
answer)10041157
If interrogator
wouldn’t be able to distinguish the answers provided by both human and computer
then the computer passes the test and machine(computer) is considered as
intelligent as human. In other words, a computer would be considered
intelligent if it’s conversation couldn’t be easily distinguished from a
human’s. The whole conversation would a computer “would be able to play the
imitation game so well that an average interrogator will not have more than a
70-percent chance of making the right identification (machine or human) after
five minutes of questioning.” No computer has come close to this standard.
But in year 1980, Mr.
John searle proposed the “Chinese room argument“.
He argued that Turing test could not be used to determine “whether or not a
machine is considered as intelligent like humans”. He argued that any machine
like ELIZA and PARRY could easily
pass Turing Test simply by manipulating symbols of which they had no
understanding. Without understanding, they could not be described as “thinking”
in the same sense people do. We will discuss more about this in next article.
In 1990, The Newyork business man Hugh Loebner announce to reward
$100,000 prize for the first computer program to pass the test. however no AI
program has so far come close to passing an undiluted Turing Test
Application Of NLP
·
Speech Recognition:
Speech recognition is an
interdisciplinary subfield of computational linguistics that develops
methodologies and technologies that enables the recognition and translation of
spoken language into text by computers. It is also known as automatic speech
recognition, computer speech recognition or speech to text
·
Sentiment Analysis
: Sentiment analysis refers to the
use of natural language processing, text analysis, computational linguistics,
and biometrics to systematically identify, extract, quantify, and study
affective states and subjective information
·
Machine Translation
: Machine translation, sometimes
referred to by the abbreviation MT is a sub-field of computational linguistics
that investigates the use of software to translate text or speech from one
language to another.
·
Chat Boat
: A chatbot is a piece of software
that conducts a conversation via auditory or textual methods. Such pro grams
are often designed to convincingly simulate how a human would behave as a
conversational partner, although as of 2019, they are far short of being able
to pass the Turing test.
·
Spell Checking
: In software, a spell checker is
a software feature that checks for misspellings in a text. Spell-checking
features are often embedded in software or services, such as a word processor,
email client, electronic dictionary, or search engine.
·
Keyword Searching
: Keyword research is a practice
search engine optimization professionals use to find and research alternative
search terms that people enter into search engines while looking for a similar
subject.
·
Advertisement Matching
: It is based on our day to day browsing history.
·
Information Extraction
: Information extraction is the
task of automatically extracting structured information from unstructured and/or
semi-structured machine-readable documents.
Components of NLP
Mainly
there are two component of NLP namely
1.
Natural Language Understanding (NLU) and
2.
Natural Language Generation (NLG)
The first one is Natural language
Understanding or we can say NLU, as name says “understanding” the main thing to
do is to understand the input given by user as a part of natural language. It
deals with machine reading comprehension à ability to read text, to do process
on it, and understand its meaning. It is involved with mapping the given inputs
(let’s take plain text as input) in useful representations and analyzing the different
aspects of the language.
Lexical
Ambiguity: This type of ambiguity represents words that can have
multiple assertions. For instance, in English, the word “back” can be a noun (
back stage), an adjective (back door) or an adverb (back away).
Syntactic Ambiguity: This type of ambiguity represents sentences that
can be parsed in multiple syntactical forms. Take the following sentence: “ I
heard his cell phone rin in my office”. The propositional phrase “in my office”
can e parsed in a way that modifies the noun or on another way that modifies
the verb.
Now about second one Natural Language
Generation (NLG). Main task of this component is to generate meaningful output
or parser in form of natural language from some internal representing according
to given input.
It includes :-
1. Text planning : - retrieving the
relevant content from database. Here database can be includes vocabulary,
sentences, knowledge, sample data and many more.
2. Sentence planning : - we get our
content using text planning now next step to do is choosing required words and
forming meaningful sentence setting the words in right grammatical way.
3. Text realization :- we have all the
thing need to create actual text in humans language.
Steps
In Natural Language Processing
- 1. Lexical Analysis
- 2. Syntactic Analysis (Parsing)
- 3. Semantic Analysis
- 4. Discourse Integration
- 5. Pragmatic Analysis
1. Lexical Analysis − It involves identifying and
analyzing the structure of words. Lexicon of a language means the collection of
words and phrases in a language. Lexical analysis is dividing the whole chunk
of txt into paragraphs, sentences, and words.
2. Syntactic Analysis (Parsing) − It involves analysis of words in
the sentence for grammar and arranging words in a manner that shows the
relationship among the words. The sentence such as “The school goes to boy” is
rejected by English syntactic analyzer.
3. Semantic Analysis − It draws the exact meaning or the
dictionary meaning from the text. The text is checked for meaningfulness. It is
done by mapping syntactic structures and objects in the task domain. The
semantic analyzer disregards sentence such as “hot ice-cream”.
4. Discourse Integration − The meaning of any sentence
depends upon the meaning of the sentence just before it. In addition, it also
brings about the meaning of immediately succeeding sentence.
5. Pragmatic Analysis − During this, what was said is
re-interpreted on what it actually meant. It involves deriving those aspects of
language which require real world knowledge.
Future
Currently
there are so much work are going on in the field of NLP but the most remarkable
NLP system is been building by Google which is Google Duplex. Previously there
were previously NLP system already present in market are Alexa, Siri, Crotona
but the Google Duplex is far more advanced and much developed system then those
previous NLP system.
Challenges
•
Meeting the expectations of the user.
• Understanding ambiguity in natural language.
• Understanding the effect of context on meaning.
• Understanding the referents of phrases like: he, she, and it. (Anaphoric
Referencing)
• Speed and efficiency of the interface.
• Recognising relevant data, while disregarding the irrelevant data like age
& gender.
https://www.scedt.tees.ac.uk/
https://www.wikipedia.com/
https://www.pocket-lint.com/