NLP
Natural Language Processing (commonly abbreviated NLP) is the analysis of human language (usually as text input) to formulate understanding of the meaning (whether through semantic associations, statistics, some combination or other techniques) of the human language query or statement.
Contents
Specifications
No major specs published to date, but several competing RFCs and intriguing approaches exist.
Word2Vec
Words to Vectors (commonly abbreviated Word2Vec, also known as "Word Embeddings") are text words in a given language converted into numbers, whereby one or more different numerical representations may be made for the same text. This helps associate and group text "meanings of words" , to speed up tasks like classification, reasoning and NLP.
- wikipedia: Word2vec[1][2][3][4]
- DeepLearning4j -- Word2Vec, Doc2vec & GloVe (JAVA): https://deeplearning4j.org/word2vec.html (Neural Word Embeddings for Natural Language Processing)
- TensorFlow - Vector Representations of Words (PYTHON): https://www.tensorflow.org/tutorials/word2vec[5]
- Word2Vec script (SHELL): https://github.com/dav/word2vec (bag-of-words and skip-gram architectures for computing vector representations of words)
EXAMPLE
An example would be:
Hello everyone, my name is Bryan
An NLP system should recognize that:
- Hello' is a salutation
- everyone is the audience to which the opening salutation was directed
- my correlates the upcoming noun to a person (namely, the person entering the query, if they are speaking in 1st person)
- name indicates that it is a proper name (title or way to refer to someone or something)
- is signifies a state of being, or a fact
- Bryan represents the value of the fact (i.e. in computer code that might look like: name="Bryan" or name.equals("Bryan") or name->"Bryan" etc...)
Tools
- Semantic Assistants: http://www.semanticsoftware.info/semantic-assistants-project
- List of 20+ Sentiment Analysis APIs: http://blog.mashape.com/post/48757031167/list-of-20-sentiment-analysis-apis
- List of 25+ Natural Language Processing APIs: http://blog.mashape.com/post/48946187179/list-of-25-natural-language-processing-apis
JAVA
- General Architecture for Text Engineering (GATE): http://gate.ac.uk/
- Apache - Lucence/Solr: http://lucene.apache.org/core/ (Lucene provides full-text indexing/searching/ranking/sorting while Solr provides an API and snowballing/clustering/tokenization/entity extraction)
- Apache - OpenNLP: http://opennlp.apache.org/
- OpenGALEN: http://www.opengalen.org/
- LingPipe: http://alias-i.com/lingpipe/
- The Stanford NLP Parser -- A statistical parser: http://nlp.stanford.edu/software/lex-parser.shtml | DEMO
- Stanford Log-linear Part-Of-Speech Tagger: http://nlp.stanford.edu/software/tagger.shtml
- Maui - Multi-purpose automatic topic indexing: http://code.google.com/p/maui-indexer/
Python
- Natural Language Toolkit (python lib and open source project for NLP): http://www.nltk.org/[6]
PHP
- NLP Tools: http://nlptools.atrilla.net/
- Bayesian Opinion Mining: http://phpir.com/bayesian-opinion-mining
- Apache Solr PHP client: http://code.google.com/p/solr-php-client/ (implements the Solr REST API of a local or remote instance)
- Natural Language Processing in PHP: http://nlp.sourceforge.net/ (alpha, release forthcoming)
Proprietary/API
- OpenCalais
- Zemanta: http://zemanta.com
- DBpedia Spotlight (extract Wikipedia entities/topics/categories): http://wiki.dbpedia.org/spotlight
- IBM Watson - Tone Analyzer: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/tone-analyzer.html[7]
- Apture: http://apture.com (acquired/retired by Google)
- Extrativ: http://extractiv.com/
- Alchemy API: http://www.alchemyapi.com/
- uClassify: http://uclassify.com/
- DocumentCloud: http://www.documentcloud.org
- Complexity Intelligence - Natural Language Processing API: http://www.complexityintelligence.com/en/homepage
- SmartyStreets -- LiveAddress API - Extract addresses from text: http://smartystreets.com/products/liveaddress-api/extract
START
START is the the world's first Web-based question answering system, has been on-line and continuously operating since December, 1993. It has been developed by Boris Katz and his associates of the InfoLab Group at the MIT Computer Science and Artificial Intelligence Laboratory. Unlike information retrieval systems (e.g., search engines), START aims to supply users with "just the right information," instead of merely providing a list of hits. Currently, the system can answer millions of English questions about places (e.g., cities, countries, lakes, coordinates, weather, maps, demographics, political and economic systems), movies (e.g., titles, actors, directors), people (e.g., birth dates, biographies), dictionary definitions, and much, much more. Below is a list of some of the things START knows about, with example questions.
- START: http://start.csail.mit.edu/
- START - project info: http://start.csail.mit.edu/start-system.html
- Entity Extraction & Content API Evaluation: http://blog.viewchange.org/2010/05/entity-extraction-content-api-evaluation/
Resources
- WordNet: http://wordnet.princeton.edu/
- UMBC webbase corpus: http://ebiquity.umbc.edu/resource/html/id/351 (3 billion+ English words)[8]
- Reggie - The Metadata Editor: http://metadata.net/dstc/
- Illinois University - Cognitive Computation Group -- NLP Demos: http://cogcomp.cs.illinois.edu/page/demos
- LingPipe - toolkit for processing text using computational linguistics (DEMOS): http://alias-i.com/lingpipe/web/demos.html
- Snowball - word stemming library: http://snowball.tartarus.org/
Tutorials
- Normalizing free form text input: http://www.pratham.name/normalizing-free-text-input.html
- Word frequency algorithm for natural language processing: http://stackoverflow.com/questions/90580/word-frequency-algorithm-for-natural-language-processing
- Python NLTK/Neo4j -- Analysing the Transcripts of How I Met Your Mother: http://java.dzone.com/articles/python-nltkneo4j-analysing
- Apache OpenNLP - Hello, world! quickstart tutorial: http://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27848011
- Apache Ignite Word Count Streaming Example: http://java.dzone.com/articles/apache-ignite-word-count
External Links
- wikipedia: Natural language processing
- wikipedia: Part-of-speech tagging
- wikipedia: Name resolution
- wikipedia: Named entity recognition
- wikipedia: Stop words
- wikipedia: Poison words
- wikipedia: Function word
- wikipedia: Stemming
- wikipedia: Shallow parsing
References
- ↑ The amazing power of word vectors: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
- ↑ Word2Vec Tutorial - The Skip-Gram Model: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
- ↑ Demystifying Word2Vec: https://www.deeplearningweekly.com/blog/demystifying-word2vec
- ↑ An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
- ↑ models.word2vec – Deep learning with word2vecL https://radimrehurek.com/gensim/models/word2vec.html (the first major "word2vec" Python lib, based on the original C library)
- ↑ Natural-language/JJ Parsing/VBG For/IN The/DT Web/NN: http://nlp.naturalparsing.com/browserparser/parse
- ↑ Et tu, Watson? IBM's supercomputer can critique your writing: http://www.engadget.com/2015/07/17/ibm-watson-tone-analyzer-writing/
- ↑ UMBC WebBase corpus of 3B English words: http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words/
See Also
Semantic Web | AI | Text | WordNet | Translation