Machine Learning and Natural Language Processing [closed]_问答_开发者

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

开发者_开发问答

Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.

Closed 8 years ago.

Improve this question

Assume you know a student who wants to study Machine Learning and Natural Language Processing.

What specific computer science subjects should they focus on and which programming languages are specifically designed to solve these types of problems?

I am not looking for your favorite subjects and tools, but rather industry standards.

Example: I'm guessing that knowing Prolog and Matlab might help them. They also might want to study Discrete Structures*, Calculus, and Statistics.

*Graphs and trees. Functions: properties, recursive definitions, solving recurrences. Relations: properties, equivalence, partial order. Proof techniques, inductive proof. Counting techniques and discrete probability. Logic: propositional calculus, first-order predicate calculus. Formal reasoning: natural deduction, resolution. Applications to program correctness and automatic reasoning. Introduction to algebraic structures in computing.

This related stackoverflow question has some nice answers: What are good starting points for someone interested in natural language processing?

This is a very big field. The prerequisites mostly consist of probability/statistics, linear algebra, and basic computer science, although Natural Language Processing requires a more intensive computer science background to start with (frequently covering some basic AI). Regarding specific langauges: Lisp was created "as an afterthought" for doing AI research, while Prolog (with it's roots in formal logic) is especially aimed at Natural Language Processing, and many courses will use Prolog, Scheme, Matlab, R, or another functional language (e.g. OCaml is used for this course at Cornell) as they are very suited to this kind of analysis.

Here are some more specific pointers:

For Machine Learning, Stanford CS 229: Machine Learning is great: it includes everything, including full videos of the lectures (also up on iTunes), course notes, problem sets, etc., and it was very well taught by Andrew Ng.

Note the prerequisites:

Students are expected to have the following background: Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program. Familiarity with the basic probability theory. Familiarity with the basic linear algebra.

The course uses Matlab and/or Octave. It also recommends the following readings (although the course notes themselves are very complete):

Christopher Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
Richard Duda, Peter Hart and David Stork, Pattern Classification, 2nd ed. John Wiley & Sons, 2001.
Tom Mitchell, Machine Learning. McGraw-Hill, 1997.
Richard Sutton and Andrew Barto, Reinforcement Learning: An introduction. MIT Press, 1998

For Natural Language Processing, the NLP group at Stanford provides many good resources. The introductory course Stanford CS 224: Natural Language Processing includes all the lectures online and has the following prerequisites:

Adequate experience with programming and formal structures. Programming projects will be written in Java 1.5, so knowledge of Java (or a willingness to learn on your own) is required. Knowledge of standard concepts in artificial intelligence and/or computational linguistics. Basic familiarity with logic, vector spaces, and probability.

Some recommended texts are:

Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Second Edition. Prentice Hall.
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
James Allen. 1995. Natural Language Understanding. Benjamin/Cummings, 2ed.
Gerald Gazdar and Chris Mellish. 1989. Natural Language Processing in Prolog. Addison-Wesley. (this is available online for free)
Frederick Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.

The prerequisite computational linguistics course requires basic computer programming and data structures knowledge, and uses the same text books. The required articificial intelligence course is also available online along with all the lecture notes and uses:

S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Second Edition

This is the standard Artificial Intelligence text and is also worth reading.

I use R for machine learning myself and really recommend it. For this, I would suggest looking at The Elements of Statistical Learning, for which the full text is available online for free. You may want to refer to the Machine Learning and Natural Language Processing views on CRAN for specific functionality.

My recommendation would be either or all (depending on his amount and area of interest) of these:

The Oxford Handbook of Computational Linguistics:

Machine Learning and Natural Language Processing [closed]

_{(source: oup.com)}

Foundations of Statistical Natural Language Processing:

Machine Learning and Natural Language Processing [closed]

Introduction to Information Retrieval:

Machine Learning and Natural Language Processing [closed]

String algorithms, including suffix trees. Calculus and linear algebra. Varying varieties of statistics. Artificial intelligence optimization algorithms. Data clustering techniques... and a million other things. This is a very active field right now, depending on what you intend to do.

It doesn't really matter what language you choose to operate in. Python, for instance has the NLTK, which is a pretty nice free package for tinkering with computational linguistics.

I would say probabily & statistics is the most important prerequisite. Especially Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) are very important both in machine learning and natural language processing (of course these subjects may be part of the course if it is introductory).

Then, I would say basic CS knowledge is also helpful, for example Algorithms, Formal Languages and basic Complexity theory.

Stanford CS 224: Natural Language Processing course that was mentioned already includes also videos online (in addition to other course materials). The videos aren't linked to on the course website, so many people may not notice them.

Jurafsky and Martin's Speech and Language Processing http://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ is very good. Unfortunately the draft second edition chapters are no longer free online now that it's been published :(

Also, if you're a decent programmer it's never too early to toy around with NLP programs. NLTK comes to mind (Python). It has a book you can read free online that was published (by OReilly I think).

How about Markdown and an Introduction to Parsing Expression Grammars (PEG) posted by cletus on his site cforcoding?

ANTLR seems like a good place to start for natural language processing. I'm no expert though.

Broad question, but I certainly think that a knowledge of finite state automata and hidden Markov models would be useful. That requires knowledge of statistical learning, Bayesian parameter estimation, and entropy.

Latent semantic indexing is a commonly yet recently used tool in many machine learning problems. Some of the methods are rather easy to understand. There are a bunch of potential basic projects.

Find co-occurrences in text corpora for document/paragraph/sentence clustering.
Classify the mood of a text corpus.
Automatically annotate or summarize a document.
Find relationships among separate documents to automatically generate a "graph" among the documents.

EDIT: Nonnegative matrix factorization (NMF) is a tool that has grown considerably in popularity due to its simplicity and effectiveness. It's easy to understand. I currently research the use of NMF for music information retrieval; NMF has shown to be useful for latent semantic indexing of text corpora, as well. Here is one paper. PDF

Prolog will only help them academically it is also limited for logic constraints and semantic NLP based work. Prolog is not yet an industry friendly language so not yet practical in real-world. And, matlab also is an academic based tool unless they are doing a lot of scientific or quants based work they wouldn't really have much need for it. To start of they might want to pick up the 'Norvig' book and enter the world of AI get a grounding in all the areas. Understand some basic probability, statistics, databases, os, datastructures, and most likely an understanding and experience with a programming language. They need to be able to prove to themselves why AI techniques work and where they don't. Then look to specific areas like machine learning and NLP in further detail. In fact, the norvig book sources references after every chapter so they already have a lot of further reading available. There are a lot of reference material available for them over internet, books, journal papers for guidance. Don't just read the book try to build tools in a programming language then extrapolate 'meaningful' results. Did the learning algorithm actually learn as expected, if it didn't why was this the case, how could it be fixed.