Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this questionCan someone explain what the difference is between classification and clustering in data mining?
If you can, please give开发者_StackOverflow中文版 examples of both to understand the main idea.
In general, in classification you have a set of predefined classes and want to know which class a new object belongs to.
Clustering tries to group a set of objects and find whether there is some relationship between the objects.
In the context of machine learning, classification is supervised learning and clustering is unsupervised learning.
Also have a look at Classification and Clustering at Wikipedia.
Please read the following information:
If you have asked this question to any data mining or machine learning persons they will use the terms supervised learning and unsupervised learning to explain you the difference between clustering and classification. So let me first explain you about the key word supervised and unsupervised.
Supervised learning: suppose you have a basket and it is filled with some fresh fruits and your task is to arrange the same type fruits at one place. suppose the fruits are apple,banana,cherry, and grape. so you already know from your previous work that, the shape of each and every fruit so it is easy to arrange the same type of fruits at one place. here your previous work is called as trained data in data mining. so you already learn the things from your trained data, This is because of you have a response variable which says you that if some fruit have so and so features it is grape, like that for each and every fruit.
This type of data you will get from the trained data. This type of learning is called as supervised learning. This type solving problem comes under Classification. So you already learn the things so you can do you job confidently.
unsupervised : suppose you have a basket and it is filled with some fresh fruits and your task is to arrange the same type fruits at one place.
This time you don't know any thing about that fruits, you are first time seeing these fruits so how will you arrange the same type of fruits.
What you will do first is you take on the fruit and you will select any physical character of that particular fruit. suppose you taken color.
Then you will arrange them based on the color, then the groups will be some thing like this. RED COLOR GROUP: apples & cherry fruits. GREEN COLOR GROUP: bananas & grapes. so now you will take another physical character as size, so now the groups will be some thing like this. RED COLOR AND BIG SIZE: apple. RED COLOR AND SMALL SIZE: cherry fruits. GREEN COLOR AND BIG SIZE: bananas. GREEN COLOR AND SMALL SIZE: grapes. job done happy ending.
here you didn't learn any thing before ,means no train data and no response variable. This type of learning is known unsupervised learning. clustering comes under unsupervised learning.
+Classification: you are given some new data, you have to set new label for them.
For example, a company wants to classify their prospect customers. When a new customer comes, they have to determine if this is a customer who is going to buy their products or not.
+Clustering: you're given a set of history transactions which recorded who bought what.
By using clustering techniques, you can tell the segmentation of your customers.
I am sure a number of you have heard about machine learning. A dozen of you might even know what it is. And a couple of you might have worked with machine learning algorithms too. You see where this is going? Not a lot of people are familiar with the technology that will be absolutely essential 5 years from now. Siri is machine learning. Amazon’s Alexa is machine learning. Ad and shopping item recommender systems are machine learning. Let’s try to understand machine learning with a simple analogy of a 2 year old boy. Just for fun, let’s call him Kylo Ren
Let’s assume Kylo Ren saw an elephant. What will his brain tell him ?(Remember he has minimum thinking capacity, even if he is the successor to Vader). His brain will tell him that he saw a big moving creature which was grey in color. He sees a cat next, and his brain tells him that it is a small moving creature which is golden in color. Finally, he sees a light saber next and his brain tells him that it is a non-living object which he can play with!
His brain at this point knows that saber is different from the elephant and the cat, because the saber is something to play with and doesn’t move on its own. His brain can figure this much out even if Kylo doesn’t know what movable means. This simple phenomenon is called Clustering .
Machine learning is nothing but the mathematical version of this process. A lot of people who study statistics realized that they can make some equations work in the same way as brain works. Brain can cluster similar objects, brain can learn from mistakes and brain can learn to identify things.
All of this can be represented with statistics, and the computer based simulation of this process is called Machine Learning. Why do we need the computer based simulation? because computers can do heavy math faster than human brains. I would love to go into the mathematical/statistical part of machine learning but you don’t wanna jump into that without clearing some concepts first.
Let’s get back to Kylo Ren. Let’s say Kylo picks up the saber and starts playing with it. He accidentally hits a stormtrooper and the stormtrooper gets injured. He doesn’t understand what’s going on and continues playing. Next he hits a cat and the cat gets injured. This time Kylo is sure he has done something bad, and tries to be somewhat careful. But given his bad saber skills, he hits the elephant and is absolutely sure that he is in trouble. He becomes extremely careful thereafter, and only hits his dad on purpose as we saw in Force Awakens!!
This entire process of learning from your mistake can be mimicked with equations, where the feeling of doing something wrong is represented by an error or cost. This process of identifying what not to do with a saber is called Classification . Clustering and Classification are the absolute basics of machine learning. Let’s look at the difference between them.
Kylo differentiated between animals and light saber because his brain decided that light sabers cant move by themselves and are therefore, different. The decision was based solely upon the objects present (data) and no external help or advice was provided. In contrast to this, Kylo differentiated the importance of being careful with light saber by first observing what hitting an object can do. The decision wasn’t completely based on the saber, but on what it could do to different objects . In short, there was some help here.
Because of this difference in learning, Clustering is called an unsupervised learning method and Classification is called a supervised learning method. They are very different in the machine learning world, and are often dictated by the kind of data present. Obtaining labelled data (or things that help us learn , like stormtrooper,elephant and cat in Kylo’s case) is often not easy and becomes very complicated when the data to be differentiated is large. On the other hand, learning without labels can have it’s own disadvantages , like not knowing what are the label titles. If Kylo was to learn being careful with the saber without any examples or help, he wouldn’t know what it would do. He would just know that it is not suppose to be done. It’s kind of a lame analogy but you get the point!
We are just getting started with Machine Learning. Classification itself can be classification of continuous numbers or classification of labels. For instance, if Kylo had to classify what each stormtrooper’s height is, there would be a lot of answers because the heights can be 5.0, 5.01, 5.011, etc. But a simple classification like types of light sabers (red,blue.green) would have very limited answers. Infact they can be represented with simple numbers. Red can be 0 , Blue can be 1 and Green can be 2.
If you know basic math, you know that 0,1,2 and 5.1,5.01,5.011 are different and are called discrete and continuous numbers respectively. The classification of discrete numbers is called Logistic Regression , and classification of continuous numbers is called Regression. Logistic Regression is also known as categorical classification, so don’t be confused when you read this term elsewhere
This was a very basic introduction to Machine Learning. I will dwell into the statistical side in my next post. Please let me know if I need any corrections :)
Second part posted here.
I'm a new comer to Data Mining, but as my textbook says, CLASSICIATION is supposed to be supervised learning, and CLUSTERING unsupervised learning. The difference between supervised learning and unsupervised learning can be found here.
Classification
Is the assignment of predefined classes to new observations, based on learning from examples.
It is one of the key tasks in machine learning.
Clustering (or Cluster Analysis)
While popularly dismissed as "unsupervised classification" it is quite different.
In contrast to what many machine learners will teach you, it is not about assigning "classes" to objects, but without having them predefined. This is the very limited view of people who did too much classification; a typical example of if you have a hammer (classifier), everything looks like a nail (classification problem) to you. But it is also why classification people do not get a hang of clustering.
Instead, consider it as structure discovery. The task of clustering is to find structure (e.g. groups) in your data that you did not know before. Clustering has been successful if you learned something new. It failed, if you only got the structure you already knew.
Cluster analysis is a key task of data mining (and the ugly duckling in machine-learning, so don't listen to machine learners dismissing clustering).
"Unsupervised learning" is somewhat an Oxymoron
This has been iterated up and down the literature, but unsupervised learning is bllsht. It does not exist, but it is an oxymoron like "military intelligence".
Either the algorithm learns from examples (then it is "supervised learning"), or it does not learn. If all the clustering methods are "learning", then computing the minimum, maximum and average of a data set is "unsupervised learning", too. Then any computation "learned" its output. Thus the term 'unsupervised learning' is totally meaningless, it means everything and nothing.
Some "unsupervised learning" algorithms do, however, fall into the optimization category. For example k-means is a least-squares optimization. Such methods are all over statistics, so I don't think we need to label them "unsupervised learning", but instead should continue to call them "optimization problems". It's more precise, and more meaningful. There are plenty of clustering algorithms who do not involve optimization, and who do not fit into machine-learning paradigms well. So stop squeezing them in there under the umbrella "unsupervised learning".
There is some "learning" associated with clustering, but it is not the program that learns. It is the user that is supposed to learn new things about his data set.
By clustering, you can group data with your desired properties such as the number, the shape, and other properties of extracted clusters. While, in classification, the number and the shape of groups are fixed. Most of the clustering algorithms give the number of clusters as a parameter. However, there are some approaches to find out the appropriate number of clusters.
First of all, like many answers state here: classification is supervised learning and clustering is unsupervised. This means:
Classification needs labeled data so the classifiers can be trained on this data, and after that start classifying new unseen data based on what he knows. Unsupervised learning like clustering does not uses labeled data, and what it actually does is to discover intrinsic structures in the data like groups.
Another difference between both techniques (related to the previous one), is the fact that classification is a form of discrete regression problem where the output is a categorical dependent variable. Whereas clustering's output yields a set of subsets called groups. The way to evaluate these two models is also different for the same reason: in classification you often have to check for the precision and recall, things like overfitting and underfitting, etc. Those things will tell you how good is the model. But in clustering you usually need the vision of and expert to interpret what you find, because you don't know what type of structure you have (type of group or cluster). That's why clustering belongs to exploratory data analysis.
Finally, i would say that applications are the main difference between both. Classification as the word says, is used to discriminate instances that belong to a class or another, for example a man or a woman, a cat or a dog, etc. Clustering is often used in the diagnosis of medical illness, discovery of patterns, etc.
Classification: Predict results in a discrete output => map input variables into discrete categories
Popular use cases:
Email classification : Spam or non-Spam
Sanction loan to customer : Yes if he is capable of paying EMI for the sanctioned loan amount. No if he can't
Cancer tumour cells identification : Is it critical or non-critical?
Sentiment analysis of tweets : Is the tweet positive or negative or neutral
Classification of news : Classify the news into one of predefined classes - Politics, Sports, Health etc
Clustering: is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)
Popular use cases:
Marketing : Discover customer segments for marketing purposes
Biology : Classification among different species of plants and animals
Libraries : Clustering different books on the basis of topics and information
Insurance : Acknowledge the customers, their policies and identifying the frauds
City Planning : Make groups of houses and to study their values based on their geographical locations and other factors.
Earthquake studies : Identify dangerous zones
Recommendation system :
References:
geeksforgeeks
dataaspirant
3leafnodes
Classification – Predicts categorical class labels – Classifies data (constructs a model) based on a training set and the values (class labels) in a class label attribute – Uses the model in classifying new data
Cluster: a collection of data objects – Similar to one another within the same cluster – Dissimilar to the objects in other clusters
Clustering aims at finding groups in data. “Cluster” is an intuitive concept and does not have a mathematically rigorous definition. The members of one cluster should be similar to one another and dissimilar to the members of other clusters. A clustering algorithm operates on an unlabeled data set Z and produces a partition on it.
For Classes and Class Labels, class contains similar objects, whereas objects from different classes are dissimilar. Some classes have a clear-cut meaning, and in the simplest case are mutually exclusive. For example, in signature verification, the signature is either genuine or forged. The true class is one of the two, no matter that we might not be able to guess correctly from the observation of a particular signature.
Clustering is a method of grouping objects in such a way that objects with similar features come together, and objects with dissimilar features go apart. It is a common technique for statistical data analysis used in machine learning and data mining..
Classification is a process of categorization where objects are recognized, differentiated and understood on the basis of the training set of data. Classification is a supervised learning technique where a training set and correctly defined observations are available.
From book Mahout in Action, and I think it explains the difference very well:
Classification algorithms are related to, but still quite different from, clustering algorithms such as the k-means algorithm.
Classification algorithms are a form of supervised learning, as opposed to unsupervised learning, which happens with clustering algorithms.
A supervised learning algorithm is one that’s given examples that contain the desired value of a target variable. Unsupervised algorithms aren’t given the desired answer, but instead must find something plausible on their own.
One liner for Classification:
Classifying data into pre-defined categories
One liner for Clustering:
Grouping data into a set of categories
Key difference:
Classification is taking data and putting it into pre-defined categories and in Clustering the set of categories, that you want to group the data into, is not known beforehand.
Conclusion:
- Classification assigns the category to 1 new item, based on already labeled items while Clustering takes a bunch of unlabeled items and divide them into the categories
- In Classification, the categories\groups to be divided are known beforehand while in Clustering, the categories\groups to be divided are unknown beforehand
- In Classification, there are 2 phases – Training phase and then the test phase while in Clustering, there is only 1 phase – dividing of training data in clusters
- Classification is Supervised Learning while Clustering is Unsupervised Learning
I have written a long post on the same topic which you can find here:
https://neelbhatt40.wordpress.com/2017/11/21/classification-and-clustering-machine-learning-interview-questions-answers-part-i/
If you are trying to file up a large number of sheets on to your shelf(based on date or some other specification of the file), you are CLASSIFYING.
If you were to create clusters from the set of sheets, it would mean that there is something similar among the sheets.
There are two definitions in data mining "Supervised" and "Unsupervised". When someone tells the computer, algorithm, code, ... that this thing is like an apple and that thing is like an orange, this is supervised learning and using supervised learning (like tags for each sample in a data set) for classifying the data, you'll get classification. But on the other hand if you let the computer find out what is what and differentiate between features of the given data set, in fact learning unsupervised, for classifying the data set this would be called clustering. In this case data that are fed to the algorithm don't have tags and the algorithm should find out different classes.
Machine Learning or AI is largely perceived by the task it Performs/achieves.
In my opinion, by thinking about Clustering and Classification in notion of task they achieve can really help to understand the difference between the two.
Clustering is to Group things and Classification is to, kind of, label things.
Let's assume you are in a party hall where all men are in Suits and women are in Gowns.
Now, you ask your friend few questions:
Q1: Heyy, can you help me group people?
Possible answers that your friend can give are:
1: He can group people based on Gender, Male or Female
2: He can group people based on their clothes, 1 wearing suits other wearing gowns
3: He can group people based on color of their hairs
4: He can group people based on their age group, etc. etc. etc.
Their are numerous ways your friend can complete this task.
Of course, you can influence his decision making process by providing extra inputs like:
Can you help me group these people based on gender (or age group, or hair color or dress etc.)
Q2:
Before Q2, you need to do some pre-work.
You have to teach or inform your friend so that he can take informed decision. So, let's say you said to your friend that:
People with long hair are Women.
People with short hair are Men.
Q2. Now, you point out to a Person with long hair and ask your friend - Is it a Man or a Woman?
The only answer that you can expect is: Woman.
Of course, there can be men with long hairs and women with short hairs in the party. But, the answer is correct based on the learning you provided to your friend. You can further improve the process by teaching more to your friend on how to differentiate between the two.
In above example,
Q1 represents the task what Clustering achieves.
In Clustering you provide the data(people) to the algorithm(your friend) and ask it to group the data.
Now, it's up to algorithm to decide what's the best way to group is? (Gender, Color or age group).
Again,you can definitely influence the decision made by the algorithm by providing extra inputs.
Q2 represents the task Classification achieves.
There, you give your algorithm(your friend) some data(People), called as Training data, and made him learn which data corresponds to which label(Male or Female). Then you point your algorithm to certain data, called as Test data, and ask it to determine whether it is Male or Female. The better your teaching is, the better it's prediction.
And the Pre-work in Q2 or Classification is nothing but just training your model so that it can learn how to differentiate. In Clustering or Q1 this pre-work is the part of grouping.
Hope this helps someone.
Thanks
Classification- A data-set can have different groups/ classes. red, green and black. Classification will try to find rules that divides them in different classes.
Custering- if a data-set is not having any class and you want to put them in some class/grouping, you do clustering. The purple circles above.
If classification rules are not good, you will have mis-classification in testing or ur rules are not correct enough.
if clustering is not good, you will have lot of outliers ie. data points not able to fall in any cluster.
The Key Differences Between Classification and Clustering are: Classification is the process of classifying the data with the help of class labels. On the other hand, Clustering is similar to classification but there are no predefined class labels. Classification is geared with supervised learning. As against, clustering is also known as unsupervised learning. Training sample is provided in the classification method while in the case of clustering training data is not provided.
Hope this will help!
I believe classification is classifying records in a data set into predefined classes or even defining classes on the go. I look at it as pre-requisite for any valuable data mining, I like to think of it at unsupervised learning i.e. one does not know what he/she is looking for while mining the data and classification serves as a good starting point
Clustering on the other end falls under supervised learning i.e. one know what parameters to look for, the correlation between them along with critical levels. I believe it requires some understanding of statistics and maths
精彩评论