I am new to clustering and need some advice on how to approach this problem...
Let's say I have thousands of sentences, but a few from the sample could be:
There are two related (but distinct technique-wise) questions here; the first is relates to choice of clustering technique for this data.
The second, predicate question relates to the data model--i.e., for each sentence in the raw data, how to transform it to a data vector suitable for input to a clustering algorithm.
k-means is probably the most popular clustering technique, but there are many betters; consider how k-kmeans works: the user selects from among the data, a small number of data points (the cluster centers for the initial iteration in the k-means algorithm, aka centroids). Next, the distance between each data point and the set of centroids is determined and each data point assigned to the centroid it is closes to; then new centroids are determined from the mean value of the data points assigned to the same cluster. These two steps are repeated until some convergence criterion is reached (e.g., between two consecutive iterations, the centroids combined movement falls below some threshold).
The better clustering techniques do much more than just move the cluster centers around--for instance, spectral clustering techniques rotate and stretch/squeeze the data to find a single axis of maximum variance then determine additional axes orthogonal to the original one and to each other--i.e., a transformed feature space. PCA (principal component analysis), LDA (linear discriminant analysis), and kPCA are all members of this class, the defining characteristic of which is that that calculation of the eigenvalue/eigenvector pairs for each feature in the original data or in the covariance matrix. Scikit-learn has a module for PCA computation.
As you have observed, the common dilemma in constructing a data model from unstructured text data is including a feature for every word in the entire corpus (minus stop words) often results in very high sparsity over the dataset (i.e., each sentence includes only a small fraction of the total words across all sentences so each data vector is consequently sparse; on the other hand, if the corpus is trimmed so that for instance only the top 10% of the words are used as features, then some/many of the sentences have completely unpopulated data vectors.
Here's one common sequence of techniques to help solve this problem, which might be particularly effective given your data: Combine related terms into a single term using the common processing sequence of normalizing, stemming and synonymizing.
This is intuitive: e.g.,
Normalize: transform all words to lowercase (Python strings have a lower method, so
Obviously, this prevents Required, REquired, and required from comprising three separate features in your data vector, and instead collapses them into a single term.
Stem: After stemming, required, require, and requiring, are collapsed to a single token, requir.
Two of the most common stemmers are the Porter and Lancaster stemmers (the NLTK, discussed below, has both).
Synonymize: Terms like fluent, capable, and skilled, can, depending on context, all be collapsed to a single term, by identifying in a common synonym list.
The excellent Python NLP library, NLTK has (at least) several excellent synonym compilations, or digital thesaurus (thesauri?) to help you do all three of these, programmatically.
For instance, nltk.corpus.reader.lin is one (just one, there are at least several more synonym-finders in the NLTLK), and it's simple to use--just import this module and call synonym, passing in a term.
Multiple stemmers are in NLTK's stem package.