Monday, March 31, 2014

installing and using matplotlib on centos 6.4

yum install -y python-matplotlib
yum install pygtk2

http://stackoverflow.com/questions/13336823/matplotlib-python-error
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt

Monday, March 17, 2014

installing scipy

yum install scipy

installing module sklearn python on centos 6.4

yum install gcc-c++
pip install -U scikit-learn


summary hilary mason machine learing intro - part 1/2/3/4

Code : https://github.com/hmason/ml_class
Google Prediction API : https://cloud.google.com/products/prediction-api/

Classification :
1. Using NYTimes Developer API
2. Naive Bayes algo

Clustering :
1. Agglomorative
2. K-means
3. pycluster
4. cluster delicious bookmarks
5. Recommendations systems are examples of clustering.

summary hilary mason machine learing intro - part 5

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

Suppose you have two sets, A and B, and you would like to know how similar they are. First you might ask, how big is their intersection?

\displaystyle |A\cap B|

That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.

\displaystyle \frac{|A\cap B|}{|A\cup B|}

This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.


SimHash
a hash function usually hashes different values to totally different hash values
simhash is one where similiar items are hashed to similiar hash values

Blog Archive