1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
| # coding:utf-8 # !/usr/bin/env python
''' Created on Sep 16, 2010 kNN: k Nearest Neighbors
Input: inX: vector to compare to existing dataset (1xN) dataSet: size m data set of known vectors (NxM) labels: data set labels (1xM vector) k: number of neighbors to use for comparison (should be an odd number) Output: the most popular class label
@author: pbharrin '''
from numpy import * import operator from os import listdir
def classify0(inX, dataSet, labels, k): #inX是用于分类的输入向量,dataSet是输入的训练样本集,labels是标签向量,k是选择最近邻居的数目 dataSetSize = dataSet.shape[0] #shape函数求数组array的大小,例如dataSet一个4行2列的数组 #距离计算 diffMat = tile(inX, (dataSetSize,1)) - dataSet #tile函数的功能是重复某个数组,例如把[0,0]重复4行1列,并和dataSet相减 sqDiffMat = diffMat**2 #对数组中和横纵坐标平方 #print(sqDiffMat) sqDistances = sqDiffMat.sum(axis=1) #把数组中的每一行向量相加,即求a^2+b^2 #print(sqDistances) distances = sqDistances**0.5 #开根号,√a^2+b^2 #print(distances) #a = array([1.4, 1.5,1.6,1.2]) sortedDistIndicies = distances.argsort() #按升序排序,从小到大的下标依次是2,3,1,0 #print(sortedDistIndicies) classCount={} #字典 #选择距离最小的k个点 for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] #按下标取得标记 #print(voteIlabel) classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 #在字典中计数 #print(classCount[voteIlabel]) #排序 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) #print(sortedClassCount) return sortedClassCount[0][0] #返回计数最多的标记
def createDataSet(): group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = ['A','A','B','B'] return group, labels
def file2matrix(filename): #处理格式问题,输入为文件名字符串,输出为训练样本矩阵和类标签向量 fr = open(filename) numberOfLines = len(fr.readlines()) #取得文件的行数,1000行 returnMat = zeros((numberOfLines,3)) #生成一个1000行3列的矩阵 classLabelVector = [] #创建一个列表 fr = open(filename) index = 0 #表示特征矩阵的行数 for line in fr.readlines(): line = line.strip() listFromLine = line.split('\t') #将字符串切片并转换为列表 returnMat[index,:] = listFromLine[0:3] #选取前三个元素,存储在特征矩阵中 #print listFromLine #print returnMat[index,:] classLabelVector.append(int(listFromLine[-1])) #将列表的最后一列存储到向量classLabelVector中 index += 1 return returnMat,classLabelVector #返回特征矩阵和类标签向量 def autoNorm(dataSet): #归一化特征值 minVals = dataSet.min(0) #最小值 maxVals = dataSet.max(0) #最大值 ranges = maxVals - minVals #范围 normDataSet = zeros(shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - tile(minVals, (m,1)) #原来的值和最小值的差 normDataSet = normDataSet/tile(ranges, (m,1)) #特征值差除以范围 return normDataSet, ranges, minVals def datingClassTest(): hoRatio = 0.10 #测试数据的比例 datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #load data setfrom file normMat, ranges, minVals = autoNorm(datingDataMat) m = normMat.shape[0] numTestVecs = int(m*hoRatio) errorCount = 0.0 for i in range(numTestVecs): #inX是用于分类的输入向量,dataSet是输入的训练样本集,labels是标签向量,k是选择最近邻居的数目 classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3) print "分类器的结果: %d, 真正的结果: %d" % (classifierResult, datingLabels[i]) if (classifierResult != datingLabels[i]): errorCount += 1.0 print "整体的错误率: %f" % (errorCount/float(numTestVecs)) print errorCount def classifyPerson(): resultList = ['不喜欢','一点点','很喜欢'] percentTats = float(raw_input('请输入玩游戏的时间百分比:')) ffMiles = float(raw_input('请输入飞行里程总数:')) iceCream = float(raw_input('请输入冰淇淋的升数:')) datingDateMat,datingLabels = file2matrix("datingTestSet2.txt") #导入数据 normMat,ranges,minVals = autoNorm(datingDateMat) #归一化 inArr = array([ffMiles,percentTats,iceCream]) classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3) #分类的结果 print "喜欢的程度:",resultList[classifierResult-1] def img2vector(filename): #把32×32的二进制图像矩阵转换为1×1024的向量 returnVect = zeros((1,1024)) fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,32*i+j] = int(lineStr[j]) return returnVect
def handwritingClassTest(): #准备训练数据 hwLabels = [] trainingFileList = listdir('digits/trainingDigits') #导入训练数据集合 m = len(trainingFileList) trainingMat = zeros((m,1024)) #和m个训练样本进行对比 for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] #取得去掉后缀名的文件名 classNumStr = int(fileStr.split('_')[0]) #取得文件名中代表的数字 hwLabels.append(classNumStr) #由文件名生成标签向量 trainingMat[i,:] = img2vector('digits/trainingDigits/%s' % fileNameStr) #输入的训练样本集 #准备测试数据 testFileList = listdir('digits/testDigits') #iterate through the test set errorCount = 0.0 mTest = len(testFileList) #预测测试样本 for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] #取得去掉后缀名的文件名 classNumStr = int(fileStr.split('_')[0]) #取得文件名中代表的数字 vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr) #用于分类的输入向量 #inX是用于分类的输入向量,dataSet是输入的训练样本集,labels是标签向量,k是选择最近邻居的数目 classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3) print "分类器的结果: %d, 真实的结果: %d" % (classifierResult, classNumStr) if (classifierResult != classNumStr): errorCount += 1.0 print "\n预测的错误数是: %d" % errorCount print "\n预测的错误率是: %f" % (errorCount/float(mTest)) if __name__ == '__main__': # #group,labels = createDataSet() # #classify0([0,0],group,labels,3) # datingDateMat,datingLabels = file2matrix("datingTestSet2.txt") # #print datingDateMat # #print datingLabels # import matplotlib # import matplotlib.pyplot as plt # fig = plt.figure() # ax = fig.add_subplot(111) #控制位置 # ax.scatter(datingDateMat[:,1],datingDateMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels)) #点的横纵坐标,大小和颜色 # #plt.show() # # normMat,ranges,minVals = autoNorm(datingDateMat) # print normMat # print ranges # print minVals
# datingClassTest() # classifyPerson() # testVector = img2vector("digits/testDigits/0_0.txt") # print testVector[0,0:31] handwritingClassTest()
|