TensorFlow 简单验证码分类(基于K-means聚类)
本人之前做过一个基于SVM的验证码识别(https://blog.****.net/qq_42686550/article/details/81514233)
这次用相同的数据来制作一个非监督性学习的代码看看效果(效果没SVM好)
其中聚类问题的代码参考来自:https://blog.****.net/qq_40077103/article/details/82747283
Content:
1.聚类
2.获取数据
3.聚类代码
1. 什么是聚类
简单的说,聚类就是不断的计算欧氏距离,然后不断迭代直到中心点不在发生变化。聚类是一个较为简单的非监督性学习方法,这里我们用Google的TensorFlow来处理这个问题。 首先,数据源附上:链接:https://pan.baidu.com/s/1sC5HFDMAvm9uGpjm115EeQ
提取码:s29m
我已经把数据附上去了,里面是0-9 10个数字的图片格式:
共10个文件夹。
2.获取数据
首先,如何将JPG的图片变成适合机器学习的数字?
def get_feature(address,assume,dir, file,f,a):
# assume is assume predict value
f.write(assume)
im = Image.open(address + dir + a + file)
count = 0
width, height = im.size
for i in range(height):
c = 0
for j in range(width):
if im.getpixel((j, i)) == 0: c += 1 #0即为黑色
f.write(' %d' %c)
count += 1
for i in range(width):
c = 0
for j in range(height):
if im.getpixel((i, j)) == 0: c += 1
f.write(' %d' % c)
count += 1
f.write('\n')
原理很简单,图片size是9*16,我们分别从横向纵向去数有几个黑方框(图片已被二值化),举个列子,宽9,从第一列开始数过去数到第九列分别记录下每一列的黑色格子数,9+16=25, 每个图片就被赋予了25个特征.
这样我们就获得了一个记录了862个数字特征的txt文件.
然后我们需要将这个txt文件导入Python变成数组,原理就不写了很简单:
f = open('./train.txt')
line = f.readline()
data_list = []
while line:
num = list(map(float,line.split()))
data_list.append(num)
line = f.readline()
f.close()
data_array = np.array(data_list)
3.聚类代码
数据导入后便是聚类代码了:
首先我们要初始化和设定一些量
N=862
K=10
variables=25
points = tf.Variable(data_array)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64)) # 样本归属聚类中心……
centroids = tf.Variable(tf.slice(points.initialized_value(), [0, 0], [K, variables])) # 初始聚类中心……
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(centroids)
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, variables])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, variables])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
reduction_indices=2) #计算欧氏距离
best_centroids = tf.argmin(sum_squares, 1) # 样本对应的聚类中心索引
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))
我大致解释一下,N是样本数,K是簇(0-9总共是10个数所以是10),variables是特征数.
具体过程就是初始化一个聚类中心,然后将数据reshape到TensorFlow的格式,计算欧氏距离.
其实到这里就差不多了已经,后面设置一下迭代次数条件就可以了,这里为了方便展示可以用matplotlib用来做个图.
我们发现并没看出什么区别…其实也很正常毕竟25个维度,用张2D图看不出来什么的,这样我们来看一下assignment:
似乎对0,4,5划分的比较好,其余的就一般般,这也在情理之中毕竟是聚类问题,没有对每个类别进行人工分类(此处的0,4,5并不一定指的是数字‘0,4,5’的划分,而是簇的名字),有兴趣的可以自己多试试看,附上完整代码:
import tensorflow as tf
import numpy as np
import time
import matplotlib
import matplotlib.pyplot as plt
import cv2
from PIL import Image
import os
from sklearn.datasets.samples_generator import make_blobs
from sklearn.datasets.samples_generator import make_circles
start=time.time()
DATA_TYPE = 'blobs'
def _get_dynamic_binary_image(img_name):
filename ='./out_img/' + img_name.split('.')[0] + '-binary.jpg'
img_name = './out_img' + '/' + img_name
print('.....' + img_name)
image = cv2.imread(img_name)
image2 = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
th1 = cv2.adaptiveThreshold(image2, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 21, 1)
cv2.imwrite(filename,th1)
return th1
def get_feature(address,assume,dir, file,f,a):
# assume is assume predict value
f.write(assume)
im = Image.open(address + dir + a + file)
count = 0
width, height = im.size
for i in range(height):
c = 0
for j in range(width):
if im.getpixel((j, i)) == 0: c += 1 #0即为黑色
f.write(' %d' %c)
count += 1
for i in range(width):
c = 0
for j in range(height):
if im.getpixel((i, j)) == 0: c += 1
f.write(' %d' % c)
count += 1
f.write('\n')
'''
address = './dataset/'
f1 = open('./train.txt', 'w')
dirs = os.listdir(address)
for dir in dirs:
files = os.listdir(address + dir)
for file in files:
get_feature(address, dir, dir, file, f1, '/')
f1.close()
'''
f = open('.//train.txt')
line = f.readline()
data_list = []
while line:
num = list(map(float,line.split()))
data_list.append(num)
line = f.readline()
f.close()
data_array = np.array(data_list)
N=862
K=10
variables=25
points = tf.Variable(data_array)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64)) # 样本归属聚类中心……
centroids = tf.Variable(tf.slice(points.initialized_value(), [0, 0], [K, variables])) # 初始聚类中心……
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(centroids)
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, variables])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, variables])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
reduction_indices=2)
best_centroids = tf.argmin(sum_squares, 1) # 样本对应的聚类中心索引
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))
def bucket_mean(data, bucket_ids, num_buckets):
total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
return total / count
means = bucket_mean(points, best_centroids, K)
with tf.control_dependencies([did_assignments_change]):
do_updates = tf.group(
centroids.assign(means),
cluster_assignments.assign(best_centroids))
changed = True
MAX_ITERS=1000
iters = 1
fig, ax = plt.subplots()
colourindexes = [2, 1, 4, 3,5,6,7,8,9,10]
while changed and iters < MAX_ITERS:
fig, ax = plt.subplots()
iters += 1
[changed, _] = sess.run([did_assignments_change, do_updates])
[centers, assignments] = sess.run([centroids, cluster_assignments])
ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker='o', s=200, c=assignments,
cmap=plt.cm.coolwarm)
ax.scatter(centers[:, 0], centers[:, 1], marker='^', s=550, c=colourindexes, cmap=plt.cm.plasma)
ax.set_title('Iteration ' + str(iters))
plt.savefig("kmeans" + str(iters) + ".png")
print(assignments)