张一极
date:20220825-22:42
关键词:分层抽样
实现关于数据集分层抽样算法,从初始样本开始,设置len(类别)个篮子,每次假设数据放入某个篮子中,计算放置后的数据分布,放置完成以后,进入下一个样本尝试。
直到某一个类别达到样本比例,即可停止放置,再次循环到下一个样本,开始下一个类别的放置。
601def auto_split(input_path = "/backup/datasets/new_HD/new_HD/labels/",output_path = "/backup/datasets/new_HD/new_HD/",train_percent=0.7,val_percent = 0.3):
2 labels_list = os.listdir(input_path)
3 dict_classes = {}
4 labels_count = 0
5 obj_count = 0
6 for label in labels_list:
7 labels_count+=1
8 f = open(input_path+label, encoding = 'utf-8')
9 label = f.read()
10 splited_label = label.split("\n")
11 for _ in splited_label:
12 obj_name = _.split(" ")[0]
13 if len(obj_name) == 1:
14 if obj_name in dict_classes:
15 if dict_classes[obj_name] != "0":
16 dict_classes[obj_name] = int(dict_classes[obj_name])+1
17 else:
18 dict_classes[obj_name] = 0
19 train_objs = {key:0 for key in dict_classes}
20 val_objs = {key:0 for key in dict_classes}
21 test_objs = {key:0 for key in dict_classes}
22 train_obj_dict_expect = get_train_obj_dict(dict_classes,train_percent)
23 val_obj_dict_expect = get_val_obj_dict(dict_classes,val_percent)
24 train_obj_dict_now = {}
25 val_obj_dict_now = {}
26 count_train_sample = 0
27 count_val_sample = 0
28 for sample in labels_list:
29 flag_train_added = 0
30 sample_info = get_the_yolo_labels(input_path+sample)
31 for obj_ in sample_info:
32 if obj_.split(" ")[0] != '':
33 class_name = obj_.split(" ")[0]
34 if class_name in train_obj_dict_now:
35 if train_obj_dict_now[class_name] != "0":
36 # train_obj_dict_now[class_name]+=1
37
38 if train_obj_dict_now[class_name]+1 >= train_obj_dict_expect[class_name]:
39 pass
40 else:
41 train_obj_dict_now[class_name]+=1
42 flag_train_added = 1
43 else:
44 train_obj_dict_now[class_name] = 0
45 if flag_train_added:
46 write_txt(output_path+"train_list.txt",sample)
47 count_train_sample+=1
48 else:
49 write_txt(output_path+"val_list.txt",sample)
50 count_val_sample+=1
51 for obj_ in sample_info:
52 if obj_.split(" ")[0] != '':
53 class_name = obj_.split(" ")[0]
54 if class_name in val_obj_dict_now:
55 if val_obj_dict_now[class_name] != "0":
56 val_obj_dict_now[class_name]+=1
57 else:
58 val_obj_dict_now[class_name] = 0
59 print("trainset objs distribution : ",train_obj_dict_now)
60 print("valset objs distribution : ",val_obj_dict_now)
最后可以得到一个较为均衡的数据分布数据集。