变压器训练损耗不减 - Val. Acc/cy 停留在 0.58

transformer training loss not decreasing - Val. Acc/cy stuck at 0.58

提问人:Manuel Gk 提问时间:5/10/2023 最后编辑:Manuel Gk 更新时间:7/17/2023 访问量:264

问:

我正在训练一个基于 pytorch 的多分支管道,用于 deepfake 检测。我正在尝试实现论文中提出的以下模型:

一种基于Transformer的DeepFake检测方法 用于面部器官 (薛子宇等,电子学 2022 ,1 ,4143.https://doi.org/10.3390/ 电子11244143 )

多分支遮挡鲁棒 Deepfake 检测器

模型/架构:对于变压器编码器,我有自己的 MHA 自定义实现,如下所示。我还尝试通过torchvision.models在Imagenet1K上使用预训练的ViT,但观察到相同的结果。我以 [60,128,256] 的批量大小进行训练,训练损失在 [0.5 - 0.9] 范围内波动,并且没有进一步减少。

数据集:我在 FaceForensics++ 数据集上进行训练,全尺寸为 44335 张图像,具有 80/20 的分割和 1/4 的类不平衡。对于转换器,我使用模型/嵌入尺寸 dk=252。

我尝试使用 gam=0.9 的指数 lr 衰减调度器将 lr 从 [0.1 , 0.0001, 1e-6] 甚至 1e-8 更改。 你能提供我可以尝试的任何替代方案吗? (已检查数据集;没有 NAN 值。此外,检查从图像中检测到并用作变压器输入的物体,并且质量正常)。

  #main model class
        class VisionTransformer(nn.Module):
            def __init__(
                self,
                embed_dim,
                hidden_dim,
                num_channels,
                num_heads,
                num_layers,
                num_classes,
                patch_size,
                num_patches,
                batch_size,
                device,
                dropout=0.2, #20% prob dropout
            ):
                """
                Inputs:
                    embed_dim - Dimensionality of the input feature vectors to the Transformer
                    hidden_dim - Dimensionality of the hidden layer in the feed-forward networks
                                within the Transformer
                    num_channels - Number of channels of the input (3 for RGB)
                    num_heads - Number of heads to use in the Multi-Head Attention block
                    num_layers - Number of layers to use in the Transformer
                    num_classes - Number of classes to predict
                    patch_size - Number of pixels that the patches have per dimension
                    num_patches - Maximum number of patches an image can have
                    dropout - Amount of dropout to apply in the feed-forward network and
                            on the input encoding
                """
                super().__init__()

                self.patch_size = patch_size
                self.patch_dim = (3,256,256)
                self.organ_dim = (1,3,256,256)
                # Layers/Networks
                self.input_layer = nn.Linear(num_channels * (patch_size**2), embed_dim)    #takes [ , #patches, features] for embedd
                self.embed_dim = embed_dim # size of feature inputs to linear layer - BEFORE batch embeddings
     
                self.device =device
               

                self.input_layers = torch.nn.ModuleDict({
                    'in_mouth' : nn.Linear(1014,embed_dim),
                    'in_right_eyebrow': nn.Linear(1014,embed_dim),
                    'in_left_eyebrow': nn.Linear(1014,embed_dim),
                    'in_right_eye':nn.Linear(1014,embed_dim),
                    'in_left_eye':nn.Linear(1014,embed_dim),
                    'in_nose':nn.Linear(1014,embed_dim),
                    'in_jaw':nn.Linear(1014,embed_dim),
                    'in_face':nn.Linear(1014,embed_dim)

                })


                self.cnn_dict = torch.nn.ModuleDict({
                    'cnn_mouth': CNN_encoder1().to(device) ,
                    'cnn_right_eyebrow': CNN_encoder2().to(device),
                    'cnn_left_eyebrow': CNN_encoder2().to(device),
                    'cnn_right_eye': CNN_encoder2().to(device),
                    'cnn_left_eye': CNN_encoder2().to(device),
                    'cnn_nose': CNN_encoder3().to(device),
                    'cnn_jaw': CNN_encoder2().to(device),
                    'cnn_face': CNN_encoder2().to(device),

                })


                self.transformer_dict = torch.nn.ModuleDict({
                    'tf_mouth':nn.Sequential(
                        *(AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers))
                        ),
                    'tf_right_eyebrow':nn.Sequential(
                        *(AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers))
                        ),
                    'tf_left_eyebrow':nn.Sequential(
                        *(AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers))
                        ),
                    'tf_right_eye':nn.Sequential(
                        *(AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers))
                        ),
                    'tf_left_eye':nn.Sequential(
                        *(AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers))
                        ),
                    'tf_nose':nn.Sequential(
                        *(AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers))
                        ),
                    'tf_jaw':nn.Sequential(
                        *(AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers))
                        ),
                    'tf_0':nn.Sequential(
                        *(AttentionBlock(embed_dim, hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers))
                        )

                })



                self.mlp_head_dict =torch.nn.ModuleDict({
                                           'o_mouth': nn.Sequential( nn.LayerNorm(self.embed_dim),

                                            #nn.Linear(8*embed_dim, embed_dim),
                                            nn.Linear(self.embed_dim,1)),    #embded dim=transformer output,
                                            'o_right_eyebrow': nn.Sequential( nn.LayerNorm(self.embed_dim),

                                            #nn.Linear(8*embed_dim, embed_dim),
                                            nn.Linear(self.embed_dim,1)),

                                            'o_left_eyebrow': nn.Sequential( nn.LayerNorm(self.embed_dim),

                                            #nn.Linear(8*embed_dim, embed_dim),
                                            nn.Linear(self.embed_dim,1)),

                                            'o_right_eye': nn.Sequential( nn.LayerNorm(self.embed_dim),

                                            #nn.Linear(8*embed_dim, embed_dim),
                                            nn.Linear(self.embed_dim,1)),
                                            'o_left_eye': nn.Sequential( nn.LayerNorm(self.embed_dim),

                                            #nn.Linear(8*embed_dim, embed_dim),
                                            nn.Linear(self.embed_dim,1)),

                                            'o_nose': nn.Sequential( nn.LayerNorm(self.embed_dim),

                                            #nn.Linear(8*embed_dim, embed_dim),
                                            nn.Linear(self.embed_dim,1)),

                                            'o_jaw': nn.Sequential( nn.LayerNorm(self.embed_dim),

                                            #nn.Linear(8*embed_dim, embed_dim),
                                            nn.Linear(self.embed_dim,1)),

                                            'o_8': nn.Sequential( nn.LayerNorm(self.embed_dim),

                                            #nn.Linear(8*embed_dim, embed_dim),
                                            nn.Linear(self.embed_dim,1)),

                                            })
                                                                                                                      #num_classes = #ouptut neurons
                self.classify_head = nn.Linear(8,2)
                self.dropout = nn.Dropout(dropout)

                # Parameters/Embeddings
                self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))                    #b , img_flattened , embed
                self.pos_embedding = nn.Parameter(torch.randn(1, 1 + num_patches, embed_dim))  #b, numpathces=32 ,embed

                self.num_patches = num_patches
                self.num_organs = 7

 
            def forward(self, z):
                _B = z[0].shape[0]
                (x , paths) = z   #patch size B


                self.vect_gr = torch.empty((_B , 8*self.embed_dim))   #6 = 5 organs + 1 face. init on each forward call 

                """
                1st brach: organ-level transformer
                """
    
                #print("1")
                org_list_dict = {
                    "o_mouth":torch.zeros((0,self.embed_dim)).to(self.device),
                    "o_right_eyebrow":torch.zeros((0,self.embed_dim)).to(self.device),
                    "o_left_eyebrow":torch.zeros((0,self.embed_dim)).to(self.device),
                    "o_right_eye":torch.zeros((0,self.embed_dim)).to(self.device),
                    "o_left_eye":torch.zeros((0,self.embed_dim)).to(self.device),
                    "o_nose":torch.zeros((0,self.embed_dim)).to(self.device),
                    "o_jaw":torch.zeros((0,self.embed_dim)).to(self.device)
                }
               # print("2")


                for j in range(_B):         #iterate over batch images
                    selected_organs = self.organ_selector( paths[j] , organs=True)    #organ selection module -> returns dict of detected organs in all images.

                    if selected_organs != {}:

                        for idx,(name, loc) in enumerate(FACIAL_LANDMARKS_IDXS.items()):      #iter over organs 

                            if name in selected_organs:                                     #iter over orans detected

                                #run entire pipeline from cnn to tf output. Store outputs per organ
                                #for all batches
                                x1 = self.cnn_dict["cnn_"+name](selected_organs[name].to(self.device))

                                #transformer output size

                                x1 = img_to_patch(x1, self.patch_size)          #(1,flattened h*w , c*p_h*p_w)
                                                                                #(40,*)

                                x1 = self.input_layers["in_"+name](x1.to(device))                       #(*,250) *=(1,h*w)

                                B, T ,_ = x1.shape
                                #patch dims

                                cls_token = self.cls_token.repeat(1, 1, 1)  #repeat b times at d=0


                                x1 = torch.cat([x1 , cls_token] , 1)

                                x1 = x1 + self.pos_embedding[:, : T + 1]  #([:,:t+1). keep max T = N patches 

                                x1=self.dropout(x1)

                                x1=x1.transpose(0,1)

                                #transformer input dims

                                x1 = self.transformer_dict["tf_"+name](x1)  #(1,1014)

                                x1=x1[0]

                                #transformer output dims are
                 
                                #cat organ name to existing organ list dict
                                org_list_dict["o_"+name] = torch.cat([org_list_dict["o_"+name], x1] , dim = 0)   #concat with rest organs of batch



                            else:    #no organ of name detected
                                org_list_dict["o_"+name] = torch.cat([ org_list_dict["o_"+name] ,  torch.zeros((1, self.embed_dim)).to(self.device)] , dim = 0)


                    else:
                        #no organs detected,7 expected
                        for name,_ in FACIAL_LANDMARKS_IDXS.items():
                            org_list_dict["o_"+name] = torch.cat([ org_list_dict["o_"+name] ,torch.zeros((1, self.embed_dim)).to(self.device) ])

                """
                2nd brach: face-level transformer
                """
                img = torch.empty((0,3,260,260))
                for j in range(_B):
                    print(x[j].size())
                    x2 = self.organ_selector(x[j] , organs = False)  #whole face
                    x2 = x2.permute(0,3,1,2)
                    img = torch.cat([img, x2] ,)

                x2 = self.cnn_dict["cnn_face"](img.to(self.device))

                x2 = img_to_patch(x2, self.patch_size)


                x2 = self.input_layers["in_face"](x2.to(self.device))
                B, T, _ = x2.shape #(T is number of patches)


                # Add CLS token and positional encoding
                cls_token = self.cls_token.repeat(_B, 1, 1)  #repeat b times at d=0
                x2 = torch.cat([cls_token, x2], dim=1)

                x2 = x2 + self.pos_embedding[:, : T + 1]


                # Apply Transforrmer
                x2 = self.dropout(x2)
                x2 = x2.transpose(0, 1)   #transpose 0th with 1st dims

                x2 = self.transformer_dict["tf_"+str(0)](x2)
                x2 = x2[0]#cls

       
                predictions = torch.empty((10,0)).to(self.device)
        #                 pred_organ_lvl = torch.empty((0,1014))
                for idx, (name,_) in enumerate(FACIAL_LANDMARKS_IDXS.items()):
                        #if torch.count_nonzero( org_list_dict["o_"+str(name)]
                        #if torch.count_nonzero( org_list_dict["o_"+str(name)])>0:
                         pred1 =  org_list_dict["o_"+str(name)]
                         print(f"predictions size is :{pred1.size()}")
                         predictions = torch.cat([predictions, self.mlp_head_dict["o_"+name](pred1) ],1)


                x = torch.cat([predictions , self.mlp_head_dict["o_8"](x2)],1)
                x = self.classify_head(x)


                return x

            def organ_selector(self, x, organs=False):
                if organs== True:
                    detected = face_shape_extractor(x , isPath=False)     #extract valid organs at dictionary detected
      
                    x = detected
                else:
                    #Loading the file
                    img2 = x.cpu().numpy().transpose(1, 2, 0)
                    #Format for the Mul:0 Tensor
                    img2= cv2.resize(img2,dsize=(260,260), interpolation = cv2.INTER_CUBIC)
                    #Numpy array
                    np_image_data = np.asarray(img2)
                    #maybe insert float convertion here - see edit remark!
                    np_final = np.expand_dims(np_image_data,axis=0)

                    x=torch.from_numpy(np_final)
        #             x1 = x.squeeze(0).permute(1,2,0)
        #             cv2.resize(x1 , (3,260,260)) #resize for particular edge
        #             x = x1
                return x

对于训练,我使用以下逻辑:







        #CLEAR CUDA
        import gc
        gc.collect()
        torch.cuda.empty_cache()


        def train(model, loss_func, device, train_loader, optimizer,scheduler, epoch):
            #set model to training mode
            model.train()
            torch.set_grad_enabled(True)
            running_loss = []

            for batch_idx, (data, labels , paths) in enumerate(train_loader):

                data, labels = data.to(device), labels.squeeze(0).float().to(device)
                optimizer.zero_grad()
                output = model((data,paths))

                #check for any Nan Paramter
                is_nan = torch.stack([torch.isnan(p).any() for p in model.parameters()]).any()
                #print(f"nans detected ; {is_nan}")

                labels_copy =labels.clone()
                labels_inv = labels.cpu().apply_(lambda x: abs(x - 1))
                labels = torch.cat([labels_copy , labels_inv.to(device)] , 1)

                #labels = labels.repeat(2,1)
                #print(torch.isnan(output).any())
                #print(f"output is {output},size {output.size()} labels are {labels} , w size {labels.size()}")
                #loss=0
                #for j in range(8):
                    #loss_og = loss_func(output[BATCH_SIZE*j : BATCH_SIZE*(j+1) , :] , labels)
                    #print(f"Loss for organ {j} is: {loss_og.item()}")
                    #loss+=loss_og
               # print(f"The batch total loss is : {loss.item()}")
                loss = loss_func(output,labels)
                print(f"The batch total loss is: {loss}")
               # for name, p in model.named_parameters():
               #     if p.grad is not None:
               #         print(f"Printing parameter {p},name {name} data {p.grad.data}")
        #      #           print(p.grad.data)
               #     if p is None:
               #         print(f"Printing  None parameter {p}")

                loss.backward()

                #print("backward pass check")

                #exploding grads normalize
                torch.nn.utils.clip_grad_norm_(model.parameters(), 5 , error_if_nonfinite=True)
                optimizer.step()
                #LOG BATCH LOSS
                # wandb.log({'batch loss' : loss})
                running_loss.append(loss.detach().cpu().numpy())
                #         if batch_idx % 5 == 0:
            # #             wandb.log({'train-step-loss': np.mean(running_loss[-10:])})
            # #             pbar.set_postfix(loss='{:.3f} ({:.3f})'.format(running_loss[-1], np.mean(running_loss)))
                #             print("Epoch {} Iteration {}: Loss = {}, Number of mined triplets = {}".format(epoch, batch_idx, loss)
                #             )
                #LOG AVG EPOCH LOSS
                # wandb.log({'train-step-loss': np.mean(running_loss [-10:])})

                train_loss = np.mean(running_loss)
                #step scheduler
                scheduler.step()

            #print(f"Epoch loss is {np.mean(running_loss)}")
                #log epoch loss
        #         wandb.log({'train-epoch-loss': train_loss})
            pass

        # define validation logic
        @torch.no_grad()
        def validate_epoch(model, device, val_dataloader, criterion):
            model.eval()

            running_loss, y_true, y_pred = [], [], []
            for _,(x, y ,p) in enumerate(val_dataloader):
                x = x.to(device)
                y = y.to(device)

                outputs = model((x,p))
                labels_c =labels.copy()
                labels_inv = labels_c.cpu().apply_(lambda x: abs(x-1))
                labels = torch.cat([labels , labels_inv.to(device)] , 1)
                loss = criterion(outputs, labels)

                # loss calculation over batch
                running_loss.append(loss.cpu().numpy())

                # accuracy calculation over batch

                y_true.append(y.cpu())
                y_pred.append(outputs.cpu())

            y_true = torch.cat(y_true, 0).numpy()
            y_pred = torch.cat(y_pred, 0).numpy()

            #acc2 = accuracy()
            val_loss = np.mean(running_loss)
            acc = 100. * np.mean(y_true == y_pred)
            print(f"Validation loss is {val_loss} , accuracy {acc}")
            return {'val_loss': val_loss}



        device = torch.device("cuda")

        model = VisionTransformer(embed_dim=args.dk ,
                                hidden_dim = 750,
                                num_channels= 3,
                                num_heads=12,
                                num_layers=6,
                                num_classes = 1,
                                patch_size = 13,
                                num_patches = 64,
                                batch_size = BATCH_SIZE,
                                device=device,dropout=0.2 ).to(device)

        #freeze pretrained vit w
       # for _ , vit in model.transformer_dict.items():
       #     for param in vit.encoder.parameters():
       #         param.requires_grad = False

        #init params xavier p
        #linear_cls = model.mlp_head[1]
        #torch.nn.init.xavier_uniform(linear_cls.weight)

        #register fwd hook for nan checking in forward pass
        activation = {}
        def get_activation(name):
            def hook(model, input, output):
                activation[name] = output.detach()
            return hook


        # model.fc2.register_forward_hook(get_activation('fc2'))
        # #get rdn dataloader sample
        # output = model(x)
        # print(activation['fc2'])

        lr =  0.1

        optimizer = optim.AdamW(model.parameters(), lr=lr , weight_decay =0.1)

        d_model = 1000
        model_opt = NoamOpt(d_model, 1, 400,
                torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))


        scheduler1 = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

        #scheduler1 = optim.lr_scheduler.ReduceLROnPlateau(optimizer)
        # scheduler1=lr_scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[1600, 150], gamma=0.1)
        num_epochs = 100

        loss_func = torch.nn.functional.cross_entropy
        criterion = torch.nn.CrossEntropyLoss()
        criterion2 = torch.nn.BCELoss()
        criterion=criterion
        torch.autograd.set_detect_anomaly(True)

        # wandb.init()
        # wandb.log(hypers_dict)
        for epoch in range(1, num_epochs + 1):
            print(f"Current Epoch is: {epoch}")
            train(model, criterion, device, train_ds, optimizer,scheduler1, epoch)
            scheduler1.step()
            metrics = validate_epoch(model, device, val_ds, criterion)
            # wandb.log(metrics)
            print("The val.loss is:{}".format(metrics["val_loss"]))

机器学习 NLP 深度伪造

评论

0赞 Radek Svoboda 9/6/2023
您必须意识到,公众没有足够的计算能力或数据来训练 Vision Transformer 模型。只有像 G 或 MS 或 META 这样的大佬才能做到这一点。我们无法用 20M 的图像数据训练这样的模型。你只有 40k。为了能够训练 ViT,并获得比 CNN 更好的结果,您需要拥有 >100M 图像,并且还需要花费至少 2000 个 GPU/天基于论文进行训练,以至少达到比 CNN 更好的性能。

答: 暂无答案