让自定义迭代器在 Rust 中通过 &String 工作-解网

问：

我目前正在努力实现一个迭代器，该迭代器拆分给定的字符串并将子字符串作为迭代器返回。对于特殊字符，它将只返回特殊字符，或者它将返回字母数字字符的子字符串并在空格处拆分。

我想由于 utf-8 字符，索引存在某种问题，但我不知道如何管理它。

这是结构，它是迭代器实现。

pub struct SpecialStr<'a> {
    string: &'a str,
    back: usize,//index of the back of the &str substring.
}

impl<'a> SpecialStr<'a> {
    pub fn new(input : &'a str) -> Self {
        SpecialStr {string: input, back: 0}
    }
}


//anything which is not a alphanumeric or a whitespace.
pub fn is_special(c: char) -> bool {
    !c.is_ascii_alphanumeric() && !c.is_whitespace()
}

impl<'a> Iterator for SpecialStr<'a> {
    type Item = &'a str;

    fn next(&mut self) -> Option<Self::Item> {
        let input_string: &str = self.string;
        let max_index = self.string.len();

        for front in self.back..max_index {

            let character = match self.string.chars().nth(front) {
                Some(character) => character,
                None => return None,
            };

            //if the present char is a special character just return it by itself.
            if is_special(character) {
                self.back += character.len_utf8();
                return Some(&input_string[self.back-character.len_utf8()..self.back]);
            } else if !character.is_whitespace() {
                //if it is not a special character then we are going to select a substring whose end will be at :
                //--the one before the next following special character
                //--or the one before a whitespace
                //--or the one before the end of the sentence.
                //then we are going to determine the substring to be selected based on this comparision.
                for back in front+character.len_utf8()..max_index {
                    let character_2 = match self.string.chars().nth(back) {
                        Some(character) => character,
                        None => return None,
                    };
                    if is_special(character_2) || character_2.is_whitespace() || back == max_index-1 {
                        self.back = back;
                        
                        return Some(&input_string[front..self.back]);
                    }
                }
            } else {
                self.back += 1;
            }
        }

        None

    }

}

这就是测试。

fn divide_n_print_3() {
use super::tokenisation::SpecialStr;

    let input = "` i love mine, too . happy motherï¿½s day to all";
    let new_one = SpecialStr::new(&input);
    
    for i in new_one.into_iter() {
        println!("{}", i); 
    }

}

我收到错误：

thread 'feature_extraction::tokenisation_test::divide_n_print_3' panicked at 'byte index 38 is not a char boundary; it is inside '½' (bytes 37..39) of \`\` i love mine, too . happy motherï¿½s day to all\`', src\\feature_extraction\\tokenisation.rs:74:38

我理解错误的含义，但不知道如何解决这个问题，任何形式的帮助将不胜感激

String Rust UTF-8 迭代器

struct Tokenizer<'a> {
    s: &'a str,
}

impl<'a> Iterator for Tokenizer<'a> {
    type Item = &'a str;

    fn next(&mut self) -> Option<Self::Item> {
        self.s = self.s.trim_start();
        let c = self.s.chars().next()?;
        let len = if c.is_ascii_alphanumeric() {
            self.s
                .find(|c: char| !c.is_ascii_alphanumeric())
                .unwrap_or(self.s.len())
        } else {
            c.len_utf8()
        };
        let result;
        (result, self.s) = self.s.split_at(len);
        Some(result)
    }
}

这避免了您在实际迭代中使用字符串方法时遇到的大多数问题 - 跳过空格，并查找 alphnumeric 字符的运行。trim_start()find()

上一个：防止将 BOM 添加到 UTF-8 SecureCRT 会话记录文件

下一个：在 player.min.js 中查找密钥以解码字符串

让自定义迭代器在 Rust 中通过 &String 工作

Getting a custom iterator to work in Rust over a &String

评论

让自定义迭代器在 Rust 中通过 &amp;String 工作

Getting a custom iterator to work in Rust over a &String

评论

让自定义迭代器在 Rust 中通过 &String 工作