提问人:Monolith 提问时间:12/27/2020 更新时间:12/22/2022 访问量:771
如何剪切 HTML 以保留结束标记?
How do I cut HTML so that the closing tags are preserved?
问:
如何创建存储在 HTML 中的博客文章的预览?换句话说,我怎样才能“剪切”HTML,确保标签正确关闭?目前,我正在前端渲染整个东西(使用 react 的),然后设置和 .我更喜欢一种可以直接剪切 HTML 的方法。这样我就不需要将整个 HTML 流发送到前端;如果我有 10 个博客文章预览,那将是访问者甚至看不到的大量 HTML。dangerouslySetInnerHTML
overflow: hidden
height: 150px
如果我有 HTML(说这是整篇博文)
<body>
<h1>Test</h1>
<p>This is a long string of text that I may want to cut.. blah blah blah foo bar bar foo bar bar</p>
</body>
尝试对其进行切片(进行预览)将不起作用,因为标签将变得不匹配:
<body>
<h1>Test</h1>
<p>This is a long string of text <!-- Oops! unclosed tags -->
我真正想要的是:
<body>
<h1>Test</h1>
<p>This is a long string of text</p>
</body>
我正在使用 next.js,因此任何node.js解决方案都应该可以正常工作。有没有办法做到这一点(例如,next.js服务器端的库)?或者我只需要自己解析 HTML(服务器端),然后修复未关闭的标签?
答:
猜测每个预渲染元素的高度是相当复杂的。 但是,您可以使用以下伪规则按字符数剪切条目:
-
- 首先定义要保留的最大字符数。
-
- 从头开始:如果您遇到一个 HTML 标签(通过正则表达式或 )去找到结束标签。
< .. >
< .. />
- 从头开始:如果您遇到一个 HTML 标签(通过正则表达式或 )去找到结束标签。
-
- 然后从您停止的位置继续搜索标签。
一个快速的建议 在我刚刚写的(可能可以改进,但这就是想法):javascript
let str = `<body>
<h1>Test</h1>
<p>This is a long string of text that I may want to cut.. blah blah blah foo bar bar foo bar bar</p>
</body>`;
const MAXIMUM = 100; // Maximum characters for the preview
let currentChars = 0; // Will hold how many characters we kept until now
let list = str.split(/(<\/?[A-Za-z0-9]*>)/g); // split by tags
const isATag = (s) => (s[0] === '<'); // Returns true if it is a tag
const tagName = (s) => (s.replace('<', '').replace('>', '').replace('\/', '')) // Get the tag name
const findMatchingTag = (list, i) => {
let name = tagName(list[i]);
let searchingregex = new RegExp(`<\/ *${name} *>`,'g'); // The regex for closing mathing tag
let sametagregex = new RegExp(`< *${name} *>`,'g'); // The regex for mathing tag (in case there are inner scoped same tags, we want to pass those)
let buffer = 0; // Will count how many tags with the same name are in an inner hirarchy level, we need to pass those
for(let j=i+1;j<list.length;j++){
if(list[j].match(sametagregex)!=null) buffer++;
if(list[j].match(searchingregex)!=null){
if(buffer>0) buffer--;
else{
return j;
}
}
}
return -1;
}
let k = 0;
let endCut = false;
let cutArray = new Array(list.length);
while (currentChars < MAXIMUM && !endCut && k < list.length) { // As long we are still within the limit of characters and within the array
if (isATag(list[k])) { // Handling tags, finding the matching tag
let matchingTagindex = findMatchingTag(list, k);
if (matchingTagindex != -1) {
if (list[k].length + list[matchingTagindex].length + currentChars < MAXIMUM) { // If icluding both the tag and its closing exceeds the limit, do not include them and end the cut proccess
currentChars += list[k].length + list[matchingTagindex].length;
cutArray[k] = list[k];
cutArray[matchingTagindex] = list[matchingTagindex];
}
else {
endCut = true;
}
}
else {
if (list[k].length + currentChars < MAXIMUM) { // If icluding the tag exceeds the limit, do not include them and end the cut proccess
currentChars += list[k].length;
cutArray[k] = list[k];
}
else {
endCut = true;
}
}
}
else { // In case it isn't a tag - trim the text
let cutstr = list[k].substring(0, MAXIMUM - currentChars)
currentChars += cutstr.length;
cutArray[k] = cutstr;
}
k++;
}
console.log(cutArray.join(''))
评论
预览后
这是一项具有挑战性的任务,让我挣扎了大约两天,并让我在预览后发布了我的第一个 NPM 包,它可以解决您的问题。所有内容都在其自述文件中进行了描述,但如果您想知道如何将其用于解决您的特定问题:
然后,您可以在用户将他们的博客文章发布到服务器之前使用它,并将其结果(预览)与完整的帖子一起发送到后端,并验证其长度并清理其 html 并将其保存到后端存储(DB 等),并在您想要向用户展示博客文章预览而不是完整帖子时将其发送回用户。
例:
以下代码将接受 HTMLElement 作为输入,并返回其摘要的 HTML 字符串版本,长度为 *最大 200 个字符。.blogPostContainer
您可以在“previewContainer”中看到预览:.preview
js:
import postPreview from "post-preview";
const postContainer = document.querySelector(".blogPostContainer");
const previewContainer = document.querySelector(".preview");
previewContainer.innerHTML = postPreview(postContainer, 200);
html(完整的博客文章):
<div class="blogPostContainer">
<div>
<h2>Lorem ipsum</h2>
<p>
Lorem ipsum, dolor sit amet consectetur adipisicing elit. Neque, fugit hic! Quas similique
cupiditate illum vitae eligendi harum. Magnam quam ex dolor nihil natus dolore voluptates
accusantium. Reprehenderit, explicabo blanditiis?
</p>
</div>
<p>
Lorem ipsum dolor sit amet consectetur adipisicing elit. Ipsam non incidunt, corporis debitis
ducimus eum iure sed ab. Impedit, doloribus! Quos accusamus eos, incidunt enim amet maiores
doloribus placeat explicabo.Eaque dolores tempore, quia temporibus placeat, consequuntur hic
ullam quasi rem eveniet cupiditate est aliquam nisi aut suscipit fugit maiores ad neque sunt
atque explicabo unde! Explicabo quae quia voluptatem.
</p>
</div>
<div class="preview"></div>
结果(博客文章预览):
<div class="preview">
<div class="blogPostContainer">
<div>
<h2>Lorem ipsum</h2>
<p>
Lorem ipsum, dolor sit amet consectetur adipisicing elit. Neque, fugit hic! Quas similique
cupiditate illum vitae eligendi ha
</p>
</div>
</div>
</div>
这是一个同步任务,因此,如果您想同时对多个帖子运行它,您最好在工作线程中运行它以获得更好的性能。
谢谢你让我做一些研究!
祝你好运!
评论
document.querySelector
我使用了SomoKRoceS提出的解决方案,它确实帮助了我。 但后来我发现了几个问题:
- 如果超过限制的 html 内容包装在单个标记中,它将完全省略它。
- 如果标签包含任何属性,例如 or 它不会与提供的 regExp 匹配
class="width100"
style="text-align:center"
我已经做了一些调整来克服这些问题,这个解决方案将削减精确数量的纯文本以适应限制并保留所有html换行。
class HtmlTrimmer {
HTML_TAG_REGEXP = /(<\/?[a-zA-Z]+[\s a-zA-Z0-9="'-;:%]*[^<]*>)/g;
// <p style="align-items: center; width: 100%;">
HTML_TAGNAME_REGEXP = /<\/?([a-zA-Z0-9]+)[\sa-zA-Z0-9="'-_:;%]*>/;
getPlainText(html) {
return html
.split(this.HTML_TAG_REGEXP)
.filter(text => !this.isTag(text))
.map(text => text.trim())
.join('');
}
isTag(text) {
return text[0] === '<';
}
getTagName(tag) {
return tag.replace(this.HTML_TAGNAME_REGEXP, '$1');
}
findClosingTagIndex(list, openedTagIndex) {
const name = this.getTagName(list[openedTagIndex]);
// The regex for closing matching tag
const closingTagRegex = new RegExp(`</ *${name} *>`, 'g');
// The regex for matching tag (in case there are inner scoped same tags, we want to pass those)
const sameTagRegex = new RegExp(`< *${name}[\\sa-zA-Z0-9="'-_:;%]*>`, 'g');
// Will count how many tags with the same name are in an inner hierarchy level, we need to pass those
let sameTagsInsideCount = 0;
for (let j = openedTagIndex + 1; j < list.length; j++) {
if (list[j].match(sameTagRegex) !== null) sameTagsInsideCount++;
if (list[j].match(closingTagRegex) !== null) {
if (sameTagsInsideCount > 0) sameTagsInsideCount--;
else {
return j;
}
}
}
return -1;
}
trimHtmlContent(html: string, limit: number): string {
let trimmed = '';
const innerItems = html.split(this.HTML_TAG_REGEXP);
for (let i = 0; i < innerItems.length; i++) {
const item = innerItems[i];
const trimmedTextLength = this.getPlainText(trimmed).length;
if (this.isTag(item)) {
const closingTagIndex = this.findClosingTagIndex(innerItems, i);
if (closingTagIndex === -1) {
trimmed = trimmed + item;
} else {
const innerHtml = innerItems.slice(i + 1, closingTagIndex).join('');
trimmed = trimmed
+ item
+ this.trimHtmlContent(innerHtml, limit - trimmedTextLength )
+ innerItems[closingTagIndex];
i = closingTagIndex;
}
} else {
if (trimmedTextLength + item.length > limit) {
trimmed = trimmed + item.slice(0, limit - trimmedTextLength);
return trimmed + '...';
} else {
trimmed = trimmed + item;
}
}
}
return trimmed;
}
}
const htmlTrimmer = new HtmlTrimmer();
const trimmedHtml = htmlTrimmer.trimHtmlContent(html, 100);
评论