y21 / tl

Fast, zero-copy HTML Parser written in Rust

Home Page:https://docs.rs/tl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Modifying nodes

Skallwar opened this issue · comments

I would be great if tl users where able to modify Nodes.

My use case is a webdownloader, and thus I need to change the target href for images for examples.

From what I see this is not available yet but it will be great as tl seems to be a speedy HTML parser.

I might try to implement this but it seems that a big chunk of code will need to be changed so I might need some help to figure out some of the implications of such a change.

Anyway awesome project keep up the good work.

commented

Hi. Yeah, I've been meaning to do this for a while (making nodes mutable), but I never really got to it because I didn't know if it was needed and was holding off until someone asked for it.
And yep, because of the fact that everything is (immutably) borrowed from the input string, this requires some more changes in the code. I'm not sure how complex the changes would be, but I guess this means making substrings copy-on-write to avoid having to copy them during parsing.

commented

My initial idea was to change Bytes (struct that holds substrings) to something like:

--- a/src/bytes.rs
+++ b/src/bytes.rs
@@ -3,7 +3,12 @@ use std::borrow::Cow;
 
 #[derive(PartialEq, Eq, PartialOrd, Ord, Hash, Clone)]
-pub struct Bytes<'a>(&'a [u8]);
+pub enum Bytes<'a> {
+    /// Borrowed bytes
+    Borrowed(&'a [u8]),
+    /// An owned string
+    Owned(Box<str>),
+}

That is, make it an enum representing either borrowed (unmodified) data, or an owned string in case it's mutated later after parsing (much like how Cow in std works).
Sadly, this brings the size from 16 bytes to 24 bytes, and this change is visible in benchmarks. Haven't looked at callgraphs but I guess this is because we have a lot of Bytes living on the stack that are moved around quite a bit, and moving means memcpy (unless optimized out, which it likely isn't as I've seen when I did some profiling a few days ago).

change:
  time:   [+30.296% +30.417% +30.540%] (p = 0.00 < 0.10)
  thrpt:  [-23.395% -23.323% -23.252%]

That being said, I have some more ideas on how to implement it without changing the Bytes struct in a way that introduces a lot of overhead.

Thanks for the effort. I would gladly take a look at you PR, but I have a lot of school project to finish first.

Anyway, happy new year 🎉

Hey guys, what can I work on to help with this effort? I too need to be able to mutate href's.

commented

Hey guys, what can I work on to help with this effort? I too need to be able to mutate href's.

The PR that implements this (#10) needs a few more things until it can be merged.

  • One of them being that some of the data structures used by the parser (InlineVec and InlineHashMap, which store up to N elements on the stack and manage to avoid heap allocations for the most part that way, at the cost of using more stack space) currently do not run the destructor of its elements if inlined (allocated on the stack) due to how it works - elements in a [MaybeUninit<T>; N] must be manually dropped, since it doesn't know what is and isn't initialized.
    That was fine up to this point, because the elements (T) did not manage any heap allocations themselves, but with this PR, it's possible that some elements do manage allocations, such as strings (this happens when changing the contents of nodes in some way).
    That means that without Drop code in those data structures in their current form, they would leak memory (bad).

  • As I mentioned in my previous comment here, storing a &[u8] in an enum brings the size from 16 to 24, which had huge perf regressions. A workaround is to store a raw pointer to the start of the array and use a u32 for the length: (*const u8, u32) and then reconstruct the slice with its raw parts whenever needed. The problem is potential overflows when a substring contains more than 2^32-1 (u32::MAX) bytes. As far as I could tell, this isn't necessarily bad for borrowed bytes (as in, cannot cause UB or panics), but can lead to confusing bugs, such as the slice length wrapping around, resulting in the slice being smaller than it actually is.
    For owned bytes, checks are definitely needed (and they already exist) because dropping a Box<[u8]> with the wrong length is not okay.
    I'm not sure if it's okay to just consider overflowing length for borrowed bytes intended behavior and document this or have checks in some places to prevent it from happening (this would mean tl::parse() needs to return a Result, which sooner or later we'd have to do anyway). (edit: now that I think about it, we definitely should do the latter)

I was a bit busy the last few days and didn't have a lot of time to spend here, but now I have more time.

commented

@Skallwar Hey, I'm still not sure I understand your use-case entirely. What happens after modifying href attributes?
Given that it's a scraper that downloads websites, I guess after it's changed all of the href, src, etc attributes it converts the tree to a string again and e.g. saves it to disk? I don't want to assume too many things so I thought I'd ask here before making changes that aren't even needed or changes that miss the point of the feature request.
If that's the case, then it needs a bit more work than "just" changing Bytes to be copy-on-write. For example, HTMLTag::inner_html() right now simply returns the referenced substring, and changes to one of the attributes are not reflected.
Instead of directly returning the referenced substring, we'd need to do this similar to how inner_text() works. That is, it manually builds a string given its "components" (attributes) and recursively calls inner_* on all of the contained subnodes.

I guess after it's changed all of the href, src, etc attributes it converts the tree to a string again and e.g. saves it to disk?

Exactly

right now simply returns the referenced substring, and changes to one of the attributes are not reflected.

Yeah I saw that. When I will have more time, I would love to help you on both mutable node and serialization

As far as I can tell most of what was requested here is already implemented. Right? Modifying href even is one of the examples.
But I would like to ask @y21 how extensive you plan to make the DOM manipulation features? Would it be possible in any point in time to also add new nodes to the DOM (e.g. extracted from a different DOM?)

I would need this kind of feature for my fun-project and depending on whether you plan to implement those at all (and maybe even with some idea of "when"?) I would either just stick to tl or would put a second layer of abstraction (e.g. "rctree") on top. Using tl to build up the DOM and rctree to manipulate the heck out of it before then writing it out as string.

So any info you can share would be very appreciated.

commented

Yeah, most of the things in the DOM can be mutated now.

Would it be possible in any point in time to also add new nodes to the DOM

I think adding nodes to the DOM should be fairly easy to support. I can try to get it implemented later today.

I can try to get it implemented later today.

I greatly appreciate it 🙂

I wasn't so sure if that would be viable, as it appears to me, that you use some aspects of the InnerNodeHandle which might then be not true. Especially some iteration (Children#all) & length calculation (QueryIterable#len) code seemed to expect the InnerNodeHandle of the children to be strictly increasing, without skipping any value

commented

Oh, that's true. I forgot that the query selector iterator relies on the value of the node ids. That indeed makes it more difficult than I thought.

A thought on this topic I would like to share:

Instead of making the model itself mutable through and through, maybe add special methods on the parser, inspired on CQRS. So you have your still super-fast parsing & iterating over all the model with as little overhead & copying as possible.
But then you also have a method to inject a list of Commands to manipulate the dom.
All the commands are performed as batch, that could take a bit of time and copy stuff around. And once it's done all indices are re-calculated and once again you have your fast model for querying.
One step of this batch processing would be to give each and every node a new, higher ID. Those new IDs once again follow currently restrictions of stricly increasing values without a gap. And you also keep a dictionary of mappings from old IDs to new IDs. If an old ID is not in such a mapping the node was deleted and get returns None.
(And checking if an ID is old should be simple, assuming the root ID always has the lowest 'currently active' ID)