There’s a lot more than just the t-linehead missing. All the “t” class IDs are stripped, and other markup too, eg t-byline. All this is meaningful and should be preserved. And this is why standoff properties are such a good idea!
There’s two issues at stake here. First, we need to preserve the integrity of the text metadata. second, we need to have the text in a form that is useful. Too much metadata make it basically unreadable.
One of the problems is that of overlapping hierarchies, i.e. the t line numbers don’t correspond with the segments. Again, this is precisely the problem that standoff properties aim to solve.
It seems to me there are two approaches to solving this.
Use standoff properties. Take the metadata (or most of it) out of the file, and keep it in a separate file. The location of the tags is kept in JSON, which notes the segment number and glyph number.
Make two separate versions, both of which are segmented the same way. In one, the metadata is kept complete. In the second, the metadata is stripped. The second one is used for the translation. Before publishing, the translated segments are merged back into the first one.
The first solution is more elegant, but I could not say which is easier.
One problem with the second option is that it keeps the metadata tied to the original Chinese text. There’s no trivial way, so far as I can see, to then associate a particular segment of the translated text with a line number, unless you’re using the two-language view. Ideally, however, the metadata should be independent. You should be able to say, what is the translation of line so and so? Obviously, the tags cannot be inserted precisely per glyph as they are in the original, but they can be done per segment.
Anyway, please have a think about this, it will take some time to get it right.