Backport #28935 by @silverwind
The `ToUTF8*` functions were stripping BOM, while BOM is actually valid
in UTF8, so the stripping must be optional depending on use case. This
does:
- Add a options struct to all `ToUTF8*` functions, that by default will
strip BOM to preserve existing behaviour
- Remove `ToUTF8` function, it was dead code
- Rename `ToUTF8WithErr` to `ToUTF8`
- Preserve BOM in Monaco Editor
- Remove a unnecessary newline in the textarea value. Browsers did
ignore it, it seems but it's better not to rely on this behaviour.
Fixes: https://github.com/go-gitea/gitea/issues/28743
Related: https://github.com/go-gitea/gitea/issues/6716 which seems to
have once introduced a mechanism that strips and re-adds the BOM, but
from what I can tell, this mechanism was removed at some point after
that PR.
Co-authored-by: silverwind <me@silverwind.io>
(cherry picked from commit b8e6cffd31)
Change all license headers to comply with REUSE specification.
Fix #16132
Co-authored-by: flynnnnnnnnnn <flynnnnnnnnnn@github>
Co-authored-by: John Olheiser <john.olheiser@gmail.com>
Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.
This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.
Fix #19743
Signed-off-by: Andrew Thornton <art27@cantab.net>
TestToUTF8WithFallback is the cause of recurrent spurious test failures
even despite code to set the detected charset order.
The reason why this happens is because the preferred detected charset order
is not being initialised for these tests.
This PR simply ensures that this is set at the start of each test and would
allow different tests to be written to allow differing orders.
Replaces #12571
Close #12571
Signed-off-by: Andrew Thornton <art27@cantab.net>
* Fix chardet test and add ordering option
Signed-off-by: Andrew Thornton <art27@cantab.net>
* minor fixes
Signed-off-by: Andrew Thornton <art27@cantab.net>
* remove log
Signed-off-by: Andrew Thornton <art27@cantab.net>
* remove log2
Signed-off-by: Andrew Thornton <art27@cantab.net>
* only iterate through top results
Signed-off-by: Andrew Thornton <art27@cantab.net>
* Update docs/content/doc/advanced/config-cheat-sheet.en-us.md
* slight restructure of for loop
Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: techknowlogick <techknowlogick@gitea.io>
* Convert files to utf-8 for indexing
* Move utf8 functions to modules/base
* Bump repoIndexerLatestVersion to 3
* Add tests for base/encoding.go
* Changes to pass gosimple
* Move UTF8 funcs into new modules/charset package