textualheritage.org

Zoran Kostic

Dear colleagues,
I think that it is a time to summarise our discussion about definition of OCS in PUA. Five years ago we start to make a Standard of OCS for the language, not for the Unicode, with idea to solve all present problems in encoding (compatibility) of fonts and to enable all kind of users (common people dedicated to OCS, monks, people from church, students, different type of scholars, database users, electronic dictionary makers...) to use OCS fonts as simple as possible, taking into consideration the complex way of writing.

After five years, after one conference in Belgrade dedicated on this problem only and many presentations on other conferences and lot of discussion on the Web a great majority of scholars involved in this matter agreed and adopt a Belgrade Standard as a Standard for the language.

Next step was to transfer our Standard of OCS into Unicode. As we agreed, Unicode rules do not enable us to register a full Standard in official part of Unicode. Regarding this fact we decided to make our Internal Unicode Standard in PUA. It was one complete proposal for the Internal Standard in PUA so far, made by me and lot of different ideas about what we need in this Internal Standard.

Making such PUA code table, after we adopt Standard for the language, is not the part of Slavonic science at all. Some colleagues like to make decisions in the field, which is not science, but craft for typographers and font makers. And everybody has own and different opinion as usual. Instead to look if something is missing from the table, not to discuss that we don't need all of that or there are too much and so on. Most of them has particular interest in own field and don't have a big picture that someone has different interest and need other things.

There is a new religion of people dedicated to Unicode. When I ask them how you will do something, they mention some special programs which nobody from scholars use. Scholars and other users (99.5%) use Word for typing OCS texts. With present and also future registration of OCS characters in the Unicode, nobody is able to reproduce OCS writing. With "my system" it is possible, but again they are against that. Who is crazy I don't know.
When we discuss these things we should be a realistic.
When we say, "we need a system which will enable writing a full OCS Standard in common program like Word which use the most scholars" this is a realistic. Special programs are not realistic at all. Who will force users to learn and use such programs?
When we say "we need contemporary Cyrillic, OCS and Greek in one font for databases which operate on the basis - one field with data - one font only" this is realistic and is based on the present situation with commercial databases. They say we should not use such font (most probably because this is against Unicode rules) instead we should alter databases. This is not realistic. Do we have to ask big software companies to alter databases for us? So far as I know, FileMakerPro (program for databases) can have more than one font in the field with data. Problem is that you cannot export these fields in other databases programs. If we use "my system" which enables contemporary Cyrillic, OCS and Greek in one font, we do not have any further problem with databases and this is realistic.
Moreover, they can take from "my system" what ever they need. Nobody is forced to take all. If they think that one superscript letter is enough for all cases (small and capital letter, writing between two letters, writing with offset to the left, writing with or without titlo) they can take only one code and produce font, which they imagine. Nobody is stopping them. But why they like to stop me to produce one font, which will cover at least three centuries and full redaction, that is a question.

For all of them who was against putting glyphs and ligatures in our Internal Standard in PUA (and that is natural place for them) the best example is the last post from Baranov: http://www.mufi.info/specs/MUFI-CodeChart-3-0-a.pdf.

In this file you may see exactly the same solution as in my proposal: PUA-Codes-BS-full with glyphs.pdf. The have similar problem as we and proposal is similar. So, if anybody does not have better and more universal solution, let us concentrate on searching for errors in my proposal.

I will repeat, the problem is that it must be three systems and we cannot avoid that:
1. Pure Unicode (for simple transliteration)
2. Pure PUA (beside transliteration, for databases and fine typography)
3. Mixture Unicode/PUA (beside transliteration, for fine typography)

Zoran Kostic

PUA codes are from E000 to F8FF, and it means 6430 places for characters. It is free for use without any limitation (that's why they call it Private).

You tried to find place in PUA, which is not occupied, from other users. It is not important how many users/groups will use the same code in PUA for own glyph. One code can have (and already have) several tens of users. It is not obstacle at all because it is font dependant. Everyone will have own glyph in own font.

Latin medieval characters are divided into different zones (poddiapazone) as we did. We can discuss is it enough zones or we have to make additional and why.

Latin medieval characters include ligatures, from pg. 88, as we did.

Latin medieval characters include composite characters (from pg. 145), which are limited number. In our case it is not possible due to fact that any letter can be lifted and put above or between two letters, with or without titlo and this is about 100.000 composite letters including letters with diacritical signs. That's why we have all superscript letters for capital and small letters.

Latin medieval characters include Variant letter forms (glyphs) on pg. 201 as we did.

They put all medieval characters in PUA and that is logical. But big difference between Latin and Cyrillic case is that they don't have one code for contemporary and old letter as we have. So they don't have problem to have contemporary and old letters in one font. We have. That's why we must have all BS in PUA, not only those who are not already registered in Unicode. It will not produce any damage to anybody but it will allow fonts for databases.

I hope this short explanation will help in further discussion.

Victor Baranov

Yes, the same principles.
I wish only to draw attention to two circumstances.
Of course, we can be indifferent to the possible match with Chinese characters. But I propose to place Latin characters, and our characters in different range. Both are - symbols of medieval Europe. I think that there are manuscripts that use both. Accordingly, it would be correct to have one font for both sets of characters. If we knew of the existence of such a project for the medieval Greek, can not match in the PUA even with him. I think that to find three different range of PUA for the three ranges is possible and necessary.
Despite the fact that Latin medieval characters were significantly different character style than modern, they are not repeated in the PUA. The reason is clear.
But again I propose a solution that has been proposed:
those characters that is in Unicode, located in PUA, but in a separate range. This is a decision that we need.
In other words - in the PUA will be all the characters, but each type of characters will have their subrange.

Zoran Kostic

I can agree that it will be a nice to avoid the same range. Problem is that they occupy 4685 places in PUA out of 6430 existing (E000 to F8FF). You may see that on pg. 88 "Category 1: Base characters". We need about 5500 places so we cannot put our characters in PUA. It means that we have to move into another place, like F0000..FFFFF; Supplementary Private Use Area-A or 100000..10FFFF; Supplementary Private Use Area-B.

"Despite the fact that Latin medieval characters were significantly different character style than modern, they are not repeated in the PUA. The reason is clear."

I explain why. Latin and Cyrillic medieval have similar but not the same problem. Please read carefully.
... But big difference between Latin and Cyrillic case is that they don't have one code for contemporary and old letter as we have (i.e. "a" and "az").

"But again I propose a solution that has been proposed:
those characters that is in Unicode, located in PUA, but in a separate range. This is a decision that we need. In other words - in the PUA will be all the characters, but each type of characters will have their subrange"

Again I don't understand why it's necessary. What will this concept improve. My table is clear, consequent and each character can be sorted with Unicode sort (there is no need for special sorting algorithms) and this is important for databases.

With your concept you will have, for example, 30 already registered superscript letters without titlo in one subrange1 (characters already registered in Unicode) and the rest 70 in another subrange2 (characters not registered in Unicode). Tomorrow when they register another 20 superscript letters from subrange2, what you will do. Your subrange2 will have both type of characters. Why we made such division, which is temporary. There is no logic in that at all.

Victor Baranov

The division in the ranges that I suggested, it is necessary to ensure that those who will benefit both Unicode and PUA, to specify only those ranges gliphs/characters, they need to use.
If some signs will be moved from PUA in Unicode, all should stop using them in the PUA.
If we want a standard, we must think about all situations and users. Including those who did not agree to the deployment of all the characters in the PUA. This is to ensure that our proposals have a great chance for all, so I suggested that the two compromise solutions, which are set out in previous posts.
Otherwise, it will be encoded only one group.

Zoran Kostic

I don't understand what is not clear in my explanations and why Victor is persistent on wrong idea/concept.

"The division in the ranges that I suggested, it is necessary to ensure that those who will benefit both Unicode and PUA, to specify only those ranges gliphs/characters, they need to use."

When you make a font, you don't specify the certain range, but you chose your characters and than, for them, you assign Unicode code, as I explain in my attachment: Examples.pdf from November, 19 http://forum.textualheritage.org/download.php?id=5.

"If some signs will be moved from PUA in Unicode, all should stop using them in the PUA."

Yes, you should stop using them, but you must keep them for backward compatibility.

"If we want a standard, we must think about all situations and users. Including those who did not agree to the deployment of all the characters in the PUA. This is to ensure that our proposals have a great chance for all, so I suggested that the two compromise solutions, which are set out in previous posts."

There is no compromise with PUA table codes. It is good table in any sense. We can make compromise about contents of certain fonts as standard fonts for OCS. That is really important for Slavists.

This table DO NOT stops you to use it for any kind of font you imagine. If you can give me one (not many) example what is wrong or what you cannot do with this table, than I will start to think where I made mistake. I show you what is wrong with your concept on one example.

"Otherwise, it will be encoded only one group."

I don't understand what one group? All letters and superscript letters are, for example, in one group (subrange). It is logical. What is wrong with it?

Victor Baranov

Ranges. They need to correct handling of characters with different functions through various programs.
If there is a range of PUA for the characters, which is in Unicode, enough to indicate this range has the same characters as Unicode, but not at the table matches for each character.
I do not understand why it must have all alphabetic characters together - in the same range. The order of the characters in the code table can be any. I do not understand why only that a decision must be correct. Divide the characters in the ranges of principle: there is a character in Unicode or not in Unicode - does not violate the ideas have all the characters in the PUA. When separated at ranges proposals are becoming more attractive to a wide range of software for automated processing of Slavic medieval texts. Is not that we seek?

The coincidence with the medieval Latin range in PUA. If we agree to the previous question, this issue could have been done so: (1) placed in a range that does not coincide with Latin range, noncombined main characters: letters, superscript letters, punctuation, ligatures, diacritic and some others, these characters are not many; (2) combined and palaeographic characters placed in the range of options, which coincides with Latin range.

I would like to say once again: what we are trying to do, will be used only if it satisfies a broad range of users, software developers and others. I beg to take into account the views of those who have great experience and authority in creating scripts and programs for automatic processing of medieval texts. See previous posts.. The adoption of compromise proposals, which I am a supporter, offered hope that the opponents of the Belgrade proposal would use their new version. Saving Belgrade proposals without the necessary modifications will lead to the isolation of the Belgrade proposal.

Victor Baranov

Уважаемые коллеги,

Мы приступили к подготовке предложений для консорциума Unicode по двум позициям:
- зеркальное Ц,
- надстрочные буквы.

Примеры зеркального Ц и обоснование подготовлены.

Подобраны примеры выносных:
омега,
и восьмеричное,
и десятеричное,
епсилон,
ук одиночный,
еры с ь в левой части,
еры с ъ в левой части,
ер,
ерь,
е широкое (якорное).

Пункты обоснования:
- включение в версию 5.1. Unicode выносных кирилловских букв, использовавшихся и использующихся в печатных изданиях церковнославянских текстов,
- наличие в славянских рукописях XI-XVII вв. выносных букв, которые не вошли в версию 5.1;
- выполнение выносными буквами, которые предлагаем добавить, тех же функций, что и уже включенных в Unicode: сокращение словоформы, сигнализирование о конце словоформы с опущением конечного гласного (написание дво(р) вм. дворъ).

Нужна срочная помощь по выявлению в рукописях выносных:
зело,
о широкое,
о очковое,
ферт,
йотированное е,
йотированный юс малый,
кси,
пси,
ижица,
ук диграфный (ук лигатурный есть).

Нужны:
копия страницы (примерное название файла: кси_ИпатЛет_17об.jpg),
полные выходные данные рукописи / издания,
номер листа, страницы, строки.

А также: дополнительные обоснования включения выносных в Unicode.

Сроки очень сжатые - до четверга, 22 января.
Заранее спасибо за помощь.

Victor Baranov

Тема закрыта в связи с выполнением поставленных задач.
В Unicode 6.0 зарегистрировано зеркальное Ц U+A660, U+A661.
В журнале Scripta & e-Scripta опубликованы предложения по составу и расположению в PUA символов кириллицы, отсутствующих в стандартных диапазонах Unicode:
Victor Baranov, David J. Birnbaum, Ralph Cleminson, Heinz Miklas, Achim Rabus. Proposal for a unified encoding of Early Cyrillic glyphs in the Unicode Private Use Area // Scripta & e-Scripta : The Journal of Interdisciplinary Mediaeval Studies. Vol. 8-9. – Sofia : “Boyan Penev” Publishing Center ; Institute of Literature, BAS, 2010. – C. 9-26. – ISSN 1312-238X.
В настоящее время необходимо продолжить работу над аналогичным (микро)стандартом глаголицы.