بررسی بهینگی کدهای منبع بر اساس اطلاعات پیشین محدود و مشاهدات معدود

STUDENT

DEGREE

YEAR

For a given memoryless source, when the symbols' probabilities vector is known, the optimal code can be obtained from the well-known Huffman algorithm. It is obvious that if the symbols' probabilities are not precisely known,then the algorithm can no longer be used. The problem in this thesis is the source coding in the absence of symbols' probabilities information, possessing partial "information" about the source features and observing a "finite" sequence of source output symbols. One of the most primitive methods for estimating the symbols' probabilitie s based on source output observation is the empirical frequency estimator that assigns a probability to each symbol proportional to the number of symbol repeats in the observed sequence. This estimator does not depend on prior source information. The Add- estimator is another method of estimating symbols' probabilities, which provides the optimal code concerning average redundancy when we have a Dirichlet distribution over the symbols’ probability vector.The criterion used in this study for evaluating the code performance is the probability of that the H uffman codes from the estimated and actual probability distributions do not coincide, which we call it error probability of estimation. Calculations show that in some cases, source coding based on pure prior information has less error probability than using the empirical frequency estimator. Specifically, for monotone sources with 4, 5 and 6 symbols, if the length of the observed sequence is less than 62, 25 and 11, respectively, the choice of code based on the empirical frequency estimator has more error probability than source coding using prior information, on average. In this thesis, methods are introduced for source coding that minimize the error probability by simultaneously using previous information and the sequence of output observations of the source. For example, for sources with 8 and 9 symbols, if the length of the observed sequence is less than 100, then the error probability can be reduced by at least 0.068 and 0.074, respectively, compared to the empirical frequency method using one of the proposed methods. Keywords : Source coding, Data compression, Optimimality of the source code, Empirical frequency estimator.

هنگامی که توزیع احتمال سمبل‌های منبع ، یعنی ، مشخص باشد ، کد بهینه از الگوریتم شناخته شده‌ی هافمن به دست می‌آید. واضح است اگر توزیع احتمال سمبل‌ها به طور دقیق معلوم نباشد ، امکان استفاده از الگوریتم هافمن وجود ندارد. در چنین شرایطی می‌توان بردار احتمال سمبل‌ها را برداری تصادفی فرض کرد و به کمک اطلاعاتی که در مورد منبع در اختیار است ، تابع چگالی احتمالِ بردار تصادفی را تعیین نمود. مساله‌ی مورد بررسی در این پایان‌نامه ، کدگذاری منبعِ بدون حافظه در شرایط عدم اطلاع دقیق از بردار احتمال سمبل‌ها ، در اختیار داشتن «اطلاعاتی» جزئی در مورد ویژگی‌های منبع و با مشاهده‌ی دنباله‌ای با طول «محدود» از سمبل‌های خروجی منبع است. یکی از ابتدایی‌ترین روش‌های تخمین احتمال سمبل‌ها بر اساس مشاهده‌ی خروجی منبع ، استفاده از تخمین‌گر فرکانس تجربی است که به هر سمبل ،‌ احتمالی متناسب با تعداد تکرارهای آن سمبل در دنباله‌ی مشاهده شده تخصیص می‌دهد. این تخمین‌گر ، بدون توجه به اطلاعات پیشین از منبع عمل می‌کند. تخمین‌گر افزودن ، روش‌ دیگری برای تخمین بردار احتمال سمبل‌ها است که با فرض توزیع دیریکله بر روی بردار احتمال سمبل‌ها و با مشاهده‌ی دنباله‌ی خروجی از منبع ، کد بهینه با معیار میانگین افزونگی را ارائه می‌دهد. کلید واژه‌ها: کدگذاری منبع، فشرده‌سازی داده، بهینگی کد منبع، تخمین‌گر فرکانس تجربی