3 m�da�B � @ s� d dl mZmZ d dlmZmZmZmZ yd dlm Z W n e k rP eZ Y nX d dlZddl mZmZmZmZ ddlmZmZmZ ddlmZ dd lmZmZ dd lmZmZmZmZm Z m!Z! ej"d�Z#e#j$ej%� ej&� Z'e'j(ej)d�� e#j*e'� de+e,e,e-ee ee e.e.ed� dd�Z/dee,e,e-ee ee e.e.ed� dd�Z0d e e,e,e-ee ee e.e.ed� dd�Z1d!e e,e,e-ee ee e.ed�dd�Z2dS )"� )�basename�splitext)�BinaryIO�List�Optional�Set)�PathLikeN� )�coherence_ratio�encoding_languages�mb_encoding_languages�merge_coherence_ratios)�IANA_SUPPORTED�TOO_BIG_SEQUENCE�TOO_SMALL_SEQUENCE)� mess_ratio)�CharsetMatch�CharsetMatches)�any_specified_encoding� iana_name�identify_sig_or_bom� is_cp_similar�is_multi_byte_encoding�should_strip_sig_or_bomZcharset_normalizerz)%(asctime)s | %(levelname)s | %(message)s� � 皙�����?TF) � sequences�steps� chunk_size� threshold�cp_isolation�cp_exclusion�preemptive_behaviour�explain�returnc / C s t | ttf�s tdjt| ����|s2tjtj � ntjtj � t| �}|dkrptjd� t t| dddg d�g�S |dk r�tjd d j|�� dd� |D �}ng }|dk r�tjd d j|�� dd� |D �}ng }||| kr�tjd|||� d}|}|dk�r|| |k �rt|| �}t| �tk } t| �tk} | �rDtjdj|�� n| �rZtjdj|�� g }|dk�rpt| �nd}|dk �r�|j|� tjd|� t� } g }g }d}d}d}t � }t| �\}}|dk �r�|j|� tjdt|�|� |jd� d|k�r|jd� �xJ|t D �]<}|�r*||k�r*�q|�r>||k�r>�q|| k�rL�q| j|� d}||k}|�ont|�}|d5k�r�|dk�r�tjd|� �qyt|�}W n* ttfk �r� tjd|� �wY nX yr| �r|dk�rt|dk�r�| dtd�� n| t|�td�� |d� n&t|dk�r&| n| t|�d� |d�}W nT t t!fk �r� } z2t |t!��sttjd|t|�� |j|� �wW Y dd}~X nX d}x |D ]}t"||��r�d}P �q�W |�r�tjd||� �qt#|dk�r�dnt|�|t|| ��}|�o|dk �ot|�|k }|�r&tjd|� tt|�d �} | d!k �rDd!} d}!g }"g }#�x@|D �]6}$| |$|$| � }%|�r�|dk�r�||% }%|%j$|d"d#�}&|�rB|$dk�rB| |$ d$k�rB|d%k�r�d%n|}'|�rB|&d|'� |k�rBxdt#|$|$d d6�D ]P}(| |(|$| � }%|�r|dk�r||% }%|%j$|d"d#�}&|&d|'� |k�r�P �q�W |"j|&� |#jt%|&|�� |#d7 |k�rr|!d7 }!|!| k�s�|�rX|dk�rXP �qXW |#�r�t&|#�t|#� })nd})|)|k�s�|!| k�r4|j|� tjd&||!t'|)d' d(d)�� |dd|gk�rt| ||dg |�}*||k�r|*}n|dk�r,|*}n|*}�qtjd*|t'|)d' d(d)�� |�s^t(|�}+nt)|�}+|+�r�tjd+j|t|+��� g },x4|"D ],}&t*|&d,|+�r�d-j|+�nd�}-|,j|-� �q�W t+|,�}.|.�r�tjd.j|.|�� |jt| ||)||.|�� ||ddgk�r(|)d,k �r(tjd/|� t || g�S ||k�rtjd0|� t || g�S �qW t|�dk�r |�sr|�sr|�r|tjd1� |�r�tjd2|j,� |j|� nd|�r�|dk�s�|�r�|�r�|j-|j-k�s�|dk �r�tjd3� |j|� n|�r tjd4� |j|� |S )8aD Given a raw bytes sequence, return the best possibles charset usable to render str objects. If there is no results, it is a strong indicator that the source is binary/not text. By default, the process will extract 5 blocs of 512o each to assess the mess and coherence of a given sequence. And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will. The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page but never take it for granted. Can improve the performance. You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that purpose. This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32. z4Expected object of type bytes or bytearray, got: {0}r zXGiven content is empty, stopping the process very early, returning empty utf_8 str match�utf_8g F� Nz`cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : %s.z, c S s g | ]}t |d ��qS )F)r )�.0�cp� r* �/usr/lib/python3.6/api.py� <listcomp>X s zfrom_bytes.<locals>.<listcomp>zacp_exclusion is set. use this flag for debugging purpose. limited list of encoding excluded : %s.c S s g | ]}t |d ��qS )F)r )r( r) r* r* r+ r, b s z^override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.r z>Trying to detect encoding from a tiny portion of ({}) byte(s).zIUsing lazy str decoding because the payload is quite large, ({}) byte(s).Tz@Detected declarative mark in sequence. Priority +1 given for %s.zIDetected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.�ascii�utf_16�utf_32z[Encoding %s wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.z2Encoding %s does not provide an IncrementalDecoderg ��A)�encodingz9Code page %s does not fit given bytes sequence at ALL. %szW%s is deemed too similar to code page %s and was consider unsuited already. Continuing!zpCode page %s is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.� � �ignore)�errors� � zc%s was excluded because of initial chaos probing. Gave up %i time(s). Computed mean chaos is %f %%.�d � )Zndigitsz=%s passed initial chaos probing. Mean measured chaos is %f %%z&{} should target any language(s) of {}g�������?�,z We detected language {} using {}z0%s is most likely the one. Stopping the process.z[%s is most likely the one as we detected a BOM or SIG within the beginning of the sequence.zONothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.z#%s will be used as a fallback matchz&utf_8 will be used as a fallback matchz&ascii will be used as a fallback match> r/ r. ���r: ).� isinstance� bytearray�bytes� TypeError�format�type�logger�setLevel�loggingZCRITICAL�INFO�lenZwarningr r �join�intr r �infor �append�setr r �addr r �ModuleNotFoundError�ImportError�debug�str�UnicodeDecodeError�LookupErrorr �range�decoder �sum�roundr r r r r0 Zfingerprint)/r r r r r! r"