Releases: TeamMsgExtractor/msg-extractor
Releases · TeamMsgExtractor/msg-extractor
Version 0.46.2
v0.46.2
- Adjusted typing information on regular expressions. They were using a subscript that was added in Python 3.9 (apparently that is something the type checker doesn't check for), which made the module incompatible with Python 3.8. If you are using Python 3.9 or higher a version check will switch to the more specific typing.
Version 0.46.1
v0.46.1
- [TeamMsgExtractor #394] Fix typo in props that caused the wrong number of bytes to be given to a struct.
Version 0.46.0
v0.46.0
- [TeamMsgExtractor #95] Adjusted the
overrideEncodingproperty ofMSGFileto allow automatic encoding detection. Simply set the property to the string"chardet"and, assuming thechardetmodule is installed, it will analyze a number of the strings to try and form a consensus about the encoding. This will ignore the specified encoding only if if successfully detects. Otherwise it will log a warning and fall back to the default behavior. - [TeamMsgExtractor #387] Changed
extract_msg.utils.decodeRfc2047to not throw decoding errors if the content given is not ASCII. - [TeamMsgExtractor #387] Changed header parsing policy to
email.policy.compat32to prevent partial parsing of quoted header fields. - [TeamMsgExtractor #388] Updated documentation of
MSGFile.exportto specify that updated fields on anMSGFileinstance (and it's subclasses) will not be reflected in the result of the function. Many of the functions do use the newest version of a cached_property, but this is not one of them. - Removed methods deprecated in
v0.45.0. - Changed the base class of
EntryIDfrom no base class toabc.ABC. - Added
positionproperty toEntryIDto tell how many bytes were used to create theEntryID. - Added a number of properties to
MSGFilefrom [MS-OXCMSG]. - Moved some properties down to
MessageBasefrom it's subclasses. - Added support for Journal objects.
- Changed internal code of
PermanentEntryIDto correctly parse the data. Previously the distinguished name did not actually end at the null character, instead ending at the end of the bytes provided. If there was trailing data, it would be captured inadvertently. - Finished definition for
StoreObjectEntryID. - Added new keyword arguments for MSG files:
dateFormatanddatetimeFormat. These allow the user to easily override the strings being used for format dates and dates that include a time component, respectively.- In unifying all the formats into 2 options, you may notice that some will look a bit different starting from this version, as there was an unfortunately large amount of variation.
- Fixed code for
MessageBase.parsedDatewhich could have incorrect values. - Fixed issues with
MessageBase.dateand related things either being incorrectly documented or doing things that are not specified by the documentation. It was supposed to have been changed to usedatetimeobjects, but it was still using strings. - Removed unused function
extract_msg.utils.isEmptyString. - Removed unused function
extract_msg.utils.properHex. - Added a
getStreamhelper function toCustomAttachmentHandler. - Added a
getStreamAshelper function toCustomAttachmentHandler. - Added new custom attachment handler for journal-associated attachments.
- Changed
EntryID.autoCreateto returnNoneif givenNoneor empty bytes. - Changed
EntryID.autoCreateto raise aFeatureNotImplementedexception if no valid entry ID class is found. - Fix typing annotations for
CustomAttachmentHandler. - Removed unneeded
imapclientdependency. - Changed
getJsonto have values be null if they aren't found rather than an empty string. - Implemented the
getJsonmethod correctly for a number of classes. - Changed
Task.percentCompleteto always return a float. - Changed the
NotImplementedErrorfor custom attachment handler not being found toFeatureNotImplemented. Additionally, changed the error message to specify the CLSID found on the attachment to better enable people to report issues. - Changed code for
RecipientandMessageBasethat makes it rely onMessageBase.recipientTypeClassto determine the class to use for therecipientTypeproperty. Adjusted the typing ofRecipientto have it reflect the type that will be used. - Correctly changed the returned value for
ResponseStatus.fromIterto actually return a List instead of a set. - Filled out typing information for a significant portion of the module where variables or functions were missing it. This includes the entirety of the constants submodule.
- Corrected a number of minor issues.
- Extended values for
DVAspectenum. - Added new enums to go with parsing for
OLEPresentationStream. - Changed
NNTPNewsgroupFolderEntryID.newsgroupNameto bytes instead of string since it is ANSI. - Fixed an issue that would cause headers to fail to parse properly if the header text starts with "Microsoft Mail Internet Headers Version 2.0" which is common on some MSG files. This is fixed by stripping that from the beginning before actually parsing the text. This is to circumvent CPython issue #85329, confirmed to still exist in at least some of the supported Python versions.
- Added
listDirandslistDiras methods toAttachmentBase,Recipient, andNamed. These always exclude the prefix, returning as if their directory is the root of the object. This allows the named to be directly used for accessing those files. - Numerous spelling fixes in docstrings, comments, and exceptions.
- Reduced the amount of initialization performed by
MessageBase. Much of this initialization was there from before a lot of stuff changed tocached_propertyand a number of internal variables were being used. Now all of the relevant variables will be initialized by the way they are accessed. - Added new exception
DependencyError. - Changed the errors for missing optional dependencies from
ImportErrortoDependencyError. - Removed all instances of the
rawDataproperty in favor of thetoBytesmethod. For now, many of these will simply return the raw data used, specifically those that are still unmodifiable. Any whose properties have the ability to be modified will have properly implemented versions. These classes also allowNoneto be passed as the value for their data, which will be the default if no arguments have been passed to the constructor. If no arguments orNoneis given as the data, it will create a new instance with default values. This is all in an effort to move towards the ability to create new MSG files and theMSGWriterclass. AlltoBytesmethods will either exclusively returnbytesor will returnNoneto specify that the structure isn't valid to convert to bytes. Structures that may be invalid will be annotated asOptional[bytes]for the return type.- Additionally, these objects will also support the
__bytes__method. If the object returned is not bytes, the method will throw aTypeError.
- Additionally, these objects will also support the
- Removed the individual
PropBaseflag properties and changed the mainflagsproperty to return an enum containing the flags. - Changed various data structs to allow modification and creation of new instances for writing to an MSG file.
- Changed
TZRuleto use unsigned values where applicable. - Changed
TZRuleto require the 14 null bytes (I commented it out completely on accident instead of swapping it to a plain read). It now logs a warning about the bytes not being null. - Removed unneeded function
windowsUnicode. - Moved
FixedLengthProperty.parseTypeto the private API. This was not intended for external use anyways, so leaving it as public API didn't make sense. - Fixed check for type in
ContactAddressEntryIDbeing the wrong value. - Modified
inputToBytesto support objects with the__bytes__method. If the method exists and works then it will be used as a last resort. - Modified
OleWriterto accept objects with a__bytes__method for the data to use for an entry. - Added
__bytes__method toMSGFile. This is equivalent to callingMSGFile.exportBytes.
Verison 0.45.0
v0.45.0
- BREAKING: Changed parsing of string multiple properties to remove the trailing null byte. This will cause the output of parsing them to differ.
- Updated typing information for some functions and classes.
- Fixed a bug with
MessageSignedBase.attachmentsthat would cause it to return None instead of an empty list if the number of normal attachments was 0 was the error behavior was set to ignore violations of the standard. - Updated
MessageSignedBase.attachmentsto usefunctools.cached_propertyinstead ofproperty. - Fixed spelling errors in some exception strings.
- Made
NamedPropertyBasea subclass ofabc.ABC. - Cleaned up some of the code for named properties to remove unused variables and remove inefficient code.
- Changed
PropBaseto be a subclass ofabc.ABC. - Added detailed versioning info to the README.
- Deprecated many private functions, including methods on many of the classes. Of primary note are
_getStreamand_getStringStream, which have been moved to the public API asgetStreamandgetStringStream. Any deprecated functions still exist and will forward to a public API function if they are not being removed. Additionally, all internal usage of them has been removed. This change is one of the big preparations that is needed for the1.0.0release.- As mentioned, a number of these deprecated functions have been moved to the public API. It is recommended that you run tests with your code after enabling deprecation warnings to see what should be changed.
- Removed items deprecated in or before
0.42.0. - Changed the API for the private method
_genRecipient. This is not intended for use outside of the module except for subclasses. The change removed the allowance of ints for the second argument, requiring that it be a valid enum type. - Convert many enum types to
IntEnum. - Extended functionality of
PropertiesStoreto allow for integer property names and getting a property based on just the ID. You can also get a list of all properties that use a given ID. - Added new function
PropertiesStore.getPropertieswhich gets a list of all properties matching the property ID. Return type is a list ofPropBaseinstances. - Added new function
PropertiesStore.getValuewhich looks for the first matchingFixedLengthPropand returns the value from it. - Improved internal code related to getting a property with a potentially unknown type.
- Added a number of entirely new functions to the public API on
MSGFile,AttachmentBase,PropertiesStore, andRecipientobjects:getMultipleBinary: Gets a multiple binary property as a list ofbytesobjects.getSingleOrMultipleBinary: A combination ofgetStreamandgetMultipleBinarywhich prefers a single binary stream. Returns a singlebytesobject or a list ofbytesobjects.getMultipleString: Gets a multiple string property as a list ofstrobjects.getSingleOrMultipleString: A combination ofgetStringStreamandgetMultipleStringwhich prefers a single string stream. Returns a single bytes objecct or a list of bytes objects.getPropertyVal: Shortcut forinstance.props.getValuethat allows new behavior to be added by overriding it.getNamedProp: Shortcut forinstance.namedProperties.get((propertyName, guid), default)that allows new behavior to be added by overriding it.
- Removed
Named._getStringStreamandNamed.sExists. The named properties storage will always use regular streams and not string streams. - Changed all
Namedmethods to no longer have a prefix argument. The prefix should always be false sense the named property mapping will only exist in the top level directory. - Adjusted
tryGetMimeTypeto allows any attachments whosedataproperty would return abytesinstance. - Changed internal code to use public API functions wherever possible. This includes making many private API functions use calls to the public API for getting bits of data.
- Fixed potential issue with
AttachmentBase.clsidwhich had the potential to cause some attachments to fail to generate a CLSID. - Outright removed or changed a significant portion of the private API. I have rarely, if ever, seen references to these parts, so this should cause you no issues. Some of these have also been moved to the public API, either identically or with changes, and the mapping is as such:
_getNamedAs->getNamedAs: Changed to always require a conversion argument. If you were previously using it to plainly get a named property or to handle the properly being None or a real value, you should use the return value ofgetNamedPropinstead._getPropertyAs->getPropertyAs: Same as above, usegetPropertyValinstead for None or plain access._getStreamAs->getStreamAs,getStringStreamAs: Once again, see above. UsegetStreamandgetStringStream, respectively.
Version 0.44.0
v0.44.0
- Fixed a bug that caused
MessageBase.headerInitto always returnFalseafter the 0.42.0 update. - Changed
MessageBase.headerInitto a property. - Fixed
extract_msg.utils.__all__. - Minor regoanization within
extract_msg/utils.py. - Minor changes to docstrings.
- Minor README updates.
- Fix issue with folded header fields decoding incorrectly when given to
extract_msg.utils.decodeRfc2047.
Version 0.43.0
v0.43.0
- [TeamMsgExtractor #56] [TeamMsgExtractor #248] Added new function
MessageBase.asEmailMessagewhich will convert theMessageBaseinstance, if possible, to anemail.message.EmailMessageobject. If an embedded MSG file on aMessageBaseobject is of a class that does not have this function, it will simply be attached to the instance as bytes. - Changed imports in
message_base.pyto help with type checkers. - Changed from using
email.parser.EmailParsertoemail.parser.HeaderParserinMessageBase.header. - Changed some of the internal code for
MessageBase.header. This should improve usage of it, and should not have any noticeable negative changes. You man notice some of the values parse slightly differently, but this effect should be mostly suppressed.
Version 0.42.2
v0.42.2
- Fix bug in
AttachmentBase.mimetypethat would cause it to throw an error when accessed. This bug was introduced inv0.42.0.
Version 0.42.1
v0.42.1
- Fixed some constants being accessed with the wrong name (names were changed in reorganization).
- Removed unused regular expression.
Version 0.42.0
v0.42.0
- [TeamMsgExtractor #372] Changed the way that the save functions return a value. This makes the return value from all save functions much more informative, allowing a user to separate if a file or folder (or if more than one) was saved from the function. It also guarantees that all classes from this module will return the relevant path(s) if data is actually saved.
- [TeamMsgExtractor #288] Added feature to allow attachment save functions to simply overwrite existing files of the same name. This can be done with the
overwriteExistingkeyword argument from code or the--overwrite-existingoption from the command line. - [TeamMsgExtractor #40] Added new submodule
custom_attachments. This submodule provides an extendable way to handle custom attachment types, attachment types whose structure and formatting are not defined in the Microsoft documentation for MSG files. This includes a handler to at least partially cover support for Outlook images. - [TeamMsgExtractor #373] Added the
encodingsubmodule for encoding tasks, including proper support for Microsoft's implementation of CP950. This gets added to the codecs list as "windows-950".- Added infrastructure to make it easy to add variable-byte (up to two bytes) encodings and single-byte encodings.
- Added the following encodings:
- windows-874
- x-mac-ce
- x-mac-cyrillic
- x-mac-greek
- x-mac-icelandic
- x-mac-turkish
- Fixed an issue in the save functions that left the possibility for the zip files to not end up closing if the save function created it and then had an exception.
- Added new property
AttachmentBase.clsidwhich returns the listed CLSID value of the data stream/storage of the attachment. - Changed internal behavior of
MSGFile.attachments. This should not cause any noticeable changes to the output. - Refactored code significantly to make it more organized.
- Changed the exports from the main module to only include an important subset of the module. For other items, you'll have to import the submodule that it falls under to access it. Submodules export all important pieces, so it will be easier to find.
- This includes having many modules be under entirely new paths. Some of these changes have been done with no deprecation, something I generally try to avoid. This is happening at the same time as the public api is significantly changing, which makes it more acceptable.
- Fixed
__main__using the wrong enum for error behavior. - Fixed
Named.getbeing severely out of date (it's not used anywhere by the module which is why it wasn't noticed). - Fixed
Named.__getitem__being entirely case-sensitive. - Switched much of the internal code (and the
treePathproperty of all classes that have it) to usingweakref.ReferenceTypeto avoid hard cyclic references. - Fixed
Recipient._getTypedStreamnever returning a value. - Added additional type hints in various places.
- Modified tests.py to only run if it is run as a file instead of imported.
- Changed
knownMsgClassto a private function since it is explicitly not being exported by any part of the module. - Removed unused function
getFullClassName. - Fixes to the HTML body when saving as HTML will no longer require the
preparedHtml/--prepared-htmloption. - Removed unused exceptions.
- Entirely reoganized the way attachments are initialized, including the class that will be used in various circumstances. Embedded MSG files, custom attachments, and web attachments will all use dedicated classes that are subclasses of
AttachmentBase.- With this change, the way to specify a new
Attachmentclass is to override the function used when creating attachments. This can be done by passingattachmentInit = myFunctionas an option toopenMsg. This function MUST return an instance ofAttachmentBase.
- With this change, the way to specify a new
- Added first implementation of web attachments. Saving is not currently possible, but basic relevant property access is now possible. Saving will not be stopped by this attachment if
skipNotImplemented = Trueis passed to the save function. - Changed the option to suppress
RTFDEerrors to fall under theErrorBehaviorenum. Usage of the original option will be allowable, but is being marked as deprecated. However, it is still a dedicated option from the command line.- Also fixed the option not properly ignoring some
RTFDEerrors, specifically the ones that it is normal for the module to throw.
- Also fixed the option not properly ignoring some
- Removed some constants that are not used by the module.
- Updated to support
RTFDEversion0.1.0. Users encountering random errors from that module should find that those errors have disappeared. If you get errors from it still, bring up the issue on their GitHub. - Fixed bug that would cause weird behavior if you gave an empty string as the path for an MSG file.
- Added support for
IPM.StickyNote. - Fixed an issue that would cause MSG file to never close if an error happened during any of the
__init__functions for MSG classes. - Removed unneeded
chardetdependency. - Removed
Contact.__init__as it didn't provide any unique behavior. - Changed the documentation of
openMsgto specify that it accepts all options recognized byMSGFilesubclasses, allowing the doc string to not be modified every time one of them is changed.- Changed the documentation of various
__init__methods to do the same thing.
- Changed the documentation of various
- Added
dataTypeproperty toAttachmentBaseandSignedAttachmentfor checking the class that the data will be, if accessible. ReturnsNoneif the data is inaccessible, including because accessing it would throw an exception. - Added new enum
InsecureFeaturesand optioninsecureFeatures. This option will allow certain features with security implications to be used for files that you trust. Currently the only feature it supports is the usage ofPIL/Pillowto open and modify images. All features like this will be opt-in to reduce possible vulnerabilities. - Modified all custom exceptions the module uses to derive from a single base class for better organization.
- Added new exceptions to handle some of the situations previously handled by base Python exceptions.
- Changed internal handling of the
prefixoption forMSGFile.__init__(and thereforeopenMsg). If you are not setting this manually, you should notice little difference. - Made enums less strict and converted all using
fromBitsto beIntFlagenums. - Fixed
CalendarBase.keywordsbeing blatantly incorrect (it was so bad I don't know how it slipped through). - Fixed
Contact.genderbeing blatantly incorrect. - Fixed sender not being properly decoded in some circumstances.
- Changed behavior of
MSGFileto haveolefileraise defects of typeDEFECT_INCORRECTand above instead of justDEFECT_FATAL. Uncaught issues ofDEFECT_INCORRECTcan often cause the module to have parsing issues that may be misleading, this just ensures the issue is clarified. This behavior can be reverted back to the previous withErrorBehavior.OLE_DEFECT_INCORRECT. - Fixed potential issues that may have made is possible for certain attachments to ignore filename conflict resolution code.
Version 0.41.5
v0.41.5
- Fixed an issue from version
0.41.3where the header being present but missing theFromfield would cause an exception.