Data - Purging Best - Practices - FINAL
Data - Purging Best - Practices - FINAL
Data Purging
Best Practices & Guidelines
Version 0.2 FINAL
Data Purging Best Practices & Guidelines
2
Data Purging Best Practices & Guidelines
Contents
Introduction ...........................................................................................................4
Technical Considerations .....................................................................................5
Purge Job..............................................................................................................7
Customer Retention.............................................................................................................................................. 7
Order Retention..................................................................................................................................................... 7
Custom Object Retention ..................................................................................................................................... 8
3
Data Purging Best Practices & Guidelines
Introduction
The intent of this document is to provide information on the need of data purging, data purging techniques
and best practices to get the optimal response times. We have provided techniques to determine the need
for data purging for a given environment as well as covered the configuration details for the business
object retention.
The purpose of this document is not to provide a single data purging guideline, which works for all but the
techniques you may use to determine the need for purging and the ways to purge the data. It is
understandable that the data purging approach differs with each customer’s business requirements
however we provide a framework that helps to proactively work on data volume and its impact on
performance.
We have discussed the best practices from code and system design perspective to always get the optimal
response times as well as to build a system, which supports the data purging.
4
Data Purging Best Practices & Guidelines
Technical Considerations
Very often Demandware receives questions like “How many orders can live in the Demandware platform” or
“How many customers may live in the Demandware Platform”.
There are no simple answers to these questions because no two implementations are the same.
Demandware platform allows customers to extend the platform by adding custom objects or attributes.
Custom objects and attributes are just one aspect of the customization; another aspect might be the logic
or other integrations, which might impact the stability of the platform. A badly written logic might impact
the performance irrespective of the data volume.
Web site efficiency is influenced by many other factors, including integration with external systems, the
processing time spent on the web adaptor, application server, and database tiers, and the complexity of
the HTML pages the browser renders to create the customer experience.
The platform scales to meet traffic and order peaks most effectively when the web tier handles the majority
of transaction requests. As you design your site, it’s important to minimize the number of transactions that
pass through the web tier to the application tier and then from the application tier to the database.
A million customers of Client A ! = million customers of Client B. If we know the bottleneck in the system; we
can apply some adjustment to the data model with acceptable performance far above the current limit. The
5
Data Purging Best Practices & Guidelines
underlying database has no difficulty in maintaining large sets of data; but the extension model might
introduce the problems, which is different for every implementation.
From the scalability standpoint it is important to purge the obsolete or stale data periodically.
Object churn reports may help in determining the purge candid business objects. Object churn report
provides an insight about how much data is updated for every insert. A good idea may be to analyze the
business objects and determine their churn pattern over the time.
If data is frequently updated, it might not be a good candidate for purging; if objects are not updated over
time and became stale then it might be a good candidate for purging. However, there is no hard and fast
rule which applies to every customer, it all depends on the data churn rate and the data volume for a given
client.
It is advisable to purge of older orders. For some long-term clients, it helped a lot as they only kept the
orders from the last 90 days on our platform. This led to less than 500k orders in their system at any time
and searches, export etc. worked with improved performance. But this possibly brings other disadvantages
to your online business. You might not be able to reliable set up promotions for recurring buyers, first-time-
buyers etc., if you only have a history of the recent x days for each customer.
6
Data Purging Best Practices & Guidelines
Purge Job
It’s not so much the problem of having that many records in the database; it is really the query
performance. With the Search APIs (uses elastic search to retrieve the data) for customer and order it is
now feasible to maintain high number of customer and order records in the database with always a
consistent response times. However, Demandware platform is not designed as a system of records to
maintain a very high volume of data. To achieve optimal eCommerce experience, it is advised to maintain
only the required records in the database.
This document’s content generally revolves around the customer and order object however concepts
discussed here is applicable to other object types as well including but not limited to the custom objects.
Customer Retention
Demandware cleans up obsolete customer data based on the settings merchants/ administrators apply in
Business Manager.
You can configure the lifetime of the following data. In Business Manager, go to Administration >
Global Preferences à Retention Settings, to specify retentions settings.
• Product lists created by anonymous customers such as wish lists and gift registries.
• Inventory records that no longer reference products.
• Product price records that no longer reference products.
In addition, there is a system job called "PurgeObsoleteData" that checks the last visited time stamp (if not
available, last login and creation date as well) to find customers that are eligible for removal according to
the retention preference mentioned in Business Manager.
Order Retention
In general, it is nice to have only as much objects in the database as needed. So if you keep your orders in
a third party OMS and don’t require them in Demandware, there is no reason not to purge them. Other
customers however may use Demandware as the system of record, or they want to display the order history
for the last 7 years for consumers. There might also be legal obligations forcing them to keep that data.
Demandware cleans up obsolete order data based on the settings merchants/ administrators apply in
Business Manager. You may configure the lifetime of an order in Business Manager at following path:
Merchant Tools à Site Preferences à Order à Order Preference
7
Data Purging Best Practices & Guidelines
The system job “PurgeObsoleteData” uses the specified number of days as configuration. Orders older
than the specified number of days will be automatically removed from the system. Leave blank if orders
should never be purged from the system.
8
Data Purging Best Practices & Guidelines
Best Practices
• Custom objects
The system stores custom attributes and localizable system attributes in the database in tables with the
system objects. You access these using a compound key of the attribute ID, the locale, and the
corresponding system object ID. Accessing these attributes in the database is expensive - especially as the
data set grows over time. Defining many custom attributes and creating object queries with many attribute
conditions impedes performance.
The system processes custom objects similarly, so use custom objects judiciously, as well. For example, if
you implement custom analytics using custom objects, you’ll have to write to these objects for each
request. This implementation might seem fine in your sandbox, but on production, customers can generate
hundreds or thousands of these custom objects, degrading performance. A valid use of custom objects is
to store small, temporary data sets or data that an administrator manages using Business Manager. A
common use case for custom objects is to store temporary data to configure analytics integrations.
Be sure to access custom objects and system object attributes using primary keys rather than secondary
keys. You can determine the primary key for each object in Business Manager. The primary key is the
object’s attribute shown with the key icon.
The problems we face are mainly caused by the extension model. The required join between
order/customer table and the corresponding attribute table. This one is easily 30 to 50 times bigger than
the actual object table and this is where the problems start. If you know and think about this upfront, you
can
• Keep the number of distinct attribute definitions to the bare minimum and get rid of the ones not
required any longer
• Remove temporary attribute values (e.g. fraud check after order was processed)
• Re-use existing, but not required system (native) attributes as much as possible
• Think about the data access paths upfront and set up the attribute types and values accordingly
• Group attributes that you never need to query in a single "details" attribute stored as xml or json
in a text value (not indexed, no effect on search)
9
Data Purging Best Practices & Guidelines
Customer Search
There may be various interaction points where a customer search may take place. It is important to
understand that the high data volume might drastically increase the customer search response times or
number of search timeouts.
For the purpose of this document we discuss the recommended APIs and list out the APIs to avoid.
You should use searchProfiles() and processProfiles() methods of customerMgr class for retrieving and
manipulating customer objects. These methods use the latest Search service. Elastic search nodes are the
actual search backbone for these new APIs.
Please note: the queryProfile() method will be deprecated in the future releases as this API would directly
fetch the records from the DB. The performance would be fine as far as you are using the indexed
attributes for the query however if you use a custom attribute, the database needs to join the tables, which
might degrade the performance. Therefore it is recommended to use the searchProfiles() and
processProfiles() method to get a consistent performance.
Order Search
10
Data Purging Best Practices & Guidelines
performance of the storefront and/or other tools (Business Manager) - which accesses this data. This is
also why the recommendation (at this time) is to store large-scale historical order data external to the
Demandware system, and access it by user request – for performance/scalability reasons.
One option for customer who have no OMS or no OMS with an order webservice for lookup is couchdb -
open source document database with REST API: http://couchdb.apache.org/
11