Tariff classification under the Harmonized System (HS) is one of the most knowledge-intensive tasks in international trade. With over 5,000 six-digit HS codes globally, and country-specific extensions pushing the number to over 18,000 in the US Harmonized Tariff Schedule alone, accurate classification requires deep expertise in product characteristics, trade rules, and regulatory interpretations. Human classifiers, even experienced ones, can disagree on the correct classification for ambiguous products. AI and machine learning models offer the potential to standardize and accelerate this process, but how well do they actually perform?
Classification accuracy is typically measured at different levels of the HS hierarchy. At the two-digit chapter level, modern AI models routinely achieve accuracy rates above 95%. This is relatively straightforward because chapters represent broad product categories (e.g., Chapter 84 for machinery, Chapter 61 for knitted apparel). At the four-digit heading level, accuracy drops to roughly 88-93%, depending on the model and the product domain. The real challenge comes at the six-digit subheading and the eight- or ten-digit national tariff line level, where accuracy ranges from 75% to 88% in the best-performing systems.
A commonly cited benchmark is that experienced human classifiers agree with each other approximately 85-92% of the time at the 6-digit level. AI models that approach or exceed this range are performing at or above human parity for routine classifications.
Several factors influence how well an AI classification model performs. Training data quality is paramount: models trained on millions of real customs declarations with validated classifications significantly outperform those trained on product catalogs or general text descriptions. Product domain specificity also matters. Models fine-tuned for specific industries, such as chemicals, textiles, or electronics, tend to outperform general-purpose models within those domains. The input data quality is equally important: a detailed product description with material composition, intended use, and technical specifications will yield far better results than a vague one-line description.
The emergence of large language models (LLMs) has introduced a new paradigm in customs classification. Unlike traditional machine learning models that learn statistical patterns from labeled classification data, LLMs can reason about product descriptions, interpret classification rules, and apply General Rules of Interpretation (GRI) in a manner that more closely resembles human expert reasoning. When combined with retrieval-augmented generation (RAG) that feeds the model relevant tariff schedule sections and classification rulings, LLMs have shown significant improvements in handling edge cases and ambiguous classifications.
For importers considering AI classification tools, the key question is not whether the AI is perfect, but whether it improves upon their current process. If your current classification process relies on a single customs broker or an internal team without systematic quality controls, an AI tool that achieves 85% accuracy at the 6-digit level while flagging low-confidence classifications for human review will likely improve both accuracy and consistency. The most effective implementations use AI as a first-pass classifier that handles routine products automatically while routing complex or ambiguous items to human experts.
When evaluating AI classification solutions, ask vendors for accuracy metrics at each HS level, broken down by product category. Request a pilot or proof of concept using your actual product data, not generic test sets. Look for features like confidence scoring (which indicates how certain the model is about each classification), support for multiple country tariff schedules, and the ability to incorporate your historical classification data as additional training input. Also evaluate how the system handles updates when tariff schedules change, which happens annually in most countries.
Ask any AI classification vendor: What is your accuracy at the 6-digit and 10-digit level? How was it measured? What training data do you use? How do you handle low-confidence classifications? How quickly do you incorporate tariff schedule updates?
AI customs classification is improving rapidly. As models are trained on larger and more diverse datasets, and as LLM reasoning capabilities continue to advance, we can expect accuracy at the national tariff line level to approach and eventually exceed human expert performance for most product categories. However, the most complex classifications, particularly those involving novel products, multi-material compositions, or competing classification interpretations, will continue to require human expertise for the foreseeable future. The winning strategy is not AI versus humans, but AI augmenting humans to achieve both speed and accuracy.
Camtom Team
Editorial Team
Descubre por qué más de 100 agencias ya operan con nosotros.