Sound Familiar?

3 PM. Your phone rings, right on schedule.

“I posted an announcement on the website, but the text looks weird. There are question marks everywhere.”

The person on the other end insists they “just typed it normally, nothing special.” But if you’ve dealt with Korean systems long enough, you already know where this is going. A few more questions confirm it: they copied and pasted from an HWP document directly into the web editor. It looked fine in the preview. But after saving, this is what appeared on screen:

Course Registration ? How to Apply
Tuition ?150,000 KRW (installment available)

What is HWP?

HWP (Hangul Word Processor) is a word processing application developed by Hancom, a South Korean software company. It holds a near-monopoly in South Korea’s public sector — government agencies, public institutions, schools, and many large enterprises use it as their primary document format. Think of it as Korea’s Microsoft Word, only far more entrenched in official settings.

If you work with Korean organizations or government clients, encountering HWP files is essentially unavoidable. And HWP documents are well known for containing Unicode special characters that fall outside the safe range for non-Unicode database columns — which is exactly what this post is about. Working with Korean institutions means working with HWP, and working with HWP means this issue will eventually find you.

A webpage where special characters appear as '?' — the result of saving HWP special characters into a VARCHAR column
A webpage where special characters appear as '?' — the result of saving HWP special characters into a VARCHAR column
Screenshot: codeslog

The middle-dot character that was clearly in the original text had been replaced with ? somewhere along the way.

Checking the database…

sql
SELECT content FROM notices WHERE id = 123456;
-- Result: Course Registration ? How to Apply

Sure enough, the data was already stored as ?. The corruption happened at save time.

The HWP file is innocent. The network is innocent. The browser is innocent.
The suspect was the VARCHAR column all along.

This post explains why this happens and how to fix it, step by step.


Which Characters Are the Problem?

We tested a set of characters commonly used in HWP documents — middle dots, quotes, and dashes — against a VARCHAR column. The results varied:

Middle Dot Variants

데이터 표
CharUnicodeNameVARCHAR Result
U+30FBKatakana Middle Dot?
U+2022Bullet?
·U+00B7Middle Dot✅ OK
˙U+02D9Dot Above✅ OK

Single Quote Variants

데이터 표
CharUnicodeNameVARCHAR Result
'U+2018Left Single Quotation Mark✅ OK
'U+2019Right Single Quotation Mark✅ OK
U+2032Prime✅ OK
´U+00B4Acute Accent✅ OK

Double Quote Variants

데이터 표
CharUnicodeNameVARCHAR Result
"U+201CLeft Double Quotation Mark✅ OK
"U+201DRight Double Quotation Mark✅ OK
U+2033Double Prime✅ OK
˝U+02DDDouble Acute Accent✅ OK

Only (Katakana Middle Dot) and (Bullet) came out as ?.

What do those two have in common? They are not included in the server’s code page.


The Root Cause: Code Pages and Unicode

What Is a Code Page?

When a computer stores text, it ultimately converts characters into numbers (bytes). The mapping table that defines “which character maps to which number” is called a code page.

In Korean-language environments, the most common code page is CP949 (an extension of EUC-KR). For example: 0xB0A1, 0xB3AA.

The problem is that characters not listed in this mapping table simply cannot be stored. SQL Server silently replaces them with ?.

  • (U+30FB, Katakana Middle Dot) → not in CP949 → ?
  • (U+2022, Bullet) → not in CP949 → ?
  • · (U+00B7, Middle Dot) → included in KS X 1001 character set → stored correctly

What Is Unicode?

Unicode is a universal standard that assigns a unique number to virtually every character in the world. = U+AC00, A = U+0041, = U+30FB, and so on.

MSSQL’s NVARCHAR stores text using UTF-16LE encoding, which is Unicode-based. Because it doesn’t rely on code pages, any special character copied from HWP can be stored without corruption.

Conceptual diagram comparing CP949 and Unicode — characters missing from CP949 (• ・) are replaced with ?, while NVARCHAR's Unicode encoding stores all characters safely
Conceptual diagram comparing CP949 and Unicode — characters missing from CP949 (• ・) are replaced with ?, while NVARCHAR's Unicode encoding stores all characters safely
Image: Generated with Nanobanana AI

VARCHAR vs NVARCHAR: What’s the Difference?

데이터 표
FeatureVARCHARNVARCHAR
Storage methodServer code page (CP949, etc.)Unicode (UTF-16LE)
Bytes per character1 byte (ASCII), 2 bytes (Korean)2 bytes for all (4 bytes for surrogate pairs)
Storage exampleVARCHAR(100) = up to 100 bytesNVARCHAR(100) = up to 200 bytes
Characters outside code pageCorrupted to ?Stored correctly
Recommended forASCII-heavy data (e.g., English-only logs)Korean, special characters, multilingual data

The Storage Size Misconception

Since NVARCHAR uses 2 bytes per character, it takes up twice the storage for the same length. This leads some legacy systems to stick with VARCHAR in the name of “saving space.”

But given modern storage costs, this difference is negligible for user-facing data.

Bytes saved with VARCHAR vs. late nights avoided with NVARCHAR — which matters more?

VARCHAR vs NVARCHAR comparison — VARCHAR risks data corruption due to code page limitations, NVARCHAR safely stores all characters using Unicode
VARCHAR vs NVARCHAR comparison — VARCHAR risks data corruption due to code page limitations, NVARCHAR safely stores all characters using Unicode
Image: Generated with Nanobanana AI

How to Fix It

Step 1: Change the Column Type to NVARCHAR

sql
ALTER TABLE notices
ALTER COLUMN content NVARCHAR(4000);

Note: Data already stored as ? cannot be recovered by changing the column type. The original content must be re-entered.

Step 2: Add the N Prefix to Query Literals

This step is easy to miss. Even if the column is NVARCHAR, string literals without the N prefix are still processed through the code page.

sql
-- ❌ May still result in ?
INSERT INTO notices (content) VALUES ('Course Registration ・ How to Apply');

-- ✅ Correct
INSERT INTO notices (content) VALUES (N'Course Registration ・ How to Apply');

-- Same applies to UPDATE
UPDATE notices SET content = N'Tuition ・ 150,000 KRW' WHERE id = 42;

The N'...' notation explicitly tells SQL Server: “treat this string as Unicode.”

Handling It in the Application Layer

Beyond writing raw SQL, check the parameter type when using ADO.NET or similar libraries.

csharp
// ❌ Binding as SqlDbType.VarChar may cause corruption
cmd.Parameters.Add("@content", SqlDbType.VarChar).Value = content;

// ✅ Explicitly use NVarChar
cmd.Parameters.Add("@content", SqlDbType.NVarChar).Value = content;

Most modern ORMs (like Entity Framework) automatically map string types to NVARCHAR, but always double-check when working with legacy code or raw ADO.NET.


When You Can’t Change the DB: Server-Side Character Replacement

Switching to NVARCHAR is the cleanest fix, but it’s not always immediately feasible. Legacy systems, third-party solutions, or tight deployment schedules can make schema changes difficult. In those cases, replacing problematic characters on the server side before saving is a practical alternative.

Think of it as placing an interpreter between HWP and your database. When arrives from an HWP document, the interpreter swaps it for the safe · before handing it off to the DB.

Replacement Principles

  • Prefer a visually similar character when one exists (e.g., ·)
  • When no close match is available, use the nearest ASCII equivalent (e.g., -)

C# Example

csharp
private static readonly Dictionary<char, string> HwpCharReplacements = new()
{
    { '\u30FB', "·" },  // Katakana Middle Dot → Middle Dot (U+00B7)
    { '\u2022', "·" },  // Bullet → Middle Dot
    { '\u2027', "·" },  // Hyphenation Point → Middle Dot
    { '\u2014', "-" },  // Em Dash → Hyphen
    { '\u2013', "-" },  // En Dash → Hyphen
    { '\u2010', "-" },  // Hyphen (U+2010) → Hyphen-Minus
    { '\u2015', "-" },  // Horizontal Bar → Hyphen
    { '\u2192', "->" }, // Right Arrow
    { '\u21D2', "=>" }, // Double Right Arrow
    { '\u3000', " " },  // Ideographic Space → Regular Space
    { '\uFF01', "!" },  // Fullwidth Exclamation Mark
    { '\uFF08', "(" },  // Fullwidth Left Parenthesis
    { '\uFF09', ")" },  // Fullwidth Right Parenthesis
    { '\uFF1A', ":" },  // Fullwidth Colon
};

public static string ReplaceUnsafeChars(string input)
{
    if (string.IsNullOrEmpty(input)) return input;

    var sb = new StringBuilder(input.Length);
    foreach (char c in input)
    {
        if (HwpCharReplacements.TryGetValue(c, out var replacement))
            sb.Append(replacement);
        else
            sb.Append(c);
    }
    return sb.ToString();
}

Call it before saving:

csharp
string sanitized = ReplaceUnsafeChars(userInput);
cmd.Parameters.Add("@content", SqlDbType.VarChar).Value = sanitized;

To check upfront whether a string contains characters that CP949 can’t handle:

csharp
public static bool IsVarCharSafe(string input)
{
    var cp949 = Encoding.GetEncoding(949);
    byte[] bytes = cp949.GetBytes(input);
    return cp949.GetString(bytes) == input;
}

If the round-tripped string differs from the original, there are characters that will be corrupted to ?.

Python Example

For Python users connecting to MSSQL via pyodbc or pymssql:

python
CHAR_REPLACEMENTS = {
    '\u30FB': '·',  # Katakana Middle Dot
    '\u2022': '·',  # Bullet
    '\u2027': '·',  # Hyphenation Point
    '\u2014': '-',  # Em Dash
    '\u2013': '-',  # En Dash
    '\u2010': '-',  # Hyphen
    '\u2015': '-',  # Horizontal Bar
    '\u2192': '->',
    '\u21D2': '=>',
    '\u3000': ' ',  # Ideographic Space
    '\uFF01': '!',
    '\uFF08': '(',
    '\uFF09': ')',
    '\uFF1A': ':',
}

def replace_unsafe_chars(text: str) -> str:
    return ''.join(CHAR_REPLACEMENTS.get(c, c) for c in text)

Limitation: Any special character not in the replacement table will still corrupt to ?. This is not a complete solution. Migrate to NVARCHAR as soon as your schedule allows.


What About Other Databases?

MSSQL isn’t alone in having this issue — but each database handles it differently.

데이터 표
DBVARCHAR BehaviorRecommended Setting
MSSQLStored using server code page. Characters outside the code page corrupt to ?NVARCHAR + N prefix
MySQLCharacter set can be set per column, table, or database. Default latin1 causes the same problemCHARACTER SET utf8mb4
PostgreSQLWith UTF8 encoding at DB creation, even VARCHAR stores Unicode correctly. Defaults to UTF-8, so this rarely occurs in practiceENCODING 'UTF8' (usually the default)
OracleVARCHAR2 follows the DB character set. NVARCHAR2 is Unicode-onlyNVARCHAR2 or set DB to AL32UTF8
SQLiteStores all text as UTF-8 internally. TEXT type has virtually no Unicode issuesDefault is sufficient

Had this system been built on PostgreSQL or SQLite — both of which default to UTF-8 — this problem likely never would have come up. MSSQL carries the legacy of its code-page-based design, which still has real consequences today, especially in Korean-language environments.

Server racks in a data center — each database handles character encoding differently
Server racks in a data center — each database handles character encoding differently
Photo: Unsplash

Practical Checklist

Auditing an Existing Database

sql
-- Find rows with ? in VARCHAR columns
SELECT id, content
FROM notices
WHERE content LIKE '%?%';

-- List all VARCHAR/CHAR/TEXT columns that may contain user input
SELECT
    TABLE_NAME,
    COLUMN_NAME,
    DATA_TYPE,
    CHARACTER_MAXIMUM_LENGTH
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('varchar', 'char', 'text')
  AND TABLE_SCHEMA = 'dbo'
ORDER BY TABLE_NAME, COLUMN_NAME;

Best Practices for New Table Design

In any MSSQL environment handling Korean data, columns that accept user input should be designed as NVARCHAR from the start.

sql
CREATE TABLE notices (
    id       INT            IDENTITY(1,1) PRIMARY KEY,
    title    NVARCHAR(500)  NOT NULL,
    content  NVARCHAR(MAX)  NOT NULL,  -- MAX = up to ~2GB
    reg_date DATETIME       DEFAULT GETDATE()
);

NVARCHAR(MAX) supports up to approximately 2GB and is the right choice when content length is unpredictable, such as in bulletin board posts.

A combination lock on a keyboard — data corruption happens silently. Designing with NVARCHAR from the start is the best way to protect your data.
A combination lock on a keyboard — data corruption happens silently. Designing with NVARCHAR from the start is the best way to protect your data.
Photo: Unsplash

Summary

Now you know why text copied from HWP ends up as ? in MSSQL:

  1. VARCHAR stores data based on the server’s code page (e.g., CP949).
  2. Some special characters in HWP documents are not included in that code page, so they corrupt to ?.
  3. Change the column type to NVARCHAR and add the N prefix to string literals to resolve the issue.
  4. PostgreSQL and SQLite default to UTF-8 and rarely show this problem, but MSSQL requires extra care, especially in Korean-language environments.

If you’re designing a new system that handles Korean data, defaulting all text columns to NVARCHAR is the safest starting point. Recovering corrupted data after the fact is far more painful than getting it right upfront.

Data corruption happens silently. By the time you find it, it’s often too late.


References