Why Does Copy-Pasting from HWP Cause '?' in MSSQL — VARCHAR vs NVARCHAR Explained

Sound Familiar?

3 PM. Your phone rings, right on schedule.

“I posted an announcement on the website, but the text looks weird. There are question marks everywhere.”

The person on the other end insists they “just typed it normally, nothing special.” But if you’ve dealt with Korean systems long enough, you already know where this is going. A few more questions confirm it: they copied and pasted from an HWP document directly into the web editor. It looked fine in the preview. But after saving, this is what appeared on screen:

Course Registration ? How to Apply
Tuition ?150,000 KRW (installment available)

What is HWP?
HWP (Hangul Word Processor) is a word processing application developed by Hancom, a South Korean software company. It holds a near-monopoly in South Korea’s public sector — government agencies, public institutions, schools, and many large enterprises use it as their primary document format. Think of it as Korea’s Microsoft Word, only far more entrenched in official settings.
If you work with Korean organizations or government clients, encountering HWP files is essentially unavoidable. And HWP documents are well known for containing Unicode special characters that fall outside the safe range for non-Unicode database columns — which is exactly what this post is about. Working with Korean institutions means working with HWP, and working with HWP means this issue will eventually find you.

A webpage where special characters appear as '?' — the result of saving HWP special characters into a VARCHAR column
Screenshot: codeslog

The middle-dot character that was clearly in the original text had been replaced with ? somewhere along the way.

Checking the database…

sql

SELECT content FROM notices WHERE id = 123456;
-- Result: Course Registration ? How to Apply

Sure enough, the data was already stored as ?. The corruption happened at save time.

The HWP file is innocent. The network is innocent. The browser is innocent.
The suspect was the VARCHAR column all along.

This post explains why this happens and how to fix it, step by step.

Which Characters Are the Problem?

We tested a set of characters commonly used in HWP documents — middle dots, quotes, and dashes — against a VARCHAR column. The results varied:

Middle Dot Variants

데이터 표
Char	Unicode	Name	VARCHAR Result
`・`	U+30FB	Katakana Middle Dot	❌ `?`
`•`	U+2022	Bullet	❌ `?`
`·`	U+00B7	Middle Dot	✅ OK
`˙`	U+02D9	Dot Above	✅ OK

Single Quote Variants

데이터 표
Char	Unicode	Name	VARCHAR Result
`'`	U+2018	Left Single Quotation Mark	✅ OK
`'`	U+2019	Right Single Quotation Mark	✅ OK
`′`	U+2032	Prime	✅ OK
`´`	U+00B4	Acute Accent	✅ OK

Double Quote Variants

데이터 표
Char	Unicode	Name	VARCHAR Result
`"`	U+201C	Left Double Quotation Mark	✅ OK
`"`	U+201D	Right Double Quotation Mark	✅ OK
`″`	U+2033	Double Prime	✅ OK
`˝`	U+02DD	Double Acute Accent	✅ OK

Only ・ (Katakana Middle Dot) and • (Bullet) came out as ?.

What do those two have in common? They are not included in the server’s code page.

The Root Cause: Code Pages and Unicode

What Is a Code Page?

When a computer stores text, it ultimately converts characters into numbers (bytes). The mapping table that defines “which character maps to which number” is called a code page.

In Korean-language environments, the most common code page is CP949 (an extension of EUC-KR). For example: 가 → 0xB0A1, 나 → 0xB3AA.

The problem is that characters not listed in this mapping table simply cannot be stored. SQL Server silently replaces them with ?.

・ (U+30FB, Katakana Middle Dot) → not in CP949 → ?
• (U+2022, Bullet) → not in CP949 → ?
· (U+00B7, Middle Dot) → included in KS X 1001 character set → stored correctly

What Is Unicode?

Unicode is a universal standard that assigns a unique number to virtually every character in the world. 가 = U+AC00, A = U+0041, ・ = U+30FB, and so on.

MSSQL’s NVARCHAR stores text using UTF-16LE encoding, which is Unicode-based. Because it doesn’t rely on code pages, any special character copied from HWP can be stored without corruption.

Conceptual diagram comparing CP949 and Unicode — characters missing from CP949 (• ・) are replaced with ?, while NVARCHAR's Unicode encoding stores all characters safely
Image: Generated with Nanobanana AI

VARCHAR vs NVARCHAR: What’s the Difference?

데이터 표
Feature	`VARCHAR`	`NVARCHAR`
Storage method	Server code page (CP949, etc.)	Unicode (UTF-16LE)
Bytes per character	1 byte (ASCII), 2 bytes (Korean)	2 bytes for all (4 bytes for surrogate pairs)
Storage example	`VARCHAR(100)` = up to 100 bytes	`NVARCHAR(100)` = up to 200 bytes
Characters outside code page	Corrupted to `?`	Stored correctly
Recommended for	ASCII-heavy data (e.g., English-only logs)	Korean, special characters, multilingual data

The Storage Size Misconception

Since NVARCHAR uses 2 bytes per character, it takes up twice the storage for the same length. This leads some legacy systems to stick with VARCHAR in the name of “saving space.”

But given modern storage costs, this difference is negligible for user-facing data.

Bytes saved with VARCHAR vs. late nights avoided with NVARCHAR — which matters more?

VARCHAR vs NVARCHAR comparison — VARCHAR risks data corruption due to code page limitations, NVARCHAR safely stores all characters using Unicode
Image: Generated with Nanobanana AI

How to Fix It

Step 1: Change the Column Type to NVARCHAR

sql

ALTER TABLE notices
ALTER COLUMN content NVARCHAR(4000);

Note: Data already stored as ? cannot be recovered by changing the column type. The original content must be re-entered.

Step 2: Add the N Prefix to Query Literals

This step is easy to miss. Even if the column is NVARCHAR, string literals without the N prefix are still processed through the code page.

sql

-- ❌ May still result in ?
INSERT INTO notices (content) VALUES ('Course Registration ・ How to Apply');

-- ✅ Correct
INSERT INTO notices (content) VALUES (N'Course Registration ・ How to Apply');

-- Same applies to UPDATE
UPDATE notices SET content = N'Tuition ・ 150,000 KRW' WHERE id = 42;

The N'...' notation explicitly tells SQL Server: “treat this string as Unicode.”

Handling It in the Application Layer

Beyond writing raw SQL, check the parameter type when using ADO.NET or similar libraries.

csharp

// ❌ Binding as SqlDbType.VarChar may cause corruption
cmd.Parameters.Add("@content", SqlDbType.VarChar).Value = content;

// ✅ Explicitly use NVarChar
cmd.Parameters.Add("@content", SqlDbType.NVarChar).Value = content;

Most modern ORMs (like Entity Framework) automatically map string types to NVARCHAR, but always double-check when working with legacy code or raw ADO.NET.

When You Can’t Change the DB: Server-Side Character Replacement

Switching to NVARCHAR is the cleanest fix, but it’s not always immediately feasible. Legacy systems, third-party solutions, or tight deployment schedules can make schema changes difficult. In those cases, replacing problematic characters on the server side before saving is a practical alternative.

Think of it as placing an interpreter between HWP and your database. When ・ arrives from an HWP document, the interpreter swaps it for the safe · before handing it off to the DB.

Replacement Principles

Prefer a visually similar character when one exists (e.g., ・ → ·)
When no close match is available, use the nearest ASCII equivalent (e.g., — → -)

C# Example

csharp

private static readonly Dictionary<char, string> HwpCharReplacements = new()
{
    { '\u30FB', "·" },  // Katakana Middle Dot → Middle Dot (U+00B7)
    { '\u2022', "·" },  // Bullet → Middle Dot
    { '\u2027', "·" },  // Hyphenation Point → Middle Dot
    { '\u2014', "-" },  // Em Dash → Hyphen
    { '\u2013', "-" },  // En Dash → Hyphen
    { '\u2010', "-" },  // Hyphen (U+2010) → Hyphen-Minus
    { '\u2015', "-" },  // Horizontal Bar → Hyphen
    { '\u2192', "->" }, // Right Arrow
    { '\u21D2', "=>" }, // Double Right Arrow
    { '\u3000', " " },  // Ideographic Space → Regular Space
    { '\uFF01', "!" },  // Fullwidth Exclamation Mark
    { '\uFF08', "(" },  // Fullwidth Left Parenthesis
    { '\uFF09', ")" },  // Fullwidth Right Parenthesis
    { '\uFF1A', ":" },  // Fullwidth Colon
};

public static string ReplaceUnsafeChars(string input)
{
    if (string.IsNullOrEmpty(input)) return input;

    var sb = new StringBuilder(input.Length);
    foreach (char c in input)
    {
        if (HwpCharReplacements.TryGetValue(c, out var replacement))
            sb.Append(replacement);
        else
            sb.Append(c);
    }
    return sb.ToString();
}

Call it before saving:

csharp

string sanitized = ReplaceUnsafeChars(userInput);
cmd.Parameters.Add("@content", SqlDbType.VarChar).Value = sanitized;

To check upfront whether a string contains characters that CP949 can’t handle:

csharp

public static bool IsVarCharSafe(string input)
{
    var cp949 = Encoding.GetEncoding(949);
    byte[] bytes = cp949.GetBytes(input);
    return cp949.GetString(bytes) == input;
}

If the round-tripped string differs from the original, there are characters that will be corrupted to ?.

Python Example

For Python users connecting to MSSQL via pyodbc or pymssql:

python

CHAR_REPLACEMENTS = {
    '\u30FB': '·',  # Katakana Middle Dot
    '\u2022': '·',  # Bullet
    '\u2027': '·',  # Hyphenation Point
    '\u2014': '-',  # Em Dash
    '\u2013': '-',  # En Dash
    '\u2010': '-',  # Hyphen
    '\u2015': '-',  # Horizontal Bar
    '\u2192': '->',
    '\u21D2': '=>',
    '\u3000': ' ',  # Ideographic Space
    '\uFF01': '!',
    '\uFF08': '(',
    '\uFF09': ')',
    '\uFF1A': ':',
}

def replace_unsafe_chars(text: str) -> str:
    return ''.join(CHAR_REPLACEMENTS.get(c, c) for c in text)

Limitation: Any special character not in the replacement table will still corrupt to ?. This is not a complete solution. Migrate to NVARCHAR as soon as your schedule allows.

What About Other Databases?

MSSQL isn’t alone in having this issue — but each database handles it differently.

데이터 표
DB	VARCHAR Behavior	Recommended Setting
MSSQL	Stored using server code page. Characters outside the code page corrupt to `?`	`NVARCHAR` + `N` prefix
MySQL	Character set can be set per column, table, or database. Default `latin1` causes the same problem	`CHARACTER SET utf8mb4`
PostgreSQL	With `UTF8` encoding at DB creation, even `VARCHAR` stores Unicode correctly. Defaults to UTF-8, so this rarely occurs in practice	`ENCODING 'UTF8'` (usually the default)
Oracle	`VARCHAR2` follows the DB character set. `NVARCHAR2` is Unicode-only	`NVARCHAR2` or set DB to `AL32UTF8`
SQLite	Stores all text as UTF-8 internally. `TEXT` type has virtually no Unicode issues	Default is sufficient

Had this system been built on PostgreSQL or SQLite — both of which default to UTF-8 — this problem likely never would have come up. MSSQL carries the legacy of its code-page-based design, which still has real consequences today, especially in Korean-language environments.

Server racks in a data center — each database handles character encoding differently
Photo: Unsplash

Practical Checklist

Auditing an Existing Database

sql

-- Find rows with ? in VARCHAR columns
SELECT id, content
FROM notices
WHERE content LIKE '%?%';

-- List all VARCHAR/CHAR/TEXT columns that may contain user input
SELECT
    TABLE_NAME,
    COLUMN_NAME,
    DATA_TYPE,
    CHARACTER_MAXIMUM_LENGTH
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('varchar', 'char', 'text')
  AND TABLE_SCHEMA = 'dbo'
ORDER BY TABLE_NAME, COLUMN_NAME;

Best Practices for New Table Design

In any MSSQL environment handling Korean data, columns that accept user input should be designed as NVARCHAR from the start.

sql

CREATE TABLE notices (
    id       INT            IDENTITY(1,1) PRIMARY KEY,
    title    NVARCHAR(500)  NOT NULL,
    content  NVARCHAR(MAX)  NOT NULL,  -- MAX = up to ~2GB
    reg_date DATETIME       DEFAULT GETDATE()
);

NVARCHAR(MAX) supports up to approximately 2GB and is the right choice when content length is unpredictable, such as in bulletin board posts.

A combination lock on a keyboard — data corruption happens silently. Designing with NVARCHAR from the start is the best way to protect your data.
Photo: Unsplash

Summary

Now you know why text copied from HWP ends up as ? in MSSQL:

VARCHAR stores data based on the server’s code page (e.g., CP949).
Some special characters in HWP documents are not included in that code page, so they corrupt to ?.
Change the column type to NVARCHAR and add the N prefix to string literals to resolve the issue.
PostgreSQL and SQLite default to UTF-8 and rarely show this problem, but MSSQL requires extra care, especially in Korean-language environments.

If you’re designing a new system that handles Korean data, defaulting all text columns to NVARCHAR is the safest starting point. Recovering corrupted data after the fact is far more painful than getting it right upfront.

Data corruption happens silently. By the time you find it, it’s often too late.

References

KS X 1001 — Wikipedia: Character list included in the KS X 1001 standard. Useful for verifying why characters like ′ (U+2032) and ˙ (U+02D9) are safe in CP949.
CP949 to Unicode Mapping — Unicode Consortium: The official Unicode mapping table for Windows Code Page 949.

Why Does Copy-Pasting from HWP Cause '?' in MSSQL — VARCHAR vs NVARCHAR Explained

Sound Familiar?

라인 넘버 읽기 기능

라인 넘버 읽기 기능

Which Characters Are the Problem?

Middle Dot Variants

Single Quote Variants

Double Quote Variants

The Root Cause: Code Pages and Unicode

What Is a Code Page?

What Is Unicode?

VARCHAR vs NVARCHAR: What’s the Difference?

The Storage Size Misconception

How to Fix It

Step 1: Change the Column Type to NVARCHAR

라인 넘버 읽기 기능

Step 2: Add the N Prefix to Query Literals

라인 넘버 읽기 기능

Handling It in the Application Layer

라인 넘버 읽기 기능

When You Can’t Change the DB: Server-Side Character Replacement

Replacement Principles

C# Example

라인 넘버 읽기 기능

라인 넘버 읽기 기능

라인 넘버 읽기 기능

Python Example

라인 넘버 읽기 기능

What About Other Databases?

Practical Checklist

Auditing an Existing Database

라인 넘버 읽기 기능

Best Practices for New Table Design

라인 넘버 읽기 기능

Summary

References

Sound Familiar?#

Which Characters Are the Problem?#

Middle Dot Variants#

Single Quote Variants#

Double Quote Variants#

The Root Cause: Code Pages and Unicode#

What Is a Code Page?#

What Is Unicode?#

VARCHAR vs NVARCHAR: What’s the Difference?#

The Storage Size Misconception#

How to Fix It#

Step 1: Change the Column Type to NVARCHAR#

Step 2: Add the N Prefix to Query Literals#

Handling It in the Application Layer#

When You Can’t Change the DB: Server-Side Character Replacement#

Replacement Principles#

C# Example#

Python Example#

What About Other Databases?#

Practical Checklist#

Auditing an Existing Database#

Best Practices for New Table Design#

Summary#

References#

Sound Familiar?

Which Characters Are the Problem?

Middle Dot Variants

Single Quote Variants

Double Quote Variants

The Root Cause: Code Pages and Unicode

What Is a Code Page?

What Is Unicode?

VARCHAR vs NVARCHAR: What’s the Difference?

The Storage Size Misconception

How to Fix It

Step 1: Change the Column Type to NVARCHAR

Step 2: Add the N Prefix to Query Literals

Handling It in the Application Layer

When You Can’t Change the DB: Server-Side Character Replacement

Replacement Principles

C# Example

Python Example

What About Other Databases?

Practical Checklist

Auditing an Existing Database

Best Practices for New Table Design

Summary

References