Semantic Address-to-Postcode Retrieval with Ollama and Qdrant

Semantic Address-to-Postcode Retrieval with Ollama and Qdrant

This is the second problem from the previous post. Not only the address data is in the wrong format, the data is missing!

In tax-related documents, the most important data is the postcode which for some reason some people didn't include it in there. So I decide to tackle this problem with semantic search techniques. First, I created a vector store using Qdrant with Thailand data from ThepExcel.

ฐานข้อมูลตำบล อำเภอ จังหวัด รหัสไปรษณีย์ ของประเทศไทย V3: ข้อมูลสมบูรณ์มากขึ้น - Thep Excel
สืบเนื่องจากที่ผมได้พยายามทำ ”ฐานข้อมูลส่วนกลาง” หรือที่ผมจะเรียกว่า “Common Database” ที่หลายๆ คนก็น่าจะอยากใช้เหมือนกัน เช่น ฐานข้อมูลตำบล อำเภอ จังหวัด ภาค
download/ThepExcel-Thailand-Tambon.xlsx at master · ThepExcel/download
Contribute to ThepExcel/download development by creating an account on GitHub.

Again, I used n8n to created data points and inserted it into Qdrant, Which I would need to use embedding for this, at first I used Google text-embedding-004 because it is fast and free but later found out that it didn't support Thai language. So I switched to bge-m3:567m on Ollama instead.

I used cosine as vector distance and 1024 as vector size for bge-m3.

Then in n8n I arranged node like this:

I used 100 chunk size, with 0 overlap because my data is only 1 rows with average token around 70-80 so I didn't mind use more then 100.

Finally, I used the vector store I created to convert address into postcode by using raw address as a query then convert text output into JSON object so n8n could write the data into CSV file.

Subscribe to Pal's Site

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe