I am working on a project wherein I have a set of keyw开发者_高级运维ords [abc, xyz, klm]`. I also have a bunch of text files with content [1.txt, 2.txt, 3.txt].
What I am doing is bumping the keywords against the text files to find the line where the keyword occurs and it can do so multiple times. So I want to store the ID (text file name without .txt), Extracted_Data, Line_Number, Spwaned_Across (keyword may be spread across 2 lines)
for each occurence.
I decided to create a table for each keyword to store this data.
Tables : abc, xyz, klm
Table abc sample data :
ID Extracted_Data Line_Number Spawned_Across
12 MySQL is wonderful. What is 'abc' 34 1
So I end up with a table for each keyword. In my project, there are about 150 keywords and it can grow. So 150 tables.
Why did I choose to do this way?
For now I am required to find if the keyword exists in a file and I am sure in the future I will be asked to show where or how it occurred in the file. I am planning on creating a table automatically for each new keyword, this way I don't have to manually created each one of them or a giant table with 100s of columns.
Did I make the right decision? Your input is highly appreciated.
Don't do that. No database library is optimized for dynamic table names and you'll end up having to create your query from scratch each time you want to access a table. Also, how would you answer questions like "what data did I find on line 34 of file 12"?
You'll want three tables. In PostgreSQL syntax [*], that'd be:
CREATE TABLE source (sourceid SERIAL, filename VARCHAR NOT NULL);
CREATE TABLE keyword (keywordid SERIAL, keyword VARCHAR NOT NULL);
CREATE TABLE location (locationid SERIAL,
sourceid INTEGER NOT NULL REFERENCES source(sourceid),
keyword INTEGER NOT NULL REFERENCES keyword(keywordid),
data VARCHAR NOT NULL,
line INTEGER NOT NULL,
span INTEGER NOT NULL);
When you start processing a new text file, create a new source
tuple and remember its sourceid. When you encounter a keyword, either insert a new record for it and remember its keywordid or look up the old record. Then insert that sourceid, keywordid, and other relevant data into location
.
To answer the question I posed earlier:
SELECT * FROM
location JOIN source ON location.sourceid = source.sourceid
JOIN keyword ON location.keywordid = keyword.keywordid
WHERE
source.filename = 'foo.txt' AND
location.line = 34;
Yes, it's more work up front to do it the "right" way but you'll be paid back a million times over in performance, ease of maintenance, and ease of using the results.
[*] The MySQL syntax will be similar but I don't remember it off the top of my head and you can figure out the differences pretty easily.
I can't see why you can't just store the keyword along the data in one table.
ID Keyword Extracted_Data Line_Number Spawned_Across
12 abc Abc or xyz?.. 31337 1
12 xyz Abc or xyz?.. 31337 1
12 xyz just xyz here 66666 1
13 xyz xyz travels! 123 1
So you'll have to query by keyword or by file, or by both, all data is present. To normalize further you can store keywords separately in the "keywords" table and keep only the foreign key in the "occurences" table.
Also it's not very popular to name "ID" anything other than the primary key.
This is definitely a very bad decision .
Millions of rows is better than millions of tables.
Create 2 tables with the suitable foreign keys and you will be fine.
I will be asked to show where or how it occurred in the file.
This can still be done in 2 tables
I don't think this is efficient. I'm not even sure that a relational database is the right tool for the job.
New keywords will mean more tables. That's not scalable.
Keywords and files make me think of indexing and unstructured search. I'd be thinking about Lucene before a relational database.
精彩评论